Difference between revisions of "Integration and tagset conversion with Giellatekno"
Line 9: | Line 9: | ||
** <code>giella-xxx/tools/mt/apertium/tagsets</code>: These are the different files used for relabelling the transducers. |
** <code>giella-xxx/tools/mt/apertium/tagsets</code>: These are the different files used for relabelling the transducers. |
||
*** <code>giella-xxx/tools/mt/apertium/tagsets/apertium.postproc.relabel</code>: This file is used for 1:1 tag conversions. For example if you want to change {{tag|cc}} to {{tag|cnjcoo}}. |
*** <code>giella-xxx/tools/mt/apertium/tagsets/apertium.postproc.relabel</code>: This file is used for 1:1 tag conversions. For example if you want to change {{tag|cc}} to {{tag|cnjcoo}}. |
||
− | *** <code>giella-xxx/tools/mt/apertium/tagsets/modify-tags.regex</code>: This file is used for 1:1, 1:n and n:1 tag conversions. For example if you want to change {{tag|sg3}} to {{tag|p3><sg}}. |
+ | *** <code>giella-xxx/tools/mt/apertium/tagsets/modify-tags.regex</code>: This file is used for 1:1, 1:n and n:1 tag conversions. For example if you want to change {{tag|sg3}} to {{tag|p3><sg}}. This is also used if you need to do context-sensitive replacement, for example for specific lemmas <code>,<clb></code> to <code>,<cm></code>. |
===Apertium side=== |
===Apertium side=== |
Revision as of 11:05, 11 November 2015
One language pair setup nowadays is using transducers from Giellatekno and pair-specific data in Apertium. This is a tricky set up because there is a lot of machinery around the tagset conversion.
Let's assume you're using giella-xxx
, giella-yyy
and apertium-xxx-yyy
, what are the relevant files ?
Giellatekno side
giella-xxx/tools/mt/apertium
: This is where all the relabelled transducers live.giella-xxx/tools/mt/apertium/tagsets
: These are the different files used for relabelling the transducers.giella-xxx/tools/mt/apertium/tagsets/apertium.postproc.relabel
: This file is used for 1:1 tag conversions. For example if you want to change<cc>
to<cnjcoo>
.giella-xxx/tools/mt/apertium/tagsets/modify-tags.regex
: This file is used for 1:1, 1:n and n:1 tag conversions. For example if you want to change<sg3>
to<p3><sg>
. This is also used if you need to do context-sensitive replacement, for example for specific lemmas,<clb>
to,<cm>
.
Apertium side
apertium-xxx-yyy/gt2apertium.cg3r
: This file is used for converting the CG file tags to Apertium tags. You may have to convert tags in more than one place.
Testing and troubleshooting
A lot of the time it takes a lot of time and patience to get the tags as they should be. Here are some tips for checking which file you need to look in.
Apertium side
Check the trimmed analyser
$ echo , | hfst-lookup xxx-yyy.automorf.hfst , ,<cm> 0,000000
If this comes out as an unknown word, then the fault is probably in your bilingual dictionary apertium-xxx-yyy.xxx-yyy.dix
. You can grep for the analysis using something like:
$ lt-expand apertium-xxx-yyy.xxx-yyy.dix | grep '<cm>' ,<cm>:,<cm>
If the word comes out but has the wrong tag, then the fault is probably in one of the Giellatekno relabel scripts, see the files as described above in giella-xxx/tools/mt/apertium/tagsets
.
Check the untrimmed analyser
$ echo , | hfst-lookup .deps/xxx.automorf.hfst , ,<cm> 0,000000
If the word comes out with the wrong tag here, then again the fault is probably in one of the Giellatekno relabel scripts, see the files as described above in giella-xxx/tools/mt/apertium/tagsets
.
Giellatekno side
Check the relabelled analyser
$ echo , | hfst-lookup tools/mt/apertium/analyser-mt-apertium-desc.yyy.hfstol , ,<cm> 0,000000
If the word comes out with the wrong tag here, then again the fault is probably in one of the Giellatekno relabel scripts, see the files as described above in giella-xxx/tools/mt/apertium/tagsets
.
Check the unrelabelled analyser
$ echo , | hfst-lookup src/analyser-gt-desc.hfstol , ,+CLB 0,000000
If the word comes out with the wrong tag here, then again the fault is probably in one of the lexc source files, you can find them in giella-xxx/src/morphology/