Integration and tagset conversion with Giellatekno
Revision as of 10:55, 11 November 2015 by Francis Tyers (talk | contribs)
One language pair setup nowadays is using transducers from Giellatekno and pair-specific data in Apertium. This is a tricky set up because there is a lot of machinery around the tagset conversion.
Let's assume you're using giella-xxx
, giella-yyy
and apertium-xxx-yyy
, what are the relevant files ?
Giellatekno side
giella-xxx/tools/mt/apertium
giella-xxx/tools/mt/apertium/tagsets
giella-xxx/tools/mt/apertium/tagsets/apertium.postproc.relabel
: This file is used for 1:1 tag conversions. For example if you want to change<cc>
to<cnjcoo>
.giella-xxx/tools/mt/apertium/tagsets/modify-tags.regex
: This file is used for 1:1, 1:n and n:1 tag conversions. For example if you want to change<sg3>
to<p3><sg>
.
Apertium side
apertium-xxx-yyy/gt2apertium.cg3r
: This file is used for converting the CG file tags to Apertium tags. You may have to convert tags in more than one place.
Testing and troubleshooting
A lot of the time it takes a lot of time and patience to get the tags as they should be. Here are some tips for checking which file you need to look in.
Apertium side
- Check the trimmed analyser
$ echo , | hfst-lookup xxx-yyy.automorf.hfst , ,<cm> 0,000000
- Check the untrimmed analyser
$ echo , | hfst-lookup .deps/xxx.automorf.hfst , ,<cm> 0,000000
Giellatekno side
- Check the relabelled analyser
$ echo , | hfst-lookup tools/mt/apertium/analyser-mt-apertium-desc.yyy.hfstol , ,<cm> 0,000000
- Check the unrelabelled analyser
$ echo , | hfst-lookup src/analyser-gt-desc.hfstol , ,+CLB 0,000000