Difference between revisions of "Integration and tagset conversion with Giellatekno"
Line 28: | Line 28: | ||
, ,<cm> 0,000000 |
, ,<cm> 0,000000 |
||
</pre> |
</pre> |
||
+ | |||
+ | If this comes out as an unknown word, then the fault is probably in your bilingual dictionary <code>apertium-xxx-yyy.xxx-yyy.dix</code>. You can grep for the analysis using something like: |
||
+ | |||
+ | <pre> |
||
+ | $ lt-expand apertium-xxx-yyy.xxx-yyy.dix | grep '<cm>' |
||
+ | ,<cm>:,<cm> |
||
+ | </pre> |
||
+ | |||
+ | If the word comes out but has the wrong tag, then the fault is probably in one of the Giellatekno relabel scripts, see the files as described above in <code>giella-xxx/tools/mt/apertium/tagsets</code> |
||
;Check the untrimmed analyser |
;Check the untrimmed analyser |
Revision as of 10:57, 11 November 2015
One language pair setup nowadays is using transducers from Giellatekno and pair-specific data in Apertium. This is a tricky set up because there is a lot of machinery around the tagset conversion.
Let's assume you're using giella-xxx
, giella-yyy
and apertium-xxx-yyy
, what are the relevant files ?
Giellatekno side
giella-xxx/tools/mt/apertium
giella-xxx/tools/mt/apertium/tagsets
giella-xxx/tools/mt/apertium/tagsets/apertium.postproc.relabel
: This file is used for 1:1 tag conversions. For example if you want to change<cc>
to<cnjcoo>
.giella-xxx/tools/mt/apertium/tagsets/modify-tags.regex
: This file is used for 1:1, 1:n and n:1 tag conversions. For example if you want to change<sg3>
to<p3><sg>
.
Apertium side
apertium-xxx-yyy/gt2apertium.cg3r
: This file is used for converting the CG file tags to Apertium tags. You may have to convert tags in more than one place.
Testing and troubleshooting
A lot of the time it takes a lot of time and patience to get the tags as they should be. Here are some tips for checking which file you need to look in.
Apertium side
- Check the trimmed analyser
$ echo , | hfst-lookup xxx-yyy.automorf.hfst , ,<cm> 0,000000
If this comes out as an unknown word, then the fault is probably in your bilingual dictionary apertium-xxx-yyy.xxx-yyy.dix
. You can grep for the analysis using something like:
$ lt-expand apertium-xxx-yyy.xxx-yyy.dix | grep '<cm>' ,<cm>:,<cm>
If the word comes out but has the wrong tag, then the fault is probably in one of the Giellatekno relabel scripts, see the files as described above in giella-xxx/tools/mt/apertium/tagsets
- Check the untrimmed analyser
$ echo , | hfst-lookup .deps/xxx.automorf.hfst , ,<cm> 0,000000
Giellatekno side
- Check the relabelled analyser
$ echo , | hfst-lookup tools/mt/apertium/analyser-mt-apertium-desc.yyy.hfstol , ,<cm> 0,000000
- Check the unrelabelled analyser
$ echo , | hfst-lookup src/analyser-gt-desc.hfstol , ,+CLB 0,000000