Difference between revisions of "Why we trim"
Jump to navigation
Jump to search
Line 18: | Line 18: | ||
* [[Automatically trimming a monodix]] |
* [[Automatically trimming a monodix]] |
||
* [[Testvoc]] |
* [[Testvoc]] |
||
* [http://wiki.apertium.eu/index.php/Session_7 Session 7: Data consistency and quality] on wiki.apertium.eu |
|||
[[Category:Quality control]] |
[[Category:Quality control]] |
Revision as of 13:22, 13 October 2012
In Apertium language pairs, we keep the monolingual and bilingual dictionaries trimmed, so that all entries from the analyser will have some match in the bidix, and all output from transfer will have some entry in the generator.[1]
There are several reasons for doing this:
- If a word has no bilingual dictionary entry, it will be output as the analysed lemma with an '@' in front, e.g. "children" will be output as "@child", and worse: A multiword like "be worthwhile" will be output as "@be" (or, with debug symbols turned off, "child" and "be"). This means that when post-editing, the post-editor has to constantly look at the source language text (whereas an unknown word would be possible to translate there and then). And when gisting, the reader might be tricked into misunderstanding the content, instead of observing that there is an unknown word.
- Note: there could be a technical solution to carrying over the source word if it's not in the bidix, but this has so far not been tested.
- Transfer rules quite often use target language information from bidix to fill in tags etc. If transfer from English to Spanish reads a chunk like "the children", the Spanish determiner needs to get the number and gender information from the target language noun. It is not enough to look at the output of the source language analyser, number can be changed by bidix for certain nouns, and gender is not even present in the source language. The transfer rule expects to have this information; without it, not only will the noun be output as @lemma, but the determiner will not be generated correctly either. This effect gets even worse with bigger chunks.
- One might work around this by having exceptions in the transfer rules to e.g. guess number and gender if bidix doesn't give any, but this leads to an enormous increase in transfer complexity – all tags have to be presumed to be unknown, and developer time is wasted on bug-hunting and workarounds instead of improving translation quality.
There are now several ways of Automatically_trimming_a_monodix, so it is perfectly possible to keep one main, full monodix used by several language pairs, which in each individual language pair is compiled into a trimmed monodix for analysis.
Footnotes
- ↑ Typically this goes for both translation direction, although a language pair only released for one direction might only be trimmed in that direction.
See also
- Automatically trimming a monodix
- Testvoc
- Session 7: Data consistency and quality on wiki.apertium.eu