Why we trim

From Apertium
Revision as of 08:17, 12 October 2012 by Unhammer (talk | contribs) (Created page with 'In Apertium language pairs, we keep the monolingual and bilingual dictionaries ''trimmed'', so that all entries from the analyser will have some match in the bidix, and all outpu…')
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

In Apertium language pairs, we keep the monolingual and bilingual dictionaries trimmed, so that all entries from the analyser will have some match in the bidix, and all output from transfer will have some entry in the generator.[1]

There are several reasons for doing this:

  1. If a word has no bilingual dictionary entry, it will be output as the analysed lemma with an '@' in front, e.g. "children" will be output as "@child", and worse: A multiword like "be worthwhile" will be output as "@be" (or, with debug symbols turned off, "child" and "be"). This means that when post-editing, the post-editor has to constantly look at the source language text (whereas an unknown word would be possible to translate there and then). And when gisting, the reader might be tricked into misunderstanding the content, instead of observing that there is an unknown word.
    • Note: there could be a technical solution to carrying over the source word if it's not in the bidix, but this has so far not been tested.
  2. Transfer rules quite often use target language information from bidix to fill in tags etc. If transfer from English to Spanish reads a chunk like "the children", the Spanish determiner needs to get the number and gender information from the target language noun. It is not enough to look at the output of the source language analyser, number can be changed by bidix for certain nouns, and gender is not even present in the source language. The transfer rule expects to have this information; without it, not only will the noun be output as @lemma, but the determiner will not be generated correctly either. This effect gets even worse with bigger chunks.
    • One might work around this by having exceptions in the transfer rules to e.g. guess number and gender if bidix doesn't give any, but this leads to an enormous increase in transfer complexity – all tags have to be presumed to be unknown, and developer time is wasted on bug-hunting and workarounds instead of improving translation quality.


There are now several ways of Automatically_trimming_a_monodix, so it is perfectly possible to keep one main, full monodix used by several language pairs, which in each individual language pair is compiled into a trimmed monodix for analysis.


Footnotes

  1. Typically this goes for both translation direction, although a language pair only released for one direction might only be trimmed in that direction.