Why we trim
In Apertium language pairs, we keep the monolingual and bilingual dictionaries trimmed, so that all entries from the analyser will have some match in the bidix, and all output from transfer will have some entry in the generator.
There are several reasons for doing this:
- If a word has no bilingual dictionary entry, it will be output as the analysed lemma with an '@' in front, e.g. "children" will be output as "@child", and worse: A multiword like "be worthwhile" will be output as "@be" (or, with debug symbols turned off, "child" and "be"). This means that when post-editing, the post-editor has to constantly look at the source language text (whereas an unknown word would be possible to translate there and then). And when gisting, the reader might be tricked into misunderstanding the content, instead of observing that there is an unknown word.
- Transfer rules quite often use target language information from bidix to fill in tags etc. If transfer from English to Spanish reads a chunk like "the children", the Spanish determiner needs to get the number and gender information from the target language noun. It is not enough to look at the output of the source language analyser, number can be changed by bidix for certain nouns, and gender is not even present in the source language. The transfer rule expects to have this information; without it, not only will the noun be output as @lemma, but the determiner will not be generated correctly either. This effect gets even worse with bigger chunks.
- One might work around this by having exceptions in the transfer rules to e.g. guess number and gender if bidix doesn't give any, but this leads to an enormous increase in transfer complexity – all tags have to be presumed to be unknown, and developer time is wasted on bug-hunting and workarounds instead of improving translation quality.
- Although there could be a technical solution to carrying over the source word if it's not in the bidix (
lt-proc -o), this leads to problems with compounds and other multiwords that are split into two lexical units before bidix lookup: What do you do when part of a multiword is unknown? For example, if we have ^writes about/write<vblex>+about<pr>$, this is currently split before bidix lookup into two units ^write<vblex>$ ^about<pr>$, without lemmas, and if only one is unknown after bidix lookup, the other will still translate: ^write<vblex>/escribir<vblex>$ ^about<pr>/@about<pr>$. If, on the other hand, we were to keep the surface form around, we would also have keep it as one unit in bidix lookup, such that if parts of the multiword were unknown, all of it would be marked unknown, giving something like ^@writes about/write<vblex>+@about<pr>$.
- Can't you just distribute the surface form over the two units? ^writes/write<vblex>$ ^about/about<pr>$! While in this constructed example, the split was at a space, it could be anywhere. The surface form gives no general indication of where. We have multiwords that split in the middle of contractions (^au/à<pr>+le<det><def><m><sg>$), or in the middle of compunds (^vasskokaren/vatn<n>+kokar<n>$)
There are now several ways of Automatically_trimming_a_monodix, so it is perfectly possible to keep one main, full monodix used by several language pairs, which in each individual language pair is compiled into a trimmed monodix for analysis.
- Typically this goes for both translation direction, although a language pair only released for one direction might only be trimmed in that direction.