How to figure out what's wrong with a translation
|Line 41:||Line 41:|
[[Category:Documentation in English]]
Latest revision as of 13:57, 26 September 2016
To figure out what's going wrong in a translation, you have to understand how apertium (and the specific language pair) works, and then you need to go routing around to fix a mistake, add a missing rule, etc.
 Missing stems
The easiest kind of mistake. You'll need to make sure you have the input stem and desired output stem in both of your monolingual dictionaries and that they're matched in the bilingual dictionary
 When the wrong word is translated
Missing or incorrect Disambiguation and Lexical Selection rules often cause the wrong word to be translated. Problems in disambiguation can also cause the wrong form of a word to be translated.
The lowest level of these translation problems is morphological disambiguation. This is where one form has multiple interpretations. These needed to be sorted out before any words are looked up in dix. For example, you have the form (
kaz)енді, which can have the following readings: ен
<acc> "width", ен
<pl> "they entered", ен
<sg> "s/he/it entered", енді
<adv> "now". The morphological analyser, without the help of disambiguation, spits out all these forms, and chooses one at random (for our purposes) to translate through dix. If you get the wrong form, you get weird things like "They entered I know answer" instead of "Now I know the answer".
To disambiguate these readings, you need to make rules based on the grammatical context. E.g., if the next word is a verb, then you can remove the verb reading from the list of possible correct readings; if the word is at the end of the sentence, you can remove the noun from the list of possible correct readings; etc. You can also choose a default reading (in this case, the adverb reading is by far the most common) and select other readings in very specific contexts—e.g., not just in certain grammatical positions, but when they occur with certain other words.
For apertium, morphological disambiguation is done with CG (Constraint Grammar) rules. For languages with stand-alone vanilla transducers, the file will be something like apertium-xxx/apertium-xxx.rlx; for pairs that don't have stand-alone vanilla transducers, the files (in the pair directory) will be apertium-xxx-yyy.xxx-yyy.rlx and apertium-xxx-yyy.yyy-xxx.rlx, where the first language code in the second group is the language that's being disambiguated. See Apertium and Constraint Grammar and Introduksjon til føringsgrammatikk for a more specific introduction to disambiguation with CG rules.
 Lexical Selection
The next level of problem, once you've chosen the correct morphological form, is lexical selection. The sort of problem this best solves is when you have one stem in Language A that translates to multiple stem in Language B. It could be that the word in Language A has multiple senses, or it could be that there is a specific word in Language B used in specific contexts.
For example, we might have a word like (
<tv>, which has many different senses in English: solve, untie, decide, untangle, figure [something] out, take off (clothes), etc. The correct translation here can often be chosen, for example, based on the object (or noun immediate preceding the verb) in Kazakh.
Usually lexical selection rules are written by writing a very simple rule for a default translation and then writing rules based on specific contexts for "more specific"/other translations.
Lexical selection is done in .lrx files in the language pair directory (e.g., apertium-xxx-yyy.xxx-yyy.lrx). To learn more about lexical selection rules, see How to get started with lexical selection rules.
 When the output grammar is wrong
The highest level of translation problem is when there is no way to get the right translation from a model that one-to-one translates X
<bar> to e.g., Y
<bar>—that is, a model where each stem and each tag has a corresponding stem and the tags are the same. However, the premise of transfer rules begins with a one-to-one model, and modifies the output. This includes a range of possibilities:
- The tags are different, e.g. X
- Tags are removed or added, e.g. X
- The order of two things needs to be reversed X
- A word needs to be added or removed, e.g. X
Files are apertium-xxx-yyy.xxx-yyy.t*x and apertium-xxx-yyy.yyy-xxx.t*x, where the second language in the file's pair name is the one whose output needs to be controlled. You can read more about transfer rules at A long introduction to transfer rules.
If the syntax of the languages are really different, or to handle a few pieces of divergence, you'll want to do syntactic chunking. You may use Helsinki Apertium Workshop/Session 6 and Chunking as guides for how to do chunking. Chunking is rather involved, so it's recommended that you avoid it until absolutely necessary. For closely related languages, you can get fairly good results without reverting to chunking.