User:Firespeaker/Steps for writing a language pair

From Apertium
Jump to navigation Jump to search

Document Resources

  • If there are any dictionaries or other resources for the language pair, you should make a list of these.

Comparative Grammar

Transducers

Write a morphological transducer for each language. At first they can be pretty basic—i.e., you don't have to get through all the steps for each one.

Start a bidix

Map some tags

Add some nouns

Add some verbs

Other word classes

Solving more complicated translation problems

There are many sorts of translation problems where a simple one-to-one mapping in dix won't work.

Disambiguation

The lowest level of these translation problems is morphological disambiguation. This is where one form has multiple interpretations. These needed to be sorted out before any words are looked up in dix. For example, you have the form (kaz)енді, which can have the following readings: ен<n><acc> "width", ен<v><iv><ifi><p3><pl> "they entered", ен<v><iv><ifi><p3><sg> "s/he/it entered", енді<adv> "now". The morphological analyser, without the help of disambiguation, spits out all these forms, and chooses one at random (for our purposes) to translate through dix. If you get the wrong form, you get weird things like "They entered I know answer" instead of "Now I know the answer".

To disambiguate these readings, you need to make rules based on the grammatical context. E.g., if the next word is a verb, then you can remove the verb reading from the list of possible correct readings; if the word is at the end of the sentence, you can remove the noun from the list of possible correct readings; etc. You can also choose a default reading (in this case, the adverb reading is by far the most common) and select other readings in very specific contexts—e.g., not just in certain grammatical positions, but when they occur with certain other words.

For apertium, morphological disambiguation is done with CG (Constraint Grammar) rules. For languages with stand-alone vanilla transducers, the file will be something like apertium-xxx/apertium-xxx.rlx; for pairs that don't have stand-alone vanilla transducers, the files (in the pair directory) will be apertium-xxx-yyy.xxx-yyy.rlx and apertium-xxx-yyy.yyy-xxx.rlx, where the first language code in the second group is the language that's being disambiguated. See Apertium and Constraint Grammar and Introduksjon til føringsgrammatikk for a more specific introduction to disambiguation with CG rules.

Lexical Selection

The next level of problem, once you've chosen the correct morphological form, is lexical selection. The sort of problem this best solves is when you have one stem in Language A that translates to multiple stem in Language B. It could be that the word in Language A has multiple senses, or it could be that there is a specific word in Language B used in specific contexts.

For example, we might have a word like (kaz)шеш<v><tv>, which has many different senses in English: solve, untie, decide, untangle, figure [something] out, take off (clothes), etc. The correct translation here can often be chosen, for example, based on the object (or noun immediate preceding the verb) in Kazakh.

Usually lexical selection rules are written by writing a very simple rule for a default translation and then writing rules based on specific contexts for "more specific"/other translations.

To learn more about lexical selection rules, see How to get started with lexical selection rules. Lexical selection is done in .lrx files in the language pair directory (e.g., apertium-xxx-yyy.xxx-yyy.lrx).

Transfer

The highest level of translation problem is when there is no way to get the right translation from a model that one-to-one translates X<foo><bar> to e.g., Y<foo><bar>—that is, a model where each stem and each tag has a corresponding stem and the tags are the same. However, the premise of transfer rules begins with a one-to-one model, and modifies the output. This includes a range of possibilities:

  • The tags are different, e.g. X<foo><bar> → Y<goo><baz>,
  • Tags are removed or added, e.g. X<foo><bar> → Y<foo> or X<foo> → Y<foo><baz>,
  • The order of two things needs to be reversed X<foo> A<bar> → B<bar> Y<foo>,
  • A word needs to be added or removed, e.g. X<foo> → B<bar> Y<foo> or X<foo> A<bar> → Y<foo>

Files are apertium-xxx-yyy.xxx-yyy.t*x and apertium-xxx-yyy.yyy-xxx.t*x, where the second language in the file's pair name is the one whose output needs to be controlled. You can read more about transfer rules at A long introduction to transfer rules.

Evaluate