User:Firespeaker/Steps for writing a language pair

From Apertium
Jump to navigation Jump to search

Document Resources

  • If there are any dictionaries or other resources for the language pair, you should make a list of these.

Comparative Grammar

Transducers

Write a morphological transducer for each language. At first they can be pretty basic—i.e., you don't have to get through all the steps for each one.

Start a bidix

Map some tags

Add some nouns

Add some verbs

Other word classes

Solving more complicated translation problems

There are many sorts of translation problems where a simple one-to-one mapping in dix won't work.

Disambiguation

The lowest level of these translation problems is morphological disambiguation. This is where one form has multiple interpretations. These needed to be sorted out before any words are looked up in dix. For example, you have the form (kaz)енді, which can have the following readings: ен<n><acc> "width", ен<v><iv><ifi><p3><pl> "they entered", ен<v><iv><ifi><p3><sg> "s/he/it entered", енді<adv> "now". The morphological analyser, without the help of disambiguation, spits out all these forms, and chooses one at random (for our purposes) to translate through dix. If you get the wrong form, you get weird things like "They entered I know answer" instead of "Now I know the answer".

To disambiguate these readings, you need to make rules based on the grammatical context. E.g., if the next word is a verb, then you can remove the verb reading from the list of possible correct readings; if the word is at the end of the sentence, you can remove the noun from the list of possible correct readings; etc. You can also choose a default reading (in this case, the adverb reading is by far the most common) and select other readings in very specific contexts—e.g., not just in certain grammatical positions, but when they occur with certain other words.

See Apertium and Constraint Grammar and Introduksjon til føringsgrammatikk for a more specific introduction to how this is done.

Lexical Selection

The next level of problem, once you've chosen the correct morphological form, is lexical selection. The sort of problem this best solves is when you have one stem in Language A that translates to multiple stem in Language B. It could be that the word in Language A has multiple senses, or it could be that there is a specific word in Language B used in specific contexts.

For example, we might have a word like (kaz)шеш<v><tv>, which has many different senses in English: solve, untie, decide, untangle, figure [something] out, take off (clothes), etc. The correct translation here can often be chosen, for example, based on the object (or noun immediate preceding the verb) in Kazakh.

Usually lexical selection rules are written by writing a very simple rule for a default translation and then writing rules based on specific contexts for "more specific"/other translations.

To learn more about lexical selection rules, see How to get started with lexical selection rules.

Transfer

Evaluate