User:Firespeaker/Steps for writing a language pair

From Apertium
Jump to navigation Jump to search

Document Resources

  • If there are any dictionaries or other resources for the language pair, you should make a list of these.

Comparative Grammar

Transducers

Write a morphological transducer for each language. At first they can be pretty basic—i.e., you don't have to get through all the steps for each one.

Start a bidix

Map some tags

Add some nouns

Add some verbs

Other word classes

Solving more complicated translation problems

There are many sorts of translation problems where a simple one-to-one mapping in dix won't work. This section outlines the various types of problems that arise and how to solve them using Apertium.

Disambiguation

The lowest level of these translation problems is morphological disambiguation. This is where one form has multiple interpretations. These needed to be sorted out before any words are looked up in dix. For example, you have the form (kaz)енді, which can have the following readings: ен<n><acc> "width", ен<v><iv><ifi><p3><pl> "they entered", ен<v><iv><ifi><p3><sg> "s/he/it entered", енді<adv> "now". The morphological analyser, without the help of disambiguation, spits out all these forms, and chooses one at random (for our purposes) to translate through dix. If you get the wrong form, you get weird things like "They entered I know answer" instead of "Now I know the answer".

To disambiguate these readings, you need to make rules based on the grammatical context. E.g., if the next word is a verb, then you can remove the verb reading from the list of possible correct readings; if the word is at the end of the sentence, you can remove the noun from the list of possible correct readings; etc. You can also choose a default reading (in this case, the adverb reading is by far the most common) and select other readings in very specific contexts—e.g., not just in certain grammatical positions, but when they occur with certain other words.

For apertium, morphological disambiguation is done with CG (Constraint Grammar) rules. For languages with stand-alone vanilla transducers, the file will be something like apertium-xxx/apertium-xxx.rlx; for pairs that don't have stand-alone vanilla transducers, the files (in the pair directory) will be apertium-xxx-yyy.xxx-yyy.rlx and apertium-xxx-yyy.yyy-xxx.rlx, where the first language code in the second group is the language that's being disambiguated. See Apertium and Constraint Grammar and Introduksjon til føringsgrammatikk for a more specific introduction to disambiguation with CG rules.

Lexical Selection

The next level of problem, once you've chosen the correct morphological form, is lexical selection. The sort of problem this best solves is when you have one stem in Language A that translates to multiple stem in Language B. It could be that the word in Language A has multiple senses, or it could be that there is a specific word in Language B used in specific contexts.

For example, we might have a word like (kaz)шеш<v><tv>, which has many different senses in English: solve, untie, decide, untangle, figure [something] out, take off (clothes), etc. The correct translation here can often be chosen, for example, based on the object (or noun immediate preceding the verb) in Kazakh.

Usually lexical selection rules are written by writing a very simple rule for a default translation and then writing rules based on specific contexts for "more specific"/other translations.

Lexical selection is done in .lrx files in the language pair directory (e.g., apertium-xxx-yyy.xxx-yyy.lrx). To learn more about lexical selection rules, see How to get started with lexical selection rules.

Transfer

The highest level of translation problem is when there is no way to get the right translation from a model that one-to-one translates X<foo><bar> to e.g., Y<foo><bar>—that is, a model where each stem and each tag has a corresponding stem and the tags are the same. However, the premise of transfer rules begins with a one-to-one model, and modifies the output. This includes a range of possibilities:

  • The tags are different, e.g. X<foo><bar> → Y<goo><baz>,
  • Tags are removed or added, e.g. X<foo><bar> → Y<foo> or X<foo> → Y<foo><baz>,
  • The order of two things needs to be reversed X<foo> A<bar> → B<bar> Y<foo>,
  • A word needs to be added or removed, e.g. X<foo> → B<bar> Y<foo> or X<foo> A<bar> → Y<foo>

Files are apertium-xxx-yyy.xxx-yyy.t*x and apertium-xxx-yyy.yyy-xxx.t*x, where the second language in the file's pair name is the one whose output needs to be controlled. You can read more about transfer rules at A long introduction to transfer rules.

Evaluation

There are three main ways to evaluate your language pair, each one telling you a different type of strength or weakness: trimmed coverage, WER, and testvoc. Ideally you should evaluate your pair for all three of these, in both directions (language X → language Y and language Y → language X).

Trimmed Coverage

Trimmed coverage tells you how much (what percentage) of a corpus is able to be analysed by your language pair.

For trimmed coverage, you'll need at least one corpus of text in your source language (ideally both, if your pair will support translation in both directions). The corpus or corpora ideally should represent a range of content—i.e., it shouldn't be just sports coverage, economics, or childrens' stories—think about what your user base might want to use your language pair to translate.

To learn more about trimmed coverage, see An introduction to trimmed coverage.

WER

WER (Word Error Rate) tells you what percentage of the words of a text need to be changed before a text is publication-ready.

To evaluate WER, you need to choose a text in the source language, run it through your language pair, and make a copy of it where you manually fix all the translation mistakes, rendering the text cleanly and accurately in your target language. You then use a script to compare the output of the language pair to your cleaned translation.

You can learn more about performing WER evaluation.

testvoc

The basic idea of a testvoc is to check that all possible forms in one language have a clean translation (i.e., no *, @, #, etc. symbols representing various types of translation errors) in the other language. Any gaps that exist will come up, and are likely to be systematic. A testvoc helps you identify these gaps and fix them.

Testvoc can be done in several ways: you can test just one part of speech (e.g., nouns) to make sure that all forms are translated correctly; you can test a pair based on the entire transducer of one of the languages (as a source langauge), or you can run testvoc on a corpus (for problems that might come up because of transfer rules).

You can learn more about testvoc and how to perform one.