Quick and dirty guide addendum: other important things
Hello! You might have reached this page from the previous guide: The quick and dirty guide to making a new language pair. This page documents further instructions after you have created a basic language pair.
Document Available Resources
If there are any dictionaries or other resources for the language pair, you should make a list of these and (if possible) keep them on hand while you work. Also useful are resources in a third language you know. E.g., if you're making an English-Qaraqalpaq MT system, and you can't find any good English-Qaraqalpaq dictionaries, but you know Russian, a Russian-Qaraqalpaq dictionary would be very useful. Don't be shy to get creative in your use of resources, but be wary of mistakes in various resources (even seemingly reputable ones!) or over-all low-quality resources.
Document Grammatical Differences (Comparative Grammar)
Make a page on this wiki systematically comparing the grammar of the two languages. This can range from simple differences, such as how plurals are formed or the word order of noun phrases (e.g., Adjective Noun versus Noun Adjective), to high-level differences, such as how passivised relative clauses are formed. Some good examples of such pages include English and Esperanto/Outstanding tests and Welsh to English/Regression tests. Like with these pages, you should probably plan to keep track of which points you've been able to handle (through the implementation of various rules—see below) and which points remain unhandled.
Write a morphological transducer for each language in your pair. The transducers are the pillars which the pair stands on. They produce morphological analyses of your forms and generate forms from morphological analyses. The language pair essentially just maps analyses in one language to analyses in another language, so having good (and similar/compatible) transducers is really important.
At first, your transducers can be pretty basic—i.e., you don't have to get through all the steps to create each one. But you'll want to build it up, testing against corpora, etc. You want to be able to have as many correct analyses as possible, and generate as many correct forms as possible without overgenerating. For morphologically complex languages this can be quite a lot of work, but it pays off when you're working on other aspects of the pair and things "just work".
The lowest level of these translation problems is morphological disambiguation. This is where one form has multiple interpretations. These needed to be sorted out before any words are looked up in dix. For example, you have the form (
kaz)енді, which can have the following readings: ен
<acc> "width", ен
<pl> "they entered", ен
<sg> "s/he/it entered", енді
<adv> "now". The morphological analyser, without the help of disambiguation, spits out all these forms, and chooses one at random (for our purposes) to translate through dix. If you get the wrong form, you get weird things like "They entered I know answer" instead of "Now I know the answer".
To disambiguate these readings, you need to make rules based on the grammatical context. E.g., if the next word is a verb, then you can remove the verb reading from the list of possible correct readings; if the word is at the end of the sentence, you can remove the noun from the list of possible correct readings; etc. You can also choose a default reading (in this case, the adverb reading is by far the most common) and select other readings in very specific contexts—e.g., not just in certain grammatical positions, but when they occur with certain other words.
For apertium, morphological disambiguation is done with CG (Constraint Grammar) rules. For languages with stand-alone vanilla transducers, the file will be something like apertium-xxx/apertium-xxx.rlx; for pairs that don't have stand-alone vanilla transducers, the files (in the pair directory) will be apertium-xxx-yyy.xxx-yyy.rlx and apertium-xxx-yyy.yyy-xxx.rlx, where the first language code in the second group is the language that's being disambiguated. See Apertium and Constraint Grammar and Introduksjon til føringsgrammatikk for a more specific introduction to disambiguation with CG rules.
The next level of problem, once you've chosen the correct morphological form, is lexical selection. The sort of problem this best solves is when you have one stem in Language A that translates to multiple stem in Language B. It could be that the word in Language A has multiple senses, or it could be that there is a specific word in Language B used in specific contexts.
For example, we might have a word like (
<tv>, which has many different senses in English: solve, untie, decide, untangle, figure [something] out, take off (clothes), etc. The correct translation here can often be chosen, for example, based on the object (or noun immediate preceding the verb) in Kazakh.
Usually lexical selection rules are written by writing a very simple rule for a default translation and then writing rules based on specific contexts for "more specific"/other translations.
Lexical selection is done in .lrx files in the language pair directory (e.g., apertium-xxx-yyy.xxx-yyy.lrx). To learn more about lexical selection rules, see How to get started with lexical selection rules.
The highest level of translation problem is when there is no way to get the right translation from a model that one-to-one translates X
<bar> to e.g., Y
<bar>—that is, a model where each stem and each tag has a corresponding stem and the tags are the same. However, the premise of transfer rules begins with a one-to-one model, and modifies the output. This includes a range of possibilities:
- The tags are different, e.g. X
- Tags are removed or added, e.g. X
- The order of two things needs to be reversed X
- A word needs to be added or removed, e.g. X
Files are apertium-xxx-yyy.xxx-yyy.t*x and apertium-xxx-yyy.yyy-xxx.t*x, where the second language in the file's pair name is the one whose output needs to be controlled. You can read more about transfer rules at A long introduction to transfer rules.
If the syntax of the languages are really different, or to handle a few pieces of divergence, you'll want to do syntactic chunking. You may use Helsinki Apertium Workshop/Session 6 and Chunking as guides for how to do chunking. Chunking is rather involved, so it's recommended that you avoid it until absolutely necessary. For closely related languages, you can get fairly good results without reverting to chunking.
Evaluate the pair
There are three main ways to evaluate your language pair, each one telling you a different type of strength or weakness: trimmed coverage, WER, and testvoc. Ideally you should evaluate your pair for all three of these, in both directions (language X → language Y and language Y → language X).
Trimmed coverage tells you how much (what percentage) of a corpus is able to be analysed by your language pair.
For trimmed coverage, you'll need at least one corpus of text in your source language (ideally both, if your pair will support translation in both directions). The corpus or corpora ideally should represent a range of content—i.e., it shouldn't be just sports coverage, economics, or childrens' stories—think about what your user base might want to use your language pair to translate.
Your pair should have ≥80% trimmed coverage to be considered stable, and ≥90% trimmed coverage to be considered production ready.
To learn more about trimmed coverage, see An introduction to trimmed coverage.
WER (Word Error Rate) tells you what percentage of the words of a text need to be changed before a text is production-ready.
To evaluate WER, you need to choose a text in the source language, run it through your language pair, and make a copy of it where you manually fix all the translation mistakes, rendering the text cleanly and accurately in your target language. You then use a script to compare the output of the language pair to your cleaned translation.
Your pair should consistently have around a 10% WER on random texts of sizable length (e.g., 500 words) to be considered production-ready.
You can learn more about performing WER evaluation.
The basic idea of a testvoc is to check that all possible forms in one language have a clean translation (i.e., no *, @, #, etc. symbols representing various types of translation errors) in the other language. Any gaps that exist will come up, and are likely to be systematic. A testvoc helps you identify these gaps and fix them.
Testvoc can be done in several ways: you can test just one part of speech (e.g., nouns) to make sure that all forms are translated correctly; you can test a pair based on the entire transducer of one of the languages (as a source langauge), or you can run testvoc on a corpus (for problems that might come up because of transfer rules).
Your pair should have an entirely clean testvoc (i.e., all forms are translated without errors) to be considered production-ready.
You can learn more about testvoc and how to perform one.