Apertium New Language Pair HOWTO
Apertium New Language Pair HOWTO
This HOWTO document will describe how to start a new language pair for the Apertium machine translation system from scratch.
It does not assume any knowledge of linguistics, or machine translation above the level of being able to distinguish nouns from verbs (and prepositions etc.)
Apertium is, as you've probably realised by now, a machine translation system. Well, not quite, its a machine translation platform. It provides and engine and toolbox which allow you to build your own machine translation systems. The only thing you need to do is write the data. The data consists on a basic level, of three dictionaries and a few rules (to deal word re-ordering and other grammatical stuff).
For a more detailed introduction into how it all works, there are some excellent papers on the project's website apertium.sourceforge.net.
You will need
- libxml utils (xmllint etc.)
- a text editor (or a specialized XML editor if you prefer to)
This document will not describe how to install these packages, for more information on that, please see the documentation section of the Apertium website.
What does a language pair consist of?
The Apertium machine translation system is of the shallow-transfer type, this basically means it works on dictionaries and shallow transfer rules. Shallow transfer is distinguished from "deep transfer" in that it doesn't do full syntactic parsing, the rules are typically operations on groups of lexical units, rather than operations on parse trees. On a basic level, there are three main dictionaries:
- The morphological dictionary for language xx: this contains the rules of how words in language xx are inflected. In our example this will be called: apertium-sh-en.sh.dix
- The morphological dictionary for language yy: this contains the rules of how words in language yy are inflected. In our example this will be called: apertium-sh-en.en.dix
- Bilingual dictionary: contains correspondences between words and symbols in the two languages. In our example this will be called: apertium-sh-en.sh-en.dix
In a translation pair, both languages can be either source or target for translation, these are relative terms.
There are also two files for transfer rules. These are the rules which govern how words are re-ordered in sentences, e.g. chat noir -> cat black -> black cat. It also governs agreement of gender, number etc. The rules can also be used to insert or delete lexical items, as will be described later. These files are:
- language xx to language yy transfer rules: this file contains rules for how language xx will be changed into language yy. In our example this will be: apertium-sh-en.trules-sh-en.xml
- language yy to xx language transfer rules: this file contains rules for how language yy will be changed into language xx. In our example this will be: apertium-sh-en.trules-en-sh.xml
Many of the language pairs currently available have other files, but we won't cover them here. These files are the only ones required to generate a functional system.
As may have been alluded by the file names, this HOWTO will use the example of translating Serbo-Croatian to English to explain how to create a basic system. This is not an ideal pair, as the system works better for more closely related languages, and furthermore it does not currently support the full Serbo-Croatian alphabet, but that shouldn't present a problem for the simple examples we'll have here.
A brief note on terms
There are number of terms that will need to be understood before we continue.
The first is lemma. A lemma is the citation form of a word. It is the word stripped of any grammatical information. For example, the lemma of the word cats is cat. In English nouns this will typically be the singular form of the word in question. For verbs, the lemma is the infinitive stripped of to, e.g. the lemma of was would be be.
The second is symbol. In the context of the Apertium system, symbol refers to a grammatical label. The word cats is a plural noun, therefore it will have the noun symbol and the plural symbol. In the input and output of Apertium modules these are typically given between angle brackets, as follows:
- <n>; for noun.
- <pl>; for plural.
Other examples of symbols are <sg>; singular, <p1> first person, <pri> present indicative etc. When written in angle brackets, the symbols may also be referred to as tags. It is worth noting that in many of the currently available language pairs the symbol definitions are acronyms or contractions of words in Catalan. For example, vbhaver — from vb (verb) and haver ("to have" in Catalan). Symbols are defined in <sdef> tags and used in
The third word is paradigm. In the context of the Apertium system, paradigm refers to a example of how a particular group of words inflect. In the morphological dictionary, lemmas (see above) are linked to paradigms which allows us to describe how a given lemma inflects without having to write out all of the endings.
An example of the utility of this is, if we wanted to store the two adjectives happy and lazy, instead of storing two lots of the same thing:
- happy, happ (y, ier, iest)
- lazy, laz (y, ier, iest)
We can simply store one, and then say "lazy, inflects like happy", or indeed "shy inflects like happy", "naughty inflects like happy", "friendly inflects like happy" etc. In this example, happy would be the paradigm, the model for how the others inflect. The precise description of how this is defined will be explained shortly. Paradigms are defined in <pardef> tags, and used in <par> tags.
Lets start by making our first source language dictionary. The dictionary is an XML file. Fire up your text editor and type the following:
<?xml version="1.0" encoding="ISO-8859-1"?> <dictionary> </dictionary>
Save the file as apertium-sh-en.sh.dix with an ISO-8859-1 encoding. A short note on encoding: currently (as of April 2007), Apertium only supports the ISO-8859-1 single byte encoding. There is work ongoing to port it to Unicode (indeed an experimental version of lttoolbox with UTF-8 support is available from the SVN repository on the Apertium project site).
Note: It is important to have your locale set up correctly when writing/reading files, you can find out your current locale setting by doing echo $LANG from a shell.
So, the file so far defines that we want to start a dictionary. In order for it to be useful, we need to add some more entries, the first is an alphabet. This defines the set of letters that may be used in the dictionary, for Serbo-Croatian. Normally it would look something like the following, containing all the letters of the Serbo-Croatian alphabet:
However in our example, it will look like this:
The reason for this is that, as mentioned above, lttoolbox requires ISO-8859-1 encoding, and Č, Ć, Dž, Đ, Lj, Nj, Š, and Ž (along with their minuscule forms) are not found in this encoding. Some languages have got round this by choosing other characters from ISO-8859-1 to represent the missing letters, and then transliterating. For example, using the character 'ç' (c with cedilla) to represent 'ć' (c with acute accent), or using 'ð' (eth) to represent 'đ' (d with stroke). We will not be using this method, although an example of its use may be found in the Romanian-Spanish translation pair.
Place the alphabet below the <dictionary> tag.
Next we need to define some symbols. Lets start off with the simple stuff, noun (n) in singular (sg) and plural (pl).
<sdefs> <sdef n="n"/> <sdef n="sg"/> <sdef n="pl"/> </sdefs>
The symbol names do not have to be so small, in fact they could be just written our in full, but as you'll be typing them a lot, it makes sense to abbreviate.
Unfortunately, it isn't quite so simple, nouns in Serbo-Croatian inflect for more than just number, they also inflect for gender and case. However, we'll assume for the purposes of this example that the noun is masculine and in the nominative case (a full example may be found at the end of this document).
Next thing is to define a section for the paradigms,
and a dictionary section:
<section id="main" type="standard"> </section>
There are two types of sections, the first is a standard section, which contains words, enclitics etc. The second type is an inconditional section which typically contains punctuation etc. We don't have an inconditional section here, although it will be demonstrated later.
So, our file should now look something like:
<?xml version="1.0" encoding="ISO-8859-1"?> <dictionary> <sdefs> <sdef n="n"/> <sdef n="sg"/> <sdef n="pl"/> </sdefs> <pardefs> </pardefs> <section id="main" type="standard"> </section> </dictionary>
Now we've got the skeleton in place, we can start by adding a noun. The noun in question will be 'gramofon' (which means 'gramophone' or 'record player').
The first thing we need to do, as we have no prior paradigms, is to define a paradigm.
Remember we're assuming masculine gender and nominative case. The singular form of the noun is 'gramofon', and the plural is 'gramofoni'. So:
<pardef n="gramofon__n"> <e> <p> <l/> <r><s n="n"/><s n="sg"/></r> </p> </e> <e> <p> <l>i</l> <r><s n="n"/><s n="pl"/></r> </p> </e> </pardef>
Note: the '<l/>' (equivalent to <l></l>) denotes that there is no extra material to be added to the stem for the singular.
This may seem like a rather verbose way of describing it, but there are reasons for it and it quickly becomes second nature. You're probably wondering what the <e>,
, <l> and <r> stand for. Well,
- e, is for entry.
- p, is for pair.
- l, is for left.
- r, is for right.
Why left and right? Well, the morphological dictionaries will later be compiled into finite state machines. Compiling them left to right produces analyses from words, and from right to left produces words from analyses. For example:
* gramofoni (left to right) gramofon<n><pl> (analysis) * gramofon<n><pl> (right to left) gramofoni (generation)
Now we've defined a paradigm, we need to link it to its lemma, gramofon. We put this in the section that we've defined.
The entry to put in will look like:
<e lm="gramofon"><i>gramofon</i><par n="gramofon__n"/></e>
A quick run down on the abbreviations:
- lm, is for lemma.
- i, is for identity (the left and the right are the same).
- par, is for paradigm.
This entry states the lemma of the word, gramofon, the root, gramofon and the paradigm with which it inflects gramofon__n. The difference between the lemma and the root is that the lemma is the citation form of the word, while the root is the substring of the lemma to which stems are added. This will become clearer later when we show an entry where the two are different.
We're now ready to test the dictionary. Save it, and then return to the shell. We first need to compile it (with lt-comp), then we can test it (with lt-proc).
$ lt-comp lr apertium-sh-en.sh.dix sh-en.automorf.bin
Should produce the output:
main@standard 12 12
As we are compiling it left to right, we're producing an analyser. Lets make a generator too.
$ lt-comp rl apertium-sh-en.sh.dix sh-en.autogen.bin
At this stage, the command should produce the same output.
We can now test these. Run lt-proc on the analyser.
$ lt-proc sh-en.automorf.bin
Now try it out, type in gramofoni (gramophones), and see the output:
Now, for the English dictionary, do the same thing, but substitute the English word gramophone for gramofon, and change the plural inflection. What if you want to use the more correct word 'record player', well, we'll explain how to do that later.
You should now have two files in the directory:
- apertium-sh-en.sh.dix which contains a (very) basic Serbo-Croatian morphological dictionary, and
- apertium-sh-en.en.dix which contains a (very) basic English morphological dictionary.