Apertium New Language Pair HOWTO

From Apertium
Jump to navigation Jump to search

Apertium New Language Pair HOWTO

This HOWTO document will describe how to start a new language pair for the Apertium machine translation system from scratch.

It does not assume any knowledge of linguistics, or machine translation above the level of being able to distinguish nouns from verbs (and prepositions etc.)

Introduction

Apertium is, as you've probably realised by now, a machine translation system. Well, not quite, its a machine translation platform. It provides and engine and toolbox which allow you to build your own machine translation systems. The only thing you need to do is write the data. The data consists on a basic level, of three dictionaries and a few rules (to deal word re-ordering and other grammatical stuff).

For a more detailed introduction into how it all works, there are some excellent papers on the project's website apertium.sourceforge.net.

You will need

  • lttoolbox
  • libxml utils (xmllint etc.)
  • apertium
  • a text editor (or a specialized XML editor if you prefer to)

This document will not describe how to install these packages, for more information on that, please see the documentation section of the Apertium website.

What does a language pair consist of?

The Apertium machine translation system is of the shallow-transfer type, this basically means it works on dictionaries and shallow transfer rules. Shallow transfer is distinguished from "deep transfer" in that it doesn't do full syntactic parsing, the rules are typically operations on groups of lexical units, rather than operations on parse trees. On a basic level, there are three main dictionaries:

  1. The morphological dictionary for language xx: this contains the rules of how words in language xx are inflected. In our example this will be called: apertium-sh-en.sh.dix
  2. The morphological dictionary for language yy: this contains the rules of how words in language yy are inflected. In our example this will be called: apertium-sh-en.en.dix
  3. Bilingual dictionary: contains correspondences between words and symbols in the two languages. In our example this will be called: apertium-sh-en.sh-en.dix

In a translation pair, both languages can be either source or target for translation, these are relative terms.

There are also two files for transfer rules. These are the rules which govern how words are re-ordered in sentences, e.g. chat noir -> cat black -> black cat. It also governs agreement of gender, number etc. The rules can also be used to insert or delete lexical items, as will be described later. These files are:

  • language xx to language yy transfer rules: this file contains rules for how language xx will be changed into language yy. In our example this will be: apertium-sh-en.trules-sh-en.xml
  • language yy to xx language transfer rules: this file contains rules for how language yy will be changed into language xx. In our example this will be: apertium-sh-en.trules-en-sh.xml

Many of the language pairs currently available have other files, but we won't cover them here. These files are the only ones required to generate a functional system.

Language pair

As may have been alluded by the file names, this HOWTO will use the example of translating Serbo-Croatian to English to explain how to create a basic system. This is not an ideal pair, as the system works better for more closely related languages, and furthermore it does not currently support the full Serbo-Croatian alphabet, but that shouldn't present a problem for the simple examples we'll have here.

A brief note on terms

There are number of terms that will need to be understood before we continue.

The first is lemma. A lemma is the citation form of a word. It is the word stripped of any grammatical information. For example, the lemma of the word cats is cat. In English nouns this will typically be the singular form of the word in question. For verbs, the lemma is the infinitive stripped of to, e.g. the lemma of was would be be.

The second is symbol. In the context of the Apertium system, symbol refers to a grammatical label. The word cats is a plural noun, therefore it will have the noun symbol and the plural symbol. In the input and output of Apertium modules these are typically given between angle brackets, as follows:

  • <n>; for noun.
  • <pl>; for plural.

Other examples of symbols are <sg>; singular, <p1> first person, <pri> present indicative etc. When written in angle brackets, the symbols may also be referred to as tags. It is worth noting that in many of the currently available language pairs the symbol definitions are acronyms or contractions of words in Catalan. For example, vbhaver — from vb (verb) and haver ("to have" in Catalan). Symbols are defined in <sdef> tags and used in tags.

The third word is paradigm. In the context of the Apertium system, paradigm refers to a example of how a particular group of words inflect. In the morphological dictionary, lemmas (see above) are linked to paradigms which allows us to describe how a given lemma inflects without having to write out all of the endings.

An example of the utility of this is, if we wanted to store the two adjectives happy and lazy, instead of storing two lots of the same thing:

  • happy, happ (y, ier, iest)
  • lazy, laz (y, ier, iest)

We can simply store one, and then say "lazy, inflects like happy", or indeed "shy inflects like happy", "naughty inflects like happy", "friendly inflects like happy" etc. In this example, happy would be the paradigm, the model for how the others inflect. The precise description of how this is defined will be explained shortly. Paradigms are defined in <pardef> tags, and used in <par> tags.

Getting started

Monolingual dictionaries

Lets start by making our first source language dictionary. The dictionary is an XML file. Fire up your text editor and type the following:

<?xml version="1.0" encoding="ISO-8859-1"?>
<dictionary>

</dictionary>

Save the file as apertium-sh-en.sh.dix with an ISO-8859-1 encoding. A short note on encoding: currently (as of April 2007), Apertium only supports the ISO-8859-1 single byte encoding. There is work ongoing to port it to Unicode (indeed an experimental version of lttoolbox with UTF-8 support is available from the SVN repository on the Apertium project site).

Note: It is important to have your locale set up correctly when writing/reading files, you can find out your current locale setting by doing echo $LANG from a shell.

So, the file so far defines that we want to start a dictionary. In order for it to be useful, we need to add some more entries, the first is an alphabet. This defines the set of letters that may be used in the dictionary, for Serbo-Croatian. Normally it would look something like the following, containing all the letters of the Serbo-Croatian alphabet:

<alphabet>ABCČĆDDžĐEFGHIJKLLjMNNjOPRSŠTUVZŽabcčćddžđefghijklljmnnjoprsštuvzž</alphabet>

However in our example, it will look like this:

<alphabet>ABCDEFGHIJKLMNOPRSTUVZabcdefghijklmnoprstuvz</alphabet>

The reason for this is that, as mentioned above, lttoolbox requires ISO-8859-1 encoding, and Č, Ć, Dž, Đ, Lj, Nj, Š, and Ž (along with their minuscule forms) are not found in this encoding. Some languages have got round this by choosing other characters from ISO-8859-1 to represent the missing letters, and then transliterating. For example, using the character 'ç' (c with cedilla) to represent 'ć' (c with acute accent), or using 'ð' (eth) to represent 'đ' (d with stroke). We will not be using this method, although an example of its use may be found in the Romanian-Spanish translation pair.

Place the alphabet below the <dictionary> tag.

Next we need to define some symbols. Lets start off with the simple stuff, noun (n) in singular (sg) and plural (pl).

<sdefs>
   <sdef n="n"/>
   <sdef n="sg"/>
   <sdef n="pl"/>
</sdefs>

The symbol names do not have to be so small, in fact they could be just written our in full, but as you'll be typing them a lot, it makes sense to abbreviate.

Unfortunately, it isn't quite so simple, nouns in Serbo-Croatian inflect for more than just number, they also inflect for gender and case. However, we'll assume for the purposes of this example that the noun is masculine and in the nominative case (a full example may be found at the end of this document).

Next thing is to define a section for the paradigms,

<pardefs>

</pardefs>

and a dictionary section:

<section id="main" type="standard">

</section>

There are two types of sections, the first is a standard section, which contains words, enclitics etc. The second type is an inconditional section which typically contains punctuation etc. We don't have an inconditional section here, although it will be demonstrated later.

So, our file should now look something like:

<?xml version="1.0" encoding="ISO-8859-1"?>
<dictionary>
   <sdefs>
     <sdef n="n"/>
     <sdef n="sg"/>
     <sdef n="pl"/>
   </sdefs>
   <pardefs>

   </pardefs>
   <section id="main" type="standard">

   </section>
</dictionary>

Now we've got the skeleton in place, we can start by adding a noun. The noun in question will be 'gramofon' (which means 'gramophone' or 'record player').

The first thing we need to do, as we have no prior paradigms, is to define a paradigm.

Remember we're assuming masculine gender and nominative case. The singular form of the noun is 'gramofon', and the plural is 'gramofoni'. So:

<pardef n="gramofon__n">
   <e>
     <p>
       <l/>
       <r><s n="n"/><s n="sg"/></r>
     </p>
   </e>
   <e>
     <p>
       <l>i</l>
       <r><s n="n"/><s n="pl"/></r>
     </p>
   </e>
</pardef>

Note: the '<l/>' (equivalent to <l></l>) denotes that there is no extra material to be added to the stem for the singular.

This may seem like a rather verbose way of describing it, but there are reasons for it and it quickly becomes second nature. You're probably wondering what the <e>,

, <l> and <r> stand for. Well,

  • e, is for entry.
  • p, is for pair.
  • l, is for left.
  • r, is for right.

Why left and right? Well, the morphological dictionaries will later be compiled into finite state machines. Compiling them left to right produces analyses from words, and from right to left produces words from analyses. For example:

* gramofoni (left to right) gramofon<n><pl> (analysis)
* gramofon<n><pl> (right to left) gramofoni (generation)

Now we've defined a paradigm, we need to link it to its lemma, gramofon. We put this in the section that we've defined.

The entry to put in will look like:

<e lm="gramofon"><i>gramofon</i><par n="gramofon__n"/></e>

A quick run down on the abbreviations:

  • lm, is for lemma.
  • i, is for identity (the left and the right are the same).
  • par, is for paradigm.

This entry states the lemma of the word, gramofon, the root, gramofon and the paradigm with which it inflects gramofon__n. The difference between the lemma and the root is that the lemma is the citation form of the word, while the root is the substring of the lemma to which stems are added. This will become clearer later when we show an entry where the two are different.

We're now ready to test the dictionary. Save it, and then return to the shell. We first need to compile it (with lt-comp), then we can test it (with lt-proc).

$ lt-comp lr apertium-sh-en.sh.dix sh-en.automorf.bin

Should produce the output:

main@standard 12 12

As we are compiling it left to right, we're producing an analyser. Lets make a generator too.

$ lt-comp rl apertium-sh-en.sh.dix sh-en.autogen.bin

At this stage, the command should produce the same output.

We can now test these. Run lt-proc on the analyser.

$ lt-proc sh-en.automorf.bin

Now try it out, type in gramofoni (gramophones), and see the output:

^gramofoni/gramofon<n><pl>$

Now, for the English dictionary, do the same thing, but substitute the English word gramophone for gramofon, and change the plural inflection. What if you want to use the more correct word 'record player', well, we'll explain how to do that later.

You should now have two files in the directory:

  • apertium-sh-en.sh.dix which contains a (very) basic Serbo-Croatian morphological dictionary, and
  • apertium-sh-en.en.dix which contains a (very) basic English morphological dictionary.

Bilingual dictionary

So we now have two morphological dictionaries, next thing to make is the bilingual dictionary. This describes mappings between words. All dictionaries use the same format (which is specified in the DTD, dix.dtd).

Create a new file, apertium-sh-en.sh-en.dix and add the basic skeleton:

<?xml version="1.0" encoding="ISO-8859-1"?>
<dictionary>
   <alphabet/>
   <sdefs>
     <sdef n="n"/>
     <sdef n="sg"/>
     <sdef n="pl"/>
   </sdefs>

   <section id="main" type="standard">

   </section>
</dictionary>

Now we need to add an entry to translate between the two words. Something like:

<e><p><l>gramofon<s n="n"/></l><r>gramophone<s n="n"/></r></p></e>

Because there are a lot of these entries, they're typically written on one line to facilitate easier reading of the file. Again with the 'l' and 'r' right? Well, we compile it left to right to produce the Serbo-Croatian → English dictionary, and right to left to produce the English → Serbo-Croatian dictionary.

So, once this is done, run the following commands:

$ lt-comp lr apertium-sh-en.sh.dix sh-en.automorf.bin
$ lt-comp rl apertium-sh-en.sh.dix sh-en.autogen.bin

$ lt-comp lr apertium-sh-en.en.dix en-sh.automorf.bin
$ lt-comp rl apertium-sh-en.en.dix en-sh.autogen.bin

$ lt-comp lr apertium-sh-en.sh-en.dix sh-en.autobil.bin
$ lt-comp rl apertium-sh-en.sh-en.dix en-sh.autobil.bin

To generate the morphological analysers (automorf), the morphological generators (autogen) and the word lookups (autobil), the bil is for "bilingual".

Transfer rules

So, now we have two morphological dictionaries, and a bilingual dictionary. All that we need now is a transfer rule for nouns. Transfer rule files have their own DTD (transfer.dtd) which can be found in the Apertium package. If you need to implement a rule it is often a good idea to look in the rule files of other language pairs first. Many rules can be recycled/reused between languages. For example the one described below would be useful for any null-subject language.

Start out like all the others with a basic skeleton:

<?xml version="1.0" encoding="ISO-8859-1"?>
<transfer>

</transfer>

At the moment, because we're ignoring case, we just need to make a rule that takes the grammatical symbols input and outputs them again.

We first need to define categories and attributes. Categories and attributes both allow us to group grammatical symbols. Categories allow us to group symbols for the purposes of matching (for example 'n.*' is all nouns). Attributes allow us to group a set of symbols that can be chosen from. For example ('sg' and 'pl' may be grouped a an attribute 'number').

Lets add the necessary sections:

<section-def-cats>

</section-def-cats>
<section-def-attrs>

</section-def-attrs>

As we're only inflecting, nouns in singular and plural then we need to add a category for nouns, and with an attribute of number. Something like the following will suffice:

Into section-def-cats add:

<def-cat n="nom">
   <cat-item tags="n.*"/>
</def-cat>

This catches all nouns (lemmas followed by <n> then anything) and refers to them as "nom" (we'll see how thats used later).

Into the section section-def-attrs, add:

<def-attr n="nbr">
   <attr-item tags="sg"/>
   <attr-item tags="pl"/>
</def-attr>

and then

<def-attr n="a_nom">
   <attr-item tags="n"/>
</def-attr>

The first defines the attribute nbr (number), which can be either singular (sg) or plural (pl).

The second defines the attribute a_nom (attribute noun).

Next we need to add a section for global variables:

<section-def-vars>

</section-def-vars>

These variables are used to store or transfer attributes between rules. We need only one for now,

<def-var n="number"/>

Finally, we need to add a rule, to take in the noun and then output it in the correct form. We'll need a rules section...

<section-rules>

</section-rules>

Changing the pace from the previous examples, I'll just paste this rule, then go through it, rather than the other way round.

<rule>
   <pattern>
     <pattern-item n="nom"/>
   </pattern>
   <action>
     <out>
       <lu>
         <clip pos="1" side="tl" part="lem"/>
         <clip pos="1" side="tl" part="a_nom"/>
         <clip pos="1" side="tl" part="nbr"/>
       </lu>
     </out>
   </action>
</rule>

The first tag is obvious, it defines a rule. The second tag, pattern basically says: "apply this rule, if this pattern is found". In this example the pattern consists of a single noun (defined by the category item nom). Note that patterns are matched in a longest-match first. So if you have three rules, the first catches "<prn><vblex><n>", the second catches "<prn><vblex>" and the third catches "<n>", the pattern matched, and rule executed will be the first.

For each pattern, there is an associated action, which produces an associated output, out. The output, is a lexical unit (lu).

The clip tag allows a user to select and manipulate attributes and parts of the source language (side="sl"), or target language (side="tl") lexical item.

Let's compile it and test it. Transfer rules are compiled with:

$ apertium-preprocess-transfer apertium-sh-en.trules-sh-en.xml trules-sh-en.bin

Which will generate a trules-sh-en.bin file.

Now we're ready to test our machine translation system. There is one crucial part missing, the part-of-speech (PoS) tagger, but that will be explained shortly. In the meantime we can test it as is:

First, lets analyse a word, gramofoni:

$ echo "gramofoni" | lt-proc sh-en.automorf.bin 
^gramofon/gramofon<n><pl>$

Now, normally here the POS tagger would choose the right version based on the part of speech, but we don't have a POS tagger yet, so we can use this little perl script that will just output the first item retrieved.

$ echo "gramofoni" | lt-proc sh-en.automorf.bin | \
  perl -ne 's,^([^/]*/)(.*)$,\^\2,; s,^(.*\$\s\^)[^/]+/(.*)$,\1\2,; print' | \
^gramofon<n><pl>$

Now let's process that with the transfer rule:

$ echo "gramofoni" | lt-proc sh-en.automorf.bin | \
  perl -ne 's,^([^/]*/)(.*)$,\^\2,; s,^(.*\$\s\^)[^/]+/(.*)$,\1\2,; print' | \
  apertium-transfer apertium-sh-en.trules-sh-en.xml trules-sh-en.bin sh-en.autobil.bin

It will output:

^gramophone<n><pl>$^@
  • 'gramophone' is the target language (side="tl") lemma (lem) at position 1 (pos="1").
  • '<n>' is the target language a_nom at position 1.
  • '<pl>' is the target language attribute of number (nbr) at position 1.

Try commenting out one of these clip statements, recompiling and seeing what happens.

So, now we have the output from the transfer, the only thing that remains is to generate the target-language inflected forms. For this, we use lt-proc, but in generation (-g), not analysis mode.

$ echo "gramofoni" | lt-proc sh-en.automorf.bin | \
  perl -ne 's,^([^/]*/)(.*)$,\^\2,; s,^(.*\$\s\^)[^/]+/(.*)$,\1\2,; print' | \
  apertium-transfer apertium-sh-en.trules-sh-en.xml trules-sh-en.bin sh-en.autobil.bin | \
  lt-proc -g sh-en.autogen.bin

gramophones\@

And c'est ca. You now have a machine translation system that translates a Serbo-Croatian noun into an English noun. Obviously this isn't very useful, but we'll get onto the more complex stuff soon. Oh, and don't worry about the '@' symbol, I'll explain that soon too.

Think of a few other words that inflect the same as gramofon. How about adding those. We don't need to add any paradigms, just the entries in the main section of the monolingual and bilingual dictionaries.