The quick and dirty guide to making a new language pair

From Apertium
Revision as of 07:01, 23 December 2015 by Darkgaia (talk | contribs) (→‎Bootstrap)
Jump to navigation Jump to search

This document will describe how to start a new language pair for the Apertium machine translation system from scratch. You can check the list of language pairs that have already been started.

It does not assume any knowledge of linguistics, or machine translation above the level of being able to distinguish nouns from verbs (and prepositions etc.) You should probably familiarise yourself with basic Unix commands, though, if you're a complete newbie.

Introduction

Apertium is, as you've probably realised by now, a machine translation system. Well, not quite, it's a machine translation platform. It provides an engine and toolbox that allow you to build your own machine translation systems. The only thing you need to do is write the data. The data consists, on a basic level, of three dictionaries and a few rules (to deal with word re-ordering and other grammatical stuff).

For a more detailed introduction into how it all works, there are some excellent papers on the Publications page.

These are the key steps one has to take in order to create a new language pair:

  1. Have Apertium installed.
  2. Decide whether the languages used your new language pair falls under lttoolbox or HFST.
  3. Bootstrap the pair: Create the templates for a new language pair using the new bootstrap module.
  4. Create the morphological dictionary for the first language in the language pair (henceforth known as language xxx). This contains words in the first language for Apertium to recognise.
  5. Create the morphological dictionary for the second language in the language pair (henceforth known as language yyy). This contains words in the second language for Apertium to recognise.
  6. Create the bilingual dictionary for the desired language pair xxx-yyy. This contains correspondences between words and symbols in the two languages.
  7. Create the transfer rules for the language pair. This gives the Apertium rules to follow during translation of the language pair. When a pattern that correspond to these rules is detected in the source language text, Apertium will follow these transfer rules to create a grammatical translation in the target language.
  8. Follow Contributing to an existing pair for instructions on expansion and maintenance of the new language pair.


To make this HOWTO guide as straightforward and simple as possible, our example will create a fictional language pair xxx-yyy. We aim to guide you through the steps to create a basic working language pair, before you proceed to Contributing to an existing pair for the more advanced guide. We write this for the complete newbie to Apertium, but assume that you have basic knowledge of operating a Unix terminal.

Do take note that "xxx" and "yyy" represent the standard 3-letter character ISO codes for the languages used. In your language pair, replace these with the appropriate 3-letter ISO code from this page: ISO 639-3.

Installation

See also: Installation and Minimal installation from SVN

Apertium is a rule-based machine translation platform that runs on Unix. If you're using Windows, you'll need to download a Virtual Machine. We have a ready-to-use Apertium VirtualBox for Windows users to quickly get started.

Here are some installation instructions for convenience (from Installation): Then, proceed to Minimal installation from SVN to obtain the languages that you need for creation of your language pair.

Linux

wget http://apertium.projectjj.com/apt/install-nightly.sh -O - | sudo bash

Debian-based/Ubuntu/Mint

sudo apt-get -f install apertium-all-dev

RHEL/CentOS/Fedora:

sudo yum install apertium-all-devel

OpenSUSE:

sudo zypper install apertium-all-devel

Mac OS

# Mac OS X:
sudo port install autoconf automake expat flex \
gettext gperf help2man libiconv libtool \
libxml2 libxslt m4 ncurses p5-locale-gettext \
pcre perl5 pkgconfig zlib gawk subversion

Language Properties

See also: lexc, twol, and lttoolbox

Before we start, we need to determine whether the languages you will be using are more suitable for lexc/twol (HFST) or lttoolbox based processing. "What on earth are those?", you ask. Bear me with me a bit.

  • HFST is best for languages with agglutinative morphology and/or complicated morphophonology.
  • lttoolbox is best for languages with fixed paradigms.

Languages with fairly limited morphology (e.g., English) can use either. Lttoolbox is the more default and common one, but if a language has agglutinative morphology, or any amount of productive morphophonology (especially vowel harmony), then it should use HFST.

If in doubt, please go to IRC and ask an expert.

For our example, we will imagine that xxx is an lttoolbox language, while yyy is an HFST language, so that I can demonstrate the addition of dictionaries for both types of languages.

Bootstrap

Apertium has recently created a new bootstrap script to automatically create templates for new language pair creation. The older method of language-pair creation was, suffice it to say, more tedious.

  1. Get apertium-init from https://raw.githubusercontent.com/goavki/bootstrap/master/apertium-init.py
  2. Check the languages and incubator directories to find out if the languages that you will be working on are supported in Apertium. For our case, we will be starting from scratch, because our fictional "xxx" and "yyy" languages are not found in Apertium. Check the bootstrap wiki page for specific instructions if your languages can be found in Apertium. Also, remember to create the appropriate template depending if your language is lttoolbox or lexc/twol (HFST).
  3. Create the templates for language xxx (or check out from SVN if it exists).
  4. Create the template for language yyy (or check out from SVN if it exists).
  5. Create the template for language pair xxx-yyy.

To perform step 3 ("xxx" is an lttoolbox language):

python3 apertium-init.py xxx
cd apertium-xxx
./autogen.sh
make -j
cd ..

To perform step 4 ("yyy" is a HFST language):

python3 apertium-init.py yyy --analyser=hfst
cd apertium-yyy
./autogen.sh
make -j
cd ..

To perform step 5:

python3 apertium-init.py xxx-yyy --analyser2=hfst # Only yyy (second language) uses HFST
cd apertium xxx-yyy
./autogen.sh --with-lang1=../apertium-XXX --with-lang2=../apertium-yyy
make -j

Monolingual dictionaries

See also: Morphological dictionary, List of dictionaries, and Incubator

Apertium uses dictionaries to identify words in a language. Only then can a word be processed by Apertium's tools.

Remember that you only need to look at just one of the two sections, depending on the languages used. Unless, of course, one language in the pair needs lttoolbox and the other needs HFST, then you'll have to read both of them.

Lttoolbox modules ("dix" files)

See also: Dix

We shall begin by adding words to the dictionary of the first language "xxx". Open the dix file in the folder apertium-xxx (it should be named "apertium-xxx.xxx.dix"). You will see a few lines:

<?xml version="1.0" encoding="UTF-8"?>
<dictionary>

</dictionary>

<alphabet>ABCČĆDDžĐEFGHIJKLLjMNNjOPRSŠTUVZŽabcčćddžđefghijklljmnnjoprsštuvzž</alphabet>

Edit the alphabet to be your appropriate for your language.

Next, you will see some symbol definitions:

<sdefs>
   <sdef n="n"/>
   <sdef n="sg"/>
   <sdef n="pl"/>
</sdefs>

The template is very intuitive. "n" means noun, "sg" means singular, and "pl" means plural. Do, however, be familiar with Apertium's standard List of symbols. Do be also prepared to add more symbols if your language demands it. For example, nouns in Serbo-Croatian inflect for more than just number, they are also inflected for case, and have a gender. Appropriate symbols will have to be added to the language dictionary to reflect these language features.

The next thing you will see is the section for paradigms. Paradigms are 'structures' so to say, that define the part of speech traits (such as noun, verb, transitive, etc) of a certain word. Then, any further words in the dictionary that behave similarly to the given word can be assigned this paradigm, to mark that they have the same traits and can be treated similarly. It's a faster way to assign part-of-speech traits to words without typing stuff over and over again. You'll be adding more paradigms in future when expanding the language (and hence, language pair), as many as different word forms found in the language itself.

<pardefs>

</pardefs>

Add the most basic of singular nouns for now. In English, this can be "house".

<pardef n="house__n">
      <e>       <p><l></l>         <r><s n="n"/><s n="sg"/></r></p></e>
    </pardef>

Add the e, p and l tags, then add r tags and fit the appropriate defining symbols inside it. <l> tags include different inflections of the word (but the word's base form essentially remains the same). So, the basic word "house" can actually have four paradigm entries.

<pardef n="house__n">
      <e>       <p><l></l>         <r><s n="n"/><s n="sg"/></r></p></e>
      <e>       <p><l>'s</l>       <r><s n="n"/><s n="sg"/><s n="gen"/></r></p></e>
      <e>       <p><l>s</l>        <r><s n="n"/><s n="pl"/></r></p></e>
      <e>       <p><l>s'</l>       <r><s n="n"/><s n="pl"/><s n="gen"/></r></p></e>
    </pardef>

You do not have to concern yourself with the specifics of what the tags <e> and <'p'> do for now.


Let's now add words to the dictionary. Nouns are the easiest to begin with for most languages. It may also be wise to look at the dix files of other languages to personally observe the format for yourself.

<e lm="abbot"><i>abbot</i><par n="house__n"/></e>

The format is: <e lm="wordhere">wordhere<the most appropriate paradigm here></e>

Once you're done, save it and compile the folder again with make -j

If your second language also uses a dix file, do the same thing, substituting the noun you added to xxx's dictionary with the appropriate translation in yyy.

HFST modules (lexc/twol files)

See also: Starting a new language with HFST

HFST-type languages are more difficult to begin, because of the nature of their more complex morphology.

We shall begin by adding words to the dictionary of the second language "yyy". Open the lexc file in apertium-yyy (it should be named "apertium-yyy.yyy.lexc"). You will see a few lines:

Multichar_Symbols

%<n%>   ! Noun
%<nom%> ! Nominative
%<pl%>  ! Plural

The symbols < and > are reserved in lexc, so we need to escape them with %

Take a look at the Root lexicon definition, which is going to point to a list of stems in the lexicon NounStems. This should already be in the file:


LEXICON Root

NounStems ;

Now let's add our word, substituting the noun you added to xxx's dictionary with the appropriate translation in yyy. :

LEXICON NounStems

yyy N1 ; ! "xxx"

First we put the stem, then we put the paradigm (or continuation class) that it belongs to, in this case N1, and finally, in a comment (the comment symbol is !) we put the translation.

And define the most basic of inflection, that is, tagging the bare stem with <n> to indicate a noun:

LEXICON N1

%<n%>: # ;

This LEXICON should go before the NounStems lexicon. The # symbol is the end-of-word boundary. It is very important to have this, as it tells the transducer where to stop.

Now that we're done with our lexc file, save it and make -j. We move on to the twol file now.

The idea of twol is to take the surface forms produced by lexc and apply rules to them to change them into real surface forms. So, this is where we change -l{A}r into -lar or -ler.

What we basically want to say is "if the stem contains front vowels, then we want the front vowel alternation, if it contains back vowels then we want the back vowel alternation". And at the same time, remove the morpheme boundary. So let's give it a shot.

Enter the twol file apertium-yyy.yyy.twol.

Edit the alphabet to be your appropriate for your language.

Alphabet
 A B Ç D E Ä F G H I J Ž K L M N Ň O Ö P R S Ş T U Ü W Y Ý Z
 a b ç d e ä f g h i j ž k l m n ň o ö p r s ş t u ü w y ý z
 %{A%}:a ;

The pre-built template should already have everything set up for you, so make -j and let's move on. Remember to check out Starting a new language with HFST for additional detail.

Completion

You should now have two files:

  • apertium-xxx.xxx.dix which contains a (very) basic xxx morphological dictionary, and
  • apertium-yyy.yyy.lexc which contains a (very) basic yyy morphological dictionary.

Bilingual dictionary

See also: Bilingual dictionary

So we now have two morphological dictionaries, next thing to make is the bilingual dictionary. This describes mappings between words. During translation, one word will be translated to its corresponding word as specified in this dictionary.

Add an entry to translate between the two words, as specified in the monolingual dictionaries. Something like:

<e><p><l>xxx<s n="n"/></l><r>yyy<s n="n"/></r></p></e>

There are going to be a lot of these entries, so it would be wise to write every entry on one line to facilitate easier reading of the file.

So, once this is done, run make -j to compile the bilingual dictionary.

You're done with the addition of dictionaries!


Testing of dictionaries

This is a good checkpoint to test if your language pair works, and that you have set up everything correctly up till this point.

Make sure that you are in the folder with the language pair, and type an echo command:

echo xxx | apertium -d . xxx-yyy
echo yyy | apertium -d . yyy-xxx

If this works, great! We move on to the next section: The addition of transfer rules.

Transfer rules

See also: Transfer

So, now we have two morphological dictionaries, and a bilingual dictionary. All that we need now is a transfer rule for nouns. Transfer rules are necessary in sentence-based translation to translate longer chunks of text grammatically. They also help assign the correct tags for the part-of-speech tagger that is necessary in the development of a language pair.

If you need to implement a rule it is often a good idea to look in the rule files of other language pairs first. Many rules can be recycled/reused between languages.

Add a rule to take in the nouns and then output it in the correct form.

I will use the example of a language pair English-Malay (eng-zlm) to demonstrate a simple transfer rule. In English, the adjective comes before the noun, such as "happy boy". However, in Malay, the noun comes before the adjective, such as "lelaki gembira" (boy happy). Remember to modify the rule as is appropriate for your language.

Note: You need to add the words that you will want test in the dictionaries first, before you can test them!

<rule comment="REGLA: adj nom">  <!--happy boy-->
      <pattern>
        <pattern-item n="adj"/>
        <pattern-item n="nom"/>
      </pattern>
      <action>
        <out>
          <chunk name="j_n" case="caseFirstWord">
            <lu>
              <clip pos="2"/> <!--pos 2 refers to the second word in the rule ("boy"). It is now placed at the front.-->
            </lu>
            <b/> <!-- this is the space to indicate the word should be separated-->
            <lu>
              <clip pos="1"/> <!--pos 1 refers to the first word in the rule ("happy"). It is now placed at the back.-->  
            </lu>
          </chunk>
        </out>
      </action>
    </rule>

For each pattern, there is an associated action, which produces an associated output, out. The output, is a lexical unit (lu).

The clip tag allows a user to select and manipulate attributes and parts of the source language (side="sl"), or target language (side="tl") lexical item. This can get very complicated, so let's move on. Further elaboration can be found in A long introduction to transfer rules.

Let's compile it and test it. make -j.

Now we're ready to test our machine translation system.

$ echo "happy boy" | apertium -d . eng-zlm
^lelaki/lelaki<n><sg>$ ^gembira/gembira<adj>$


And c'est ca. You now have a machine translation system that translates words between your chosen two languages. You have also learned how to write a basic transfer rule, that grammatically translates longer chunks of words into their target language. Obviously this isn't very useful, but you'll get onto the more complex stuff soon.

Think of a few other words that inflect the same as the nouns in your dictionaries, in our case, "xxx" and "yyy". How about adding those? You can link their entries with the paradigm you had written earlier.

Conclusion

See also: Contributing to an existing pair, Quick and dirty guide addendum: other important things

Congratulations on creating your new infant language pair! Your journey is (obviously) not over yet, though! There is a long way to go before the language pair becomes stable. Steps include: adding more words to the dictionaries, adding more transfer rules to accommodate them, training the part-of-speech tagger, bug and error testing, and more. See the next guide as well as Contributing to an existing pair for more advanced instructions on how to expand your language pair.

As is the usual, feel free to browse the wiki for more resources, or proceed to IRC for help.

We wish you the best of luck in the creation of your new language pair!

See also