Difference between revisions of "The quick and dirty guide to making a new language pair"

From Apertium
Jump to navigation Jump to search
(wget -> curl)
 
(64 intermediate revisions by 4 users not shown)
Line 1: Line 1:
[[Créer une nouvelle paire de langues|En français]]
 
 
 
{{TOCD}}
 
{{TOCD}}
'''Apertium New Language Pair HOWTO'''
 
   
This HOWTO document will describe how to start a new language pair for the Apertium machine translation system from scratch. You can check the [[list of language pairs]] that have already been started.
+
This document will describe how to start a new language pair for the Apertium machine translation system from scratch. You can check the [[list of language pairs]] that have already been started.
   
 
It does not assume any knowledge of linguistics, or machine translation above the level of being able to distinguish nouns from verbs (and prepositions etc.) You should probably familiarise yourself with basic Unix commands, though, if you're a complete newbie.
 
It does not assume any knowledge of linguistics, or machine translation above the level of being able to distinguish nouns from verbs (and prepositions etc.) You should probably familiarise yourself with basic Unix commands, though, if you're a complete newbie.
Line 14: Line 11:
 
For a more detailed introduction into how it all works, there are some excellent papers on the [[Publications]] page.
 
For a more detailed introduction into how it all works, there are some excellent papers on the [[Publications]] page.
   
  +
'''These are the key steps one has to take in order to create a new language pair:'''
==You will need==
 
  +
# Have Apertium '''[[#Installation|installed]]'''.
 
  +
# Decide whether each language in your pair is best implemented with '''[[#Language Properties|lttoolbox or HFST]]'''.
* [[lttoolbox]] (>= 3.3.0)
 
  +
# '''[[#Bootstrap|Bootstrap the pair]]''': Create the templates for a new language pair using [http://wiki.apertium.org/wiki/How_to_bootstrap_a_new_pair the new bootstrap module].
* apertium (>= 3.3.0)
 
  +
# Create the '''[[#Monolingual dictionaries|morphological dictionary]]''' for the first language in the language pair (henceforth known as language xxx). This contains words in the first language for Apertium to recognise.
* a text editor (or a specialized [[XML editors|XML editor]] if you prefer)
 
  +
# Create the '''[[#Monolingual dictionaries|morphological dictionary]]''' for the second language in the language pair (henceforth known as language yyy). This contains words in the second language for Apertium to recognise.
 
  +
# Create the '''[[#Bilingual dictionary|bilingual dictionary]]''' for the desired language pair xxx-yyy. This contains correspondences between words and symbols in the two languages.
==How to create a new language pair==
 
  +
# Create the '''[[#Transfer rules|transfer]]''' rules for the language pair. This gives the Apertium rules to follow during translation of the language pair. When a pattern that correspond to these rules is detected in the source language text, Apertium will follow these transfer rules to create a grammatical translation in the target language.
 
These are the key steps one has to take in order to create a new language pair:
+
# Follow '''[[#Conclusion|Contributing to an existing pair]]''' for instructions on expansion and maintenance of the new language pair.
# Create the templates for a new language pair using [http://wiki.apertium.org/wiki/How_to_bootstrap_a_new_pair the new bootstrap module].
 
# Create the [[morphological dictionary]] for the first language in the language pair (henceforth known as language xxx). This contains words in the first language for Apertium to recognise.
 
# Create the [[morphological dictionary]] for the second language in the language pair (henceforth known as language yyy). This contains words in the second language for Apertium to recognise.
 
# Create the [[bilingual dictionary]] for the desired language pair xxx-yyy. This contains correspondences between words and symbols in the two languages.
 
# Create the [[transfer]] rules for the language pair. This gives the Apertium rules to follow during translation of the language pair. When a pattern that correspond to these rules is detected in the source language text, Apertium will follow these transfer rules to create a grammatical translation in the target language.
 
# Follow [[Contributing to an existing pair]] for instructions on expansion and maintenance of the new language pair.
 
   
   
 
To make this HOWTO guide as straightforward and simple as possible, our example will create a fictional language pair xxx-yyy. We aim to guide you through the steps to create a basic working language pair, before you proceed to [[Contributing to an existing pair]] for the more advanced guide. We write this for the complete newbie to Apertium, but assume that you have basic knowledge of operating a Unix terminal.
 
To make this HOWTO guide as straightforward and simple as possible, our example will create a fictional language pair xxx-yyy. We aim to guide you through the steps to create a basic working language pair, before you proceed to [[Contributing to an existing pair]] for the more advanced guide. We write this for the complete newbie to Apertium, but assume that you have basic knowledge of operating a Unix terminal.
   
  +
Do take note that "xxx" and "yyy" represent the standard 3-letter character ISO codes for the languages used. In your language pair, replace these with the appropriate 3-letter ISO code from this page: [[ISO 639-3]].
==Installation==
 
{{see-also|Installation|Minimal installation from SVN}}
 
   
  +
==Installation==
 
Apertium is a rule-based machine translation platform that runs on Unix. If you're using Windows, you'll need to download a Virtual Machine. We have a ready-to-use [[Apertium VirtualBox]] for Windows users to quickly get started.
 
Apertium is a rule-based machine translation platform that runs on Unix. If you're using Windows, you'll need to download a Virtual Machine. We have a ready-to-use [[Apertium VirtualBox]] for Windows users to quickly get started.
   
Here are some installation instructions for convenience (from [[Installation]]):
+
For full instructions, on every platform, see [[Installation]].
  +
  +
Then, proceed to [[How to bootstrap a new pair]] to obtain the languages that you need for creation of your language pair.
  +
  +
===Linux===
  +
<pre>
  +
curl -sS https://apertium.projectjj.com/apt/install-nightly.sh | sudo bash
  +
</pre>
  +
'''Debian-based/Ubuntu/Mint'''
 
<pre>
 
<pre>
wget http://apertium.projectjj.com/apt/install-nightly.sh -O - | sudo bash
 
# Debian-based/Ubuntu/Mint
 
 
sudo apt-get -f install apertium-all-dev
 
sudo apt-get -f install apertium-all-dev
  +
</pre>
 
# RHEL/CentOS/Fedora:
+
'''RHEL/CentOS/Fedora:'''
  +
<pre>
 
sudo yum install apertium-all-devel
 
sudo yum install apertium-all-devel
  +
</pre>
   
# OpenSUSE:
+
'''OpenSUSE:'''
  +
<pre>
 
sudo zypper install apertium-all-devel
 
sudo zypper install apertium-all-devel
 
# Mac OS X:
 
sudo port install autoconf automake expat flex \
 
gettext gperf help2man libiconv libtool \
 
libxml2 libxslt m4 ncurses p5-locale-gettext \
 
pcre perl5 pkgconfig zlib gawk subversion
 
 
</pre>
 
</pre>
   
  +
===Mac OS===
Then, proceed to [[Minimal installation from SVN]] to obtain the languages that you need for creation of your language pair.
 
  +
  +
Follow all of the instructions in [[Installation]]. The documentation for quick Mac installation is no longer functional.
  +
  +
==Language Properties==
  +
{{see-also|lexc|twol|lttoolbox}}
  +
  +
Before we start, we need to determine whether the languages you will be using are more suitable for [[lexc]]/[[twol]] (HFST) or [[lttoolbox]] based processing. "What on earth are those?", you ask. Bear me with me a bit.
  +
  +
* HFST is best for languages with agglutinative morphology and/or complicated morphophonology.
  +
* lttoolbox is best for languages with fixed paradigms.
  +
  +
Languages with fairly limited morphology (e.g., English) can use either. Lttoolbox is the more default and common one, but if a language has agglutinative morphology, or any amount of productive morphophonology (especially vowel harmony), then it should use HFST.
  +
  +
If in doubt, please go to [[IRC]] and ask an expert.
  +
  +
For our example, we will imagine that xxx is an lttoolbox language, while yyy is an HFST language, so that I can demonstrate the addition of dictionaries for both types of languages.
   
 
==Bootstrap==
 
==Bootstrap==
   
Apertium has recently created a [http://wiki.apertium.org/wiki/How_to_bootstrap_a_new_pair new bootstrap module] to automatically create templates for new language pair creation. The older method of language-pair creation was, suffice it to say, more tedious.
+
Apertium has recently created a [http://wiki.apertium.org/wiki/How_to_bootstrap_a_new_pair new bootstrap script] to automatically create templates for new language pair creation. The older method of language-pair creation was, suffice it to say, more tedious.
   
# Get [[apertium-init]] from https://raw.githubusercontent.com/goavki/bootstrap/master/apertium-init.py
+
# Get [[apertium-init]] from https://apertium.projectjj.com/apt/nightly/apertium-init
# Check the [https://svn.code.sf.net/p/apertium/svn/languages/ languages] and [https://svn.code.sf.net/p/apertium/svn/incubator incubator] directories to find out if the languages that you will be working on are supported in Apertium. For our case, we will be starting from scratch, because our fictional "xxx" and "yyy" languages are not found in Apertium. Check the [http://wiki.apertium.org/wiki/How_to_bootstrap_a_new_pair bootstrap wiki page] for specific instructions if your languages can be found in Apertium.
+
# Check the [https://github.com/apertium/apertium-languages languages] and [https://github.com/apertium/apertium-incubator incubator] directories to find out if the languages that you will be working on are supported in Apertium. For our case, we will be starting from scratch, because our fictional "xxx" and "yyy" languages are not found in Apertium. Check the [http://wiki.apertium.org/wiki/How_to_bootstrap_a_new_pair bootstrap wiki page] for specific instructions if your languages can be found in Apertium. Also, remember to create the appropriate template depending if your language is [[lttoolbox]] or [[lexc]]/[[twol]] (HFST).
# Create the templates for language xxx.
+
# Create the templates for language xxx (or check out from [[SVN]] if it exists).
# Create the template for language yyy.
+
# Create the template for language yyy (or check out from [[SVN]] if it exists).
 
# Create the template for language pair xxx-yyy.
 
# Create the template for language pair xxx-yyy.
   
  +
'''After performing these steps, you should have three new folders that hold the relevant language template files: apertium-xxx, apertium-yyy, and apertium-xxx-yyy.'''
To perform step 3:
 
  +
  +
To perform step 3 ("xxx" is an lttoolbox language):
 
<pre>
 
<pre>
 
python3 apertium-init.py xxx
 
python3 apertium-init.py xxx
Line 78: Line 90:
 
</pre>
 
</pre>
   
To perform step 4:
+
To perform step 4 ("yyy" is a HFST language):
 
<pre>
 
<pre>
python3 apertium-init.py yyy
+
python3 apertium-init.py yyy --analyser=hfst
 
cd apertium-yyy
 
cd apertium-yyy
 
./autogen.sh
 
./autogen.sh
Line 89: Line 101:
 
To perform step 5:
 
To perform step 5:
 
<pre>
 
<pre>
python3 apertium-init.py xxx-yyy
+
python3 apertium-init.py xxx-yyy --analyser2=hfst # Only yyy (second language) uses HFST
 
cd apertium xxx-yyy
 
cd apertium xxx-yyy
./autogen.sh --with-lang1=../apertium-XXX --with-lang2=../apertium-YYY
+
./autogen.sh --with-lang1=../apertium-XXX --with-lang2=../apertium-yyy
 
make -j
 
make -j
 
</pre>
 
</pre>
 
   
 
==Monolingual dictionaries==
 
==Monolingual dictionaries==
Line 101: Line 112:
 
Apertium uses dictionaries to identify words in a language. Only then can a word be processed by Apertium's tools.
 
Apertium uses dictionaries to identify words in a language. Only then can a word be processed by Apertium's tools.
   
  +
''Remember that you only need to look at just one of the two sections, depending on the languages used. Unless, of course, one language in the pair needs lttoolbox and the other needs HFST, then you'll have to read both of them.''
We shall begin by adding words to the dictionary of the first language "xxx". Enter "apertium-xxx" and open the dix file (it should be named "apertium-xxx.xxx.dix"). You will see a few lines:
 
  +
=== Lttoolbox modules ("monodix" files)===
  +
{{see-also|Monodix basics}}
  +
We shall begin by adding words to the dictionary of the first language "xxx". Open the dix file in the folder apertium-xxx (it should be named "apertium-xxx.xxx.dix"). You will see a few lines:
 
<pre>
 
<pre>
 
<?xml version="1.0" encoding="UTF-8"?>
 
<?xml version="1.0" encoding="UTF-8"?>
Line 160: Line 174:
 
The format is: <e lm="wordhere"><i>wordhere</i><the most appropriate paradigm here></e>
 
The format is: <e lm="wordhere"><i>wordhere</i><the most appropriate paradigm here></e>
   
  +
Once you're done, save it and compile the folder again with <code>make -j</code>
   
  +
You can test that the new word works with the lt-proc subprogram.
   
 
We're now ready to test the dictionary. Save it and compile the folder again with <code>make -j</code>
 
 
Now, for the dictionary of the other language pair ("yyy"), do the same thing, substituting the noun you added to xxx's dictionary with the appropriate translation in yyy.
 
 
You should now have two files:
 
 
* apertium-xxx.xxx.dix which contains a (very) basic xxx morphological dictionary, and
 
* apertium-yyy.yyy.dix which contains a (very) basic yyy morphological dictionary.
 
 
===Bilingual dictionary===
 
'''INCOMPLETE'''
 
''See also: [[Bilingual dictionary]]''
 
 
So we now have two morphological dictionaries, next thing to make is the [[bilingual dictionary]]. This describes mappings between words. All dictionaries use the same format (which is specified in the DTD, dix.dtd).
 
 
Create a new file, <code>apertium-hbs-eng.hbs-eng.dix</code> and add the basic skeleton:
 
 
<pre>
 
<pre>
  +
$ echo "wordhere" | lt-proc xxx.automorf.bin
<?xml version="1.0" encoding="UTF-8"?>
 
  +
</pre>
<dictionary>
 
<alphabet/>
 
<sdefs>
 
<sdef n="n"/>
 
<sdef n="sg"/>
 
<sdef n="pl"/>
 
</sdefs>
 
   
  +
Or using the apertium interface to analysis:
&lt;section id="main" type="standard">
 
   
</section&gt;
 
</dictionary>
 
</pre>
 
Now we need to add an entry to translate between the two words. Something like:
 
 
<pre>
 
<pre>
  +
$ echo "wordhere" | apertium -d . xxx-morph
<e><p><l>gramofon<s n="n"/></l><r>gramophone<s n="n"/></r></p></e>
 
 
</pre>
 
</pre>
Because there are a lot of these entries, they're typically written on one line to facilitate easier reading of the file. Again with the 'l' and 'r' right? Well, we compile it left to right to produce the Serbo-Croatian → English dictionary, and right to left to produce the English → Serbo-Croatian dictionary.
 
   
  +
If your second language also uses a dix file, do the same thing, substituting the noun you added to xxx's dictionary with the appropriate translation in yyy.
So, once this is done, run the following commands:
 
<pre>
 
$ lt-comp lr apertium-hbs.hbs.dix hbs-eng.automorf.bin
 
$ lt-comp rl apertium-eng.eng.dix hbs-eng.autogen.bin
 
   
  +
=== HFST modules (lexc/twol files)===
$ lt-comp lr apertium-eng.eng.dix eng-hbs.automorf.bin
 
  +
{{see-also|Starting a new language with HFST}}
$ lt-comp rl apertium-hbs.hbs.dix eng-hbs.autogen.bin
 
   
  +
HFST-type languages are more difficult to begin, because of the nature of their more complex morphology.
$ lt-comp lr apertium-hbs-eng.hbs-eng.dix hbs-eng.autobil.bin
 
$ lt-comp rl apertium-hbs-eng.hbs-eng.dix eng-hbs.autobil.bin
 
</pre>
 
To generate the morphological analysers (automorf), the morphological generators (autogen) and the word lookups (autobil), the bil is for "bilingual".
 
   
  +
We shall begin by adding words to the dictionary of the second language "yyy". Open the lexc file in apertium-yyy (it should be named "apertium-yyy.yyy.lexc"). You will see a few lines:
===Transfer rules===
 
   
So, now we have two morphological dictionaries, and a bilingual dictionary. All that we need now is a transfer rule for nouns. Transfer rule files have their own DTD (transfer.dtd) which can be found in the Apertium package. If you need to implement a rule it is often a good idea to look in the rule files of other language pairs first. Many rules can be recycled/reused between languages. For example the one described below would be useful for any null-subject language.
 
 
Start out like all the others with a basic skeleton (<code>apertium-hbs-eng.hbs-eng.t1x</code>) :
 
 
<pre>
 
<pre>
  +
Multichar_Symbols
<?xml version="1.0" encoding="UTF-8"?>
 
<transfer>
 
   
  +
%<n%> ! Noun
</transfer>
 
  +
%<nom%> ! Nominative
  +
%<pl%> ! Plural
 
</pre>
 
</pre>
At the moment, because we're ignoring case, we just need to make a rule that takes the grammatical symbols input and outputs them again.
 
   
  +
The symbols <code>&lt;</code> and <code>&gt;</code> are reserved in <code>lexc</code>, so we need to escape them with <code>%</code>
We first need to define categories and attributes. Categories and attributes both allow us to group grammatical symbols. Categories allow us to group symbols for the purposes of matching (for example 'n.*' is all nouns). Attributes allow us to group a set of symbols that can be chosen from. For example ('sg' and 'pl' may be grouped a an attribute 'number').
 
  +
  +
Take a look at the <code>Root</code> lexicon definition, which is going to point to a list of stems in the lexicon <code>NounStems</code>. This should already be in the file:
   
Lets add the necessary sections:
 
 
<pre>
 
<pre>
<section-def-cats>
 
   
  +
LEXICON Root
</section-def-cats>
 
<section-def-attrs>
 
   
  +
NounStems ;
</section-def-attrs>
 
</pre>
 
As we're only inflecting, nouns in singular and plural then we need to add a category for nouns, and with an attribute of number. Something like the following will suffice:
 
   
Into section-def-cats add:
 
<pre>
 
<def-cat n="nom">
 
<cat-item tags="n.*"/>
 
</def-cat>
 
 
</pre>
 
</pre>
This catches all nouns (lemmas followed by <n> then anything) and refers to them as "nom" (we'll see how that's used later).
 
   
  +
Now let's add our word, substituting the noun you added to xxx's dictionary with the appropriate translation in yyy. :
Into the section section-def-attrs, add:
 
<pre>
 
<def-attr n="nbr">
 
<attr-item tags="sg"/>
 
<attr-item tags="pl"/>
 
</def-attr>
 
</pre>
 
and then
 
<pre>
 
<def-attr n="a_nom">
 
<attr-item tags="n"/>
 
</def-attr>
 
</pre>
 
The first defines the attribute nbr (number), which can be either singular (sg) or plural (pl).
 
 
The second defines the attribute a_nom (attribute noun).
 
   
Next we need to add a section for global variables:
 
 
<pre>
 
<pre>
  +
LEXICON NounStems
<section-def-vars>
 
   
  +
yyy N1 ; ! "xxx"
</section-def-vars>
 
 
</pre>
 
</pre>
These variables are used to store or transfer attributes between rules. We need only one for now so that the file can be validated,
 
<pre>
 
<def-var n="number"/>
 
</pre>
 
Finally, we need to add a rule, to take in the noun and then output it in the correct form. We'll need a rules section...
 
<pre>
 
<section-rules>
 
   
  +
First we put the stem, then we put the ''paradigm'' (or ''continuation class'') that it belongs to, in this case <code>N1</code>, and finally, in a comment (the comment symbol is <code>!</code>) we put the translation.
</section-rules>
 
</pre>
 
Changing the pace from the previous examples, I'll just paste this rule, then go through it, rather than the other way round.
 
<pre>
 
<rule>
 
<pattern>
 
<pattern-item n="nom"/>
 
</pattern>
 
<action>
 
<out>
 
<lu>
 
<clip pos="1" side="tl" part="lem"/>
 
<clip pos="1" side="tl" part="a_nom"/>
 
<clip pos="1" side="tl" part="nbr"/>
 
</lu>
 
</out>
 
</action>
 
</rule>
 
</pre>
 
   
  +
And define the most basic of inflection, that is, tagging the bare stem with <code><n></code> to indicate a noun:
The first tag is obvious, it defines a rule. The second tag, pattern basically says: "apply this rule, if this pattern is found". In this example the pattern consists of a single noun (defined by the category item nom). Note that patterns are matched in a longest-match first. So, say you have three rules, the first catches "<prn><vblex><n>", the second catches "<prn><vblex>" and the third catches "<n>". The pattern matched, and rule executed would be the first one.
 
   
For each pattern, there is an associated action, which produces an associated output, out. The output, is a lexical unit (lu).
 
 
The clip tag allows a user to select and manipulate attributes and parts of the source language (side="sl"), or target language (side="tl") lexical item.
 
 
Let's compile it and test it. Transfer rules are compiled with:
 
 
<pre>
 
<pre>
  +
LEXICON N1
$ apertium-preprocess-transfer apertium-hbs-eng.hbs-eng.t1x hbs-eng.t1x.bin
 
</pre>
 
Which will generate a <code>hbs-eng.t1x.bin</code> file.
 
   
  +
%<n%>: # ;
Now we're ready to test our machine translation system. There is one crucial part missing, the part-of-speech (PoS) tagger, but that will be explained shortly. In the meantime we can test it as is:
 
 
First, lets analyse a word, gramofoni:
 
<pre>
 
$ echo "gramofoni" | lt-proc hbs-eng.automorf.bin
 
^gramofoni/gramofon<n><pl>$
 
 
</pre>
 
</pre>
Now, normally here the POS tagger would choose the right version based on the part of speech, but we don't have a POS tagger yet, so we can use this little gawk script (thanks to Sergio) that will just output the first item retrieved.
 
<pre>
 
$ echo "gramofoni" | lt-proc hbs-eng.automorf.bin | \
 
gawk 'BEGIN{RS="$"; FS="/";}{nf=split($1,COMPONENTS,"^"); for(i = 1; i<nf; i++) printf COMPONENTS[i]; if($2 != "") printf("^%s$",$2);}'
 
^gramofon<n><pl>$
 
</pre>
 
Now let's process that with the transfer rule:
 
<pre>
 
$ echo "gramofoni" | lt-proc hbs-eng.automorf.bin | \
 
gawk 'BEGIN{RS="$"; FS="/";}{nf=split($1,COMPONENTS,"^"); for(i = 1; i<nf; i++) printf COMPONENTS[i]; if($2 != "") printf("^%s$",$2);}' | \
 
apertium-transfer apertium-hbs-eng.hbs-eng.t1x hbs-eng.t1x.bin hbs-eng.autobil.bin
 
</pre>
 
It will output:
 
<pre>
 
^gramophone<n><pl>$^@
 
</pre>
 
* 'gramophone' is the target language (side="tl") lemma (lem) at position 1 (pos="1").
 
* '<n>' is the target language a_nom at position 1.
 
* '<pl>' is the target language attribute of number (nbr) at position 1.
 
   
  +
This <code>LEXICON</code> should go ''before'' the <code>NounStems</code> lexicon. The <code>#</code> symbol is the end-of-word boundary. It is very important to have this, as it tells the transducer where to stop.
Try commenting out one of these clip statements, recompiling and seeing what happens.
 
   
  +
Now that we're done with our lexc file, save it and <code>make -j</code>. We move on to the twol file now.
So, now we have the output from the transfer, the only thing that remains is to generate the target-language inflected forms. For this, we use lt-proc, but in generation (-g), not analysis mode.
 
<pre>
 
$ echo "gramofoni" | lt-proc hbs-eng.automorf.bin | \
 
gawk 'BEGIN{RS="$"; FS="/";}{nf=split($1,COMPONENTS,"^"); for(i = 1; i<nf; i++) printf COMPONENTS[i]; if($2 != "") printf("^%s$",$2);}' | \
 
apertium-transfer apertium-hbs-eng.hbs-eng.t1x hbs-eng.t1x.bin hbs-eng.autobil.bin | \
 
lt-proc -g hbs-eng.autogen.bin
 
   
  +
The idea of <code>twol</code> is to take the surface forms produced by lexc and apply rules to them to change them into real surface forms. So, this is where we change ''-l{A}r'' into ''-lar'' or ''-ler''.
gramophones\@
 
</pre>
 
And c'est ca. You now have a machine translation system that translates a Serbo-Croatian noun into an English noun. Obviously this isn't very useful, but we'll get onto the more complex stuff soon. Oh, and don't worry about the '@' symbol, I'll explain that soon too.<ref>Turns out I didn't get around to explaining it, so: The <code>@</code> symbol appears because of the full stop '<code>.</code>' that is added to every translation in case it doesn't have one. For the technically minded, this is so that the tagger knows when a new sentence is starting, even if there isn't a full stop there.</ref>
 
   
  +
What we basically want to say is "if the stem contains front vowels, then we want the front vowel alternation, if it contains back vowels then we want the back vowel alternation". And at the same time, remove the morpheme boundary. So let's give it a shot.
Think of a few other words that inflect the same as gramofon. How about adding those. We don't need to add any paradigms, just the entries in the main section of the monolingual and bilingual dictionaries.
 
   
  +
Enter the twol file <code>apertium-yyy.yyy.twol</code>.
==Bring on the verbs==
 
   
  +
Edit the alphabet to be your appropriate for your language.
Ok, so we have a system that translates nouns, but that's pretty useless, we want to translate verbs too, and even whole sentences! How about we start with the verb to see. In Serbo-Croatian this is videti. Serbo-Croatian is a null-subject language, this means that it doesn't typically use personal pronouns before the conjugated form of the verb. English is not. So for example: I see in English would be translated as vidim in Serbo-Croatian.
 
 
* Vidim
 
* see<p1><sg>
 
* I see
 
 
Note: <code><p1></code> denotes first person
 
 
This will be important when we come to write the transfer rule for verbs. Other examples of null-subject languages include: Spanish, Romanian and Polish. This also has the effect that while we only need to add the verb in the Serbo-Croatian morphological dictionary, we need to add both the verb, and the personal pronouns in the English morphological dictionary. We'll go through both of these.
 
 
The other forms of the verb videti are: vidiš, vidi, vidimo, vidite, and vide; which correspond to: you see (singular), he sees, we see, you see (plural), and they see.
 
 
There are two forms of you see, one is plural and formal singular (vidite) and the other is singular and informal (vidiš).
 
 
We're going to try and translate the sentence: "Vidim gramofoni" into "I see gramophones". In the interests of space, we'll just add enough information to do the translation and will leave filling out the paradigms (adding the other conjugations of the verb) as an exercise to the reader.
 
 
The astute reader will have realised by this point that we can't just translate vidim gramofoni because it is not a grammatically correct sentence in Serbo-Croatian. The correct sentence would be vidim gramofone, as the noun takes the accusative case. We'll have to add that form too, no need to add the case information for now though, we just add it as another option for plural. So, in the paradigm definition just copy the 'e' block for 'i' and change the 'i' to 'e' there.
 
   
 
<pre>
 
<pre>
  +
Alphabet
<pardef n="gramofon__n">
 
  +
A B Ç D E Ä F G H I J Ž K L M N Ň O Ö P R S Ş T U Ü W Y Ý Z
<e><p><l/><r><s n="n"/><s n="sg"/></r></p></e>
 
  +
a b ç d e ä f g h i j ž k l m n ň o ö p r s ş t u ü w y ý z
<e><p><l>i</l><r><s n="n"/><s n="pl"/></r></p></e>
 
  +
%{A%}:a ;
<e><p><l>e</l><r><s n="n"/><s n="pl"/></r></p></e>
 
</pardef>
 
 
</pre>
 
</pre>
   
  +
The pre-built template should already have everything set up for you, so <code>make -j</code> and let's move on.
First thing we need to do is add some more symbols. We need to first add a symbol for 'verb', which we'll call "vblex" (this means lexical verb, as opposed to modal verbs and other types). Verbs have 'person', and 'tense' along with number, so lets add a couple of those as well. We need to translate "I see", so for person we should add "p1", or 'first person', and for tense "pri", or 'present indicative'.
 
<pre>
 
<sdef n="vblex"/>
 
<sdef n="p1"/>
 
<sdef n="pri"/>
 
</pre>
 
After we've done this, the same with the nouns, we add a paradigm for the verb conjugation. The first line will be:
 
<pre>
 
<pardef n="vid/eti__vblex">
 
</pre>
 
The '/' is used to demarcate where the stems (the parts between the <l> </l> tags) are added to.
 
   
  +
You can test that your newly added word works with hfst-proc.
Then the inflection for first person singular:
 
<pre>
 
 
<e><p><l>im</l><r>eti<s n="vblex"/><s n="pri"/><s n="p1"/><s n="sg"/></r></p></e>
 
 
</pre>
 
The 'im' denotes the ending (as in 'vidim'), it is necessary to add 'eti' to the <r> section, as this will be chopped off by the definition. The rest is fairly straightforward, 'vblex' is lexical verb, 'pri' is present indicative tense, 'p1' is first person and 'sg' is singular. We can also add the plural which will be the same, except 'imo' instead of 'im' and 'pl' instead of 'sg'.
 
   
After this we need to add a lemma, paradigm mapping to the main section:
 
 
<pre>
 
<pre>
  +
$ echo "word" | hfst-proc yyy.automorf.hfst.ol
<e lm="videti"><i>vid</i><par n="vid/eti__vblex"/></e>
 
 
</pre>
 
</pre>
Note: the content of <nowiki><i> </i></nowiki> is the root, not the lemma.
 
   
  +
Or using the apertium interface to analysis:
That's the work on the Serbo-Croatian dictionary done for now. Lets compile it then test it.
 
<pre>
 
$ lt-comp lr apertium-hbs.hbs.dix hbs-eng.automorf.bin
 
main@standard 23 25
 
$ echo "vidim" | lt-proc hbs-eng.automorf.bin
 
^vidim/videti<vblex><pri><p1><sg>$
 
$ echo "vidimo" | lt-proc hbs-eng.automorf.bin
 
^vidimo/videti<vblex><pri><p1><pl>$
 
</pre>
 
Ok, so now we do the same for the English dictionary (remember to add the same symbol definitions here as you added to the Serbo-Croatian one).
 
   
The paradigm is:
 
 
<pre>
 
<pre>
  +
$ echo "wordhere" | apertium -d . xxx-morph
<pardef n="s/ee__vblex">
 
 
</pre>
 
</pre>
because the past tense is 'saw'. Now, we can do one of two things, we can add both first and second person, but they are the same form. In fact, all forms (except third person singular) of the verb 'to see' are 'see'. So instead we make one entry for 'see' and give it only the 'pri' symbol.
 
<pre>
 
   
  +
Remember to check out [[Starting a new language with HFST]] for additional detail.
<e><p><l>ee</l><r>ee<s n="vblex"/><s n="pri"/></r></p></e>
 
   
  +
===Summary===
</pre>
 
and as always, an entry in the main section:
 
<pre>
 
<e lm="see"><i>s</i><par n="s/ee__vblex"/></e>
 
</pre>
 
Then lets save, recompile and test:
 
<pre>
 
$ lt-comp lr apertium-eng.eng.dix eng-hbs.automorf.bin
 
main@standard 18 19
 
   
  +
You should now have a module (directory) for each language. The module should include one of the following user-edited files:
$ echo "see" | lt-proc eng-hbs.automorf.bin
 
^see/see<vblex><pri>$
 
</pre>
 
Now for the obligatory entry in the bilingual dictionary:
 
<pre>
 
<e><p><l>videti<s n="vblex"/></l><r>see<s n="vblex"/></r></p></e>
 
</pre>
 
(again, don't forget to add the sdefs from earlier)
 
   
  +
* apertium-xxx.xxx.dix which contains a (very) basic xxx morphological dictionary, ''or''
And recompile:
 
  +
* apertium-xxx.xxx.lexc which contains a (very) basic yyy morphological dictionary.
<pre>
 
$ lt-comp lr apertium-hbs-eng.hbs-eng.dix hbs-eng.autobil.bin
 
main@standard 18 18
 
$ lt-comp rl apertium-hbs-eng.hbs-eng.dix eng-hbs.autobil.bin
 
main@standard 18 18
 
</pre>
 
Now to test:
 
<pre>
 
$ echo "vidim" | lt-proc hbs-eng.automorf.bin | \
 
gawk 'BEGIN{RS="$"; FS="/";}{nf=split($1,COMPONENTS,"^"); for(i = 1; i<nf; i++) printf COMPONENTS[i]; if($2 != "") printf("^%s$",$2);}' | \
 
apertium-transfer apertium-hbs-eng.hbs-eng.t1x hbs-eng.t1x.bin hbs-eng.autobil.bin
 
   
  +
==Bilingual dictionary==
^see<vblex><pri><p1><sg>$^@
 
  +
{{see-also|Bilingual dictionary}}
</pre>
 
We get the analysis passed through correctly, but when we try and generate a surface form from this, we get a '#', like below:
 
<pre>
 
$ echo "vidim" | lt-proc hbs-eng.automorf.bin | \
 
gawk 'BEGIN{RS="$"; FS="/";}{nf=split($1,COMPONENTS,"^"); for(i = 1; i<nf; i++) printf COMPONENTS[i]; if($2 != "") printf("^%s$",$2);}' | \
 
apertium-transfer apertium-hbs-eng.hbs-eng.t1x hbs-eng.t1x.bin hbs-eng.autobil.bin | \
 
lt-proc -g hbs-eng.autogen.bin
 
#see\@
 
</pre>
 
This '#' means that the generator cannot generate the correct lexical form because it does not contain it. Why is this?
 
 
Basically the analyses don't match, the 'see' in the dictionary is see<vblex><pri>, but the see delivered by the transfer is see<vblex><pri><p1><sg>. The Serbo-Croatian side has more information than the English side requires. You can test this by adding the missing symbols to the English dictionary, and then recompiling, and testing again.
 
   
  +
So we now have two morphological dictionaries, next thing to make is the [[bilingual dictionary]]. This describes mappings between words. During translation, one word will be translated to its corresponding word as specified in this dictionary.
However, a more paradigmatic way of taking care of this is by writing a rule. So, we open up the rules file (<code>apertium-hbs-eng.hbs-eng.t1x hbs-eng.t1x.bin hbs-eng.autobil.bin</code> in case you forgot).
 
   
  +
Add an entry to translate between the two words, as specified in the monolingual dictionaries. Something like:
We need to add a new category for 'verb'.
 
 
<pre>
 
<pre>
  +
<e><p><l>xxx<s n="n"/></l><r>yyy<s n="n"/></r></p></e>
<def-cat n="vrb">
 
<cat-item tags="vblex.*"/>
 
</def-cat>
 
 
</pre>
 
</pre>
We also need to add attributes for tense and for person. We'll make it really simple for now, you can add p2 and p3, but I won't in order to save space.
 
<pre>
 
<def-attr n="temps">
 
<attr-item tags="pri"/>
 
</def-attr>
 
   
  +
There are going to be a lot of these entries, so it would be wise to write every entry on one line to facilitate easier reading of the file.
<def-attr n="pers">
 
<attr-item tags="p1"/>
 
</def-attr>
 
</pre>
 
We should also add an attribute for verbs.
 
<pre>
 
<def-attr n="a_verb">
 
<attr-item tags="vblex"/>
 
</def-attr>
 
</pre>
 
Now onto the rule:
 
<pre>
 
<rule>
 
<pattern>
 
<pattern-item n="vrb"/>
 
</pattern>
 
<action>
 
<out>
 
<lu>
 
<clip pos="1" side="tl" part="lem"/>
 
<clip pos="1" side="tl" part="a_verb"/>
 
<clip pos="1" side="tl" part="temps"/>
 
</lu>
 
</out>
 
</action>
 
</rule>
 
</pre>
 
Remember when you tried commenting out the 'clip' tags in the previous rule example and they disappeared from the transfer, well, that's pretty much what we're doing here. We take in a verb with a full analysis, but only output a partial analysis (lemma + verb tag + tense tag).
 
   
  +
So, once this is done, run <code>make -j</code> to compile the bilingual dictionary.
So now, if we recompile that, we get:
 
<pre>
 
$ echo "vidim" | lt-proc hbs-eng.automorf.bin | \
 
gawk 'BEGIN{RS="$"; FS="/";}{nf=split($1,COMPONENTS,"^"); for(i = 1; i<nf; i++) printf COMPONENTS[i]; if($2 != "") printf("^%s$",$2);}' | \
 
apertium-transfer apertium-hbs-eng.hbs-eng.t1x hbs-eng.t1x.bin hbs-eng.autobil.bin
 
^see<vblex><pri>$^@
 
</pre>
 
and:
 
<pre>
 
$ echo "vidim" | lt-proc hbs-eng.automorf.bin | \
 
gawk 'BEGIN{RS="$"; FS="/";}{nf=split($1,COMPONENTS,"^"); for(i = 1; i<nf; i++) printf COMPONENTS[i]; if($2 != "") printf("^%s$",$2);}' | \
 
apertium-transfer apertium-hbs-eng.hbs-eng.t1x hbs-eng.t1x.bin hbs-eng.autobil.bin | \
 
lt-proc -g hbs-eng.autogen.bin
 
see\@
 
</pre>
 
Try it with 'vidimo' (we see) to see if you get the correct output.
 
   
  +
You're done with the addition of dictionaries!
Now try it with "vidim gramofone":
 
<pre>
 
$ echo "vidim gramofoni" | lt-proc hbs-eng.automorf.bin | \
 
gawk 'BEGIN{RS="$"; FS="/";}{nf=split($1,COMPONENTS,"^"); for(i = 1; i<nf; i++) printf COMPONENTS[i]; if($2 != "") printf("^%s$",$2);}' | \
 
apertium-transfer apertium-hbs-eng.hbs-eng.t1x hbs-eng.t1x.bin hbs-eng.autobil.bin | \
 
lt-proc -g hbs-eng.autogen.bin
 
see gramophones\@
 
</pre>
 
   
==But what about personal pronouns?==
 
   
  +
'''Testing of dictionaries'''
Well, that's great, but we're still missing the personal pronoun that is necessary in English. In order to add it in, we first need to edit the English morphological dictionary.
 
   
  +
This is a good checkpoint to test if your language pair works, and that you have set up everything correctly up till this point.
As before, the first thing to do is add the necessary symbols:
 
<pre>
 
<sdef n="prn"/>
 
<sdef n="subj"/>
 
</pre>
 
Of the two symbols, prn is pronoun, and subj is subject (as in the subject of a sentence).
 
   
  +
Make sure that you are in the folder ''with the language pair'', and type an echo command:
Because there is no root, or 'lemma' for personal subject pronouns, we just add the pardef as follows:
 
<pre>
 
<pardef n="prsubj__prn">
 
<e><p><l>I</l><r>prpers<s n="prn"/><s n="subj"/><s n="p1"/><s n="sg"/></r></p></e>
 
</pardef>
 
</pre>
 
With 'prsubj' being 'personal subject'. The rest of them (You, We etc.) are left as an exercise to the reader.
 
   
We can add an entry to the main section as follows:
 
 
<pre>
 
<pre>
  +
echo '^house<n><sg>$' | lt-proc -b xxx-yyy.autobil.bin
<e lm="personal subject pronouns"><i/><par n="prsubj__prn"/></e>
 
</pre>
 
So, save, recompile and test, and we should get something like:
 
<pre>
 
$ echo "I" | lt-proc eng-hbs.automorf.bin
 
^I/PRPERS<prn><subj><p1><sg>$
 
 
</pre>
 
</pre>
   
  +
If this works, great! We move on to the next section: The addition of transfer rules.
(Note: it's in capitals because 'I' is in capitals).
 
   
  +
==Transfer rules==
Now we need to amend the 'verb' rule to output the subject personal pronoun along with the correct verb form.
 
  +
{{see-also|Transfer|A long introduction to transfer rules}}
   
  +
So, now we have two morphological dictionaries, and a bilingual dictionary. All that we need now is a transfer rule for nouns. Transfer rules are necessary in sentence-based translation to translate longer chunks of text grammatically. They also help assign the correct tags for the part-of-speech tagger that is necessary in the development of a language pair.
First, add a category (this must be getting pretty pedestrian by now):
 
<pre>
 
<def-cat n="prpers">
 
<cat-item lemma="prpers" tags="prn.*"/>
 
</def-cat>
 
</pre>
 
Now add the types of pronoun as attributes, we might as well add the 'obj' type as we're at it, although we won't need to use it for now:
 
<pre>
 
<def-attr n="tipus_prn">
 
<attr-item tags="prn.subj"/>
 
<attr-item tags="prn.obj"/>
 
</def-attr>
 
</pre>
 
And now to input the rule:
 
<pre>
 
<rule>
 
<pattern>
 
<pattern-item n="vrb"/>
 
</pattern>
 
<action>
 
<out>
 
<lu>
 
<lit v="prpers"/>
 
<lit-tag v="prn"/>
 
<lit-tag v="subj"/>
 
<clip pos="1" side="tl" part="pers"/>
 
<clip pos="1" side="tl" part="nbr"/>
 
</lu>
 
<b/>
 
<lu>
 
<clip pos="1" side="tl" part="lem"/>
 
<clip pos="1" side="tl" part="a_verb"/>
 
<clip pos="1" side="tl" part="temps"/>
 
</lu>
 
</out>
 
</action>
 
</rule>
 
</pre>
 
This is pretty much the same rule as before, only we made a couple of small changes.
 
   
  +
If you need to implement a rule it is often a good idea to look in the rule files of other language pairs first. Many rules can be recycled/reused between languages.
We needed to output:
 
<pre>
 
^prpers<prn><subj><p1><sg>$ ^see<vblex><pri>$
 
</pre>
 
so that the generator could choose the right pronoun and the right form of the verb.
 
   
  +
Add a rule to take in the nouns and then output it in the correct form.
So, a quick rundown:
 
   
  +
I will use the example of a language pair English-Malay (eng-zlm) to demonstrate a simple transfer rule. In English, the adjective comes before the noun, such as "happy boy". However, in Malay, the noun comes before the adjective, such as "lelaki gembira" (boy happy). Remember to modify the rule as is appropriate for your language.
* <code><lit></code>, prints a literal string, in this case "prpers"
 
* <code><lit-tag></code>, prints a literal tag, because we can't get the tags from the verb, we add these ourself, "prn" for pronoun, and "subj" for subject.
 
* <code><b/></code>, prints a blank, a space.
 
   
  +
''Note: You need to add the words that you will want test in the dictionaries first, before you can test them!''
Note that we retrieved the information for number and tense directly from the verb.
 
 
So, now if we recompile and test that again:
 
 
<pre>
 
<pre>
  +
<rule comment="REGLA: adj nom"> <!--happy boy-->
$ echo "vidim gramofone" | lt-proc hbs-eng.automorf.bin | \
 
  +
<pattern>
gawk 'BEGIN{RS="$"; FS="/";}{nf=split($1,COMPONENTS,"^"); for(i = 1; i<nf; i++) printf COMPONENTS[i]; if($2 != "") printf("^%s$",$2);}' | \
 
  +
<pattern-item n="adj"/>
apertium-transfer apertium-hbs-eng.hbs-eng.t1x hbs-eng.t1x.bin hbs-eng.autobil.bin | \
 
  +
<pattern-item n="nom"/>
lt-proc -g hbs-eng.autogen.bin
 
  +
</pattern>
I see gramophones
 
  +
<action>
  +
<out>
  +
<chunk name="j_n" case="caseFirstWord">
  +
<lu>
  +
<clip pos="2" side="tl" part="whole"/> <!--pos 2 refers to the second word in the rule ("boy"). It is now placed at the front.-->
  +
</lu>
  +
<b/> <!-- this is the space to indicate the word should be separated-->
  +
<lu>
  +
<clip pos="1" side="tl" part="whole"/> <!--pos 1 refers to the first word in the rule ("happy"). It is now placed at the back.-->
  +
</lu>
  +
</chunk>
  +
</out>
  +
</action>
  +
</rule>
 
</pre>
 
</pre>
Which, while it isn't exactly prize-winning prose (much like this HOWTO), is a fairly accurate translation.
 
   
  +
For each pattern, there is an associated action, which produces an associated output, out. The output, is a lexical unit (lu).
==So tell me about the record player (Multiwords)==
 
   
  +
The clip tag allows a user to select and manipulate attributes and parts of the source language (side="sl"), or target language (side="tl") lexical item. This can get very complicated, so let's move on. Further elaboration can be found in [[A long introduction to transfer rules]].
While gramophone is an English word, it isn't the best translation. Gramophone is typically used for the very old kind, you know with the needle instead of the stylus, and no powered amplification. A better translation would be 'record player'. Although this is more than one word, we can treat it as if it is one word by using multiword (multipalabra) constructions.
 
   
  +
Let's compile it and test it. <code>make -j</code>.
We don't need to touch the Serbo-Croatian dictionary, just the English one and the bilingual one, so open it up.
 
   
  +
Now we're ready to test our machine translation system.
The plural of 'record player' is 'record players', so it takes the same paradigm as gramophone (gramophone__n) — in that we just add 's'. All we need to do is add a new element to the main section.
 
<pre>
 
<e lm="record player"><i>record<b/>player</i><par n="gramophone__n"/></e>
 
</pre>
 
The only thing different about this is the use of the <b/> tag, although this isn't entirely new as we saw it in use in the rules file.
 
   
So, recompile and test in the orthodox fashion:
 
 
<pre>
 
<pre>
$ echo "vidim gramofone" | lt-proc hbs-eng.automorf.bin | \
+
$ echo "happy boy" | apertium -d . eng-zlm
  +
lelaki gembira
gawk 'BEGIN{RS="$"; FS="/";}{nf=split($1,COMPONENTS,"^"); for(i = 1; i<nf; i++) printf COMPONENTS[i]; if($2 != "") printf("^%s$",$2);}' | \
 
apertium-transfer apertium-hbs-eng.hbs-eng.t1x hbs-eng.t1x.bin hbs-eng.autobil.bin | \
 
lt-proc -g hbs-eng.autogen.bin
 
I see record players
 
 
</pre>
 
</pre>
Perfect. A big benefit of using multiwords is that you can translate idiomatic expressions verbatim, without having to do word-by-word translation. For example the English phrase, "at the moment" would be translated into Serbo-Croatian as "trenutno" (trenutak = ''moment'', trenutno being adverb of that) &mdash; it would not be possible to translate this English phrase word-by-word into Serbo-Croatian.
 
   
==Dealing with minor variation==
 
   
  +
And c'est ca. You now have a machine translation system that translates words between your chosen two languages. You have also learned how to write a basic transfer rule, that grammatically translates longer chunks of words into their target language. Obviously this isn't very useful, but you'll get onto the more complex stuff soon.
Serbo-Croatian is an umbrella term for several standard languages, so there are differences in pronounciation and ortography. There is a cool phonetic writing system so you write how you speak. A notable example is the pronounciation of the proto-Slavic vowel ''yat''. The word for dictionary can for instance be either "rječnik" (called Ijekavian), or "rečnik" (called Ekavian).
 
   
  +
Think of a few other words that inflect the same as the nouns in your dictionaries, in our case, "xxx" and "yyy". How about adding those? You can link their entries with the paradigm you had written earlier.
===Analysis===
 
   
  +
==Conclusion==
There should be a fairly easy way of dealing with this, and there is, using paradigms again. Paradigms aren't only used for adding grammatical symbols, but they can also be used to replace any character/symbol with another. For example, here is a paradigm for accepting both "e" and "je" in the analysis. The paradigm should, as with the others go into the monolingual dictionary for Serbo-Croatian.
 
  +
''See also: [[Contributing to an existing pair]], [[Quick and dirty guide addendum: other important things]]''
 
<pre>
 
<pardef n="e_je__yat">
 
<e>
 
<p>
 
<l>e</l>
 
<r>e</r>
 
</p>
 
</e>
 
<e>
 
<p>
 
<l>je</l>
 
<r>e</r>
 
</p>
 
</e>
 
</pardef>
 
</pre>
 
 
Then in the "main section":
 
 
<pre>
 
<e lm="rečnik"><i>r</i><par n="e_je__yat"/><i>čni</i><par n="rečni/k__n"/></e>
 
</pre>
 
   
  +
Congratulations on creating your new infant language pair! Your journey is (obviously) not over yet, though! There is a long way to go before the language pair becomes stable. Steps include: adding more words to the dictionaries, adding more transfer rules to accommodate them, [http://wiki.apertium.org/wiki/Tagger_training training the part-of-speech tagger], bug and error testing, and more. See the [http://wiki.apertium.org/wiki/Quick_and_dirty_guide_addendum:_other_important_things next guide] as well as [[Contributing to an existing pair]] for more advanced instructions on how to expand your language pair.
This only allows us to analyse both forms however... more work is necessary if we want to generate both forms.
 
   
  +
As is the usual, feel free to browse the wiki for more resources, or proceed to [[IRC]] for help.
===Generation===
 
   
  +
We wish you the best of luck in the creation of your new language pair!
==Notes==
 
<references/>
 
   
 
==See also==
 
==See also==
   
  +
*[[A long introduction to transfer rules]]
  +
*[[Apertium VirtualBox]]
  +
*[[Bilingual dictionary]]
 
*[[Building dictionaries]]
 
*[[Building dictionaries]]
*[[Finding_errors_in_dictionaries]]
 
*[[Cookbook]]
 
 
*[[Chunking]]
 
*[[Chunking]]
 
*[[Contributing to an existing pair]]
 
*[[Contributing to an existing pair]]
  +
*[[Finding errors in dictionaries]]
  +
*[[How to bootstrap a new pair]]
  +
*[[Installation]]
  +
*[[lexc]]
  +
*[[lttoolbox]]
  +
*[[List of language pairs]]
  +
*[[Installation]]
  +
*[[Morphological dictionary]]
  +
*[[Starting a new language with HFST]]
  +
*[[Starting a new language with lttoolbox]]
  +
*[[twol]]
  +
*[[Quick and dirty guide addendum: other important things]]
   
 
[[Category:Documentation in English]]
 
[[Category:Documentation in English]]

Latest revision as of 20:58, 2 April 2021

This document will describe how to start a new language pair for the Apertium machine translation system from scratch. You can check the list of language pairs that have already been started.

It does not assume any knowledge of linguistics, or machine translation above the level of being able to distinguish nouns from verbs (and prepositions etc.) You should probably familiarise yourself with basic Unix commands, though, if you're a complete newbie.

Introduction[edit]

Apertium is, as you've probably realised by now, a machine translation system. Well, not quite, it's a machine translation platform. It provides an engine and toolbox that allow you to build your own machine translation systems. The only thing you need to do is write the data. The data consists, on a basic level, of three dictionaries and a few rules (to deal with word re-ordering and other grammatical stuff).

For a more detailed introduction into how it all works, there are some excellent papers on the Publications page.

These are the key steps one has to take in order to create a new language pair:

  1. Have Apertium installed.
  2. Decide whether each language in your pair is best implemented with lttoolbox or HFST.
  3. Bootstrap the pair: Create the templates for a new language pair using the new bootstrap module.
  4. Create the morphological dictionary for the first language in the language pair (henceforth known as language xxx). This contains words in the first language for Apertium to recognise.
  5. Create the morphological dictionary for the second language in the language pair (henceforth known as language yyy). This contains words in the second language for Apertium to recognise.
  6. Create the bilingual dictionary for the desired language pair xxx-yyy. This contains correspondences between words and symbols in the two languages.
  7. Create the transfer rules for the language pair. This gives the Apertium rules to follow during translation of the language pair. When a pattern that correspond to these rules is detected in the source language text, Apertium will follow these transfer rules to create a grammatical translation in the target language.
  8. Follow Contributing to an existing pair for instructions on expansion and maintenance of the new language pair.


To make this HOWTO guide as straightforward and simple as possible, our example will create a fictional language pair xxx-yyy. We aim to guide you through the steps to create a basic working language pair, before you proceed to Contributing to an existing pair for the more advanced guide. We write this for the complete newbie to Apertium, but assume that you have basic knowledge of operating a Unix terminal.

Do take note that "xxx" and "yyy" represent the standard 3-letter character ISO codes for the languages used. In your language pair, replace these with the appropriate 3-letter ISO code from this page: ISO 639-3.

Installation[edit]

Apertium is a rule-based machine translation platform that runs on Unix. If you're using Windows, you'll need to download a Virtual Machine. We have a ready-to-use Apertium VirtualBox for Windows users to quickly get started.

For full instructions, on every platform, see Installation.

Then, proceed to How to bootstrap a new pair to obtain the languages that you need for creation of your language pair.

Linux[edit]

curl -sS https://apertium.projectjj.com/apt/install-nightly.sh | sudo bash

Debian-based/Ubuntu/Mint

sudo apt-get -f install apertium-all-dev

RHEL/CentOS/Fedora:

sudo yum install apertium-all-devel

OpenSUSE:

sudo zypper install apertium-all-devel

Mac OS[edit]

Follow all of the instructions in Installation. The documentation for quick Mac installation is no longer functional.

Language Properties[edit]

See also: lexc, twol, and lttoolbox

Before we start, we need to determine whether the languages you will be using are more suitable for lexc/twol (HFST) or lttoolbox based processing. "What on earth are those?", you ask. Bear me with me a bit.

  • HFST is best for languages with agglutinative morphology and/or complicated morphophonology.
  • lttoolbox is best for languages with fixed paradigms.

Languages with fairly limited morphology (e.g., English) can use either. Lttoolbox is the more default and common one, but if a language has agglutinative morphology, or any amount of productive morphophonology (especially vowel harmony), then it should use HFST.

If in doubt, please go to IRC and ask an expert.

For our example, we will imagine that xxx is an lttoolbox language, while yyy is an HFST language, so that I can demonstrate the addition of dictionaries for both types of languages.

Bootstrap[edit]

Apertium has recently created a new bootstrap script to automatically create templates for new language pair creation. The older method of language-pair creation was, suffice it to say, more tedious.

  1. Get apertium-init from https://apertium.projectjj.com/apt/nightly/apertium-init
  2. Check the languages and incubator directories to find out if the languages that you will be working on are supported in Apertium. For our case, we will be starting from scratch, because our fictional "xxx" and "yyy" languages are not found in Apertium. Check the bootstrap wiki page for specific instructions if your languages can be found in Apertium. Also, remember to create the appropriate template depending if your language is lttoolbox or lexc/twol (HFST).
  3. Create the templates for language xxx (or check out from SVN if it exists).
  4. Create the template for language yyy (or check out from SVN if it exists).
  5. Create the template for language pair xxx-yyy.

After performing these steps, you should have three new folders that hold the relevant language template files: apertium-xxx, apertium-yyy, and apertium-xxx-yyy.

To perform step 3 ("xxx" is an lttoolbox language):

python3 apertium-init.py xxx
cd apertium-xxx
./autogen.sh
make -j
cd ..

To perform step 4 ("yyy" is a HFST language):

python3 apertium-init.py yyy --analyser=hfst
cd apertium-yyy
./autogen.sh
make -j
cd ..

To perform step 5:

python3 apertium-init.py xxx-yyy --analyser2=hfst # Only yyy (second language) uses HFST
cd apertium xxx-yyy
./autogen.sh --with-lang1=../apertium-XXX --with-lang2=../apertium-yyy
make -j

Monolingual dictionaries[edit]

See also: Morphological dictionary, List of dictionaries, and Incubator

Apertium uses dictionaries to identify words in a language. Only then can a word be processed by Apertium's tools.

Remember that you only need to look at just one of the two sections, depending on the languages used. Unless, of course, one language in the pair needs lttoolbox and the other needs HFST, then you'll have to read both of them.

Lttoolbox modules ("monodix" files)[edit]

See also: Monodix basics

We shall begin by adding words to the dictionary of the first language "xxx". Open the dix file in the folder apertium-xxx (it should be named "apertium-xxx.xxx.dix"). You will see a few lines:

<?xml version="1.0" encoding="UTF-8"?>
<dictionary>

</dictionary>

<alphabet>ABCČĆDDžĐEFGHIJKLLjMNNjOPRSŠTUVZŽabcčćddžđefghijklljmnnjoprsštuvzž</alphabet>

Edit the alphabet to be your appropriate for your language.

Next, you will see some symbol definitions:

<sdefs>
   <sdef n="n"/>
   <sdef n="sg"/>
   <sdef n="pl"/>
</sdefs>

The template is very intuitive. "n" means noun, "sg" means singular, and "pl" means plural. Do, however, be familiar with Apertium's standard List of symbols. Do be also prepared to add more symbols if your language demands it. For example, nouns in Serbo-Croatian inflect for more than just number, they are also inflected for case, and have a gender. Appropriate symbols will have to be added to the language dictionary to reflect these language features.

The next thing you will see is the section for paradigms. Paradigms are 'structures' so to say, that define the part of speech traits (such as noun, verb, transitive, etc) of a certain word. Then, any further words in the dictionary that behave similarly to the given word can be assigned this paradigm, to mark that they have the same traits and can be treated similarly. It's a faster way to assign part-of-speech traits to words without typing stuff over and over again. You'll be adding more paradigms in future when expanding the language (and hence, language pair), as many as different word forms found in the language itself.

<pardefs>

</pardefs>

Add the most basic of singular nouns for now. In English, this can be "house".

<pardef n="house__n">
      <e>       <p><l></l>         <r><s n="n"/><s n="sg"/></r></p></e>
    </pardef>

Add the e, p and l tags, then add r tags and fit the appropriate defining symbols inside it. <l> tags include different inflections of the word (but the word's base form essentially remains the same). So, the basic word "house" can actually have four paradigm entries.

<pardef n="house__n">
      <e>       <p><l></l>         <r><s n="n"/><s n="sg"/></r></p></e>
      <e>       <p><l>'s</l>       <r><s n="n"/><s n="sg"/><s n="gen"/></r></p></e>
      <e>       <p><l>s</l>        <r><s n="n"/><s n="pl"/></r></p></e>
      <e>       <p><l>s'</l>       <r><s n="n"/><s n="pl"/><s n="gen"/></r></p></e>
    </pardef>

You do not have to concern yourself with the specifics of what the tags <e> and <'p'> do for now.


Let's now add words to the dictionary. Nouns are the easiest to begin with for most languages. It may also be wise to look at the dix files of other languages to personally observe the format for yourself.

<e lm="abbot"><i>abbot</i><par n="house__n"/></e>

The format is: <e lm="wordhere">wordhere<the most appropriate paradigm here></e>

Once you're done, save it and compile the folder again with make -j

You can test that the new word works with the lt-proc subprogram.

$ echo "wordhere" | lt-proc xxx.automorf.bin

Or using the apertium interface to analysis:

$ echo "wordhere" | apertium -d . xxx-morph

If your second language also uses a dix file, do the same thing, substituting the noun you added to xxx's dictionary with the appropriate translation in yyy.

HFST modules (lexc/twol files)[edit]

See also: Starting a new language with HFST

HFST-type languages are more difficult to begin, because of the nature of their more complex morphology.

We shall begin by adding words to the dictionary of the second language "yyy". Open the lexc file in apertium-yyy (it should be named "apertium-yyy.yyy.lexc"). You will see a few lines:

Multichar_Symbols

%<n%>   ! Noun
%<nom%> ! Nominative
%<pl%>  ! Plural

The symbols < and > are reserved in lexc, so we need to escape them with %

Take a look at the Root lexicon definition, which is going to point to a list of stems in the lexicon NounStems. This should already be in the file:


LEXICON Root

NounStems ;

Now let's add our word, substituting the noun you added to xxx's dictionary with the appropriate translation in yyy. :

LEXICON NounStems

yyy N1 ; ! "xxx"

First we put the stem, then we put the paradigm (or continuation class) that it belongs to, in this case N1, and finally, in a comment (the comment symbol is !) we put the translation.

And define the most basic of inflection, that is, tagging the bare stem with <n> to indicate a noun:

LEXICON N1

%<n%>: # ;

This LEXICON should go before the NounStems lexicon. The # symbol is the end-of-word boundary. It is very important to have this, as it tells the transducer where to stop.

Now that we're done with our lexc file, save it and make -j. We move on to the twol file now.

The idea of twol is to take the surface forms produced by lexc and apply rules to them to change them into real surface forms. So, this is where we change -l{A}r into -lar or -ler.

What we basically want to say is "if the stem contains front vowels, then we want the front vowel alternation, if it contains back vowels then we want the back vowel alternation". And at the same time, remove the morpheme boundary. So let's give it a shot.

Enter the twol file apertium-yyy.yyy.twol.

Edit the alphabet to be your appropriate for your language.

Alphabet
 A B Ç D E Ä F G H I J Ž K L M N Ň O Ö P R S Ş T U Ü W Y Ý Z
 a b ç d e ä f g h i j ž k l m n ň o ö p r s ş t u ü w y ý z
 %{A%}:a ;

The pre-built template should already have everything set up for you, so make -j and let's move on.

You can test that your newly added word works with hfst-proc.

$ echo "word" | hfst-proc yyy.automorf.hfst.ol

Or using the apertium interface to analysis:

$ echo "wordhere" | apertium -d . xxx-morph

Remember to check out Starting a new language with HFST for additional detail.

Summary[edit]

You should now have a module (directory) for each language. The module should include one of the following user-edited files:

  • apertium-xxx.xxx.dix which contains a (very) basic xxx morphological dictionary, or
  • apertium-xxx.xxx.lexc which contains a (very) basic yyy morphological dictionary.

Bilingual dictionary[edit]

See also: Bilingual dictionary

So we now have two morphological dictionaries, next thing to make is the bilingual dictionary. This describes mappings between words. During translation, one word will be translated to its corresponding word as specified in this dictionary.

Add an entry to translate between the two words, as specified in the monolingual dictionaries. Something like:

<e><p><l>xxx<s n="n"/></l><r>yyy<s n="n"/></r></p></e>

There are going to be a lot of these entries, so it would be wise to write every entry on one line to facilitate easier reading of the file.

So, once this is done, run make -j to compile the bilingual dictionary.

You're done with the addition of dictionaries!


Testing of dictionaries

This is a good checkpoint to test if your language pair works, and that you have set up everything correctly up till this point.

Make sure that you are in the folder with the language pair, and type an echo command:

echo '^house<n><sg>$' | lt-proc -b xxx-yyy.autobil.bin

If this works, great! We move on to the next section: The addition of transfer rules.

Transfer rules[edit]

See also: Transfer and A long introduction to transfer rules

So, now we have two morphological dictionaries, and a bilingual dictionary. All that we need now is a transfer rule for nouns. Transfer rules are necessary in sentence-based translation to translate longer chunks of text grammatically. They also help assign the correct tags for the part-of-speech tagger that is necessary in the development of a language pair.

If you need to implement a rule it is often a good idea to look in the rule files of other language pairs first. Many rules can be recycled/reused between languages.

Add a rule to take in the nouns and then output it in the correct form.

I will use the example of a language pair English-Malay (eng-zlm) to demonstrate a simple transfer rule. In English, the adjective comes before the noun, such as "happy boy". However, in Malay, the noun comes before the adjective, such as "lelaki gembira" (boy happy). Remember to modify the rule as is appropriate for your language.

Note: You need to add the words that you will want test in the dictionaries first, before you can test them!

<rule comment="REGLA: adj nom">  <!--happy boy-->
      <pattern>
        <pattern-item n="adj"/>
        <pattern-item n="nom"/>
      </pattern>
      <action>
        <out>
          <chunk name="j_n" case="caseFirstWord">
            <lu>
              <clip pos="2" side="tl" part="whole"/> <!--pos 2 refers to the second word in the rule ("boy"). It is now placed at the front.-->
            </lu>
            <b/> <!-- this is the space to indicate the word should be separated-->
            <lu>
              <clip pos="1" side="tl" part="whole"/> <!--pos 1 refers to the first word in the rule ("happy"). It is now placed at the back.-->  
            </lu>
          </chunk>
        </out>
      </action>
    </rule>

For each pattern, there is an associated action, which produces an associated output, out. The output, is a lexical unit (lu).

The clip tag allows a user to select and manipulate attributes and parts of the source language (side="sl"), or target language (side="tl") lexical item. This can get very complicated, so let's move on. Further elaboration can be found in A long introduction to transfer rules.

Let's compile it and test it. make -j.

Now we're ready to test our machine translation system.

$ echo "happy boy" | apertium -d . eng-zlm
lelaki gembira


And c'est ca. You now have a machine translation system that translates words between your chosen two languages. You have also learned how to write a basic transfer rule, that grammatically translates longer chunks of words into their target language. Obviously this isn't very useful, but you'll get onto the more complex stuff soon.

Think of a few other words that inflect the same as the nouns in your dictionaries, in our case, "xxx" and "yyy". How about adding those? You can link their entries with the paradigm you had written earlier.

Conclusion[edit]

See also: Contributing to an existing pair, Quick and dirty guide addendum: other important things

Congratulations on creating your new infant language pair! Your journey is (obviously) not over yet, though! There is a long way to go before the language pair becomes stable. Steps include: adding more words to the dictionaries, adding more transfer rules to accommodate them, training the part-of-speech tagger, bug and error testing, and more. See the next guide as well as Contributing to an existing pair for more advanced instructions on how to expand your language pair.

As is the usual, feel free to browse the wiki for more resources, or proceed to IRC for help.

We wish you the best of luck in the creation of your new language pair!

See also[edit]