Matxin 1.0 New Language Pair HOWTO

From Apertium
Jump to navigation Jump to search

This page intends to give a step-by-step walk-through of how to create a new translator in the Matxin platform.


Main article: Matxin

This page does not give instructions on installing Matxin, but presumes that the following packages are correctly installed.

  • lttoolbox (from SVN)
  • Freeling (from SVN)
  • Matxin (from SVN)
  • a text editor (or a specialised XML editor if you prefer)
  • The fl-* tools for Freeling


As mentioned in the lead, this page intends to give a step-by-step guide to creating a new language pair with Matxin from scratch. No programming knowledge is required, all that needs to be defined are some dictionaries and grammars. The Matxin platform is described in detail in Documentation of Matxin and on the Matxin homepage. This page will only focus on the creation of a new language pair, and will avoid theoretical and methodological issues.

The language pair for the tutorial will be Breton to English. This has been chosen as the two languages have fairy divergent word order (Breton is fairly free, allowing VSO, OVS and SVO, where English is fairly uniformly SVO) which can show some of the advantage which Matxin has over Apertium.

Getting started


The analysis process in Matxin is done by Freeling, an free / open-source suite of language analysers. The analysis is done in four stages, requiring four (or more) separate files. The first is the morphological dictionary, which is basically a full-form list (e.g. Speling format) compiled into a BerkeleyDB format. There are then files for word-category disambiguation and for specifying chunking and dependency rules.


In order to create your morphological analyser in Freeling you basically need to make a full-form list. If there is already an Apertium dictionary for the language, you can use the scripts in apertium SVN (module apertium-tools/freeling) to generate a dictionary from scratch, if not, then either build it from scratch, or build a dictionary in lttoolbox and then generate the list.

For the purposes of this exercise, you can just key in a small dictionary manually. We'll call the dictionary, and it will contain

ul un DI0CN0 
un un DI0CN0 
ur un DI0CN0 
yezhoù yezh NCFPV0 
yezh yezh NCFSV0 yezh AQ0CN0
indezeuropek indezeuropek AQ0CN0 
eo bezañ VMIP3S0 
al an DA0CN0 
ar an DA0CN0 
an an DA0CN0
brezhoneg brezhoneg NCMSV0 
prezhoneg brezhoneg NCMSV0 
vrezhoneg brezhoneg NCMSV0 
. . Fp

The file is space separated with three or more columns. The first is for the surface form of the word, further columns are for a list of lemmas and PAROLE-style analyses.

After we've keyed this in, we can compile it to BerkleyDB format using the tool indexdict from the Freeling utilities. It is worth noting that Freeling currently only supports the latin1 character encoding, so if you're working in UTF-8, convert the dictionary to latin1 first.

$ cat | iconv -f utf-8 -t latin1 | indexdict

Now you should have two files,, which is the dictionary source, and which is the dictionary in BerkleyDB format. We cannot however use this analyser without specifying a tokeniser and splitter. These files define how words and sentences will be tokenised. For now we'll use a minimal configuration file for the splitter, so put the following in the file matxin-br-en.spt.dat

. 0

And now for the word tokeniser, which we'll put in matxin-br-en.tok.dat

ALPHANUM   [^\]<>[(\.,";:?!'`)^@~|}{_/\\+=&$#*+%\s\-]
OTHERS     [\]<>[(\.,";:?!'`)^@~|}{_/\\+=&$#*+%\-]
WORD             0  {ALPHANUM}+
OTHERS_C         0  {OTHERS}+

The macros define regular expressions which are used to tokenise the input into words and punctuation. The regular expression WORD is defined as a sequence of one or more ALPHANUM which in turn is defined as anything except a punctuation character.

So now if we want to morphologically analyse a sentence, we just do:

$ echo "Ur yezh eo ar brezhoneg." | fl-morph matxin-br-en.tok.dat matxin-br-en.spt.dat  | iconv -f latin1
Ur un DI0CN0 -1   -1
yezh yezh NCFSV0 -1 yezh AQ0CN0 -1
eo bezañ VMIP3S0 -1   -1
ar an DA0CN0 -1   -1
brezhoneg brezhoneg NCMSV0 -1   -1
. . Fp -1

Category disambiguation


Dependency parsing