Difference between revisions of "Matxin 1.0 New Language Pair HOWTO"
Line 127: | Line 127: | ||
====Chunking==== |
====Chunking==== |
||
So, after tagging the next stage is chunk parsing. This is somewhat like the chunking available in Apertium, however no transfer takes place, it just groups words into chunks. The grammar is quite familiar, the left side shows the non-terminal, and the right side can be either terminal (in the case of a tag, e.g. <code>NCM*</code>) or non-terminal (in the case of <code>n-m</code>). This extremely simple grammar will chunk the tagged input into the constituents (noun phrases <code>sn</code> and verb) for later use by the dependency parser. |
So, after tagging the next stage is chunk parsing. This is somewhat like the chunking available in Apertium, however no transfer takes place, it just groups words into chunks. The grammar is quite familiar, the left side shows the non-terminal, and the right side can be either terminal (in the case of a tag, e.g. <code>NCM*</code>) or non-terminal (in the case of <code>n-m</code>). This extremely simple grammar will chunk the tagged input into the constituents (noun phrases <code>sn</code> and verb) for later use by the dependency parser. It should be fairly straight forward, <code>|</code> is an or statement, and <code>+</code> marks the governor, or head of the chunk. |
||
<pre> |
<pre> |
Revision as of 14:02, 13 June 2009
This page intends to give a step-by-step walk-through of how to create a new translator in the Matxin platform.
Prerequisites
- Main article: Matxin
This page does not give instructions on installing Matxin, but presumes that the following packages are correctly installed.
- lttoolbox (from SVN)
- Freeling (from SVN)
- Matxin (from SVN)
- a text editor (or a specialised XML editor if you prefer)
- The
fl-*
tools for Freeling (for the moment these can be found inapertium-tools/freeling
in apertium SVN)
Overview
As mentioned in the lead, this page intends to give a step-by-step guide to creating a new language pair with Matxin from scratch. No programming knowledge is required, all that needs to be defined are some dictionaries and grammars. The Matxin platform is described in detail in Documentation of Matxin and on the Matxin homepage. This page will only focus on the creation of a new language pair, and will avoid theoretical and methodological issues.
The language pair for the tutorial will be Breton to English. This has been chosen as the two languages have fairy divergent word order (Breton is fairly free, allowing VSO, OVS and SVO, where English is fairly uniformly SVO) which can show some of the advantage which Matxin has over Apertium.
Getting started
Analysis
The analysis process in Matxin is done by Freeling, an free / open-source suite of language analysers. The analysis is done in four stages, requiring four (or more) sets of separate files. The first is the morphological dictionary, which is basically a full-form list (e.g. Speling format) compiled into a BerkeleyDB format. There are then files for word-category disambiguation and for specifying chunking and dependency rules. There are two more stages that come before morphological analysis, tokenisation and sentence splitting, but for the purposes of this tutorial they will be considered along with morphological analysis.
Normally a single program is used to do all the different stages of analysis, taking as input plain or deformatted text, and outputting a dependency analysis, the behaviour of this program is controlled by a file called config.cfg
. In Matxin this program is called Analyzer
, however in the following stages, we'll be using separate tools and will leave creating the config file until the last minute as it can get overly complicated.
As companion reading to this section the Freeling documentation is highly recommended. This tutorial skips over features of Freeling which are not necessary for making a basic MT system with Matxin.
Morphological
In order to create your morphological analyser in Freeling you basically need to make a full-form list. If there is already an Apertium dictionary for the language, you can use the scripts in apertium SVN (module apertium-tools/freeling
) to generate a dictionary from scratch, if not, then either build it from scratch, or build a dictionary in lttoolbox and then generate the list.
For the purposes of this exercise, you can just key in a small dictionary manually. We'll call the dictionary matxin-br-en.br.dicc
, and it will contain
ul un DI0CN0 un un DI0CN0 ur un DI0CN0 yezhoù yezh NCFPV0 yezh yezh NCFSV0 yezh AQ0CN0 indezeuropek indezeuropek AQ0CN0 eo bezañ VMIP3S0 al an DA0CN0 ar an DA0CN0 an an DA0CN0 brezhoneg brezhoneg NCMSV0 prezhoneg brezhoneg NCMSV0 vrezhoneg brezhoneg NCMSV0 . . Fp
The file is space separated with three or more columns. The first is for the surface form of the word, further columns are for a list of lemmas and Parole-style analyses.
After we've keyed this in, we can compile it to BerkleyDB format using the tool indexdict
from the Freeling utilities. It is worth noting that Freeling currently only supports the latin1
character encoding, so if you're working in UTF-8, convert the dictionary to latin1 first.
$ cat matxin-br-en.br.dicc | iconv -f utf-8 -t latin1 | indexdict br-en.br.db
Now you should have two files, matxin-br-en.br.dicc
, which is the dictionary source, and br-en.br.db
which is the dictionary in BerkleyDB format. We cannot however use this analyser without specifying a tokeniser and splitter. These files define how words and sentences will be tokenised. For now we'll use a minimal configuration file for the splitter, so put the following in the file matxin-br-en.spt.dat
<SentenceEnd> . 0 </SentenceEnd>
Of course, other end of sentence punctuation such as '?' and '!' could also be put in there. And now for the word tokeniser, which we'll put in matxin-br-en.tok.dat
<Macros> ALPHANUM [^\]<>[(\.,";:?!'`)^@~|}{_/\\+=&$#*+%\s\-] OTHERS [\]<>[(\.,";:?!'`)^@~|}{_/\\+=&$#*+%\-] </Macros> <RegExps> WORD 0 {ALPHANUM}+ OTHERS_C 0 {OTHERS}+ </RegExps>
The macros define regular expressions which are used to tokenise the input into words and punctuation. The regular expression WORD
is defined as a sequence of one or more ALPHANUM
which in turn is defined as anything except a punctuation character.
So now if we want to morphologically analyse a sentence, we just do:
$ echo "Ur yezh eo ar brezhoneg." | fl-morph matxin-br-en.tok.dat matxin-br-en.spt.dat br-en.br.db | iconv -f latin1 Ur un DI0CN0 -1 -1 yezh yezh NCFSV0 -1 yezh AQ0CN0 -1 eo bezañ VMIP3S0 -1 -1 ar an DA0CN0 -1 -1 brezhoneg brezhoneg NCMSV0 -1 -1 . . Fp -1
Category disambiguation
After we have working morphological analysis, the next stage is to create a part-of-speech tagger. Freeling offers various ways to do this, both HMM-based and Relax Constraint Grammar (RelaxCG) based are supported. We're going to demonstrate how to create a RelaxCG tagger as it is easier and does not require tagger training.
Our tagger will be very simple as we only have one ambiguity, yezh 'language' can be a noun or an adjective. As adjectives come after the noun in Breton, we'll weight adjectives after determiners very low,
SETS CONSTRAINTS %% after a determiner down-weight adjective -8.0 AQ* (-1 D*);
The file (which we will call matxin-br-en.br-en.rcg
is made up of two sections, the first SETS
defines any sets of tags or lemmas, much like the LIST
and SET
in VISL Constraint Grammar taggers. The second section defines a series of weighted constraints, in the format of 'weight', followed by space, followed by the tag followed by another space and then the context. The context is defined as a series of positions relative to the tag in question.
So, using this file we should be able to get disambiguated output:
$ echo "Ur yezh eo ar brezhoneg." | fl-morph matxin-br-en.tok.dat matxin-br-en.spt.dat br-en.br.db | \ fl-tagger matxin-br-en.br-en.rcg | iconv -f latin1 Ur un DI0CN0 -1 yezh yezh NCFSV0 -1 eo bezañ VMIP3S0 -1 ar an DA0CN0 -1 brezhoneg brezhoneg NCMSV0 -1 . . Fp -1
Chunking
So, after tagging the next stage is chunk parsing. This is somewhat like the chunking available in Apertium, however no transfer takes place, it just groups words into chunks. The grammar is quite familiar, the left side shows the non-terminal, and the right side can be either terminal (in the case of a tag, e.g. NCM*
) or non-terminal (in the case of n-m
). This extremely simple grammar will chunk the tagged input into the constituents (noun phrases sn
and verb) for later use by the dependency parser. It should be fairly straight forward, |
is an or statement, and +
marks the governor, or head of the chunk.
n-m ==> NCM* . n-f ==> NCF* . adj ==> AQ* . def ==> DA0CN0 . indef ==> DI0CN0 . verb ==> V* . punt ==> Fp . sn ==> def, +n-f, adj | def, +n-f | +n-f, adj | +n-f . sn ==> def, +n-m, adj | def, +n-m | +n-m, adj | +n-m . sn ==> indef, +n-m, adj | indef, +n-m | +n-m, adj | +n-m . sn ==> indef, +n-f, adj | indef, +n-f | +n-f, adj | +n-f . @START S.
The @START
directive states that the start node of the sentence should be labelled S
. So, the output of this grammar will be,
$ echo "Ur yezh indezeuropek eo ar brezhoneg." | fl-morph matxin-br-en.tok.dat matxin-br-en.spt.dat br-en.br.db | \ fl-tagger matxin-br-en.br-en.rcg | fl-chunker matxin-br-en.br-en.gram | iconv -f latin1 S_[ sn_[ indef_[ +(Ur un DI0CN0) ] +n-f_[ +(yezh yezh NCFSV0) ] adj_[ +(indezeuropek indezeuropek AQ0CN0) ] ] verb_[ +(eo bezañ VMIP3S0) ] sn_[ def_[ +(ar an DA0CN0) ] +n-m_[ +(brezhoneg brezhoneg NCMSV0) ] ] punt_[ +(. . Fp) ] ]
Note the sentence is chunked into sn verb sn
. It might be worth playing around a bit with the grammar to get a better feel for it.