Matxin New Language Pair HOWTO
This page describes the process of creating a new language pair with Matxin, a dependency-based machine translation system.
Preliminaries
Make a directory called matxin-tur-eng
. Then make two more directories matxin-tur
and matxin-eng
.
Note that if you are doing this howto for your own language, then tur
should be the ISO-639-3 language code of the source language and eng
should be the ISO-639-3 for the target language
Analysis
There are a number of ways analysis can be done in Matxin, the Spanish to Basque system uses FreeLing, while the English to Basque system uses a wrapper around the Stanford parser. In this tutorial we're going to be using Constraint Grammar (CG) to do dependency parsing of pre-disambiguated sentences. Writing a morphological analyser and morphological disambiguator is out of the scope of this HOWTO, but for more information, check out the following pages:
So, let's assume that you've been through those tutorials and have a morphological analyser capable of analysing and disambiguating sentences in Turkish. You'll give it a sentence like "Dün benim için aldığın birayı içeceğim." and get some output like:
^Dün/dün<adv>$ ^benim/ben<prn><pers><p1><sg><gen>$ ^için/için<post>$ ^aldığın/al<v><tv><gpr_past><px2sg>$ ^birayı/bira<n><acc>$ ^içeceğim/iç<v><tv><fut><p1><sg>$^./.<sent>$
Save this output into a file, perhaps called input.txt
. We'll need it later.
Now go into the matxin-tur
directory, and create a file apertium-tur.tur.deprlx
. The first thing we have to do in this file is to specify the delimiters. These are the tokens that CG will use for splitting the input into chunks (like "sentences") for processing:
DELIMITERS = "." ;
Great, now we need to specify some lists of interesting grammatical features that we're going to use to write the rules. First we define some part-of-speech and morphological tags, e.g.
LIST Adv = adv; LIST Pers = (prn pers) ; LIST Post = post ; LIST V = v ; LIST N = n ; LIST Acc = acc; LIST Gen = gen; LIST Gpr = gpr_past ; LIST Sent = sent ; LIST Fin = fut aor past ; # Finite verb forms
Etc. It might seem a bit odd to just be repeating the tag in title case, but the point is that it allows us to distinguish CG tags and tag groups from those in the input morphological analyser.
So, now we've got some morphological information we need to start thinking about the syntactic relations. For the purposes of this tutorial we'll be using the relations and guidelines from the Universal dependencies project, here are some examples:
LIST @root = @root ; # The root of the sentence, often a finite verb LIST @nsubj = @nsubj ; # The nominal subject of the sentence LIST @advmod = @advmod ; # An adverbial modifier LIST @case = @case ; # The relation of an adposition to its head LIST @acl = @acl ; # A clause which modifies a nominal LIST @nmod = @nmod ; # Nominal modifier LIST @dobj = @dobj ; # The direct object of the sentence LIST @punct = @punct ; # Any punctuation LIST @dep = @dep ; # Any remaining dependency
Add these lists to the file along with the morphological lists. You'll note that they are prefixed with an @
symbol. This is important as it says that this is a syntactic relation/function and not a morphological tag.
After we've defined our terms, next come the rules, in the first section we're going to map the relations to the words in the sentence. So let's start with:
SECTION
In constraint grammar, all rules come in sections.
So, now a rule, let's start with an easy one. Adverbs nearly always get @advmod
relation, whether they are modifying an adjective or a verb, so we can safely map @advmod
to the adverb using the following rule:
MAP @advmod TARGET Adv ;
The MAP
rule has two compulsory parts (the tag and the target) and one optional part, a context. Here we have no context.
So now let's save the file and try it out! First though we need to compile the rules:
$ cg-comp matxin-tur.tur.deprlx tur.deprlx.bin Sections: 1, Rules: 1, Sets: 4, Tags: 29
And now try it out:
$ cat input.txt | cg-proc tur.deprlx.bin ^Dün/dün<adv><@advmod>$ ^benim/ben<prn><pers><p1><sg><gen>$ ^için/için<post>$ ^aldığın/al<v><tv><gpr_past><px2sg>$ ^birayı/bira<n><acc>$ ^içeceğim/iç<v><tv><fut><p1><sg>$^./.<sent>$
Great, it works... so now we can do basicallty the same rule for the verbal adjective, aldığın which should get an @acl
tag, the postposition which should get a @case
tag and the accusative which should get a @dobj
tag.
MAP @case TARGET Post ; MAP @acl TARGET Gpr ; MAP @dobj TARGET Acc ; MAP @punct TARGET Sent ;
Save it and try it again:
$ cg-comp matxin-tur.tur.deprlx tur.deprlx.bin Sections: 1, Rules: 5, Sets: 12, Tags: 29 $ cat /tmp/input | cg-proc /tmp/tur.bin ^Dün/dün<adv><@advmod>$ ^benim/ben<prn><pers><p1><sg><gen>$ ^için/için<post><@case>$ ^aldığın/al<v><tv><gpr_past><px2sg><@acl>$ ^birayı/bira<n><acc><@dobj>$ ^içeceğim/iç<v><tv><fut><p1><sg>$^./.<sent><@punct>$
Great, now for a couple of harder relations, the functions of benim and of içeceğim. As benim is the head of an adpositional phrase, following the universal dependencies guidelines we assign it the @nmod
relation. How do we know it is the head of an adpositional phrase. Well, for this example, we can go for something like "if the word is a personal pronoun and it is followed by a postposition", so let's give it a go:
MAP @nmod TARGET Pers IF (1 Post) ;
And finally the root of the sentence, the finite verb. In Turkish it is quite typical that the finite verb is the last verb in the sentence, so we can go for something like "if the word is a finite verb and the next token is a full stop and there is no other finite verb to the left, then make it the root".
MAP @root TARGET Fin IF (NOT -1* Fin) (1 Sent) ;
So try those two rules out and we should have a fully labelled input sentence:
^Dün/dün<adv><@advmod>$ ^benim/ben<prn><pers><p1><sg><gen><@nmod>$ ^için/için<post><@case>$ ^aldığın/al<v><tv><gpr_past><px2sg><@acl>$ ^birayı/bira<n><acc><@dobj>$ ^içeceğim/iç<v><tv><fut><p1><sg><@root>$^./.<sent><@punct>$
That covers our sentence, but we want to add one more rule "just in case" ... and that is a default rule that adds the relation @dep
to any token that isn't covered by the other rules:
MAP (@dep) TARGET (*) ;
Tree building
Now we have a full labelled sentence, we can start building the tree, first make another section:
SECTION
The first thing we want to attach the root node:
SETPARENT @root TO (@0 (*)) ;
This basically says, set the parent of the root node to node 0, which is the invisible root that CG uses. Next we want to attach all of the rest of the nodes to this root node:
SETPARENT (*) (NEGATE p (*)) TO (0* @root) ;
Here we have an extra condition (NEGATE p (*))
which means that matched node should not already have a parent.
Let's test the rules, so save the file, and go to the terminal:
$ cat input.txt | cg-proc -f 2 /tmp/tur.bin <corpus> <SENTENCE ord="X" alloc="Y"> <NODE ord="6" alloc="0" form="içeceğim" lemma="iç" mi="v|tv|fut|p1|sg" si="root"> <NODE ord="1" alloc="0" form="Dün" lemma="dün" mi="adv" si="advmod"/> <NODE ord="2" alloc="0" form="benim" lemma="ben" mi="prn|pers|p1|sg|gen" si="nmod"/> <NODE ord="3" alloc="0" form="için" lemma="için" mi="post" si="case"/> <NODE ord="4" alloc="0" form="aldığın" lemma="al" mi="v|tv|gpr_past|px2sg" si="acl"/> <NODE ord="5" alloc="0" form="birayı" lemma="bira" mi="n|acc" si="dobj"/> <NODE ord="7" alloc="0" form="." lemma="." mi="sent" si="punct"/> </NODE> </SENTENCE> </corpus>
Note how here we have passed -f 2
parameter to the cg-proc
program, so now it is output in Matxin XML format.
This is a pretty sad looking tree :( ... But fortunately using rules we can make it a happy tree! One that correctly implements the annotation guidelines.
So, let's think of some rules:
- The direct object should depend on the finite verb
- A postposition should depend on its head
- A relative clause should modify (depend on) a noun
- An adverb should modify a verb
Let's start with the first one:
SETPARENT @dobj TO (1* Fin) ;
This rule says that the parent of the direct object should be the finite verb label anywhere to the left. Next up:
SETPARENT @case TO (-1 Pers) ;
Set the parent of the word with the @case
label to be the previous personal pronoun. And then:
SETPARENT @acl TO (1 N) ;
Set the parent of the word with the @acl
label to be the following noun.
Let's save the file and apply the rules:
$ cat input.txt | cg-proc -f 2 tur.deprlx.bin <corpus> <SENTENCE ord="X" alloc="Y"> <NODE ord="6" alloc="0" form="içeceğim" lemma="iç" mi="v|tv|fut|p1|sg" si="root"> <NODE ord="1" alloc="0" form="Dün" lemma="dün" mi="adv" si="advmod"/> <NODE ord="2" alloc="0" form="benim" lemma="ben" mi="prn|pers|p1|sg|gen" si="nmod"> <NODE ord="3" alloc="0" form="için" lemma="için" mi="post" si="case"/> </NODE> <NODE ord="5" alloc="0" form="birayı" lemma="bira" mi="n|acc" si="dobj"> <NODE ord="4" alloc="0" form="aldığın" lemma="al" mi="v|tv|gpr_past|px2sg" si="acl"/> </NODE> <NODE ord="7" alloc="0" form="." lemma="." mi="sent" si="punct"/> </NODE> </SENTENCE> </corpus>
This tree is looking much better. The only thing we have left to do is to attach the postpositional phrase benim için and the adverb Dün to the appropriate verb, which in this case is the head of the relative clause aldığın.
So we could specify the rule something like: Set the parent of a nominal modifier or an adverb to be the first verb to the right. This rule happens to work in this case, but is not very robust.
SETPARENT @advmod TO (1* V BARRIER V) ; SETPARENT @nmod TO (1* V BARRIER V) ;
We can test these rules and see the output:
$ cat input.txt | cg-proc -f 2 tur.deprlx.bin <corpus> <SENTENCE ord="X" alloc="Y"> <NODE ord="6" alloc="0" form="içeceğim" lemma="iç" mi="v|tv|fut|p1|sg" si="root"> <NODE ord="5" alloc="0" form="birayı" lemma="bira" mi="n|acc" si="dobj"> <NODE ord="4" alloc="0" form="aldığın" lemma="al" mi="v|tv|gpr_past|px2sg" si="acl"> <NODE ord="1" alloc="0" form="Dün" lemma="dün" mi="adv" si="advmod"/> <NODE ord="2" alloc="0" form="benim" lemma="ben" mi="prn|pers|p1|sg|gen" si="nmod"> <NODE ord="3" alloc="0" form="için" lemma="için" mi="post" si="case"/> </NODE> </NODE> </NODE> <NODE ord="7" alloc="0" form="." lemma="." mi="sent" si="punct"/> </NODE> </SENTENCE> </corpus>
Yesss! Now we have a nice tree ready to be translated!
Lexical transfer
The first stage of transfer is lexical transfer. This is where we take the tree that we have just constructed, and we translate the words, and convert the morphological information into attributes in the tree. This is done using an lttoolbox dictionary in a form that will be familiar to those who have used Apertium before. There are however a number of differences which will be explained below.
In any case, first we need to change directory to matxin-tur-eng
and to make a new file, matxin-tur-eng.tur-eng.dix
. In this file we start by defining our attributes:
<dictionary> <sdefs> <sdef n="mi" c="Morphological information"/> <sdef n="pos" c="Part of speech"/> <sdef n="nbr" c="Number"/> <sdef n="prs" c="Person"/> <sdef n="tns" c="Tense"/> </sdefs> </dictionary>
The attributes are defined by <sdef>
tags, which stands for symbol (in this case an attribute) definition. After this we can start by adding a new section for our first entry:
<section id="main" type="standard"> </section>
And within that section, our entry:
<e><p><l>dün<s n="mi"/>adv</l><r>yesterday<s n="pos"/>adv</r></p></e>
So now we can save this file and compile it,
$ lt-comp lr matxin-tur-eng.tur-eng.dix tur-eng.autobil.bin main@standard 14 13
Note that the lr
means to compile the dictionary (which is in fact a finite-state transducer) from left to right, translating from Turkish to English. Unlike in Apertium, in Matxin the bilingual dictionaries are unidirectional.
You can try it out using the matxin-xfer-lex
command:
$ cat input.txt | cg-proc -f2 tur-eng.deprlx.bin | matxin-xfer-lex tur-eng.autobil.bin <pre> <corpus> <SENTENCE ref="1" alloc="0"> <NODE ref="6" alloc="0" slem="iç" smi="v|tv|fut|p1|sg" si="root" UpCase="none" lem="iç" mi="v|tv|fut|p1|sg" unknown="transfer"> <NODE ref="5" alloc="0" slem="bira" smi="n|acc" si="dobj" UpCase="none" lem="bira" mi="n|acc" unknown="transfer"> <NODE ref="4" alloc="0" slem="al" smi="v|tv|gpr_past|px2sg" si="acl" UpCase="none" lem="al" mi="v|tv|gpr_past|px2sg" unknown="transfer"> <NODE ref="1" alloc="0" slem="dün" smi="adv" si="advmod" UpCase="none" lem="yesterday" pos="adv"/> <NODE ref="2" alloc="0" slem="ben" smi="prn|pers|p1|sg|gen" si="nmod" UpCase="none" lem="ben" mi="prn|pers|p1|sg|gen" unknown="transfer"> <NODE ref="3" alloc="0" slem="için" smi="post" si="case" UpCase="none" lem="için" mi="post" unknown="transfer"/> </NODE> </NODE> </NODE> <NODE ref="7" alloc="0" slem="." smi="sent" si="punct" UpCase="none" lem="." mi="sent" unknown="transfer"/> </NODE> </SENTENCE> </corpus>
You'll not that the XML output has substantially changed. Some of the attributes have been renamed, and some new attributes have been created. Probably the most obvious thing is that for all of the words except dün "yesterday", there is a new attribute unknown
with the value transfer
, this means that the word was not able to be looked up in the bilingual dictionary. This is not surprising as we only added one word. How about the other attributes ?
mi
→smi
: Morphological information is now "source morphological information"lem
→slem
: Lemma is not "source lemma"Upcase
: To do with getting casing rightmi
: This is copied from thesmi
lem
: With unknown words this is copied from the source, otherwise the translation is inserted.pos
: This is the attribute that we defined in our bilingual dictionary.
So with that in mind we can start translating the other words:
<e><p><l>için<s n="mi"/>post</l><r>for<s n="pos"/>pr</r></p></e> <e><p><l>.<s n="mi"/>sent</l><r>.<s n="pos"/>sent</r></p></e>
These two are easy, they work just like the adverb. For categories that inflect however, things get a touch more complex, as we need to convert the morphological information into feature attributes in the XML. We do this using paradigms.
Paradigms
Structural transfer
Generation
lttoolbox | hfst