Matxin New Language Pair HOWTO
This page describes the process of creating a new language pair with Matxin, a dependency-based machine translation system.
Preliminaries
Make a directory called matxin-tur-eng
. Then make two more directories matxin-tur
and matxin-eng
.
Note that if you are doing this howto for your own language, then tur
should be the ISO-639-3 language code of the source language and eng
should be the ISO-639-3 for the target language
Analysis
There are a number of ways analysis can be done in Matxin, the Spanish to Basque system uses FreeLing, while the English to Basque system uses a wrapper around the Stanford parser. In this tutorial we're going to be using Constraint Grammar (CG) to do dependency parsing of pre-disambiguated sentences. Writing a morphological analyser and morphological disambiguator is out of the scope of this HOWTO, but for more information, check out the following pages:
So, let's assume that you've been through those tutorials and have a morphological analyser capable of analysing and disambiguating sentences in Turkish. You'll give it a sentence like "Dün benim için aldığın birayı içeceğim." and get some output like:
^Dün/dün<adv>$ ^benim/ben<prn><pers><p1><sg><gen>$ ^için/için<post>$ ^aldığın/al<v><tv><gpr_past><px2sg>$ ^birayı/bira<n><acc>$ ^içeceğim/iç<v><tv><fut><p1><sg>$^./.<sent>$
Save this output into a file, perhaps called input.txt
. We'll need it later.
Now go into the matxin-tur
directory, and create a file matxin-tur.tur.deprlx
. The first thing we have to do in this file is to specify the delimiters. These are the tokens that CG will use for splitting the input into chunks (like "sentences") for processing:
DELIMITERS = "." ;
Great, now we need to specify some lists of interesting grammatical features that we're going to use to write the rules. First we define some part-of-speech and morphological tags, e.g.
LIST Adv = adv; LIST Pers = (prn pers) ; LIST Post = post ; LIST V = v ; LIST N = n ; LIST Acc = acc; LIST Gen = gen; LIST Gpr = gpr_past ; LIST Sent = sent ; LIST Fin = fut aor past ; # Finite verb forms
Etc. It might seem a bit odd to just be repeating the tag in title case, but the point is that it allows us to distinguish CG tags and tag groups from those in the input morphological analyser.
So, now we've got some morphological information we need to start thinking about the syntactic relations. For the purposes of this tutorial we'll be using the relations and guidelines from the Universal dependencies project, here are some examples:
LIST @root = @root ; # The root of the sentence, often a finite verb LIST @nsubj = @nsubj ; # The nominal subject of the sentence LIST @advmod = @advmod ; # An adverbial modifier LIST @case = @case ; # The relation of an adposition to its head LIST @acl = @acl ; # A clause which modifies a nominal LIST @nmod = @nmod ; # Nominal modifier LIST @dobj = @dobj ; # The direct object of the sentence LIST @punct = @punct ; # Any punctuation LIST @dep = @dep ; # Any remaining dependency
Add these lists to the file along with the morphological lists. You'll note that they are prefixed with an @
symbol. This is important as it says that this is a syntactic relation/function and not a morphological tag.
After we've defined our terms, next come the rules, in the first section we're going to map the relations to the words in the sentence. So let's start with:
SECTION
In constraint grammar, all rules come in sections.
So, now a rule, let's start with an easy one. Adverbs nearly always get @advmod
relation, whether they are modifying an adjective or a verb, so we can safely map @advmod
to the adverb using the following rule:
MAP @advmod TARGET Adv ;
The MAP
rule has two compulsory parts (the tag and the target) and one optional part, a context. Here we have no context.
So now let's save the file and try it out! First though we need to compile the rules:
$ cg-comp matxin-tur.tur.deprlx tur.deprlx.bin Sections: 1, Rules: 1, Sets: 4, Tags: 29
And now try it out:
$ cat input.txt | cg-proc tur.deprlx.bin ^Dün/dün<adv><@advmod>$ ^benim/ben<prn><pers><p1><sg><gen>$ ^için/için<post>$ ^aldığın/al<v><tv><gpr_past><px2sg>$ ^birayı/bira<n><acc>$ ^içeceğim/iç<v><tv><fut><p1><sg>$^./.<sent>$
Great, it works... so now we can do basicallty the same rule for the verbal adjective, aldığın which should get an @acl
tag, the postposition which should get a @case
tag and the accusative which should get a @dobj
tag.
MAP @case TARGET Post ; MAP @acl TARGET Gpr ; MAP @dobj TARGET Acc ; MAP @punct TARGET Sent ;
Save it and try it again:
$ cg-comp matxin-tur.tur.deprlx tur.deprlx.bin Sections: 1, Rules: 5, Sets: 12, Tags: 29 $ cat /tmp/input | cg-proc /tmp/tur.bin ^Dün/dün<adv><@advmod>$ ^benim/ben<prn><pers><p1><sg><gen>$ ^için/için<post><@case>$ ^aldığın/al<v><tv><gpr_past><px2sg><@acl>$ ^birayı/bira<n><acc><@dobj>$ ^içeceğim/iç<v><tv><fut><p1><sg>$^./.<sent><@punct>$
Great, now for a couple of harder relations, the functions of benim and of içeceğim. As benim is the head of an adpositional phrase, following the universal dependencies guidelines we assign it the @nmod
relation. How do we know it is the head of an adpositional phrase. Well, for this example, we can go for something like "if the word is a personal pronoun and it is followed by a postposition", so let's give it a go:
MAP @nmod TARGET Pers IF (1 Post) ;
And finally the root of the sentence, the finite verb. In Turkish it is quite typical that the finite verb is the last verb in the sentence, so we can go for something like "if the word is a finite verb and the next token is a full stop and there is no other finite verb to the left, then make it the root".
MAP @root TARGET Fin IF (NOT -1* Fin) (1 Sent) ;
So try those two rules out and we should have a fully labelled input sentence:
^Dün/dün<adv><@advmod>$ ^benim/ben<prn><pers><p1><sg><gen><@nmod>$ ^için/için<post><@case>$ ^aldığın/al<v><tv><gpr_past><px2sg><@acl>$ ^birayı/bira<n><acc><@dobj>$ ^içeceğim/iç<v><tv><fut><p1><sg><@root>$^./.<sent><@punct>$
That covers our sentence, but we want to add one more rule "just in case" ... and that is a default rule that adds the relation @dep
to any token that isn't covered by the other rules:
MAP (@dep) TARGET (*) ;
Tree building
Now we have a full labelled sentence, we can start building the tree, first make another section:
SECTION
The first thing we want to attach the root node:
SETPARENT @root TO (@0 (*)) ;
This basically says, set the parent of the root node to node 0, which is the invisible root that CG uses. Next we want to attach all of the rest of the nodes to this root node:
SETPARENT (*) (NEGATE p (*)) TO (0* @root) ;
Here we have an extra condition (NEGATE p (*))
which means that matched node should not already have a parent.
Let's test the rules, so save the file, and go to the terminal:
$ cat input.txt | cg-proc -f 2 /tmp/tur.bin <corpus> <SENTENCE ord="1" alloc="0"> <NODE ord="6" alloc="0" form="içeceğim" lem"iç" mi="v|tv|fut|p1|sg" si="root"> <NODE ord="1" alloc="0" form="Dün" lem="dün" mi="adv" si="advmod"/> <NODE ord="2" alloc="0" form="benim" lem="ben" mi="prn|pers|p1|sg|gen" si="nmod"/> <NODE ord="3" alloc="0" form="için" lem="için" mi="post" si="case"/> <NODE ord="4" alloc="0" form="aldığın" lem="al" mi="v|tv|gpr_past|px2sg" si="acl"/> <NODE ord="5" alloc="0" form="birayı" lem="bira" mi="n|acc" si="dobj"/> <NODE ord="7" alloc="0" form="." lem="." mi="sent" si="punct"/> </NODE> </SENTENCE> </corpus>
Note how here we have passed -f 2
parameter to the cg-proc
program, so now it is output in Matxin XML format.
This is a pretty sad looking tree :( ... But fortunately using rules we can make it a happy tree! One that correctly implements the annotation guidelines.
So, let's think of some rules:
- The direct object should depend on the finite verb
- A postposition should depend on its head
- A relative clause should modify (depend on) a noun
- An adverb should modify a verb
Let's start with the first one:
SETPARENT @dobj TO (1* Fin) ;
This rule says that the parent of the direct object should be the finite verb label anywhere to the left. Next up:
SETPARENT @case TO (-1 Pers) ;
Set the parent of the word with the @case
label to be the previous personal pronoun. And then:
SETPARENT @acl TO (1 N) ;
Set the parent of the word with the @acl
label to be the following noun.
Let's save the file and apply the rules:
$ cat input.txt | cg-proc -f 2 tur.deprlx.bin <corpus> <SENTENCE ord="1" alloc="0"> <NODE ord="6" alloc="0" form="içeceğim" lemma="iç" mi="v|tv|fut|p1|sg" si="root"> <NODE ord="1" alloc="0" form="Dün" lemma="dün" mi="adv" si="advmod"/> <NODE ord="2" alloc="0" form="benim" lemma="ben" mi="prn|pers|p1|sg|gen" si="nmod"> <NODE ord="3" alloc="0" form="için" lemma="için" mi="post" si="case"/> </NODE> <NODE ord="5" alloc="0" form="birayı" lemma="bira" mi="n|acc" si="dobj"> <NODE ord="4" alloc="0" form="aldığın" lemma="al" mi="v|tv|gpr_past|px2sg" si="acl"/> </NODE> <NODE ord="7" alloc="0" form="." lemma="." mi="sent" si="punct"/> </NODE> </SENTENCE> </corpus>
This tree is looking much better. The only thing we have left to do is to attach the postpositional phrase benim için and the adverb Dün to the appropriate verb, which in this case is the head of the relative clause aldığın.
So we could specify the rule something like: Set the parent of a nominal modifier or an adverb to be the first verb to the right. This rule happens to work in this case, but is not very robust.
SETPARENT @advmod TO (1* V BARRIER V) ; SETPARENT @nmod TO (1* V BARRIER V) ;
The 1* X BARRIER Y
instruction here means that the parser should read to the right looking to match context X
(a verb), but it should stop if it finds context Y
(in this case another verb).
We can test these rules and see the output:
$ cat input.txt | cg-proc -f 2 tur.deprlx.bin <corpus> <SENTENCE ord="1" alloc="0"> <NODE ord="6" alloc="0" form="içeceğim" lemma="iç" mi="v|tv|fut|p1|sg" si="root"> <NODE ord="5" alloc="0" form="birayı" lemma="bira" mi="n|acc" si="dobj"> <NODE ord="4" alloc="0" form="aldığın" lemma="al" mi="v|tv|gpr_past|px2sg" si="acl"> <NODE ord="1" alloc="0" form="Dün" lemma="dün" mi="adv" si="advmod"/> <NODE ord="2" alloc="0" form="benim" lemma="ben" mi="prn|pers|p1|sg|gen" si="nmod"> <NODE ord="3" alloc="0" form="için" lemma="için" mi="post" si="case"/> </NODE> </NODE> </NODE> <NODE ord="7" alloc="0" form="." lemma="." mi="sent" si="punct"/> </NODE> </SENTENCE> </corpus>
Yesss! Now we have a nice tree ready to be translated!
Lexical transfer
The first stage of transfer is lexical transfer. This is where we take the tree that we have just constructed, and we translate the words, and convert the morphological information into attributes in the tree. This is done using an lttoolbox dictionary in a form that will be familiar to those who have used Apertium before. There are however a number of differences which will be explained below.
In any case, first we need to change directory to matxin-tur-eng
and to make a new file, matxin-tur-eng.tur-eng.dix
. In this file we start by defining our attributes:
<dictionary> <sdefs> <sdef n="mi" c="Morphological information"/> <sdef n="pos" c="Part of speech"/> <sdef n="nbr" c="Number"/> <sdef n="prs" c="Person"/> <sdef n="cas" c="Case"/> <sdef n="tns" c="Tense"/> </sdefs> </dictionary>
The attributes are defined by <sdef>
tags, which stands for symbol (in this case an attribute) definition. After this we can start by adding a new section for our first entry:
<section id="main" type="standard"> </section>
And within that section, our entry:
<e><p><l>dün<s n="mi"/>adv</l><r>yesterday<s n="pos"/>adv</r></p></e>
So now we can save this file and compile it,
$ lt-comp lr matxin-tur-eng.tur-eng.dix tur-eng.autobil.bin main@standard 14 13
Note that the lr
means to compile the dictionary (which is in fact a finite-state transducer) from left to right, translating from Turkish to English. Unlike in Apertium, in Matxin the bilingual dictionaries are unidirectional.
You can try it out using the matxin-xfer-lex
command:
$ cat input.txt | cg-proc -f2 tur-eng.deprlx.bin | matxin-xfer-lex tur-eng.autobil.bin <pre> <corpus> <SENTENCE ref="1" alloc="0"> <NODE ref="6" alloc="0" slem="iç" smi="v|tv|fut|p1|sg" si="root" UpCase="none" lem="iç" mi="v|tv|fut|p1|sg" unknown="transfer"> <NODE ref="5" alloc="0" slem="bira" smi="n|acc" si="dobj" UpCase="none" lem="bira" mi="n|acc" unknown="transfer"> <NODE ref="4" alloc="0" slem="al" smi="v|tv|gpr_past|px2sg" si="acl" UpCase="none" lem="al" mi="v|tv|gpr_past|px2sg" unknown="transfer"> <NODE ref="1" alloc="0" slem="dün" smi="adv" si="advmod" UpCase="none" lem="yesterday" pos="adv"/> <NODE ref="2" alloc="0" slem="ben" smi="prn|pers|p1|sg|gen" si="nmod" UpCase="none" lem="ben" mi="prn|pers|p1|sg|gen" unknown="transfer"> <NODE ref="3" alloc="0" slem="için" smi="post" si="case" UpCase="none" lem="için" mi="post" unknown="transfer"/> </NODE> </NODE> </NODE> <NODE ref="7" alloc="0" slem="." smi="sent" si="punct" UpCase="none" lem="." mi="sent" unknown="transfer"/> </NODE> </SENTENCE> </corpus>
You'll note that the XML output has substantially changed. Some of the attributes have been renamed, and some new attributes have been created. Probably the most obvious thing is that for all of the words except dün "yesterday", there is a new attribute unknown
with the value transfer
, this means that the word was not able to be looked up in the bilingual dictionary. This is not surprising as we only added one word. How about the other attributes ?
mi
→smi
: Morphological information is now "source morphological information"lem
→slem
: Lemma is not "source lemma"Upcase
: To do with getting casing rightmi
: This is copied from thesmi
lem
: With unknown words this is copied from the source, otherwise the translation is inserted.pos
: This is the attribute that we defined in our bilingual dictionary.
So with that in mind we can start translating the other words:
<e><p><l>için<s n="mi"/>post</l><r>for<s n="pos"/>pr</r></p></e> <e><p><l>.<s n="mi"/>sent</l><r>.<s n="pos"/>sent</r></p></e>
These two are easy, they work just like the adverb. For categories that inflect however, things get a touch more complex, as we need to convert the morphological information into feature attributes in the XML. We do this using paradigms.
Paradigms
Paradigms are used to convert strings of input tags into features for the tree. The section they belong in goes after the sdefs
section and before the section id="main"
. Let's start looking at nouns:
<pardefs> <pardef n="n__n"> <e><p><l>|nom</l><r><s n="cas"/>nom</r></p></e> <e><p><l>|acc</l><r><s n="cas"/>acc</r></p></e> </pardef> </pardefs>
This paradigm converts the strings |acc
and |nom
into the attribute cas
with the values of nom
and acc
respectively.
Now we've defined the paradigm, we can try using it, go back to the main section
, and add:
<e><p><l>bira<s n="mi"/>n</l><r>beer<s n="pos"/>n</r></p><par n="n__n"/></e>
Then compile the dictionary:
$ lt-comp lr matxin-tur-eng.tur-eng.dix tur-eng.autobil.bin main@standard 38 41
And test it:
$ cat input.txt | cg-proc -f 2 tur-eng.deprlx.bin | matxin-xfer-lex tur-eng.autobil.bin <corpus> <SENTENCE ref="1" alloc="0"> <NODE ref="6" alloc="0" slem="iç" smi="v|tv|fut|p1|sg" si="root" UpCase="none" lem="iç" mi="v|tv|fut|p1|sg" unknown="transfer"> <NODE ref="5" alloc="0" slem="bira" smi="n|acc" si="dobj" UpCase="none" lem="beer" pos="n" cas="acc"> <NODE ref="4" alloc="0" slem="al" smi="v|tv|gpr_past|px2sg" si="acl" UpCase="none" lem="al" mi="v|tv|gpr_past|px2sg" unknown="transfer"> <NODE ref="1" alloc="0" slem="dün" smi="adv" si="advmod" UpCase="none" lem="yesterday" pos="adv"/> <NODE ref="2" alloc="0" slem="ben" smi="prn|pers|p1|sg|gen" si="nmod" UpCase="none" lem="ben" mi="prn|pers|p1|sg|gen" unknown="transfer"> <NODE ref="3" alloc="0" slem="için" smi="post" si="case" UpCase="none" lem="for" pos="pr"/> </NODE> </NODE> </NODE> <NODE ref="7" alloc="0" slem="." smi="sent" si="punct" UpCase="none" lem="." pos="sent"/> </NODE> </SENTENCE> </corpus>
As we move onto verb forms, it's worth noting that can call other paradigms, for example, if we want to avoid listing all the possible combinations of tense, person and number, we can set up the paradigms like:
<pardef n="num"> <e><p><l>|sg</l><r><s n="nbr"/>sg</r></p></e> <e><p><l>|pl</l><r><s n="nbr"/>pl</r></p></e> </pardef> <pardef n="pers"> <e><p><l>|p1</l><r><s n="prs"/>p1</r></p><par n="num"/></e> <e><p><l>|p2</l><r><s n="prs"/>p2</r></p><par n="num"/></e> <e><p><l>|p3</l><r><s n="prs"/>p3</r></p><par n="num"/></e> </pardef> <pardef n="tense"> <e><p><l>|fut</l><r><s n="vtype"/>fin<s n="tns"/>fut</r></p><par n="pers"/></e> </pardef> <pardef n="tv__v"> <e><p><l>|tv</l><r><s n="val"/>tv</r></p><par n="tense"/></e> </pardef>
Then in the main section:
<e><p><l>iç<s n="mi"/>v</l><r>drink<s n="pos"/>v</r></p><par n="tv__v"/></e> <e><p><l>al<s n="mi"/>v</l><r>buy<s n="pos"/>v</r></p><par n="tv__v"/></e>
This will take care of the finite verb, içeceğim "I will drink", but what do we do with the non-finite verb aldığın "that you bought"? The structure is very different in Turkish and English, and although changing the structure is part of structural transfer, we need to get the attributes in order to enable us to properly do transfer. So in this case what we can do is have a paradigm setup something like:
<pardef n="px__pers"> <e><p><l>|px1sg</l><r><s n="prs"/>p1<s n="nbr"/>sg</r></p></e> <e><p><l>|px2sg</l><r><s n="prs"/>p2<s n="nbr"/>sg</r></p></e> <e><p><l>|px3sg</l><r><s n="prs"/>p3<s n="nbr"/>sg</r></p></e> <e><p><l>|px1pl</l><r><s n="prs"/>p1<s n="nbr"/>pl</r></p></e> <e><p><l>|px2pl</l><r><s n="prs"/>p2<s n="nbr"/>pl</r></p></e> <e><p><l>|px3pl</l><r><s n="prs"/>p3<s n="nbr"/>pl</r></p></e> </pardef> <pardef n="nonfin"> <e><p><l>|gpr_past</l><r><s n="vtype"/>gpr<s n="tns"/>past</r></p><par n="px__pers"/></e> </pardef>
And then we update the previously defined tv__v
paradigm thusly:
<pardef n="tv__v"> <e><p><l>|tv</l><r><s n="val"/>tv</r></p><par n="tense"/></e> <e><p><l>|tv</l><r><s n="val"/>tv</r></p><par n="nonfin"/></e> </pardef>
So, what does the code above do ? We turn the non-finite style agreement markers (using the possessive morpheme, px1sg, px2sg
, etc.) into finite agreement (p1.sg, p2.sg
, etc.), and we set the verb type to verbal adjective, gpr
. Then in the structural transfer, we will be able to match the verb type attribute when it is gpr
and use the other attributes to construct a finite relative clause in English.
Let's save the file and compile and test again...
$ lt-comp lr matxin-tur-eng.tur-eng.dix tur-eng.autobil.bin main@standard 75 86 $ cat input.txt | cg-proc -f 2 tur-eng.deprlx.bin | matxin-xfer-lex tur-eng.autobil.bin <corpus> <SENTENCE ref="1" alloc="0"> <NODE ref="6" alloc="0" slem="iç" smi="v|tv|fut|p1|sg" si="root" UpCase="none" lem="drink" pos="v" val="tv" tns="fut" prs="p1" nbr="sg"> <NODE ref="5" alloc="0" slem="bira" smi="n|acc" si="dobj" UpCase="none" lem="beer" pos="n" cas="acc"> <NODE ref="4" alloc="0" slem="al" smi="v|tv|gpr_past|px2sg" si="acl" UpCase="none" lem="buy" pos="v" val="tv" vtype="gpr" tns="past" prs="p2" nbr="sg"> <NODE ref="1" alloc="0" slem="dün" smi="adv" si="advmod" UpCase="none" lem="yesterday" pos="adv"/> <NODE ref="2" alloc="0" slem="ben" smi="prn|pers|p1|sg|gen" si="nmod" UpCase="none" lem="ben" mi="prn|pers|p1|sg|gen" unknown="transfer"> <NODE ref="3" alloc="0" slem="için" smi="post" si="case" UpCase="none" lem="for" pos="pr"/> </NODE> </NODE> </NODE> <NODE ref="7" alloc="0" slem="." smi="sent" si="punct" UpCase="none" lem="." pos="sent"/> </NODE> </SENTENCE> </corpus>
Looking good, now there is only one remaining item, the personal pronoun ben "I".
Personal pronouns can be quite idiosyncratic to translate between different languages, so we're just going to add a new paradigm for it:
<pardef n="ben__I"> <e><p><l>|p1|sg|nom</l><r><s n="prs"/>p1<s n="nbr"/>sg<s n="cas"/>nom</r></p></e> <e><p><l>|p1|sg|acc</l><r><s n="prs"/>p1<s n="nbr"/>sg<s n="cas"/>acc</r></p></e> <e><p><l>|p1|sg|gen</l><r><s n="prs"/>p1<s n="nbr"/>sg<s n="cas"/>gen</r></p></e> </pardef>
And then the entry in the main section
:
<e><p><l>ben<s n="mi"/>prn|pers</l><r>I<s n="pos"/>prn|pers</r></p><par n="ben__I"/></e>
We set the pos
specifically as personal pronoun instead of just pronoun because personal pronouns often need to be treated differently from other pronoun types. The final output we should get is:
<corpus> <SENTENCE ref="1" alloc="0"> <NODE ref="6" alloc="0" slem="iç" smi="v|tv|fut|p1|sg" si="root" UpCase="none" lem="drink" pos="v" val="tv" tns="fut" prs="p1" nbr="sg"> <NODE ref="5" alloc="0" slem="bira" smi="n|acc" si="dobj" UpCase="none" lem="beer" pos="n" cas="acc"> <NODE ref="4" alloc="0" slem="al" smi="v|tv|gpr_past|px2sg" si="acl" UpCase="none" lem="buy" pos="v" val="tv" vtype="gpr" tns="past" prs="p2" nbr="sg"> <NODE ref="1" alloc="0" slem="dün" smi="adv" si="advmod" UpCase="none" lem="yesterday" pos="adv"/> <NODE ref="2" alloc="0" slem="ben" smi="prn|pers|p1|sg|gen" si="nmod" UpCase="none" lem="I" pos="prn|pers" prs="p1" nbr="sg" cas="gen"> <NODE ref="3" alloc="0" slem="için" smi="post" si="case" UpCase="none" lem="for" pos="pr"/> </NODE> </NODE> </NODE> <NODE ref="7" alloc="0" slem="." smi="sent" si="punct" UpCase="none" lem="." pos="sent"/> </NODE> </SENTENCE> </corpus>
Underspecification
Sometimes a language will underspecify some morphological feature, for example this often happens with gender, or number, for example in our previous example with bira "beer", in Turkish the form without an affix is underspecified for number. It could be singular or plural, for example 1 bira "1 beer", but 5 bira "5 beers".
What should we do about this, well, one thing we can do is add the feature in lexical transfer but with a value saying that we don't know what it should be and that it should be dealt with in transfer. Normally we use ND
for "number to be determined" and GD
for gender to be determined (for example in translating the third person pronoun in Turkish to English we would need to set the gender to be determined in transfer).
We can set the feature like all of the other features, so just go to where you have the n__n
paradigm defined, and update it to look like:
<pardef n="n__n"> <e><p><l>|nom</l><r><s n="nbr"/>ND<s n="cas"/>nom</r></p></e> <e><p><l>|acc</l><r><s n="nbr"/>ND<s n="cas"/>acc</r></p></e> </pardef>
And now we're ready for structural transfer! :D
Structural transfer
In Matxin, structural transfer is done by applying a cascade of rules written in the XSLT programming language. XSLT is basically a complete language for tree transformations. This HOWTO will not give a complete overview of the language... there would be far too much to write. But you can use a search engine to find out about it.
But before we start working on encoding the rules as XSLT, we should get clear what we want to do in terms of linguistics. To help us with that, let's take a look at two trees:
So, given these two trees, we can discover some obvious stuff, like:
- A definite accusative noun in Turkish birayı, should get a definite article in English.
- The synthetic future in Turkish, içeceğim to will drink
- Subject pronouns for both the main clause and relative clause should be added
- A relative pronoun to stand in place for the direct object in the relative clause headed by aldığın.
- If there is no dependent numeral greater than one, set the number of nouns to singular.
Starting out
Let's create a new file called matxin-tur-eng.tur-eng.t1x
:
<transfer> </transfer>
Rules are defined as XSLT templates within a def-rule
tag, so let's start one:
<def-rule comment="Add definite article if overt accusative"> </def-rule>
The comment is free-form, you can write what you like.
Now let's get to the main part of the rule, we want to match any NODE
in the tree where the pos
attribute is n
and the cas
attribute is acc
. We specify that we only want to match nouns as in Turkish, pronouns may also be in accusative (in which case we don't want to add a definite article), and complement clauses are also often marked with accusative.
Matching in XSLT is done with Xpath, a way of finding node sets in a tree, so, expressing the pattern above:
<template match="//NODE[@pos = 'n' and @cas = 'acc']"> </template>
Attributes are referred to prefixed with @
, the =
is equal-to and not assignment, //
means search the whole tree. So, what do we want to do when we've found this set of nodes? We want to basically copy the nodes and add a new dependent node for the definite article:
<template match="//NODE[@pos = 'n' and @cas = 'acc']"> <copy> <apply-templates select="@* | *"/> <NODE si="det" lem="the" pos="det|def" nbr="sp"/> </copy> </template>
This is one pattern for working with transfer in Matxin... The copy
directive means that the output tree will have the nodes that are in the source tree (by applying the apply-templates
instruction, and in addition it will have a subnode which is the definite article. In the apply-templates
instruction, the select="@* | *"
part means that it should be applied to all attributes, @*
and also to all subnodes *
.
Let's save the rule, and compile it:
$ matxin-preprocess-transfer matxin-tur-eng.tur-eng.t1x tur-eng.t1x.bin 1 rules processed.
And then we can apply the rule using:
$ cat input.txt | cg-proc -f 2 tur-eng.deprlx.bin | matxin-xfer-lex tur-eng.autobil.bin | matxin-transfer tur-eng.t1x.bin <corpus> <SENTENCE ref="1" alloc="0"> <CHUNK ref="0" type="root"> <NODE ref="6" alloc="0" slem="iç" smi="v|tv|fut|p1|sg" si="root" UpCase="none" lem="drink" pos="v" val="tv" tns="fut" prs="p1" nbr="sg"> <NODE ref="5" alloc="0" slem="bira" smi="n|acc" si="dobj" UpCase="none" lem="beer" pos="n" nbr="ND" cas="acc"> <NODE ref="4" alloc="0" slem="al" smi="v|tv|gpr_past|px2sg" si="acl" UpCase="none" lem="buy" pos="v" val="tv" vtype="gpr" tns="past" prs="p2" nbr="sg"> <NODE ref="1" alloc="0" slem="dün" smi="adv" si="advmod" UpCase="none" lem="yesterday" pos="adv"/> <NODE ref="2" alloc="0" slem="ben" smi="prn|pers|p1|sg|gen" si="nmod" UpCase="none" lem="I" pos="prn|pers" prs="p1" nbr="sg" cas="gen"> <NODE ref="3" alloc="0" slem="için" smi="post" si="case" UpCase="none" lem="for" pos="pr"/> </NODE> </NODE> <NODE si="det" lem="the" pos="det|def" nbr="sp"/> </NODE> <NODE ref="7" alloc="0" slem="." smi="sent" si="punct" UpCase="none" lem="." pos="sent"/> </NODE> </CHUNK> </SENTENCE> </corpus>
We can use a very similar rule to get the auxiliary verb "will" added when the verb is finite and in future tense:
<def-rule comment="Add auxiliary verb 'will' if the head verb is finite and in the future tense"> <template match="//NODE[@pos = 'v' and @vtype = 'fin' and @tns = 'fut']"> <copy> <apply-templates select="@* | *"/> <NODE si="aux" lem="will" pos="vaux" tns="pres"/> </copy> </template> </def-rule>
Save the file and compile the rules again:
$ matxin-preprocess-transfer matxin-tur-eng.tur-eng.t1x tur-eng.t1x.bin 2 rules processed.
So, let's see what our Turkish tree looks like now:
Propagating information between nodes
It's looking good, now let's try adding the relative pronoun, this is slightly more complicated as we want the function of the relative
pronoun in relation to the head of its clause to be the same as the function of the word it modifies in relation to the matrix clause.[1] In Xpath we can use ..
to refer to the parent node.
<def-rule comment="Add a relative pronoun if the verb type is relative clause"> <template match="//NODE[@pos = 'v' and @vtype = 'gpr']"> <copy> <apply-templates select="@* | *"/> <NODE><attr name="si"><value-of select="../@si"/></attr> <attr name="lem">that</attr> <attr name="pos">rel</attr> <attr name="ani">an</attr> <attr name="num">sp</attr></NODE> </copy> </template> </def-rule>
The attributes are used for generation, ani
is for animacy (some relative pronouns in English can only be used with animates (e.g. "who") and others with both animate and inanimate (e.g. "that", "which"). The sp
value for nbr
means that is it the same form for singular and plural.
Conditional statements
Now there are only two words that need to be added, the subject pronouns. This is slightly more complicated because although we can inherit the person and number from the verb, in English the lemma is going to be different depending on the person and number. But never fear, choose, when
is here! Basically choose, when, [otherwise]
works like if, else if, [else]
in other programming languages. As can be seen from the following example, you have a condition test=
that basically is equivalent to an Xpath expression.
<def-rule comment="Add subject pronouns to clauses that do not have them"> <template match="//NODE[@pos = 'v' and not(.//NODE[@si = 'nsubj'])]"> <copy> <apply-templates select="@* | *"/> <NODE><attr name="si">nsubj</attr> <choose> <when test="@prs = 'p1' and @nbr = 'sg'"> <attr name="lem">I</attr> </when> <when test="@prs = 'p2' and @nbr = 'sg'"> <attr name="lem">you</attr> </when> </choose> <attr name="pos">prn|pers</attr> <attr name="prs"><value-of select="@prs"/></attr> <attr name="nbr"><value-of select="@nbr"/></attr> <attr name="cas">nom</attr></NODE> </copy> </template> </def-rule>
Now let's compile and test:
$ matxin-preprocess-transfer matxin-tur-eng.tur-eng.t1x tur-eng.t1x.bin 4 rules processed. $ cat input.txt | cg-proc -f 2 tur-eng.deprlx.bin | matxin-xfer-lex tur-eng.autobil.bin | matxin-transfer tur-eng.t1x.bin <corpus> <SENTENCE ref="1" alloc="0"> <NODE ref="6" alloc="0" slem="iç" smi="v|tv|fut|p1|sg" si="root" UpCase="none" lem="drink" pos="v" val="tv" vtype="fin" tns="fut" prs="p1" nbr="sg"> <NODE ref="5" alloc="0" slem="bira" smi="n|acc" si="dobj" UpCase="none" lem="beer" pos="n" nbr="ND" cas="acc"> <NODE ref="4" alloc="0" slem="al" smi="v|tv|gpr_past|px2sg" si="acl" UpCase="none" lem="buy" pos="v" val="tv" vtype="gpr" tns="past" prs="p2" nbr="sg"> <NODE ref="1" alloc="0" slem="dün" smi="adv" si="advmod" UpCase="none" lem="yesterday" pos="adv"/> <NODE ref="2" alloc="0" slem="ben" smi="prn|pers|p1|sg|gen" si="nmod" UpCase="none" lem="I" pos="prn|pers" prs="p1" nbr="sg" cas="gen"> <NODE ref="3" alloc="0" slem="için" smi="post" si="case" UpCase="none" lem="for" pos="pr"/> </NODE> <NODE si="dobj" lem="that" pos="rel" ani="an" nbr="sp"/> <NODE si="nsubj" lem="you" pos="prn|pers" prs="p2" nbr="sg"/> </NODE> <NODE lem="the" pos="det|def" mi="sp"/> </NODE> <NODE ref="7" alloc="0" slem="." smi="sent" si="punct" UpCase="none" lem="." pos="sent"/> <NODE si="aux" lem="will" pos="vaux" mi="pri"/> <NODE si="nsubj" lem="I" pos="prn|pers" prs="p1" nbr="sg"/> </NODE> </SENTENCE> </corpus>
And looking at the tree:
Our tree is looking pretty English now! The only thing left to do is to correctly set the number of the noun to singular:
<def-rule comment="Set number of nouns with ND and no dependent numeral to singular"> <template match="//NODE[@pos = 'n' and not(.//NODE[@pos = 'num'])]"> <copy> <attr name="nbr">sg</attr> <copy-of select="@*[name()!='nbr'] | *"/> </copy> </template> </def-rule>
and set the case of the pronoun with a dependent preposition to accusative:
<def-rule comment="Set the case of personal pronouns with prepositions to accusative"> <template match="//NODE[@pos = 'prn|pers' and not(.//NODE[@pos = 'pr'])]"> <copy> <attr name="cas">acc</attr> <copy-of select="@*[name()!='cas'] | *"/> </copy> </template> </def-rule>
These rules show a good pattern for changing the value of a given attribute, we first add the new attribute, and then we copy all attributes apart from the attribute that we've already added.
Now we have two more steps, first we need to reorder the tree, which we call linearisation, and then we need to generate the word forms in English using a morphological generator. First onto linearisation...
Reordering and linearisation
Generation
In Matxin, generation is done by the program matxin-generate
which takes two arguments, an file with a cascade of stylesheets and a compiled finite-state transducer. The cascade is used to organise the attributes of the XML into feature-strings suitable to be passed to the finite-state transducer to generate the morphological forms.
Morphological dictionary
So, what does a morphological dictionary look like ? Again, that is mostly outside of the scope of this howto, but for the sake of easy of copy/paste, let's go through it here, taking an lttoolbox dictionary as an example.
To start with, we change into the directory matxin-eng
and we create a new file matxin-eng.eng.dix
. The file will have the skeleton structure:
<dictionary> <alphabet/> <sdefs> <sdef n="mi"/> </sdefs> <pardefs> </pardefs> <section id="main" type="standard"> </section> </dictionary>
This structure will seem familiar if you read the section on lexical transfer (if you didn't read it, it's up there). Instead of translating between source words and target words, the morphological generator translates between lexical forms (combinations of lemmas and tags) and surface forms. Let's take a look at the words we need to generate:
Lemma | POS | Forms |
---|---|---|
beer | <n> |
beer, beers |
buy | <v> |
buy, buys, bought, bought |
drink | <v> |
drink, drinks, drank, drunk |
for | <pr> |
for |
I | <prn> |
I, me |
the | <det> |
the |
will | <vaux> |
will, would |
yesterday | <adv> |
yesterday |
you | <prn> |
you, you |
Given these words there isn't much paradigmatically that we can do, each word needs a separate paradigm, so let's just start with the noun, "beer", the paradigm is going to be:
<pardef n="beer__n"> <e><p><l></l><r><s n="mi"/>n|sg</r></p></e> <e><p><l>s</l><r><s n="mi"/>n|pl</r></p></e> </pardef>
and then the entry in the main section
:
<e lm="beer"><i>beer</i><par n="beer__n"/></e>
Save the dictionary, and go to the terminal, you can do two things: 1) compile the dictionary, using:
$ lt-comp rl matxin-eng.eng.dix eng.autogen.bin main@standard 14 14
You can also print out all the strings recognised by the dictionary using lt-expand
:
$ lt-expand matxin-eng.eng.dix beer:beer<mi>n|sg beers:beer<mi>n|pl
The remainder of the vocabulary is left as an exercise for the reader.
Generation rules
Generation rules take a node and its attributes and produce a new attribute, mi
that has all the information necessary to pass to the morphological generator. They are written in the same XSLT format as the transfer rules.
We start out with a file, let's call it matxin-tur-eng.tur-eng.gnx
<generate> </generate>
Then we add a rule to generate the mi
attribute for nouns:
<def-rule comment="Generate the morphological information for nouns"> <template match="//NODE[@pos = 'n']"> <copy> <attr name="mi"><value-of select="concat(@pos, '|', @nbr)"/></attr> <copy-of select="@*[name()!='mi'] | *"/> </copy> </template> </def-rule>
This rule basically says, match all nouns //NODE[@pos = 'n']
and create a new attribute, mi
which is the concatenation of the attribute pos
, the string literal |
and the attribute nbr
. The result of this concatenation for the node containing lem="beer" pos="n" nbr="sg"
will be mi="n|sg"
.
If we save the file and compile it:
$ matxin-preprocess-generate matxin-eng.eng.gnx eng.gnx.bin 1 rules processed.
We can now test it in the whole pipeline... first switch directory to matxin-tur-eng
, then:
$ cat input.txt | cg-proc -f 2 tur-eng.deprlx.bin | matxin-xfer-lex tur-eng.autobil.bin | matxin-transfer tur-eng.t1x.bin |\ LLL | matxin-generate ../matxin-eng/eng.gnx.bin ../matxin-eng/eng.autogen.bin
Notes
- ↑ Consider in English the difference between:
They congratulated the girl thatnsubj graduated yesterday. and
They drank the beer thatdobj she bought yesterday.