Difference between revisions of "Matxin New Language Pair HOWTO"
(→Paradigms: add an accidentally ommitted word) |
|||
(55 intermediate revisions by 3 users not shown) | |||
Line 1: | Line 1: | ||
''This page refers to Matxin 2.0, for Matxin 1.0, see: [[Matxin 1.0 New Language Pair HOWTO]]'' |
|||
{{TOCD}} |
{{TOCD}} |
||
Line 26: | Line 27: | ||
Save this output into a file, perhaps called <code>input.txt</code>. We'll need it later. |
Save this output into a file, perhaps called <code>input.txt</code>. We'll need it later. |
||
Now go into the <code>matxin-tur</code> directory, and create a file <code> |
Now go into the <code>matxin-tur</code> directory, and create a file <code>matxin-tur.tur.deprlx</code>. The first thing we have to do in this file is to specify the delimiters. These are the tokens that CG will use for splitting the input into chunks (like "sentences") for processing: |
||
<pre> |
<pre> |
||
Line 96: | Line 97: | ||
</pre> |
</pre> |
||
Great, it works... so now we can do |
Great, it works... so now we can do basically the same rule for the verbal adjective, ''aldığın'' which should get an <code>@acl</code> tag, the postposition which should get a <code>@case</code> tag and the accusative which should get a <code>@dobj</code> tag. |
||
<pre> |
<pre> |
||
Line 111: | Line 112: | ||
Sections: 1, Rules: 5, Sets: 12, Tags: 29 |
Sections: 1, Rules: 5, Sets: 12, Tags: 29 |
||
$ cat |
$ cat input.txt | cg-proc tur.deprlx.bin |
||
^Dün/dün<adv><@advmod>$ ^benim/ben<prn><pers><p1><sg><gen>$ ^için/için<post><@case>$ ^aldığın/al<v><tv><gpr_past><px2sg><@acl>$ |
^Dün/dün<adv><@advmod>$ ^benim/ben<prn><pers><p1><sg><gen>$ ^için/için<post><@case>$ ^aldığın/al<v><tv><gpr_past><px2sg><@acl>$ |
||
^birayı/bira<n><acc><@dobj>$ ^içeceğim/iç<v><tv><fut><p1><sg>$^./.<sent><@punct>$ |
^birayı/bira<n><acc><@dobj>$ ^içeceğim/iç<v><tv><fut><p1><sg>$^./.<sent><@punct>$ |
||
Line 165: | Line 166: | ||
Let's test the rules, so save the file, and go to the terminal: |
Let's test the rules, so save the file, and go to the terminal: |
||
{| |
|||
<pre> |
|||
|<pre> |
|||
$ cat input.txt | cg-proc -f 2 /tmp/tur.bin |
|||
$ cat input.txt | cg-proc -f 2 tur.deprlx.bin |
|||
<corpus> |
<corpus> |
||
<SENTENCE ord="1" alloc="0"> |
<SENTENCE ord="1" alloc="0"> |
||
<NODE ord="6" alloc="0" form="içeceğim" lem"iç" mi="v|tv|fut|p1|sg" si="root"> |
<NODE ord="6" alloc="0" form="içeceğim" lem="iç" mi="v|tv|fut|p1|sg" si="root"> |
||
<NODE ord="1" alloc="0" form="Dün" lem="dün" mi="adv" si="advmod"/> |
<NODE ord="1" alloc="0" form="Dün" lem="dün" mi="adv" si="advmod"/> |
||
<NODE ord="2" alloc="0" form="benim" lem="ben" mi="prn|pers|p1|sg|gen" si="nmod"/> |
<NODE ord="2" alloc="0" form="benim" lem="ben" mi="prn|pers|p1|sg|gen" si="nmod"/> |
||
Line 179: | Line 181: | ||
</SENTENCE> |
</SENTENCE> |
||
</corpus> |
</corpus> |
||
</pre> ||align="right"| [[File:Tur-parse-1.svg|thumb|400px|right]] |
|||
</pre> |
|||
|} |
|||
Note how here we have passed <code>-f 2</code> parameter to the <code>cg-proc</code> program, so now it is output in [[Matxin]] XML format. |
Note how here we have passed <code>-f 2</code> parameter to the <code>cg-proc</code> program, so now it is output in [[Matxin]] XML format. |
||
Line 214: | Line 217: | ||
Let's save the file and apply the rules: |
Let's save the file and apply the rules: |
||
{| |
|||
<pre> |
|||
|<pre> |
|||
$ cat input.txt | cg-proc -f 2 tur.deprlx.bin |
$ cat input.txt | cg-proc -f 2 tur.deprlx.bin |
||
<corpus> |
<corpus> |
||
<SENTENCE ord=" |
<SENTENCE ord="1" alloc="0"> |
||
<NODE ord="6" alloc="0" form="içeceğim" |
<NODE ord="6" alloc="0" form="içeceğim" lem="iç" mi="v|tv|fut|p1|sg" si="root"> |
||
<NODE ord="1" alloc="0" form="Dün" |
<NODE ord="1" alloc="0" form="Dün" lem="dün" mi="adv" si="advmod"/> |
||
<NODE ord="2" alloc="0" form="benim" |
<NODE ord="2" alloc="0" form="benim" lem="ben" mi="prn|pers|p1|sg|gen" si="nmod"> |
||
<NODE ord="3" alloc="0" form="için" |
<NODE ord="3" alloc="0" form="için" lem="için" mi="post" si="case"/> |
||
</NODE> |
</NODE> |
||
<NODE ord="5" alloc="0" form="birayı" |
<NODE ord="5" alloc="0" form="birayı" lem="bira" mi="n|acc" si="dobj"> |
||
<NODE ord="4" alloc="0" form="aldığın" |
<NODE ord="4" alloc="0" form="aldığın" lem="al" mi="v|tv|gpr_past|px2sg" si="acl"/> |
||
</NODE> |
</NODE> |
||
<NODE ord="7" alloc="0" form="." |
<NODE ord="7" alloc="0" form="." lem="." mi="sent" si="punct"/> |
||
</NODE> |
</NODE> |
||
</SENTENCE> |
</SENTENCE> |
||
</corpus> |
</corpus> |
||
</pre> ||align="right"| [[File:Tur-parse-2.svg|thumb|right|300px]] |
|||
|} |
|||
</pre> |
|||
This tree is looking much better. The only thing we have left to do is to attach the postpositional phrase ''benim için'' and the adverb ''Dün'' to the appropriate verb, which in this case is the head of the relative clause ''aldığın''. |
This tree is looking much better. The only thing we have left to do is to attach the postpositional phrase ''benim için'' and the adverb ''Dün'' to the appropriate verb, which in this case is the head of the relative clause ''aldığın''. |
||
Line 247: | Line 251: | ||
We can test these rules and see the output: |
We can test these rules and see the output: |
||
{| |
|||
<pre> |
|||
|<pre> |
|||
$ cat input.txt | cg-proc -f 2 tur.deprlx.bin |
$ cat input.txt | cg-proc -f 2 tur.deprlx.bin |
||
<corpus> |
<corpus> |
||
<SENTENCE ord=" |
<SENTENCE ord="1" alloc="0"> |
||
<NODE ord="6" alloc="0" form="içeceğim" |
<NODE ord="6" alloc="0" form="içeceğim" lem="iç" mi="v|tv|fut|p1|sg" si="root"> |
||
<NODE ord="5" alloc="0" form="birayı" |
<NODE ord="5" alloc="0" form="birayı" lem="bira" mi="n|acc" si="dobj"> |
||
<NODE ord="4" alloc="0" form="aldığın" |
<NODE ord="4" alloc="0" form="aldığın" lem="al" mi="v|tv|gpr_past|px2sg" si="acl"> |
||
<NODE ord="1" alloc="0" form="Dün" |
<NODE ord="1" alloc="0" form="Dün" lem="dün" mi="adv" si="advmod"/> |
||
<NODE ord="2" alloc="0" form="benim" |
<NODE ord="2" alloc="0" form="benim" lem="ben" mi="prn|pers|p1|sg|gen" si="nmod"> |
||
<NODE ord="3" alloc="0" form="için" |
<NODE ord="3" alloc="0" form="için" lem="için" mi="post" si="case"/> |
||
</NODE> |
</NODE> |
||
</NODE> |
</NODE> |
||
Line 264: | Line 269: | ||
</SENTENCE> |
</SENTENCE> |
||
</corpus> |
</corpus> |
||
</pre> ||align="right"| [[File:Tur-parse-3.svg|thumb|right|350px]] |
|||
</pre> |
|||
|} |
|||
Yesss! Now we have a nice tree ready to be translated! |
Yesss! Now we have a nice tree ready to be translated! |
||
Line 340: | Line 346: | ||
* <code>mi</code> → <code>smi</code>: Morphological information is now "source morphological information" |
* <code>mi</code> → <code>smi</code>: Morphological information is now "source morphological information" |
||
* <code>lem</code> → <code>slem</code>: Lemma is |
* <code>lem</code> → <code>slem</code>: Lemma is now "source lemma" |
||
* <code>Upcase</code>: To do with getting casing right |
* <code>Upcase</code>: To do with getting casing right |
||
* <code>mi</code>: This is copied from the <code>smi</code> |
* <code>mi</code>: This is copied from the <code>smi</code> |
||
Line 406: | Line 412: | ||
</pre> |
</pre> |
||
As we move onto verb forms, it's worth noting that can call other paradigms, for example, if we want to avoid listing all the possible combinations of tense, person and number, we can set up the paradigms like: |
As we move onto verb forms, it's worth noting that a paradigm can call other paradigms, for example, if we want to avoid listing all the possible combinations of tense, person and number, we can set up the paradigms like: |
||
<pre> |
<pre> |
||
Line 527: | Line 533: | ||
</corpus> |
</corpus> |
||
</pre> |
|||
====Underspecification==== |
|||
Sometimes a language will underspecify some morphological feature, for example this often happens with gender, or number, for example in our previous example with ''bira'' "beer", in Turkish the form without an affix is underspecified for number. It could be singular or plural, for example ''1 bira'' "1 beer", but ''5 bira'' "5 beers". |
|||
What should we do about this, well, one thing we can do is add the feature in lexical transfer but with a value saying that we don't know what it should be and that it should be dealt with in transfer. Normally we use <code>ND</code> for "number to be determined" and <code>GD</code> for gender to be determined (for example in translating the third person pronoun in Turkish to English we would need to set the gender to be determined in transfer). |
|||
We can set the feature like all of the other features, so just go to where you have the <code>n__n</code> paradigm defined, and update it to look like: |
|||
<pre> |
|||
<pardef n="n__n"> |
|||
<e><p><l>|nom</l><r><s n="nbr"/>ND<s n="cas"/>nom</r></p></e> |
|||
<e><p><l>|acc</l><r><s n="nbr"/>ND<s n="cas"/>acc</r></p></e> |
|||
</pardef> |
|||
</pre> |
</pre> |
||
Line 550: | Line 571: | ||
* A relative pronoun to stand in place for the direct object in the relative clause headed by ''aldığın''. |
* A relative pronoun to stand in place for the direct object in the relative clause headed by ''aldığın''. |
||
And a couple of non-obvious, morphological things: |
|||
* If there is no dependent numeral greater than one, set the number of nouns to singular. |
|||
* Set the case of a pronoun with a dependent adposition to accusative. |
|||
===Starting out=== |
===Starting out=== |
||
Line 609: | Line 634: | ||
<CHUNK ref="0" type="root"> |
<CHUNK ref="0" type="root"> |
||
<NODE ref="6" alloc="0" slem="iç" smi="v|tv|fut|p1|sg" si="root" UpCase="none" lem="drink" pos="v" val="tv" tns="fut" prs="p1" nbr="sg"> |
<NODE ref="6" alloc="0" slem="iç" smi="v|tv|fut|p1|sg" si="root" UpCase="none" lem="drink" pos="v" val="tv" tns="fut" prs="p1" nbr="sg"> |
||
<NODE ref="5" alloc="0" slem="bira" smi="n|acc" si="dobj" UpCase="none" lem="beer" pos="n" cas="acc"> |
<NODE ref="5" alloc="0" slem="bira" smi="n|acc" si="dobj" UpCase="none" lem="beer" pos="n" nbr="ND" cas="acc"> |
||
<NODE ref="4" alloc="0" slem="al" smi="v|tv|gpr_past|px2sg" si="acl" UpCase="none" lem="buy" pos="v" val="tv" vtype="gpr" tns="past" prs="p2" nbr="sg"> |
<NODE ref="4" alloc="0" slem="al" smi="v|tv|gpr_past|px2sg" si="acl" UpCase="none" lem="buy" pos="v" val="tv" vtype="gpr" tns="past" prs="p2" nbr="sg"> |
||
<NODE ref="1" alloc="0" slem="dün" smi="adv" si="advmod" UpCase="none" lem="yesterday" pos="adv"/> |
<NODE ref="1" alloc="0" slem="dün" smi="adv" si="advmod" UpCase="none" lem="yesterday" pos="adv"/> |
||
Line 650: | Line 675: | ||
<center> |
<center> |
||
{| |
{| |
||
| [[File:Ex-dep-graph.tur2.svg|thumb|250px]] |
| [[File:Ex-dep-graph.tur2.svg|thumb|250px|Resultant tree after applying our first two transfer rules.]] |
||
|} |
|} |
||
</center> |
</center> |
||
Line 657: | Line 682: | ||
It's looking good, now let's try adding the relative pronoun, this is slightly more complicated as we want the function of the relative |
It's looking good, now let's try adding the relative pronoun, this is slightly more complicated as we want the function of the relative |
||
pronoun in relation to the head of its clause to be the same as the function of the word it modifies in relation to the matrix clause. In Xpath |
pronoun in relation to the head of its clause to be the same as the function of the word it modifies in relation to the matrix clause.<ref>Consider in English the difference between:<br/>''They congratulated the girl '''that<sub>{{sc|nsubj}}</sub>''' graduated yesterday.'' and<br/>''They drank the beer '''that<sub>{{sc|dobj}}</sub>''' she bought yesterday.''</ref> In Xpath we can use <code>..</code> to refer to the parent node. |
||
we can use <code>..</code> to refer to the parent node. |
|||
<pre> |
<pre> |
||
Line 679: | Line 703: | ||
====Conditional statements==== |
====Conditional statements==== |
||
Now there are only two words that need to be added, the subject pronouns. This is slightly more complicated because although we can inherit the person and number from the verb, in English the lemma is going to be different depending on the person and number. But never fear, <code>choose, when</code> is here! Basically <code>choose, when, [otherwise]</code> works like <code>if, else if, [else]</code> in other programming languages. As can be seen from the following example, you have a condition <code>test=</code> that basically is equivalent to an Xpath expression. |
|||
Now there are only two words that need to be added, the subject pronouns. |
|||
<pre> |
<pre> |
||
Line 688: | Line 712: | ||
<NODE><attr name="si">nsubj</attr> |
<NODE><attr name="si">nsubj</attr> |
||
<choose> |
<choose> |
||
<when test="@prs = 'p1' and @nbr ='sg'"> |
<when test="@prs = 'p1' and @nbr = 'sg'"> |
||
<attr name="lem">I</attr> |
<attr name="lem">I</attr> |
||
</when> |
</when> |
||
<when test="@prs = 'p2' and @nbr ='sg'"> |
<when test="@prs = 'p2' and @nbr = 'sg'"> |
||
<attr name="lem">you</attr> |
<attr name="lem">you</attr> |
||
</when> |
</when> |
||
Line 697: | Line 721: | ||
<attr name="pos">prn|pers</attr> |
<attr name="pos">prn|pers</attr> |
||
<attr name="prs"><value-of select="@prs"/></attr> |
<attr name="prs"><value-of select="@prs"/></attr> |
||
<attr name="nbr"><value-of select="@nbr"/></attr |
<attr name="nbr"><value-of select="@nbr"/></attr> |
||
<attr name="cas">nom</attr></NODE> |
|||
</copy> |
</copy> |
||
</template> |
</template> |
||
Line 713: | Line 738: | ||
<SENTENCE ref="1" alloc="0"> |
<SENTENCE ref="1" alloc="0"> |
||
<NODE ref="6" alloc="0" slem="iç" smi="v|tv|fut|p1|sg" si="root" UpCase="none" lem="drink" pos="v" val="tv" vtype="fin" tns="fut" prs="p1" nbr="sg"> |
<NODE ref="6" alloc="0" slem="iç" smi="v|tv|fut|p1|sg" si="root" UpCase="none" lem="drink" pos="v" val="tv" vtype="fin" tns="fut" prs="p1" nbr="sg"> |
||
<NODE ref="5" alloc="0" slem="bira" smi="n|acc" si="dobj" UpCase="none" lem="beer" pos="n" cas="acc"> |
<NODE ref="5" alloc="0" slem="bira" smi="n|acc" si="dobj" UpCase="none" lem="beer" pos="n" nbr="ND" cas="acc"> |
||
<NODE ref="4" alloc="0" slem="al" smi="v|tv|gpr_past|px2sg" si="acl" UpCase="none" lem="buy" pos="v" val="tv" vtype="gpr" tns="past" prs="p2" nbr="sg"> |
<NODE ref="4" alloc="0" slem="al" smi="v|tv|gpr_past|px2sg" si="acl" UpCase="none" lem="buy" pos="v" val="tv" vtype="gpr" tns="past" prs="p2" nbr="sg"> |
||
<NODE ref="1" alloc="0" slem="dün" smi="adv" si="advmod" UpCase="none" lem="yesterday" pos="adv"/> |
<NODE ref="1" alloc="0" slem="dün" smi="adv" si="advmod" UpCase="none" lem="yesterday" pos="adv"/> |
||
Line 722: | Line 747: | ||
<NODE si="nsubj" lem="you" pos="prn|pers" prs="p2" nbr="sg"/> |
<NODE si="nsubj" lem="you" pos="prn|pers" prs="p2" nbr="sg"/> |
||
</NODE> |
</NODE> |
||
<NODE lem="the" pos="det|def" mi="sp"/> |
<NODE si="det" lem="the" pos="det|def" mi="sp"/> |
||
</NODE> |
</NODE> |
||
<NODE ref="7" alloc="0" slem="." smi="sent" si="punct" UpCase="none" lem="." pos="sent"/> |
<NODE ref="7" alloc="0" slem="." smi="sent" si="punct" UpCase="none" lem="." pos="sent"/> |
||
Line 741: | Line 766: | ||
</center> |
</center> |
||
Our tree is looking pretty English now! |
Our tree is looking pretty English now! The only thing left to do is to correctly set the number of the noun to singular: |
||
<pre> |
|||
<def-rule comment="Set number of nouns with ND and no dependent numeral to singular"> |
|||
<template match="//NODE[@pos = 'n' and not(.//NODE[@pos = 'num'])]"> |
|||
<copy> |
|||
<attr name="nbr">sg</attr> |
|||
<copy-of select="@*[name()!='nbr'] | *"/> |
|||
</copy> |
|||
</template> |
|||
</def-rule> |
|||
</pre> |
|||
and set the case of the pronoun with a dependent preposition to accusative: |
|||
<pre> |
|||
<def-rule comment="Set the case of personal pronouns with prepositions to accusative"> |
|||
<template match="//NODE[@pos = 'prn|pers' and .//NODE[@pos = 'pr']]"> |
|||
<copy> |
|||
<attr name="cas">acc</attr> |
|||
<copy-of select="@*[name()!='cas'] | *"/> |
|||
</copy> |
|||
</template> |
|||
</def-rule> |
|||
</pre> |
|||
These rules show a good pattern for changing the value of a given attribute, we first add the new attribute, and then we copy all attributes apart from the attribute that we've already added. |
|||
Now we have two more steps, first we need to reorder the tree, which we call <i>linearisation</i>, and then we need to generate the word forms in English using a morphological generator. First onto linearisation... |
|||
===Reordering and linearisation=== |
===Reordering and linearisation=== |
||
The idea of linearisation is to put an order to the nodes in the tree so that they are ready for generation and printing out. In the Matxin pipeline, the program that does this is called <code>matxin-linearise</code>. It currently takes a four column tab-separated file which specifies the order of heads, dependents and siblings. |
|||
Let's create a new file called <code>matxin-tur-eng.tur-eng.l1x</code>, and add our first line: |
|||
<pre> |
|||
# HEAD DEPENDENT RELPOS ORDER |
|||
si='root' lem='.' .*? x1.x2 |
|||
</pre> |
|||
This say that we want to linearise the node pair head "root" (e.g. the finite verb head of the sentence) and dependent where the lemma is "." in the order head followed by dependent, the variable <code>x1</code> stands for the head, and the variable <code>x2</code> stands for the dependent. |
|||
If we save this file and test it, we should see that the "." is ordered after the root node (the verb): |
|||
<pre> |
|||
$ cat input.txt | cg-proc -f 2 ../matxin-tur/tur.deprlx.bin | matxin-transfer tur-eng.t1x.bin |\ |
|||
matxin-linearise matxin-tur-eng.tur-eng.l1x |
|||
<corpus> |
|||
<SENTENCE ord="1" ref="1" alloc="0"> |
|||
<NODE ord="10" ref="6" alloc="0" slem="iç" smi="v|tv|fut|p1|sg" si="root" UpCase="none" lem="drink" pos="v" val="tv" vtype="fin" tns="fut" prs="p1" nbr="sg"> |
|||
<NODE ord="7" nbr="sg" ref="5" alloc="0" slem="bira" smi="n|acc" si="dobj" UpCase="none" lem="beer" pos="n" cas="acc"> |
|||
<NODE ord="5" ref="4" alloc="0" slem="al" smi="v|tv|gpr_past|px2sg" si="acl" UpCase="none" lem="buy" pos="v" val="tv" vtype="gpr" tns="past" prs="p2" nbr="sg"> |
|||
<NODE ord="0" ref="1" alloc="0" slem="dün" smi="adv" si="advmod" UpCase="none" lem="yesterday" pos="adv"/> |
|||
<NODE ord="2" cas="acc" ref="2" alloc="0" slem="ben" smi="prn|pers|p1|sg|gen" si="nmod" UpCase="none" lem="I" pos="prn|pers" prs="p1" nbr="sg"> |
|||
<NODE ord="1" ref="3" alloc="0" slem="için" smi="post" si="case" UpCase="none" lem="for" pos="pr"/> |
|||
</NODE> |
|||
<NODE ord="3" ref="new-0" si="dobj" lem="that" pos="rel" ani="an" nbr="sp"/> |
|||
<NODE ord="4" ref="new-1" si="nsubj" lem="you" pos="prn|pers" prs="p2" nbr="sg" cas="nom"/> |
|||
</NODE> |
|||
<NODE ord="6" ref="new-2" si="det" lem="the" pos="det|def" nbr="sp"/> |
|||
</NODE> |
|||
<NODE ord="11" ref="7" alloc="0" slem="." smi="sent" si="punct" UpCase="none" lem="." pos="sent"/> |
|||
<NODE ord="8" ref="new-3" si="aux" lem="will" pos="vaux" tns="pri"/> |
|||
<NODE ord="9" ref="new-4" si="nsubj" lem="I" pos="prn|pers" prs="p1" nbr="sg" cas="nom"/> |
|||
</NODE> |
|||
</SENTENCE> |
|||
</corpus> |
|||
</pre> |
|||
It might be easier to visualise in a flat format: |
|||
<center> |
|||
{|class=wikitable |
|||
! '''Word''' || yesterday || for || me || that || you || bought || the || beer || will || I || drink || . |
|||
|- |
|||
| <code>ord</code> || 0 || 1 || 2 || 3 || 4 || 5 || 6 || 7 || 8 || 9 || 10 || 11 |
|||
|- |
|||
| <code>pos</code> || <code>adv</code> || <code>pr</code> || <code>prn|pers</code> || <code>rel</code> || <code>prn|pers</code> || <code>v</code> || <code>det</code> || <code>n</code> || <code>vaux</code> || <code>prn|pers</code> || <code>v</code> || <code>sent</code> |
|||
|- |
|||
| <code>si</code> || {{sc|advmod}} || {{sc|case}} || {{sc|nmod}} || {{sc|dobj}} || {{sc|nsubj}} || {{sc|acl}} || {{sc|det}} || {{sc|dobj}} || {{sc|aux}} || {{sc|nsubj}} || {{sc|root}} || {{sc|punct}} |
|||
|- |
|||
|} |
|||
</center> |
|||
The "." receives <code>ord="11"</code> and the root receives <code>ord="10"</code>, e.g. they have been properly reordered. Now, this sentence is far from adequately ordered... |
|||
Let's try something a bit more movey, let's move the direct object after the root, and the subject before it. |
|||
<pre> |
|||
# HEAD DEPENDENT RELPOS ORDER |
|||
si='root' lem='.' .*? x1.x2 |
|||
si='root' si='dobj' .*? x1.x2 |
|||
si='root' si='nsubj' .*? x2.x1 |
|||
</pre> |
|||
So, that is a little better, it's still pretty mangled though. |
|||
<center> |
|||
{|class=wikitable |
|||
! Word || will || I || drink || yesterday || for || me || bought || that || you || the || beer || . |
|||
|- |
|||
| <code>ord</code> || 0 || 1 || 2 || 3 || 4 || 5 || 6 || 7 || 8 || 9 || 10 || 11 |
|||
|- |
|||
|} |
|||
</center> |
|||
Let's try moving the auxiliary "will" to before the verb. Note that it is already before the verb, but we want to order it ''right next to'' the verb. For that we need another order statement, this time instead of the <code>x2.x1</code> to move the dependent before it's head, we need <code>x2+x1</code> which means "order <code>x2</code> '''right before''' <code>x1</code>". |
|||
<pre> |
|||
# HEAD DEPENDENT RELPOS ORDER |
|||
si='root' lem='.' .*? x1.x2 |
|||
si='root' si='dobj' .*? x1.x2 |
|||
si='root' si='nsubj' .*? x2.x1 |
|||
si='root' si='aux' .*? x2+x1 |
|||
</pre> |
|||
<center> |
|||
{|class=wikitable |
|||
! Word || I || will || drink || yesterday || for || me || bought || that || you || the || beer || . |
|||
|- |
|||
| ||colspan=3| || advmod || case || nmod || acl || dobj || nsubj || det || dobj || |
|||
|- |
|||
| ||align="center"| {{sc|nsubj}} ||align="center"| {{sc|aux}} ||align="center"| {{sc|root}} ||align="center" colspan=8| {{sc|dobj}} || {{sc|punct}} |
|||
|- |
|||
| <code>ord</code> || 0 || 1 || 2 || 3 || 4 || 5 || 6 || 7 || 8 || 9 || 10 || 11 |
|||
|- |
|||
|} |
|||
</center> |
|||
If we look at the tree, we will see that now we have the correct order of the dependents of the root node. But the order of the dependents of the direct object (e.g. principally the relative clause) is totally messed up still. Let's try and fix it. The first thing we want to do is put a relative clause after the noun it depends on. |
|||
<pre> |
|||
# HEAD DEPENDENT RELPOS ORDER |
|||
si='root' lem='.' .*? x1.x2 |
|||
si='root' si='dobj' .*? x1.x2 |
|||
si='root' si='nsubj' .*? x2.x1 |
|||
si='root' si='aux' .*? x2+x1 |
|||
pos='n' si='acl' .*? x1.x2 |
|||
</pre> |
|||
Testing it, we get a bit better order: |
|||
<center> |
|||
{|class=wikitable |
|||
! Word || I || will || drink || the || beer || yesterday || for || me || that || you || bought || . |
|||
|- |
|||
| <code>ord</code> || 0 || 1 || 2 || 3 || 4 || 5 || 6 || 7 || 8 || 9 || 10 || 11 |
|||
|- |
|||
|} |
|||
</center> |
|||
Now let's try linearising the rest of the relative clause: |
|||
<pre> |
|||
# HEAD DEPENDENT RELPOS ORDER |
|||
si='root' lem='.' .*? x1.x2 |
|||
si='root' si='dobj' .*? x1.x2 |
|||
si='root' si='nsubj' .*? x2.x1 |
|||
si='root' si='aux' .*? x2+x1 |
|||
pos='n' si='acl' .*? x1.x2 |
|||
si='acl' si='nsubj' .*? x2.x1 |
|||
si='acl' si='nmod' .*? x1.x2 |
|||
si='acl' si='advmod' .*? x1.x2 |
|||
</pre> |
|||
Which gives: |
|||
<center> |
|||
{|class=wikitable |
|||
! Word || I || will || drink || the || beer || that || you || bought || yesterday || for || me || . |
|||
|- |
|||
| <code>ord</code> || 0 || 1 || 2 || 3 || 4 || 5 || 6 || 7 || 8 || 9 || 10 || 11 |
|||
|- |
|||
|} |
|||
</center> |
|||
So, there is only one weirdness left with the word order now, that is that "yesterday" should come after "for me"... the placement of adverbs in English can be a bit unpredictable, so we can either leave it like this, or we can change the <code>nmod</code> placement rule to place <code>nmod</code> directly before the verb, e.g. |
|||
<pre> |
|||
si='acl' si='nmod' .*? x1+x2 |
|||
</pre> |
|||
So, let's run all of those rules on the XML and see what we come out with: |
|||
<pre> |
|||
$ cat input.txt | cg-proc -f 2 ../matxin-tur/tur.deprlx.bin | matxin-transfer tur-eng.t1x.bin |\ |
|||
matxin-linearise matxin-tur-eng.tur-eng.l1x |
|||
<corpus> |
|||
<SENTENCE ord="1" ref="1" alloc="0"> |
|||
<NODE ord="2" ref="6" alloc="0" slem="iç" smi="v|tv|fut|p1|sg" si="root" UpCase="none" lem="drink" pos="v" val="tv" vtype="fin" tns="fut" prs="p1" nbr="sg"> |
|||
<NODE ord="4" nbr="sg" ref="5" alloc="0" slem="bira" smi="n|acc" si="dobj" UpCase="none" lem="beer" pos="n" cas="acc"> |
|||
<NODE ord="7" ref="4" alloc="0" slem="al" smi="v|tv|gpr_past|px2sg" si="acl" UpCase="none" lem="buy" pos="v" val="tv" vtype="gpr" tns="past" prs="p2" nbr="sg"> |
|||
<NODE ord="11" ref="1" alloc="0" slem="dün" smi="adv" si="advmod" UpCase="none" lem="yesterday" pos="adv"/> |
|||
<NODE ord="9" cas="acc" ref="2" alloc="0" slem="ben" smi="prn|pers|p1|sg|gen" si="nmod" UpCase="none" lem="I" pos="prn|pers" prs="p1" nbr="sg"> |
|||
<NODE ord="8" ref="3" alloc="0" slem="için" smi="post" si="case" UpCase="none" lem="for" pos="pr"/> |
|||
</NODE> |
|||
<NODE ord="5" ref="new-0" si="dobj" lem="that" pos="rel" ani="an" nbr="sp"/> |
|||
<NODE ord="6" ref="new-1" si="nsubj" lem="you" pos="prn|pers" prs="p2" nbr="sg" cas="nom"/> |
|||
</NODE> |
|||
<NODE ord="3" ref="new-2" si="det" lem="the" pos="det|def" nbr="sp"/> |
|||
</NODE> |
|||
<NODE ord="12" ref="7" alloc="0" slem="." smi="sent" si="punct" UpCase="none" lem="." pos="sent"/> |
|||
<NODE ord="1" ref="new-3" si="aux" lem="will" pos="vaux" tns="pri"/> |
|||
<NODE ord="0" ref="new-4" si="nsubj" lem="I" pos="prn|pers" prs="p1" nbr="sg" cas="nom"/> |
|||
</NODE> |
|||
</SENTENCE> |
|||
</corpus> |
|||
</pre> |
|||
==Generation== |
==Generation== |
||
In Matxin, generation is done by the program <code>matxin-generate</code> which takes two arguments, an file with a cascade of stylesheets and a compiled finite-state transducer. The cascade is used to organise the attributes of the XML into feature-strings suitable to be passed to the finite-state transducer to generate the morphological forms. |
|||
lttoolbox | hfst |
|||
===Morphological dictionary=== |
|||
So, what does a morphological dictionary look like ? Again, that is mostly outside of the scope of this howto, but for the sake of easy of copy/paste, let's go through it here, taking an [[lttoolbox]] dictionary as an example. |
|||
To start with, we change into the directory <code>matxin-eng</code> and we create a new file <code>matxin-eng.eng.dix</code>. The file will have the skeleton structure: |
|||
<pre> |
|||
<dictionary> |
|||
<alphabet/> |
|||
<sdefs> |
|||
<sdef n="mi"/> |
|||
</sdefs> |
|||
<pardefs> |
|||
</pardefs> |
|||
<section id="main" type="standard"> |
|||
</section> |
|||
</dictionary> |
|||
</pre> |
|||
This structure will seem familiar if you read the section on lexical transfer (if you didn't read it, it's [[#Lexical transfer|up there]]). Instead of translating between source words and target words, the morphological generator translates between lexical forms (combinations of lemmas and tags) and surface forms. Let's take a look at the words we need to generate: |
|||
{|class=wikitable |
|||
! Lemma !! POS !! Forms |
|||
|- |
|||
| beer || {{tag|n}} || beer, beers |
|||
|- |
|||
| buy || {{tag|v}} || buy, buys, bought, bought |
|||
|- |
|||
| drink || {{tag|v}} || drink, drinks, drank, drunk |
|||
|- |
|||
| for || {{tag|pr}} || for |
|||
|- |
|||
| I || {{tag|prn}} || I, me |
|||
|- |
|||
| the || {{tag|det}} || the |
|||
|- |
|||
| will || {{tag|vaux}} || will, would |
|||
|- |
|||
| yesterday || {{tag|adv}} || yesterday |
|||
|- |
|||
| you || {{tag|prn}} || you, you |
|||
|- |
|||
|} |
|||
Given these words there isn't much paradigmatically that we can do, each word needs a separate paradigm, so let's just start with the noun, "beer", the paradigm is going to be: |
|||
<pre> |
|||
<pardef n="beer__n"> |
|||
<e><p><l></l><r><s n="mi"/>n|sg</r></p></e> |
|||
<e><p><l>s</l><r><s n="mi"/>n|pl</r></p></e> |
|||
</pardef> |
|||
</pre> |
|||
and then the entry in the main <code>section</code>: |
|||
<pre> |
|||
<e lm="beer"><i>beer</i><par n="beer__n"/></e> |
|||
</pre> |
|||
Save the dictionary, and go to the terminal, you can do two things: 1) compile the dictionary, using: |
|||
<pre> |
|||
$ lt-comp rl matxin-eng.eng.dix eng.autogen.bin |
|||
main@standard 14 14 |
|||
</pre> |
|||
You can also print out all the strings recognised by the dictionary using <code>lt-expand</code>: |
|||
<pre> |
|||
$ lt-expand matxin-eng.eng.dix |
|||
beer:beer<mi>n|sg |
|||
beers:beer<mi>n|pl |
|||
</pre> |
|||
The remainder of the vocabulary is left as an exercise for the reader. |
|||
===Generation rules=== |
|||
Generation rules take a node and its attributes and produce a new attribute, <code>mi</code> that has all the information necessary to pass to the morphological generator. They are written in the same XSLT format as the transfer rules. |
|||
We start out with a file, let's call it <code>matxin-tur-eng.tur-eng.gnx</code> |
|||
<pre> |
|||
<generate> |
|||
</generate> |
|||
</pre> |
|||
Then we add a rule to generate the <code>mi</code> attribute for nouns: |
|||
<pre> |
|||
<def-rule comment="Generate the morphological information for nouns"> |
|||
<template match="//NODE[@pos = 'n']"> |
|||
<copy> |
|||
<attr name="mi"><value-of select="concat(@pos, '|', @nbr)"/></attr> |
|||
<copy-of select="@*[name()!='mi'] | *"/> |
|||
</copy> |
|||
</template> |
|||
</def-rule> |
|||
</pre> |
|||
This rule basically says, match all nouns <code>//NODE[@pos = 'n']</code> and create a new attribute, <code>mi</code> which is the concatenation of the attribute <code>pos</code>, the string literal <code>|</code> and the attribute <code>nbr</code>. The result of this concatenation for the node containing <code>lem="beer" pos="n" nbr="sg"</code> will be <code>mi="n|sg"</code>. |
|||
If we save the file and compile it: |
|||
<pre> |
|||
$ matxin-preprocess-generate matxin-eng.eng.gnx eng.gnx.bin |
|||
1 rules processed. |
|||
</pre> |
|||
We can now test it in the whole pipeline... first switch directory to <code>matxin-tur-eng</code>, then: |
|||
<pre> |
|||
$ cat input.txt | cg-proc -f 2 tur-eng.deprlx.bin | matxin-xfer-lex tur-eng.autobil.bin | matxin-transfer tur-eng.t1x.bin |\ |
|||
matxin-linearise matxin-tur-eng.tur-eng.l1x | matxin-generate ../matxin-eng/eng.gnx.bin ../matxin-eng/eng.autogen.bin |
|||
<?xml version="1.0"?> |
|||
<corpus> |
|||
<SENTENCE ord="1" ref="1" alloc="0"> |
|||
<NODE ord="2" ref="6" alloc="0" slem="iç" smi="v|tv|fut|p1|sg" si="root" UpCase="none" lem="drink" pos="v" val="tv" vtype="fin" tns="fut" prs="p1" nbr="sg" form="%drink|v|tv|fut|p1|sg"> |
|||
<NODE mi="n|sg" ord="4" nbr="sg" ref="5" alloc="0" slem="bira" smi="n|acc" si="dobj" UpCase="none" lem="beer" pos="n" cas="acc" form="beer"> |
|||
<NODE ord="7" ref="4" alloc="0" slem="al" smi="v|tv|gpr_past|px2sg" si="acl" UpCase="none" lem="buy" pos="v" val="tv" vtype="gpr" tns="past" prs="p2" nbr="sg" form="%buy|v|tv|gpr_past|px2sg"> |
|||
<NODE ord="11" ref="1" alloc="0" slem="dün" smi="adv" si="advmod" UpCase="none" lem="yesterday" pos="adv" form="%yesterday|adv"/> |
|||
<NODE ord="9" cas="acc" ref="2" alloc="0" slem="ben" smi="prn|pers|p1|sg|gen" si="nmod" UpCase="none" lem="I" pos="prn|pers" prs="p1" nbr="sg" form="%I|prn|pers|p1|sg|gen"> |
|||
<NODE ord="8" ref="3" alloc="0" slem="için" smi="post" si="case" UpCase="none" lem="for" pos="pr" form="%for|post"/> |
|||
</NODE> |
|||
<NODE ord="5" ref="new-0" si="dobj" lem="that" pos="rel" ani="an" nbr="sp" form="=that"/> |
|||
<NODE ord="6" ref="new-1" si="nsubj" lem="you" pos="prn|pers" prs="p2" nbr="sg" cas="nom" form="=you"/> |
|||
</NODE> |
|||
<NODE ord="3" ref="new-2" si="det" lem="the" pos="det|def" nbr="sp" form="=the"/> |
|||
</NODE> |
|||
<NODE ord="12" ref="7" alloc="0" slem="." smi="sent" si="punct" UpCase="none" lem="." pos="sent" form="%.|sent"/> |
|||
<NODE ord="1" ref="new-3" si="aux" lem="will" pos="vaux" tns="pri" form="=will"/> |
|||
<NODE ord="0" ref="new-4" si="nsubj" lem="I" pos="prn|pers" prs="p1" nbr="sg" cas="nom" form="=I"/> |
|||
</NODE> |
|||
</SENTENCE> |
|||
</corpus> |
|||
</pre> |
|||
It's a bit difficult to pull out from the XML, but you can see that some nodes have a new attribute <code>form</code> and one node (the noun, "beer") has a new attribute <code>mi</code>. |
|||
* If there is an <code>mi</code> attribute, and the form is able to be generated by the morphological analyser, then you get the correctly generated form. |
|||
* If there is an <code>mi</code> attribute, and the form is not able to be generated by the morphological analyser, then you get the symbol <code>#</code> followed by the target language lemma concatenated with the target language morphological information. |
|||
* If there is no <code>mi</code> attribute, then the form is the symbol <code>%</code> followed by target language lemma concatenated with the source language morphological information. This tells you that you need to write a generation rule to correctly build the <code>mi</code> attribute. |
|||
* If there is no <code>smi</code> and no <code>mi</code> attributes, then the form is <code>=</code> followed by the target language lemma. |
|||
===The reformatter=== |
|||
The reformatter basically takes the XML and iterates over the tree and outputs the forms of the nodes, so if we run it on the above tree: |
|||
<pre> |
|||
$ cat input.txt | cg-proc -f 2 tur-eng.deprlx.bin | matxin-xfer-lex tur-eng.autobil.bin | matxin-transfer tur-eng.t1x.bin |\ |
|||
matxin-linearise matxin-tur-eng.tur-eng.l1x | matxin-generate ../matxin-eng/eng.gnx.bin ../matxin-eng/eng.autogen.bin |\ |
|||
matxin-reformat |
|||
=I =will %drink|v|tv|fut|p1|sg =the beer =that =you %buy|v|tv|gpr_past|px2sg %for|post %I|prn|pers|p1|sg|gen %yesterday|adv %.|sent |
|||
</pre> |
|||
This is far from an adequate sentence, but fixing it basically means solving the problems, for example, the words prefixed with <code>%</code> and <code>=</code> probably need to be added to the morphological generator (<code>matxin-eng.eng.dix</code>) and have generation rules written for them (in <code>matxin-eng.eng.g1x</code>). |
|||
==Troubleshooting== |
|||
===Nodes aren't output=== |
|||
Sometimes you'll find that nodes that you want to be output are not output, check your <code>copy-of</code> and <code>apply-template</code> statements, if you have: |
|||
<pre> |
|||
<copy-of select="@*"/> |
|||
</pre> |
|||
You should change it to: |
|||
<pre> |
|||
<copy-of select="@* | *"/> |
|||
</pre> |
|||
The <code>@* | *</code> means "all attributes and all subnodes". |
|||
==Notes== |
|||
<references/> |
|||
==See also== |
==See also== |
||
Line 755: | Line 1,169: | ||
[[Category:Matxin]] |
[[Category:Matxin]] |
||
[[Category:Documentation in English]] |
Latest revision as of 04:05, 21 January 2017
This page refers to Matxin 2.0, for Matxin 1.0, see: Matxin 1.0 New Language Pair HOWTO
This page describes the process of creating a new language pair with Matxin, a dependency-based machine translation system.
Preliminaries[edit]
Make a directory called matxin-tur-eng
. Then make two more directories matxin-tur
and matxin-eng
.
Note that if you are doing this howto for your own language, then tur
should be the ISO-639-3 language code of the source language and eng
should be the ISO-639-3 for the target language
Analysis[edit]
There are a number of ways analysis can be done in Matxin, the Spanish to Basque system uses FreeLing, while the English to Basque system uses a wrapper around the Stanford parser. In this tutorial we're going to be using Constraint Grammar (CG) to do dependency parsing of pre-disambiguated sentences. Writing a morphological analyser and morphological disambiguator is out of the scope of this HOWTO, but for more information, check out the following pages:
So, let's assume that you've been through those tutorials and have a morphological analyser capable of analysing and disambiguating sentences in Turkish. You'll give it a sentence like "Dün benim için aldığın birayı içeceğim." and get some output like:
^Dün/dün<adv>$ ^benim/ben<prn><pers><p1><sg><gen>$ ^için/için<post>$ ^aldığın/al<v><tv><gpr_past><px2sg>$ ^birayı/bira<n><acc>$ ^içeceğim/iç<v><tv><fut><p1><sg>$^./.<sent>$
Save this output into a file, perhaps called input.txt
. We'll need it later.
Now go into the matxin-tur
directory, and create a file matxin-tur.tur.deprlx
. The first thing we have to do in this file is to specify the delimiters. These are the tokens that CG will use for splitting the input into chunks (like "sentences") for processing:
DELIMITERS = "." ;
Great, now we need to specify some lists of interesting grammatical features that we're going to use to write the rules. First we define some part-of-speech and morphological tags, e.g.
LIST Adv = adv; LIST Pers = (prn pers) ; LIST Post = post ; LIST V = v ; LIST N = n ; LIST Acc = acc; LIST Gen = gen; LIST Gpr = gpr_past ; LIST Sent = sent ; LIST Fin = fut aor past ; # Finite verb forms
Etc. It might seem a bit odd to just be repeating the tag in title case, but the point is that it allows us to distinguish CG tags and tag groups from those in the input morphological analyser.
So, now we've got some morphological information we need to start thinking about the syntactic relations. For the purposes of this tutorial we'll be using the relations and guidelines from the Universal dependencies project, here are some examples:
LIST @root = @root ; # The root of the sentence, often a finite verb LIST @nsubj = @nsubj ; # The nominal subject of the sentence LIST @advmod = @advmod ; # An adverbial modifier LIST @case = @case ; # The relation of an adposition to its head LIST @acl = @acl ; # A clause which modifies a nominal LIST @nmod = @nmod ; # Nominal modifier LIST @dobj = @dobj ; # The direct object of the sentence LIST @punct = @punct ; # Any punctuation LIST @dep = @dep ; # Any remaining dependency
Add these lists to the file along with the morphological lists. You'll note that they are prefixed with an @
symbol. This is important as it says that this is a syntactic relation/function and not a morphological tag.
After we've defined our terms, next come the rules, in the first section we're going to map the relations to the words in the sentence. So let's start with:
SECTION
In constraint grammar, all rules come in sections.
So, now a rule, let's start with an easy one. Adverbs nearly always get @advmod
relation, whether they are modifying an adjective or a verb, so we can safely map @advmod
to the adverb using the following rule:
MAP @advmod TARGET Adv ;
The MAP
rule has two compulsory parts (the tag and the target) and one optional part, a context. Here we have no context.
So now let's save the file and try it out! First though we need to compile the rules:
$ cg-comp matxin-tur.tur.deprlx tur.deprlx.bin Sections: 1, Rules: 1, Sets: 4, Tags: 29
And now try it out:
$ cat input.txt | cg-proc tur.deprlx.bin ^Dün/dün<adv><@advmod>$ ^benim/ben<prn><pers><p1><sg><gen>$ ^için/için<post>$ ^aldığın/al<v><tv><gpr_past><px2sg>$ ^birayı/bira<n><acc>$ ^içeceğim/iç<v><tv><fut><p1><sg>$^./.<sent>$
Great, it works... so now we can do basically the same rule for the verbal adjective, aldığın which should get an @acl
tag, the postposition which should get a @case
tag and the accusative which should get a @dobj
tag.
MAP @case TARGET Post ; MAP @acl TARGET Gpr ; MAP @dobj TARGET Acc ; MAP @punct TARGET Sent ;
Save it and try it again:
$ cg-comp matxin-tur.tur.deprlx tur.deprlx.bin Sections: 1, Rules: 5, Sets: 12, Tags: 29 $ cat input.txt | cg-proc tur.deprlx.bin ^Dün/dün<adv><@advmod>$ ^benim/ben<prn><pers><p1><sg><gen>$ ^için/için<post><@case>$ ^aldığın/al<v><tv><gpr_past><px2sg><@acl>$ ^birayı/bira<n><acc><@dobj>$ ^içeceğim/iç<v><tv><fut><p1><sg>$^./.<sent><@punct>$
Great, now for a couple of harder relations, the functions of benim and of içeceğim. As benim is the head of an adpositional phrase, following the universal dependencies guidelines we assign it the @nmod
relation. How do we know it is the head of an adpositional phrase. Well, for this example, we can go for something like "if the word is a personal pronoun and it is followed by a postposition", so let's give it a go:
MAP @nmod TARGET Pers IF (1 Post) ;
And finally the root of the sentence, the finite verb. In Turkish it is quite typical that the finite verb is the last verb in the sentence, so we can go for something like "if the word is a finite verb and the next token is a full stop and there is no other finite verb to the left, then make it the root".
MAP @root TARGET Fin IF (NOT -1* Fin) (1 Sent) ;
So try those two rules out and we should have a fully labelled input sentence:
^Dün/dün<adv><@advmod>$ ^benim/ben<prn><pers><p1><sg><gen><@nmod>$ ^için/için<post><@case>$ ^aldığın/al<v><tv><gpr_past><px2sg><@acl>$ ^birayı/bira<n><acc><@dobj>$ ^içeceğim/iç<v><tv><fut><p1><sg><@root>$^./.<sent><@punct>$
That covers our sentence, but we want to add one more rule "just in case" ... and that is a default rule that adds the relation @dep
to any token that isn't covered by the other rules:
MAP (@dep) TARGET (*) ;
Tree building[edit]
Now we have a full labelled sentence, we can start building the tree, first make another section:
SECTION
The first thing we want to attach the root node:
SETPARENT @root TO (@0 (*)) ;
This basically says, set the parent of the root node to node 0, which is the invisible root that CG uses. Next we want to attach all of the rest of the nodes to this root node:
SETPARENT (*) (NEGATE p (*)) TO (0* @root) ;
Here we have an extra condition (NEGATE p (*))
which means that matched node should not already have a parent.
Let's test the rules, so save the file, and go to the terminal:
$ cat input.txt | cg-proc -f 2 tur.deprlx.bin <corpus> <SENTENCE ord="1" alloc="0"> <NODE ord="6" alloc="0" form="içeceğim" lem="iç" mi="v|tv|fut|p1|sg" si="root"> <NODE ord="1" alloc="0" form="Dün" lem="dün" mi="adv" si="advmod"/> <NODE ord="2" alloc="0" form="benim" lem="ben" mi="prn|pers|p1|sg|gen" si="nmod"/> <NODE ord="3" alloc="0" form="için" lem="için" mi="post" si="case"/> <NODE ord="4" alloc="0" form="aldığın" lem="al" mi="v|tv|gpr_past|px2sg" si="acl"/> <NODE ord="5" alloc="0" form="birayı" lem="bira" mi="n|acc" si="dobj"/> <NODE ord="7" alloc="0" form="." lem="." mi="sent" si="punct"/> </NODE> </SENTENCE> </corpus> |
Note how here we have passed -f 2
parameter to the cg-proc
program, so now it is output in Matxin XML format.
This is a pretty sad looking tree :( ... But fortunately using rules we can make it a happy tree! One that correctly implements the annotation guidelines.
So, let's think of some rules:
- The direct object should depend on the finite verb
- A postposition should depend on its head
- A relative clause should modify (depend on) a noun
- An adverb should modify a verb
Let's start with the first one:
SETPARENT @dobj TO (1* Fin) ;
This rule says that the parent of the direct object should be the finite verb label anywhere to the left. Next up:
SETPARENT @case TO (-1 Pers) ;
Set the parent of the word with the @case
label to be the previous personal pronoun. And then:
SETPARENT @acl TO (1 N) ;
Set the parent of the word with the @acl
label to be the following noun.
Let's save the file and apply the rules:
$ cat input.txt | cg-proc -f 2 tur.deprlx.bin <corpus> <SENTENCE ord="1" alloc="0"> <NODE ord="6" alloc="0" form="içeceğim" lem="iç" mi="v|tv|fut|p1|sg" si="root"> <NODE ord="1" alloc="0" form="Dün" lem="dün" mi="adv" si="advmod"/> <NODE ord="2" alloc="0" form="benim" lem="ben" mi="prn|pers|p1|sg|gen" si="nmod"> <NODE ord="3" alloc="0" form="için" lem="için" mi="post" si="case"/> </NODE> <NODE ord="5" alloc="0" form="birayı" lem="bira" mi="n|acc" si="dobj"> <NODE ord="4" alloc="0" form="aldığın" lem="al" mi="v|tv|gpr_past|px2sg" si="acl"/> </NODE> <NODE ord="7" alloc="0" form="." lem="." mi="sent" si="punct"/> </NODE> </SENTENCE> </corpus> |
This tree is looking much better. The only thing we have left to do is to attach the postpositional phrase benim için and the adverb Dün to the appropriate verb, which in this case is the head of the relative clause aldığın.
So we could specify the rule something like: Set the parent of a nominal modifier or an adverb to be the first verb to the right. This rule happens to work in this case, but is not very robust.
SETPARENT @advmod TO (1* V BARRIER V) ; SETPARENT @nmod TO (1* V BARRIER V) ;
The 1* X BARRIER Y
instruction here means that the parser should read to the right looking to match context X
(a verb), but it should stop if it finds context Y
(in this case another verb).
We can test these rules and see the output:
$ cat input.txt | cg-proc -f 2 tur.deprlx.bin <corpus> <SENTENCE ord="1" alloc="0"> <NODE ord="6" alloc="0" form="içeceğim" lem="iç" mi="v|tv|fut|p1|sg" si="root"> <NODE ord="5" alloc="0" form="birayı" lem="bira" mi="n|acc" si="dobj"> <NODE ord="4" alloc="0" form="aldığın" lem="al" mi="v|tv|gpr_past|px2sg" si="acl"> <NODE ord="1" alloc="0" form="Dün" lem="dün" mi="adv" si="advmod"/> <NODE ord="2" alloc="0" form="benim" lem="ben" mi="prn|pers|p1|sg|gen" si="nmod"> <NODE ord="3" alloc="0" form="için" lem="için" mi="post" si="case"/> </NODE> </NODE> </NODE> <NODE ord="7" alloc="0" form="." lemma="." mi="sent" si="punct"/> </NODE> </SENTENCE> </corpus> |
Yesss! Now we have a nice tree ready to be translated!
Lexical transfer[edit]
The first stage of transfer is lexical transfer. This is where we take the tree that we have just constructed, and we translate the words, and convert the morphological information into attributes in the tree. This is done using an lttoolbox dictionary in a form that will be familiar to those who have used Apertium before. There are however a number of differences which will be explained below.
In any case, first we need to change directory to matxin-tur-eng
and to make a new file, matxin-tur-eng.tur-eng.dix
. In this file we start by defining our attributes:
<dictionary> <sdefs> <sdef n="mi" c="Morphological information"/> <sdef n="pos" c="Part of speech"/> <sdef n="nbr" c="Number"/> <sdef n="prs" c="Person"/> <sdef n="cas" c="Case"/> <sdef n="tns" c="Tense"/> </sdefs> </dictionary>
The attributes are defined by <sdef>
tags, which stands for symbol (in this case an attribute) definition. After this we can start by adding a new section for our first entry:
<section id="main" type="standard"> </section>
And within that section, our entry:
<e><p><l>dün<s n="mi"/>adv</l><r>yesterday<s n="pos"/>adv</r></p></e>
So now we can save this file and compile it,
$ lt-comp lr matxin-tur-eng.tur-eng.dix tur-eng.autobil.bin main@standard 14 13
Note that the lr
means to compile the dictionary (which is in fact a finite-state transducer) from left to right, translating from Turkish to English. Unlike in Apertium, in Matxin the bilingual dictionaries are unidirectional.
You can try it out using the matxin-xfer-lex
command:
$ cat input.txt | cg-proc -f2 tur-eng.deprlx.bin | matxin-xfer-lex tur-eng.autobil.bin <pre> <corpus> <SENTENCE ref="1" alloc="0"> <NODE ref="6" alloc="0" slem="iç" smi="v|tv|fut|p1|sg" si="root" UpCase="none" lem="iç" mi="v|tv|fut|p1|sg" unknown="transfer"> <NODE ref="5" alloc="0" slem="bira" smi="n|acc" si="dobj" UpCase="none" lem="bira" mi="n|acc" unknown="transfer"> <NODE ref="4" alloc="0" slem="al" smi="v|tv|gpr_past|px2sg" si="acl" UpCase="none" lem="al" mi="v|tv|gpr_past|px2sg" unknown="transfer"> <NODE ref="1" alloc="0" slem="dün" smi="adv" si="advmod" UpCase="none" lem="yesterday" pos="adv"/> <NODE ref="2" alloc="0" slem="ben" smi="prn|pers|p1|sg|gen" si="nmod" UpCase="none" lem="ben" mi="prn|pers|p1|sg|gen" unknown="transfer"> <NODE ref="3" alloc="0" slem="için" smi="post" si="case" UpCase="none" lem="için" mi="post" unknown="transfer"/> </NODE> </NODE> </NODE> <NODE ref="7" alloc="0" slem="." smi="sent" si="punct" UpCase="none" lem="." mi="sent" unknown="transfer"/> </NODE> </SENTENCE> </corpus>
You'll note that the XML output has substantially changed. Some of the attributes have been renamed, and some new attributes have been created. Probably the most obvious thing is that for all of the words except dün "yesterday", there is a new attribute unknown
with the value transfer
, this means that the word was not able to be looked up in the bilingual dictionary. This is not surprising as we only added one word. How about the other attributes ?
mi
→smi
: Morphological information is now "source morphological information"lem
→slem
: Lemma is now "source lemma"Upcase
: To do with getting casing rightmi
: This is copied from thesmi
lem
: With unknown words this is copied from the source, otherwise the translation is inserted.pos
: This is the attribute that we defined in our bilingual dictionary.
So with that in mind we can start translating the other words:
<e><p><l>için<s n="mi"/>post</l><r>for<s n="pos"/>pr</r></p></e> <e><p><l>.<s n="mi"/>sent</l><r>.<s n="pos"/>sent</r></p></e>
These two are easy, they work just like the adverb. For categories that inflect however, things get a touch more complex, as we need to convert the morphological information into feature attributes in the XML. We do this using paradigms.
Paradigms[edit]
Paradigms are used to convert strings of input tags into features for the tree. The section they belong in goes after the sdefs
section and before the section id="main"
. Let's start looking at nouns:
<pardefs> <pardef n="n__n"> <e><p><l>|nom</l><r><s n="cas"/>nom</r></p></e> <e><p><l>|acc</l><r><s n="cas"/>acc</r></p></e> </pardef> </pardefs>
This paradigm converts the strings |acc
and |nom
into the attribute cas
with the values of nom
and acc
respectively.
Now we've defined the paradigm, we can try using it, go back to the main section
, and add:
<e><p><l>bira<s n="mi"/>n</l><r>beer<s n="pos"/>n</r></p><par n="n__n"/></e>
Then compile the dictionary:
$ lt-comp lr matxin-tur-eng.tur-eng.dix tur-eng.autobil.bin main@standard 38 41
And test it:
$ cat input.txt | cg-proc -f 2 tur-eng.deprlx.bin | matxin-xfer-lex tur-eng.autobil.bin <corpus> <SENTENCE ref="1" alloc="0"> <NODE ref="6" alloc="0" slem="iç" smi="v|tv|fut|p1|sg" si="root" UpCase="none" lem="iç" mi="v|tv|fut|p1|sg" unknown="transfer"> <NODE ref="5" alloc="0" slem="bira" smi="n|acc" si="dobj" UpCase="none" lem="beer" pos="n" cas="acc"> <NODE ref="4" alloc="0" slem="al" smi="v|tv|gpr_past|px2sg" si="acl" UpCase="none" lem="al" mi="v|tv|gpr_past|px2sg" unknown="transfer"> <NODE ref="1" alloc="0" slem="dün" smi="adv" si="advmod" UpCase="none" lem="yesterday" pos="adv"/> <NODE ref="2" alloc="0" slem="ben" smi="prn|pers|p1|sg|gen" si="nmod" UpCase="none" lem="ben" mi="prn|pers|p1|sg|gen" unknown="transfer"> <NODE ref="3" alloc="0" slem="için" smi="post" si="case" UpCase="none" lem="for" pos="pr"/> </NODE> </NODE> </NODE> <NODE ref="7" alloc="0" slem="." smi="sent" si="punct" UpCase="none" lem="." pos="sent"/> </NODE> </SENTENCE> </corpus>
As we move onto verb forms, it's worth noting that a paradigm can call other paradigms, for example, if we want to avoid listing all the possible combinations of tense, person and number, we can set up the paradigms like:
<pardef n="num"> <e><p><l>|sg</l><r><s n="nbr"/>sg</r></p></e> <e><p><l>|pl</l><r><s n="nbr"/>pl</r></p></e> </pardef> <pardef n="pers"> <e><p><l>|p1</l><r><s n="prs"/>p1</r></p><par n="num"/></e> <e><p><l>|p2</l><r><s n="prs"/>p2</r></p><par n="num"/></e> <e><p><l>|p3</l><r><s n="prs"/>p3</r></p><par n="num"/></e> </pardef> <pardef n="tense"> <e><p><l>|fut</l><r><s n="vtype"/>fin<s n="tns"/>fut</r></p><par n="pers"/></e> </pardef> <pardef n="tv__v"> <e><p><l>|tv</l><r><s n="val"/>tv</r></p><par n="tense"/></e> </pardef>
Then in the main section:
<e><p><l>iç<s n="mi"/>v</l><r>drink<s n="pos"/>v</r></p><par n="tv__v"/></e> <e><p><l>al<s n="mi"/>v</l><r>buy<s n="pos"/>v</r></p><par n="tv__v"/></e>
This will take care of the finite verb, içeceğim "I will drink", but what do we do with the non-finite verb aldığın "that you bought"? The structure is very different in Turkish and English, and although changing the structure is part of structural transfer, we need to get the attributes in order to enable us to properly do transfer. So in this case what we can do is have a paradigm setup something like:
<pardef n="px__pers"> <e><p><l>|px1sg</l><r><s n="prs"/>p1<s n="nbr"/>sg</r></p></e> <e><p><l>|px2sg</l><r><s n="prs"/>p2<s n="nbr"/>sg</r></p></e> <e><p><l>|px3sg</l><r><s n="prs"/>p3<s n="nbr"/>sg</r></p></e> <e><p><l>|px1pl</l><r><s n="prs"/>p1<s n="nbr"/>pl</r></p></e> <e><p><l>|px2pl</l><r><s n="prs"/>p2<s n="nbr"/>pl</r></p></e> <e><p><l>|px3pl</l><r><s n="prs"/>p3<s n="nbr"/>pl</r></p></e> </pardef> <pardef n="nonfin"> <e><p><l>|gpr_past</l><r><s n="vtype"/>gpr<s n="tns"/>past</r></p><par n="px__pers"/></e> </pardef>
And then we update the previously defined tv__v
paradigm thusly:
<pardef n="tv__v"> <e><p><l>|tv</l><r><s n="val"/>tv</r></p><par n="tense"/></e> <e><p><l>|tv</l><r><s n="val"/>tv</r></p><par n="nonfin"/></e> </pardef>
So, what does the code above do ? We turn the non-finite style agreement markers (using the possessive morpheme, px1sg, px2sg
, etc.) into finite agreement (p1.sg, p2.sg
, etc.), and we set the verb type to verbal adjective, gpr
. Then in the structural transfer, we will be able to match the verb type attribute when it is gpr
and use the other attributes to construct a finite relative clause in English.
Let's save the file and compile and test again...
$ lt-comp lr matxin-tur-eng.tur-eng.dix tur-eng.autobil.bin main@standard 75 86 $ cat input.txt | cg-proc -f 2 tur-eng.deprlx.bin | matxin-xfer-lex tur-eng.autobil.bin <corpus> <SENTENCE ref="1" alloc="0"> <NODE ref="6" alloc="0" slem="iç" smi="v|tv|fut|p1|sg" si="root" UpCase="none" lem="drink" pos="v" val="tv" tns="fut" prs="p1" nbr="sg"> <NODE ref="5" alloc="0" slem="bira" smi="n|acc" si="dobj" UpCase="none" lem="beer" pos="n" cas="acc"> <NODE ref="4" alloc="0" slem="al" smi="v|tv|gpr_past|px2sg" si="acl" UpCase="none" lem="buy" pos="v" val="tv" vtype="gpr" tns="past" prs="p2" nbr="sg"> <NODE ref="1" alloc="0" slem="dün" smi="adv" si="advmod" UpCase="none" lem="yesterday" pos="adv"/> <NODE ref="2" alloc="0" slem="ben" smi="prn|pers|p1|sg|gen" si="nmod" UpCase="none" lem="ben" mi="prn|pers|p1|sg|gen" unknown="transfer"> <NODE ref="3" alloc="0" slem="için" smi="post" si="case" UpCase="none" lem="for" pos="pr"/> </NODE> </NODE> </NODE> <NODE ref="7" alloc="0" slem="." smi="sent" si="punct" UpCase="none" lem="." pos="sent"/> </NODE> </SENTENCE> </corpus>
Looking good, now there is only one remaining item, the personal pronoun ben "I".
Personal pronouns can be quite idiosyncratic to translate between different languages, so we're just going to add a new paradigm for it:
<pardef n="ben__I"> <e><p><l>|p1|sg|nom</l><r><s n="prs"/>p1<s n="nbr"/>sg<s n="cas"/>nom</r></p></e> <e><p><l>|p1|sg|acc</l><r><s n="prs"/>p1<s n="nbr"/>sg<s n="cas"/>acc</r></p></e> <e><p><l>|p1|sg|gen</l><r><s n="prs"/>p1<s n="nbr"/>sg<s n="cas"/>gen</r></p></e> </pardef>
And then the entry in the main section
:
<e><p><l>ben<s n="mi"/>prn|pers</l><r>I<s n="pos"/>prn|pers</r></p><par n="ben__I"/></e>
We set the pos
specifically as personal pronoun instead of just pronoun because personal pronouns often need to be treated differently from other pronoun types. The final output we should get is:
<corpus> <SENTENCE ref="1" alloc="0"> <NODE ref="6" alloc="0" slem="iç" smi="v|tv|fut|p1|sg" si="root" UpCase="none" lem="drink" pos="v" val="tv" tns="fut" prs="p1" nbr="sg"> <NODE ref="5" alloc="0" slem="bira" smi="n|acc" si="dobj" UpCase="none" lem="beer" pos="n" cas="acc"> <NODE ref="4" alloc="0" slem="al" smi="v|tv|gpr_past|px2sg" si="acl" UpCase="none" lem="buy" pos="v" val="tv" vtype="gpr" tns="past" prs="p2" nbr="sg"> <NODE ref="1" alloc="0" slem="dün" smi="adv" si="advmod" UpCase="none" lem="yesterday" pos="adv"/> <NODE ref="2" alloc="0" slem="ben" smi="prn|pers|p1|sg|gen" si="nmod" UpCase="none" lem="I" pos="prn|pers" prs="p1" nbr="sg" cas="gen"> <NODE ref="3" alloc="0" slem="için" smi="post" si="case" UpCase="none" lem="for" pos="pr"/> </NODE> </NODE> </NODE> <NODE ref="7" alloc="0" slem="." smi="sent" si="punct" UpCase="none" lem="." pos="sent"/> </NODE> </SENTENCE> </corpus>
Underspecification[edit]
Sometimes a language will underspecify some morphological feature, for example this often happens with gender, or number, for example in our previous example with bira "beer", in Turkish the form without an affix is underspecified for number. It could be singular or plural, for example 1 bira "1 beer", but 5 bira "5 beers".
What should we do about this, well, one thing we can do is add the feature in lexical transfer but with a value saying that we don't know what it should be and that it should be dealt with in transfer. Normally we use ND
for "number to be determined" and GD
for gender to be determined (for example in translating the third person pronoun in Turkish to English we would need to set the gender to be determined in transfer).
We can set the feature like all of the other features, so just go to where you have the n__n
paradigm defined, and update it to look like:
<pardef n="n__n"> <e><p><l>|nom</l><r><s n="nbr"/>ND<s n="cas"/>nom</r></p></e> <e><p><l>|acc</l><r><s n="nbr"/>ND<s n="cas"/>acc</r></p></e> </pardef>
And now we're ready for structural transfer! :D
Structural transfer[edit]
In Matxin, structural transfer is done by applying a cascade of rules written in the XSLT programming language. XSLT is basically a complete language for tree transformations. This HOWTO will not give a complete overview of the language... there would be far too much to write. But you can use a search engine to find out about it.
But before we start working on encoding the rules as XSLT, we should get clear what we want to do in terms of linguistics. To help us with that, let's take a look at two trees:
So, given these two trees, we can discover some obvious stuff, like:
- A definite accusative noun in Turkish birayı, should get a definite article in English.
- The synthetic future in Turkish, içeceğim to will drink
- Subject pronouns for both the main clause and relative clause should be added
- A relative pronoun to stand in place for the direct object in the relative clause headed by aldığın.
And a couple of non-obvious, morphological things:
- If there is no dependent numeral greater than one, set the number of nouns to singular.
- Set the case of a pronoun with a dependent adposition to accusative.
Starting out[edit]
Let's create a new file called matxin-tur-eng.tur-eng.t1x
:
<transfer> </transfer>
Rules are defined as XSLT templates within a def-rule
tag, so let's start one:
<def-rule comment="Add definite article if overt accusative"> </def-rule>
The comment is free-form, you can write what you like.
Now let's get to the main part of the rule, we want to match any NODE
in the tree where the pos
attribute is n
and the cas
attribute is acc
. We specify that we only want to match nouns as in Turkish, pronouns may also be in accusative (in which case we don't want to add a definite article), and complement clauses are also often marked with accusative.
Matching in XSLT is done with Xpath, a way of finding node sets in a tree, so, expressing the pattern above:
<template match="//NODE[@pos = 'n' and @cas = 'acc']"> </template>
Attributes are referred to prefixed with @
, the =
is equal-to and not assignment, //
means search the whole tree. So, what do we want to do when we've found this set of nodes? We want to basically copy the nodes and add a new dependent node for the definite article:
<template match="//NODE[@pos = 'n' and @cas = 'acc']"> <copy> <apply-templates select="@* | *"/> <NODE si="det" lem="the" pos="det|def" nbr="sp"/> </copy> </template>
This is one pattern for working with transfer in Matxin... The copy
directive means that the output tree will have the nodes that are in the source tree (by applying the apply-templates
instruction, and in addition it will have a subnode which is the definite article. In the apply-templates
instruction, the select="@* | *"
part means that it should be applied to all attributes, @*
and also to all subnodes *
.
Let's save the rule, and compile it:
$ matxin-preprocess-transfer matxin-tur-eng.tur-eng.t1x tur-eng.t1x.bin 1 rules processed.
And then we can apply the rule using:
$ cat input.txt | cg-proc -f 2 tur-eng.deprlx.bin | matxin-xfer-lex tur-eng.autobil.bin | matxin-transfer tur-eng.t1x.bin <corpus> <SENTENCE ref="1" alloc="0"> <CHUNK ref="0" type="root"> <NODE ref="6" alloc="0" slem="iç" smi="v|tv|fut|p1|sg" si="root" UpCase="none" lem="drink" pos="v" val="tv" tns="fut" prs="p1" nbr="sg"> <NODE ref="5" alloc="0" slem="bira" smi="n|acc" si="dobj" UpCase="none" lem="beer" pos="n" nbr="ND" cas="acc"> <NODE ref="4" alloc="0" slem="al" smi="v|tv|gpr_past|px2sg" si="acl" UpCase="none" lem="buy" pos="v" val="tv" vtype="gpr" tns="past" prs="p2" nbr="sg"> <NODE ref="1" alloc="0" slem="dün" smi="adv" si="advmod" UpCase="none" lem="yesterday" pos="adv"/> <NODE ref="2" alloc="0" slem="ben" smi="prn|pers|p1|sg|gen" si="nmod" UpCase="none" lem="I" pos="prn|pers" prs="p1" nbr="sg" cas="gen"> <NODE ref="3" alloc="0" slem="için" smi="post" si="case" UpCase="none" lem="for" pos="pr"/> </NODE> </NODE> <NODE si="det" lem="the" pos="det|def" nbr="sp"/> </NODE> <NODE ref="7" alloc="0" slem="." smi="sent" si="punct" UpCase="none" lem="." pos="sent"/> </NODE> </CHUNK> </SENTENCE> </corpus>
We can use a very similar rule to get the auxiliary verb "will" added when the verb is finite and in future tense:
<def-rule comment="Add auxiliary verb 'will' if the head verb is finite and in the future tense"> <template match="//NODE[@pos = 'v' and @vtype = 'fin' and @tns = 'fut']"> <copy> <apply-templates select="@* | *"/> <NODE si="aux" lem="will" pos="vaux" tns="pres"/> </copy> </template> </def-rule>
Save the file and compile the rules again:
$ matxin-preprocess-transfer matxin-tur-eng.tur-eng.t1x tur-eng.t1x.bin 2 rules processed.
So, let's see what our Turkish tree looks like now:
Propagating information between nodes[edit]
It's looking good, now let's try adding the relative pronoun, this is slightly more complicated as we want the function of the relative
pronoun in relation to the head of its clause to be the same as the function of the word it modifies in relation to the matrix clause.[1] In Xpath we can use ..
to refer to the parent node.
<def-rule comment="Add a relative pronoun if the verb type is relative clause"> <template match="//NODE[@pos = 'v' and @vtype = 'gpr']"> <copy> <apply-templates select="@* | *"/> <NODE><attr name="si"><value-of select="../@si"/></attr> <attr name="lem">that</attr> <attr name="pos">rel</attr> <attr name="ani">an</attr> <attr name="num">sp</attr></NODE> </copy> </template> </def-rule>
The attributes are used for generation, ani
is for animacy (some relative pronouns in English can only be used with animates (e.g. "who") and others with both animate and inanimate (e.g. "that", "which"). The sp
value for nbr
means that is it the same form for singular and plural.
Conditional statements[edit]
Now there are only two words that need to be added, the subject pronouns. This is slightly more complicated because although we can inherit the person and number from the verb, in English the lemma is going to be different depending on the person and number. But never fear, choose, when
is here! Basically choose, when, [otherwise]
works like if, else if, [else]
in other programming languages. As can be seen from the following example, you have a condition test=
that basically is equivalent to an Xpath expression.
<def-rule comment="Add subject pronouns to clauses that do not have them"> <template match="//NODE[@pos = 'v' and not(.//NODE[@si = 'nsubj'])]"> <copy> <apply-templates select="@* | *"/> <NODE><attr name="si">nsubj</attr> <choose> <when test="@prs = 'p1' and @nbr = 'sg'"> <attr name="lem">I</attr> </when> <when test="@prs = 'p2' and @nbr = 'sg'"> <attr name="lem">you</attr> </when> </choose> <attr name="pos">prn|pers</attr> <attr name="prs"><value-of select="@prs"/></attr> <attr name="nbr"><value-of select="@nbr"/></attr> <attr name="cas">nom</attr></NODE> </copy> </template> </def-rule>
Now let's compile and test:
$ matxin-preprocess-transfer matxin-tur-eng.tur-eng.t1x tur-eng.t1x.bin 4 rules processed. $ cat input.txt | cg-proc -f 2 tur-eng.deprlx.bin | matxin-xfer-lex tur-eng.autobil.bin | matxin-transfer tur-eng.t1x.bin <corpus> <SENTENCE ref="1" alloc="0"> <NODE ref="6" alloc="0" slem="iç" smi="v|tv|fut|p1|sg" si="root" UpCase="none" lem="drink" pos="v" val="tv" vtype="fin" tns="fut" prs="p1" nbr="sg"> <NODE ref="5" alloc="0" slem="bira" smi="n|acc" si="dobj" UpCase="none" lem="beer" pos="n" nbr="ND" cas="acc"> <NODE ref="4" alloc="0" slem="al" smi="v|tv|gpr_past|px2sg" si="acl" UpCase="none" lem="buy" pos="v" val="tv" vtype="gpr" tns="past" prs="p2" nbr="sg"> <NODE ref="1" alloc="0" slem="dün" smi="adv" si="advmod" UpCase="none" lem="yesterday" pos="adv"/> <NODE ref="2" alloc="0" slem="ben" smi="prn|pers|p1|sg|gen" si="nmod" UpCase="none" lem="I" pos="prn|pers" prs="p1" nbr="sg" cas="gen"> <NODE ref="3" alloc="0" slem="için" smi="post" si="case" UpCase="none" lem="for" pos="pr"/> </NODE> <NODE si="dobj" lem="that" pos="rel" ani="an" nbr="sp"/> <NODE si="nsubj" lem="you" pos="prn|pers" prs="p2" nbr="sg"/> </NODE> <NODE si="det" lem="the" pos="det|def" mi="sp"/> </NODE> <NODE ref="7" alloc="0" slem="." smi="sent" si="punct" UpCase="none" lem="." pos="sent"/> <NODE si="aux" lem="will" pos="vaux" mi="pri"/> <NODE si="nsubj" lem="I" pos="prn|pers" prs="p1" nbr="sg"/> </NODE> </SENTENCE> </corpus>
And looking at the tree:
Our tree is looking pretty English now! The only thing left to do is to correctly set the number of the noun to singular:
<def-rule comment="Set number of nouns with ND and no dependent numeral to singular"> <template match="//NODE[@pos = 'n' and not(.//NODE[@pos = 'num'])]"> <copy> <attr name="nbr">sg</attr> <copy-of select="@*[name()!='nbr'] | *"/> </copy> </template> </def-rule>
and set the case of the pronoun with a dependent preposition to accusative:
<def-rule comment="Set the case of personal pronouns with prepositions to accusative"> <template match="//NODE[@pos = 'prn|pers' and .//NODE[@pos = 'pr']]"> <copy> <attr name="cas">acc</attr> <copy-of select="@*[name()!='cas'] | *"/> </copy> </template> </def-rule>
These rules show a good pattern for changing the value of a given attribute, we first add the new attribute, and then we copy all attributes apart from the attribute that we've already added.
Now we have two more steps, first we need to reorder the tree, which we call linearisation, and then we need to generate the word forms in English using a morphological generator. First onto linearisation...
Reordering and linearisation[edit]
The idea of linearisation is to put an order to the nodes in the tree so that they are ready for generation and printing out. In the Matxin pipeline, the program that does this is called matxin-linearise
. It currently takes a four column tab-separated file which specifies the order of heads, dependents and siblings.
Let's create a new file called matxin-tur-eng.tur-eng.l1x
, and add our first line:
# HEAD DEPENDENT RELPOS ORDER si='root' lem='.' .*? x1.x2
This say that we want to linearise the node pair head "root" (e.g. the finite verb head of the sentence) and dependent where the lemma is "." in the order head followed by dependent, the variable x1
stands for the head, and the variable x2
stands for the dependent.
If we save this file and test it, we should see that the "." is ordered after the root node (the verb):
$ cat input.txt | cg-proc -f 2 ../matxin-tur/tur.deprlx.bin | matxin-transfer tur-eng.t1x.bin |\ matxin-linearise matxin-tur-eng.tur-eng.l1x <corpus> <SENTENCE ord="1" ref="1" alloc="0"> <NODE ord="10" ref="6" alloc="0" slem="iç" smi="v|tv|fut|p1|sg" si="root" UpCase="none" lem="drink" pos="v" val="tv" vtype="fin" tns="fut" prs="p1" nbr="sg"> <NODE ord="7" nbr="sg" ref="5" alloc="0" slem="bira" smi="n|acc" si="dobj" UpCase="none" lem="beer" pos="n" cas="acc"> <NODE ord="5" ref="4" alloc="0" slem="al" smi="v|tv|gpr_past|px2sg" si="acl" UpCase="none" lem="buy" pos="v" val="tv" vtype="gpr" tns="past" prs="p2" nbr="sg"> <NODE ord="0" ref="1" alloc="0" slem="dün" smi="adv" si="advmod" UpCase="none" lem="yesterday" pos="adv"/> <NODE ord="2" cas="acc" ref="2" alloc="0" slem="ben" smi="prn|pers|p1|sg|gen" si="nmod" UpCase="none" lem="I" pos="prn|pers" prs="p1" nbr="sg"> <NODE ord="1" ref="3" alloc="0" slem="için" smi="post" si="case" UpCase="none" lem="for" pos="pr"/> </NODE> <NODE ord="3" ref="new-0" si="dobj" lem="that" pos="rel" ani="an" nbr="sp"/> <NODE ord="4" ref="new-1" si="nsubj" lem="you" pos="prn|pers" prs="p2" nbr="sg" cas="nom"/> </NODE> <NODE ord="6" ref="new-2" si="det" lem="the" pos="det|def" nbr="sp"/> </NODE> <NODE ord="11" ref="7" alloc="0" slem="." smi="sent" si="punct" UpCase="none" lem="." pos="sent"/> <NODE ord="8" ref="new-3" si="aux" lem="will" pos="vaux" tns="pri"/> <NODE ord="9" ref="new-4" si="nsubj" lem="I" pos="prn|pers" prs="p1" nbr="sg" cas="nom"/> </NODE> </SENTENCE> </corpus>
It might be easier to visualise in a flat format:
Word | yesterday | for | me | that | you | bought | the | beer | will | I | drink | . |
---|---|---|---|---|---|---|---|---|---|---|---|---|
ord |
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 |
pos |
adv |
pr |
pers | rel |
pers | v |
det |
n |
vaux |
pers | v |
sent
|
si |
advmod | case | nmod | dobj | nsubj | acl | det | dobj | aux | nsubj | root | punct |
The "." receives ord="11"
and the root receives ord="10"
, e.g. they have been properly reordered. Now, this sentence is far from adequately ordered...
Let's try something a bit more movey, let's move the direct object after the root, and the subject before it.
# HEAD DEPENDENT RELPOS ORDER si='root' lem='.' .*? x1.x2 si='root' si='dobj' .*? x1.x2 si='root' si='nsubj' .*? x2.x1
So, that is a little better, it's still pretty mangled though.
Word | will | I | drink | yesterday | for | me | bought | that | you | the | beer | . |
---|---|---|---|---|---|---|---|---|---|---|---|---|
ord |
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 |
Let's try moving the auxiliary "will" to before the verb. Note that it is already before the verb, but we want to order it right next to the verb. For that we need another order statement, this time instead of the x2.x1
to move the dependent before it's head, we need x2+x1
which means "order x2
right before x1
".
# HEAD DEPENDENT RELPOS ORDER si='root' lem='.' .*? x1.x2 si='root' si='dobj' .*? x1.x2 si='root' si='nsubj' .*? x2.x1 si='root' si='aux' .*? x2+x1
Word | I | will | drink | yesterday | for | me | bought | that | you | the | beer | . |
---|---|---|---|---|---|---|---|---|---|---|---|---|
advmod | case | nmod | acl | dobj | nsubj | det | dobj | |||||
nsubj | aux | root | dobj | punct | ||||||||
ord |
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 |
If we look at the tree, we will see that now we have the correct order of the dependents of the root node. But the order of the dependents of the direct object (e.g. principally the relative clause) is totally messed up still. Let's try and fix it. The first thing we want to do is put a relative clause after the noun it depends on.
# HEAD DEPENDENT RELPOS ORDER si='root' lem='.' .*? x1.x2 si='root' si='dobj' .*? x1.x2 si='root' si='nsubj' .*? x2.x1 si='root' si='aux' .*? x2+x1 pos='n' si='acl' .*? x1.x2
Testing it, we get a bit better order:
Word | I | will | drink | the | beer | yesterday | for | me | that | you | bought | . |
---|---|---|---|---|---|---|---|---|---|---|---|---|
ord |
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 |
Now let's try linearising the rest of the relative clause:
# HEAD DEPENDENT RELPOS ORDER si='root' lem='.' .*? x1.x2 si='root' si='dobj' .*? x1.x2 si='root' si='nsubj' .*? x2.x1 si='root' si='aux' .*? x2+x1 pos='n' si='acl' .*? x1.x2 si='acl' si='nsubj' .*? x2.x1 si='acl' si='nmod' .*? x1.x2 si='acl' si='advmod' .*? x1.x2
Which gives:
Word | I | will | drink | the | beer | that | you | bought | yesterday | for | me | . |
---|---|---|---|---|---|---|---|---|---|---|---|---|
ord |
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 |
So, there is only one weirdness left with the word order now, that is that "yesterday" should come after "for me"... the placement of adverbs in English can be a bit unpredictable, so we can either leave it like this, or we can change the nmod
placement rule to place nmod
directly before the verb, e.g.
si='acl' si='nmod' .*? x1+x2
So, let's run all of those rules on the XML and see what we come out with:
$ cat input.txt | cg-proc -f 2 ../matxin-tur/tur.deprlx.bin | matxin-transfer tur-eng.t1x.bin |\ matxin-linearise matxin-tur-eng.tur-eng.l1x <corpus> <SENTENCE ord="1" ref="1" alloc="0"> <NODE ord="2" ref="6" alloc="0" slem="iç" smi="v|tv|fut|p1|sg" si="root" UpCase="none" lem="drink" pos="v" val="tv" vtype="fin" tns="fut" prs="p1" nbr="sg"> <NODE ord="4" nbr="sg" ref="5" alloc="0" slem="bira" smi="n|acc" si="dobj" UpCase="none" lem="beer" pos="n" cas="acc"> <NODE ord="7" ref="4" alloc="0" slem="al" smi="v|tv|gpr_past|px2sg" si="acl" UpCase="none" lem="buy" pos="v" val="tv" vtype="gpr" tns="past" prs="p2" nbr="sg"> <NODE ord="11" ref="1" alloc="0" slem="dün" smi="adv" si="advmod" UpCase="none" lem="yesterday" pos="adv"/> <NODE ord="9" cas="acc" ref="2" alloc="0" slem="ben" smi="prn|pers|p1|sg|gen" si="nmod" UpCase="none" lem="I" pos="prn|pers" prs="p1" nbr="sg"> <NODE ord="8" ref="3" alloc="0" slem="için" smi="post" si="case" UpCase="none" lem="for" pos="pr"/> </NODE> <NODE ord="5" ref="new-0" si="dobj" lem="that" pos="rel" ani="an" nbr="sp"/> <NODE ord="6" ref="new-1" si="nsubj" lem="you" pos="prn|pers" prs="p2" nbr="sg" cas="nom"/> </NODE> <NODE ord="3" ref="new-2" si="det" lem="the" pos="det|def" nbr="sp"/> </NODE> <NODE ord="12" ref="7" alloc="0" slem="." smi="sent" si="punct" UpCase="none" lem="." pos="sent"/> <NODE ord="1" ref="new-3" si="aux" lem="will" pos="vaux" tns="pri"/> <NODE ord="0" ref="new-4" si="nsubj" lem="I" pos="prn|pers" prs="p1" nbr="sg" cas="nom"/> </NODE> </SENTENCE> </corpus>
Generation[edit]
In Matxin, generation is done by the program matxin-generate
which takes two arguments, an file with a cascade of stylesheets and a compiled finite-state transducer. The cascade is used to organise the attributes of the XML into feature-strings suitable to be passed to the finite-state transducer to generate the morphological forms.
Morphological dictionary[edit]
So, what does a morphological dictionary look like ? Again, that is mostly outside of the scope of this howto, but for the sake of easy of copy/paste, let's go through it here, taking an lttoolbox dictionary as an example.
To start with, we change into the directory matxin-eng
and we create a new file matxin-eng.eng.dix
. The file will have the skeleton structure:
<dictionary> <alphabet/> <sdefs> <sdef n="mi"/> </sdefs> <pardefs> </pardefs> <section id="main" type="standard"> </section> </dictionary>
This structure will seem familiar if you read the section on lexical transfer (if you didn't read it, it's up there). Instead of translating between source words and target words, the morphological generator translates between lexical forms (combinations of lemmas and tags) and surface forms. Let's take a look at the words we need to generate:
Lemma | POS | Forms |
---|---|---|
beer | <n> |
beer, beers |
buy | <v> |
buy, buys, bought, bought |
drink | <v> |
drink, drinks, drank, drunk |
for | <pr> |
for |
I | <prn> |
I, me |
the | <det> |
the |
will | <vaux> |
will, would |
yesterday | <adv> |
yesterday |
you | <prn> |
you, you |
Given these words there isn't much paradigmatically that we can do, each word needs a separate paradigm, so let's just start with the noun, "beer", the paradigm is going to be:
<pardef n="beer__n"> <e><p><l></l><r><s n="mi"/>n|sg</r></p></e> <e><p><l>s</l><r><s n="mi"/>n|pl</r></p></e> </pardef>
and then the entry in the main section
:
<e lm="beer"><i>beer</i><par n="beer__n"/></e>
Save the dictionary, and go to the terminal, you can do two things: 1) compile the dictionary, using:
$ lt-comp rl matxin-eng.eng.dix eng.autogen.bin main@standard 14 14
You can also print out all the strings recognised by the dictionary using lt-expand
:
$ lt-expand matxin-eng.eng.dix beer:beer<mi>n|sg beers:beer<mi>n|pl
The remainder of the vocabulary is left as an exercise for the reader.
Generation rules[edit]
Generation rules take a node and its attributes and produce a new attribute, mi
that has all the information necessary to pass to the morphological generator. They are written in the same XSLT format as the transfer rules.
We start out with a file, let's call it matxin-tur-eng.tur-eng.gnx
<generate> </generate>
Then we add a rule to generate the mi
attribute for nouns:
<def-rule comment="Generate the morphological information for nouns"> <template match="//NODE[@pos = 'n']"> <copy> <attr name="mi"><value-of select="concat(@pos, '|', @nbr)"/></attr> <copy-of select="@*[name()!='mi'] | *"/> </copy> </template> </def-rule>
This rule basically says, match all nouns //NODE[@pos = 'n']
and create a new attribute, mi
which is the concatenation of the attribute pos
, the string literal |
and the attribute nbr
. The result of this concatenation for the node containing lem="beer" pos="n" nbr="sg"
will be mi="n|sg"
.
If we save the file and compile it:
$ matxin-preprocess-generate matxin-eng.eng.gnx eng.gnx.bin 1 rules processed.
We can now test it in the whole pipeline... first switch directory to matxin-tur-eng
, then:
$ cat input.txt | cg-proc -f 2 tur-eng.deprlx.bin | matxin-xfer-lex tur-eng.autobil.bin | matxin-transfer tur-eng.t1x.bin |\ matxin-linearise matxin-tur-eng.tur-eng.l1x | matxin-generate ../matxin-eng/eng.gnx.bin ../matxin-eng/eng.autogen.bin <?xml version="1.0"?> <corpus> <SENTENCE ord="1" ref="1" alloc="0"> <NODE ord="2" ref="6" alloc="0" slem="iç" smi="v|tv|fut|p1|sg" si="root" UpCase="none" lem="drink" pos="v" val="tv" vtype="fin" tns="fut" prs="p1" nbr="sg" form="%drink|v|tv|fut|p1|sg"> <NODE mi="n|sg" ord="4" nbr="sg" ref="5" alloc="0" slem="bira" smi="n|acc" si="dobj" UpCase="none" lem="beer" pos="n" cas="acc" form="beer"> <NODE ord="7" ref="4" alloc="0" slem="al" smi="v|tv|gpr_past|px2sg" si="acl" UpCase="none" lem="buy" pos="v" val="tv" vtype="gpr" tns="past" prs="p2" nbr="sg" form="%buy|v|tv|gpr_past|px2sg"> <NODE ord="11" ref="1" alloc="0" slem="dün" smi="adv" si="advmod" UpCase="none" lem="yesterday" pos="adv" form="%yesterday|adv"/> <NODE ord="9" cas="acc" ref="2" alloc="0" slem="ben" smi="prn|pers|p1|sg|gen" si="nmod" UpCase="none" lem="I" pos="prn|pers" prs="p1" nbr="sg" form="%I|prn|pers|p1|sg|gen"> <NODE ord="8" ref="3" alloc="0" slem="için" smi="post" si="case" UpCase="none" lem="for" pos="pr" form="%for|post"/> </NODE> <NODE ord="5" ref="new-0" si="dobj" lem="that" pos="rel" ani="an" nbr="sp" form="=that"/> <NODE ord="6" ref="new-1" si="nsubj" lem="you" pos="prn|pers" prs="p2" nbr="sg" cas="nom" form="=you"/> </NODE> <NODE ord="3" ref="new-2" si="det" lem="the" pos="det|def" nbr="sp" form="=the"/> </NODE> <NODE ord="12" ref="7" alloc="0" slem="." smi="sent" si="punct" UpCase="none" lem="." pos="sent" form="%.|sent"/> <NODE ord="1" ref="new-3" si="aux" lem="will" pos="vaux" tns="pri" form="=will"/> <NODE ord="0" ref="new-4" si="nsubj" lem="I" pos="prn|pers" prs="p1" nbr="sg" cas="nom" form="=I"/> </NODE> </SENTENCE> </corpus>
It's a bit difficult to pull out from the XML, but you can see that some nodes have a new attribute form
and one node (the noun, "beer") has a new attribute mi
.
- If there is an
mi
attribute, and the form is able to be generated by the morphological analyser, then you get the correctly generated form. - If there is an
mi
attribute, and the form is not able to be generated by the morphological analyser, then you get the symbol#
followed by the target language lemma concatenated with the target language morphological information. - If there is no
mi
attribute, then the form is the symbol%
followed by target language lemma concatenated with the source language morphological information. This tells you that you need to write a generation rule to correctly build themi
attribute. - If there is no
smi
and nomi
attributes, then the form is=
followed by the target language lemma.
The reformatter[edit]
The reformatter basically takes the XML and iterates over the tree and outputs the forms of the nodes, so if we run it on the above tree:
$ cat input.txt | cg-proc -f 2 tur-eng.deprlx.bin | matxin-xfer-lex tur-eng.autobil.bin | matxin-transfer tur-eng.t1x.bin |\ matxin-linearise matxin-tur-eng.tur-eng.l1x | matxin-generate ../matxin-eng/eng.gnx.bin ../matxin-eng/eng.autogen.bin |\ matxin-reformat =I =will %drink|v|tv|fut|p1|sg =the beer =that =you %buy|v|tv|gpr_past|px2sg %for|post %I|prn|pers|p1|sg|gen %yesterday|adv %.|sent
This is far from an adequate sentence, but fixing it basically means solving the problems, for example, the words prefixed with %
and =
probably need to be added to the morphological generator (matxin-eng.eng.dix
) and have generation rules written for them (in matxin-eng.eng.g1x
).
Troubleshooting[edit]
Nodes aren't output[edit]
Sometimes you'll find that nodes that you want to be output are not output, check your copy-of
and apply-template
statements, if you have:
<copy-of select="@*"/>
You should change it to:
<copy-of select="@* | *"/>
The @* | *
means "all attributes and all subnodes".
Notes[edit]
- ↑ Consider in English the difference between:
They congratulated the girl thatnsubj graduated yesterday. and
They drank the beer thatdobj she bought yesterday.