Matxin New Language Pair HOWTO

From Apertium
Jump to navigation Jump to search

This page describes the process of creating a new language pair with Matxin, a dependency-based machine translation system.

Preliminaries

Make a directory called matxin-tur-eng. Then make two more directories matxin-tur and matxin-eng.

Note that if you are doing this howto for your own language, then tur should be the ISO-639-3 language code of the source language and eng should be the ISO-639-3 for the target language

Analysis

There are a number of ways analysis can be done in Matxin, the Spanish to Basque system uses FreeLing, while the English to Basque system uses a wrapper around the Stanford parser. In this tutorial we're going to be using Constraint Grammar (CG) to do dependency parsing of pre-disambiguated sentences. Writing a morphological analyser and morphological disambiguator is out of the scope of this HOWTO, but for more information, check out the following pages:

So, let's assume that you've been through those tutorials and have a morphological analyser capable of analysing and disambiguating sentences in Turkish. You'll give it a sentence like "Dün benim için aldığın birayı içeceğim." and get some output like:

^Dün/dün<adv>$ ^benim/ben<prn><pers><p1><sg><gen>$ ^için/için<post>$ ^aldığın/al<v><tv><gpr_past><px2sg>$ 
^birayı/bira<n><acc>$ ^içeceğim/iç<v><tv><fut><p1><sg>$^./.<sent>$

Save this output into a file, perhaps called input.txt. We'll need it later.

Now go into the matxin-tur directory, and create a file apertium-tur.tur.deprlx. The first thing we have to do in this file is to specify the delimiters. These are the tokens that CG will use for splitting the input into chunks (like "sentences") for processing:

DELIMITERS = "." ;

Great, now we need to specify some lists of interesting grammatical features that we're going to use to write the rules. First we define some part-of-speech and morphological tags, e.g.

LIST Adv = adv;
LIST Pers = (prn pers) ;
LIST Post = post ;
LIST V = v ;
LIST N = n ;
LIST Acc = acc;
LIST Gen = gen;
LIST Gpr = gpr_past ;
LIST Sent = sent ;
LIST Fin = fut aor past ; # Finite verb forms

Etc. It might seem a bit odd to just be repeating the tag in title case, but the point is that it allows us to distinguish CG tags and tag groups from those in the input morphological analyser.

So, now we've got some morphological information we need to start thinking about the syntactic relations. For the purposes of this tutorial we'll be using the relations and guidelines from the Universal dependencies project, here are some examples:

LIST @root = @root ;     # The root of the sentence, often a finite verb
LIST @nsubj = @nsubj ;   # The nominal subject of the sentence
LIST @advmod = @advmod ; # An adverbial modifier
LIST @case = @case ;     # The relation of an adposition to its head
LIST @acl = @acl ;       # A clause which modifies a nominal
LIST @nmod = @nmod ;     # Nominal modifier 
LIST @dobj = @dobj ;     # The direct object of the sentence
LIST @punct = @punct ;   # Any punctuation
LIST @dep = @dep ;       # Any remaining dependency

Add these lists to the file along with the morphological lists. You'll note that they are prefixed with an @ symbol. This is important as it says that this is a syntactic relation/function and not a morphological tag.

After we've defined our terms, next come the rules, in the first section we're going to map the relations to the words in the sentence. So let's start with:

SECTION

In constraint grammar, all rules come in sections.

So, now a rule, let's start with an easy one. Adverbs nearly always get @advmod relation, whether they are modifying an adjective or a verb, so we can safely map @advmod to the adverb using the following rule:

MAP @advmod TARGET Adv ;

The MAP rule has two compulsory parts (the tag and the target) and one optional part, a context. Here we have no context.

So now let's save the file and try it out! First though we need to compile the rules:

$ cg-comp matxin-tur.tur.deprlx tur.deprlx.bin
Sections: 1, Rules: 1, Sets: 4, Tags: 29

And now try it out:

$ cat input.txt | cg-proc tur.deprlx.bin
^Dün/dün<adv><@advmod>$ ^benim/ben<prn><pers><p1><sg><gen>$ ^için/için<post>$ ^aldığın/al<v><tv><gpr_past><px2sg>$ 
^birayı/bira<n><acc>$ ^içeceğim/iç<v><tv><fut><p1><sg>$^./.<sent>$

Great, it works... so now we can do basicallty the same rule for the verbal adjective, aldığın which should get an @acl tag, the postposition which should get a @case tag and the accusative which should get a @dobj tag.

MAP @case TARGET Post ;
MAP @acl TARGET Gpr ;
MAP @dobj TARGET Acc ;
MAP @punct TARGET Sent ;

Save it and try it again:

$ cg-comp matxin-tur.tur.deprlx tur.deprlx.bin
Sections: 1, Rules: 5, Sets: 12, Tags: 29

$ cat /tmp/input  | cg-proc /tmp/tur.bin 
^Dün/dün<adv><@advmod>$ ^benim/ben<prn><pers><p1><sg><gen>$ ^için/için<post><@case>$ ^aldığın/al<v><tv><gpr_past><px2sg><@acl>$ 
^birayı/bira<n><acc><@dobj>$ ^içeceğim/iç<v><tv><fut><p1><sg>$^./.<sent><@punct>$

Great, now for a couple of harder relations, the functions of benim and of içeceğim. As benim is the head of an adpositional phrase, following the universal dependencies guidelines we assign it the @nmod relation. How do we know it is the head of an adpositional phrase. Well, for this example, we can go for something like "if the word is a personal pronoun and it is followed by a postposition", so let's give it a go:

MAP @nmod TARGET Pers IF (1 Post) ;

And finally the root of the sentence, the finite verb. In Turkish it is quite typical that the finite verb is the last verb in the sentence, so we can go for something like "if the word is a finite verb and the next token is a full stop and there is no other finite verb to the left, then make it the root".

MAP @root TARGET Fin IF (NOT -1* Fin) (1 Sent) ; 

So try those two rules out and we should have a fully labelled input sentence:

^Dün/dün<adv><@advmod>$ ^benim/ben<prn><pers><p1><sg><gen><@nmod>$ ^için/için<post><@case>$ ^aldığın/al<v><tv><gpr_past><px2sg><@acl>$ 
^birayı/bira<n><acc><@dobj>$ ^içeceğim/iç<v><tv><fut><p1><sg><@root>$^./.<sent><@punct>$

That covers our sentence, but we want to add one more rule "just in case" ... and that is a default rule that adds the relation @dep to any token that isn't covered by the other rules:

MAP (@dep) TARGET (*) ;

Tree building

Now we have a full labelled sentence, we can start building the tree, first make another section:

SECTION

The first thing we want to attach the root node:

SETPARENT @root TO (@0 (*)) ;

This basically says, set the parent of the root node to node 0, which is the invisible root that CG uses. Next we want to attach all of the rest of the nodes to this root node:

SETPARENT (*) (NEGATE p (*)) TO (0* @root) ;

Here we have an extra condition (NEGATE p (*)) which means that matched node should not already have a parent.

Let's test the rules, so save the file, and go to the terminal:

$ cat input.txt  | cg-proc -f 2 /tmp/tur.bin
<corpus>
  <SENTENCE ord="1" alloc="0">
    <NODE ord="6" alloc="0" form="içeceğim" lem"iç" mi="v|tv|fut|p1|sg" si="root">
      <NODE ord="1" alloc="0" form="Dün" lem="dün" mi="adv" si="advmod"/>
      <NODE ord="2" alloc="0" form="benim" lem="ben" mi="prn|pers|p1|sg|gen" si="nmod"/>
      <NODE ord="3" alloc="0" form="için" lem="için" mi="post" si="case"/>
      <NODE ord="4" alloc="0" form="aldığın" lem="al" mi="v|tv|gpr_past|px2sg" si="acl"/>
      <NODE ord="5" alloc="0" form="birayı" lem="bira" mi="n|acc" si="dobj"/>
      <NODE ord="7" alloc="0" form="." lem="." mi="sent" si="punct"/>
    </NODE>
  </SENTENCE>
</corpus>

Note how here we have passed -f 2 parameter to the cg-proc program, so now it is output in Matxin XML format.

This is a pretty sad looking tree :( ... But fortunately using rules we can make it a happy tree! One that correctly implements the annotation guidelines.

So, let's think of some rules:

  • The direct object should depend on the finite verb
  • A postposition should depend on its head
  • A relative clause should modify (depend on) a noun
  • An adverb should modify a verb

Let's start with the first one:

SETPARENT @dobj TO (1* Fin) ;

This rule says that the parent of the direct object should be the finite verb label anywhere to the left. Next up:

SETPARENT @case TO (-1 Pers) ;

Set the parent of the word with the @case label to be the previous personal pronoun. And then:

SETPARENT @acl TO (1 N) ; 

Set the parent of the word with the @acl label to be the following noun.

Let's save the file and apply the rules:

$ cat input.txt  | cg-proc -f 2 tur.deprlx.bin 
<corpus>
  <SENTENCE ord="X" alloc="Y">
    <NODE ord="6" alloc="0" form="içeceğim" lemma="iç" mi="v|tv|fut|p1|sg" si="root">
      <NODE ord="1" alloc="0" form="Dün" lemma="dün" mi="adv" si="advmod"/>
      <NODE ord="2" alloc="0" form="benim" lemma="ben" mi="prn|pers|p1|sg|gen" si="nmod">
        <NODE ord="3" alloc="0" form="için" lemma="için" mi="post" si="case"/>
      </NODE>
      <NODE ord="5" alloc="0" form="birayı" lemma="bira" mi="n|acc" si="dobj">
        <NODE ord="4" alloc="0" form="aldığın" lemma="al" mi="v|tv|gpr_past|px2sg" si="acl"/>
      </NODE>
      <NODE ord="7" alloc="0" form="." lemma="." mi="sent" si="punct"/>
    </NODE>
  </SENTENCE>
</corpus>

This tree is looking much better. The only thing we have left to do is to attach the postpositional phrase benim için and the adverb Dün to the appropriate verb, which in this case is the head of the relative clause aldığın.

So we could specify the rule something like: Set the parent of a nominal modifier or an adverb to be the first verb to the right. This rule happens to work in this case, but is not very robust.

SETPARENT @advmod TO (1* V BARRIER V) ;

SETPARENT @nmod TO (1* V BARRIER V) ;

The 1* X BARRIER Y instruction here means that the parser should read to the right looking to match context X (a verb), but it should stop if it finds context Y (in this case another verb).

We can test these rules and see the output:

$ cat input.txt  | cg-proc -f 2 tur.deprlx.bin 
<corpus>
  <SENTENCE ord="X" alloc="Y">
    <NODE ord="6" alloc="0" form="içeceğim" lemma="iç" mi="v|tv|fut|p1|sg" si="root">
      <NODE ord="5" alloc="0" form="birayı" lemma="bira" mi="n|acc" si="dobj">
        <NODE ord="4" alloc="0" form="aldığın" lemma="al" mi="v|tv|gpr_past|px2sg" si="acl">
          <NODE ord="1" alloc="0" form="Dün" lemma="dün" mi="adv" si="advmod"/>
          <NODE ord="2" alloc="0" form="benim" lemma="ben" mi="prn|pers|p1|sg|gen" si="nmod">
            <NODE ord="3" alloc="0" form="için" lemma="için" mi="post" si="case"/>
          </NODE>
        </NODE>
      </NODE>
      <NODE ord="7" alloc="0" form="." lemma="." mi="sent" si="punct"/>
    </NODE>
  </SENTENCE>
</corpus>

Yesss! Now we have a nice tree ready to be translated!

Lexical transfer

The first stage of transfer is lexical transfer. This is where we take the tree that we have just constructed, and we translate the words, and convert the morphological information into attributes in the tree. This is done using an lttoolbox dictionary in a form that will be familiar to those who have used Apertium before. There are however a number of differences which will be explained below.

In any case, first we need to change directory to matxin-tur-eng and to make a new file, matxin-tur-eng.tur-eng.dix. In this file we start by defining our attributes:


<dictionary>
  <sdefs>
    <sdef n="mi"     c="Morphological information"/>
    <sdef n="pos"    c="Part of speech"/>
    <sdef n="nbr"    c="Number"/>
    <sdef n="prs"    c="Person"/>
    <sdef n="cas"    c="Case"/>
    <sdef n="tns"    c="Tense"/>
  </sdefs> 

</dictionary>

The attributes are defined by <sdef> tags, which stands for symbol (in this case an attribute) definition. After this we can start by adding a new section for our first entry:

  <section id="main" type="standard">
  
  </section>

And within that section, our entry:

    <e><p><l>dün<s n="mi"/>adv</l><r>yesterday<s n="pos"/>adv</r></p></e>

So now we can save this file and compile it,

$ lt-comp lr matxin-tur-eng.tur-eng.dix tur-eng.autobil.bin
main@standard 14 13

Note that the lr means to compile the dictionary (which is in fact a finite-state transducer) from left to right, translating from Turkish to English. Unlike in Apertium, in Matxin the bilingual dictionaries are unidirectional.

You can try it out using the matxin-xfer-lex command:


$ cat input.txt | cg-proc -f2 tur-eng.deprlx.bin | matxin-xfer-lex tur-eng.autobil.bin 

<pre>
<corpus>
  <SENTENCE ref="1" alloc="0">
    <NODE ref="6" alloc="0" slem="iç" smi="v|tv|fut|p1|sg" si="root" UpCase="none" lem="iç" mi="v|tv|fut|p1|sg" unknown="transfer">
      <NODE ref="5" alloc="0" slem="bira" smi="n|acc" si="dobj" UpCase="none" lem="bira" mi="n|acc" unknown="transfer">
        <NODE ref="4" alloc="0" slem="al" smi="v|tv|gpr_past|px2sg" si="acl" UpCase="none" lem="al" mi="v|tv|gpr_past|px2sg" unknown="transfer">
          <NODE ref="1" alloc="0" slem="dün" smi="adv" si="advmod" UpCase="none" lem="yesterday" pos="adv"/>
          <NODE ref="2" alloc="0" slem="ben" smi="prn|pers|p1|sg|gen" si="nmod" UpCase="none" lem="ben" mi="prn|pers|p1|sg|gen" unknown="transfer">
            <NODE ref="3" alloc="0" slem="için" smi="post" si="case" UpCase="none" lem="için" mi="post" unknown="transfer"/>
          </NODE>
        </NODE>
      </NODE>
      <NODE ref="7" alloc="0" slem="." smi="sent" si="punct" UpCase="none" lem="." mi="sent" unknown="transfer"/>
    </NODE>
  </SENTENCE>
</corpus>

You'll note that the XML output has substantially changed. Some of the attributes have been renamed, and some new attributes have been created. Probably the most obvious thing is that for all of the words except dün "yesterday", there is a new attribute unknown with the value transfer, this means that the word was not able to be looked up in the bilingual dictionary. This is not surprising as we only added one word. How about the other attributes ?

  • mismi: Morphological information is now "source morphological information"
  • lemslem: Lemma is not "source lemma"
  • Upcase: To do with getting casing right
  • mi: This is copied from the smi
  • lem: With unknown words this is copied from the source, otherwise the translation is inserted.
  • pos: This is the attribute that we defined in our bilingual dictionary.

So with that in mind we can start translating the other words:

    <e><p><l>için<s n="mi"/>post</l><r>for<s n="pos"/>pr</r></p></e>
    <e><p><l>.<s n="mi"/>sent</l><r>.<s n="pos"/>sent</r></p></e>

These two are easy, they work just like the adverb. For categories that inflect however, things get a touch more complex, as we need to convert the morphological information into feature attributes in the XML. We do this using paradigms.

Paradigms

Paradigms are used to convert strings of input tags into features for the tree. The section they belong in goes after the sdefs section and before the section id="main". Let's start looking at nouns:

  <pardefs>
    <pardef n="n__n"> 
      <e><p><l>|nom</l><r><s n="cas"/>nom</r></p></e>
      <e><p><l>|acc</l><r><s n="cas"/>acc</r></p></e>
    </pardef>
  </pardefs>

This paradigm converts the strings |acc and |nom into the attribute cas with the values of nom and acc respectively.

Now we've defined the paradigm, we can try using it, go back to the main section, and add:

    <e><p><l>bira<s n="mi"/>n</l><r>beer<s n="pos"/>n</r></p><par n="n__n"/></e>

Then compile the dictionary:

$ lt-comp lr matxin-tur-eng.tur-eng.dix tur-eng.autobil.bin
main@standard 38 41

And test it:


$ cat  input.txt | cg-proc -f 2 tur-eng.deprlx.bin | matxin-xfer-lex tur-eng.autobil.bin
<corpus>
  <SENTENCE ref="1" alloc="0">
    <CHUNK ref="0" type="root">
      <NODE ref="6" alloc="0" slem="iç" smi="v|tv|fut|p1|sg" si="root" UpCase="none" lem="iç" mi="v|tv|fut|p1|sg" unknown="transfer">
        <NODE ref="5" alloc="0" slem="bira" smi="n|acc" si="dobj" UpCase="none" lem="beer" pos="n" cas="acc">
          <NODE ref="4" alloc="0" slem="al" smi="v|tv|gpr_past|px2sg" si="acl" UpCase="none" lem="al" mi="v|tv|gpr_past|px2sg" unknown="transfer">
            <NODE ref="1" alloc="0" slem="dün" smi="adv" si="advmod" UpCase="none" lem="yesterday" pos="adv"/>
            <NODE ref="2" alloc="0" slem="ben" smi="prn|pers|p1|sg|gen" si="nmod" UpCase="none" lem="ben" mi="prn|pers|p1|sg|gen" unknown="transfer">
              <NODE ref="3" alloc="0" slem="için" smi="post" si="case" UpCase="none" lem="for" pos="pr"/>
            </NODE>
          </NODE>
        </NODE>
        <NODE ref="7" alloc="0" slem="." smi="sent" si="punct" UpCase="none" lem="." pos="sent"/>
      </NODE>
    </CHUNK>
  </SENTENCE>
</corpus>

As we move onto verb forms, it's worth noting that can call other paradigms, for example, if we want to avoid listing all the possible combinations of tense, person and number, we can set up the paradigms like:


    <pardef n="num"> 
      <e><p><l>|sg</l><r><s n="nbr"/>sg</r></p></e>
      <e><p><l>|pl</l><r><s n="nbr"/>pl</r></p></e>
    </pardef>

    <pardef n="pers"> 
      <e><p><l>|p1</l><r><s n="prs"/>p1</r></p><par n="num"/></e>
      <e><p><l>|p2</l><r><s n="prs"/>p2</r></p><par n="num"/></e>
      <e><p><l>|p3</l><r><s n="prs"/>p3</r></p><par n="num"/></e>
    </pardef>

    <pardef n="tense">
      <e><p><l>|fut</l><r><s n="vtype"/>fin<s n="tns"/>fut</r></p><par n="pers"/></e>
    </pardef>

    <pardef n="tv__v">
      <e><p><l>|tv</l><r><s n="val"/>tv</r></p><par n="tense"/></e>
    </pardef>

Then in the main section:

    <e><p><l>iç<s n="mi"/>v</l><r>drink<s n="pos"/>v</r></p><par n="tv__v"/></e>
    <e><p><l>al<s n="mi"/>v</l><r>buy<s n="pos"/>v</r></p><par n="tv__v"/></e>

This will take care of the finite verb, içeceğim "I will drink", but what do we do with the non-finite verb aldığın "that you bought"? The structure is very different in Turkish and English, and although changing the structure is part of structural transfer, we need to get the attributes in order to enable us to properly do transfer. So in this case what we can do is have a paradigm setup something like:

    <pardef n="px__pers">
       <e><p><l>|px1sg</l><r><s n="prs"/>p1<s n="nbr"/>sg</r></p></e>
       <e><p><l>|px2sg</l><r><s n="prs"/>p2<s n="nbr"/>sg</r></p></e>
       <e><p><l>|px3sg</l><r><s n="prs"/>p3<s n="nbr"/>sg</r></p></e>
       <e><p><l>|px1pl</l><r><s n="prs"/>p1<s n="nbr"/>pl</r></p></e>
       <e><p><l>|px2pl</l><r><s n="prs"/>p2<s n="nbr"/>pl</r></p></e>
       <e><p><l>|px3pl</l><r><s n="prs"/>p3<s n="nbr"/>pl</r></p></e>
    </pardef>

    <pardef n="nonfin">
       <e><p><l>|gpr_past</l><r><s n="vtype"/>gpr<s n="tns"/>past</r></p><par n="px__pers"/></e>
    </pardef>

And then we update the previously defined tv__v paradigm thusly:

    <pardef n="tv__v">
      <e><p><l>|tv</l><r><s n="val"/>tv</r></p><par n="tense"/></e>
      <e><p><l>|tv</l><r><s n="val"/>tv</r></p><par n="nonfin"/></e>
    </pardef>

So, what does the code above do ? We turn the non-finite style agreement markers (using the possessive morpheme, px1sg, px2sg, etc.) into finite agreement (p1.sg, p2.sg, etc.), and we set the verb type to verbal adjective, gpr. Then in the structural transfer, we will be able to match the verb type attribute when it is gpr and use the other attributes to construct a finite relative clause in English.

Let's save the file and compile and test again...

$ lt-comp lr matxin-tur-eng.tur-eng.dix tur-eng.autobil.bin
main@standard 75 86

$ cat  input.txt | cg-proc -f 2 tur-eng.deprlx.bin | matxin-xfer-lex tur-eng.autobil.bin
<corpus>
  <SENTENCE ref="1" alloc="0">
    <CHUNK ref="0" type="root">
      <NODE ref="6" alloc="0" slem="iç" smi="v|tv|fut|p1|sg" si="root" UpCase="none" lem="drink" pos="v" val="tv" tns="fut" prs="p1" nbr="sg">
        <NODE ref="5" alloc="0" slem="bira" smi="n|acc" si="dobj" UpCase="none" lem="beer" pos="n" cas="acc">
          <NODE ref="4" alloc="0" slem="al" smi="v|tv|gpr_past|px2sg" si="acl" UpCase="none" lem="buy" pos="v" val="tv" vtype="gpr" tns="past" prs="p2" nbr="sg">
            <NODE ref="1" alloc="0" slem="dün" smi="adv" si="advmod" UpCase="none" lem="yesterday" pos="adv"/>
            <NODE ref="2" alloc="0" slem="ben" smi="prn|pers|p1|sg|gen" si="nmod" UpCase="none" lem="ben" mi="prn|pers|p1|sg|gen" unknown="transfer">
              <NODE ref="3" alloc="0" slem="için" smi="post" si="case" UpCase="none" lem="for" pos="pr"/>
            </NODE>
          </NODE>
        </NODE>
        <NODE ref="7" alloc="0" slem="." smi="sent" si="punct" UpCase="none" lem="." pos="sent"/>
      </NODE>
    </CHUNK>
  </SENTENCE>
</corpus>

Looking good, now there is only one remaining item, the personal pronoun ben "I".

Personal pronouns can be quite idiosyncratic to translate between different languages, so we're just going to add a new paradigm for it:

    <pardef n="ben__I"> 
      <e><p><l>|p1|sg|nom</l><r><s n="prs"/>p1<s n="nbr"/>sg<s n="cas"/>nom</r></p></e>
      <e><p><l>|p1|sg|acc</l><r><s n="prs"/>p1<s n="nbr"/>sg<s n="cas"/>acc</r></p></e>
      <e><p><l>|p1|sg|gen</l><r><s n="prs"/>p1<s n="nbr"/>sg<s n="cas"/>gen</r></p></e>
    </pardef>

And then the entry in the main section:

      <e><p><l>ben<s n="mi"/>prn|pers</l><r>I<s n="pos"/>prn|pers</r></p><par n="ben__I"/></e>

We set the pos specifically as personal pronoun instead of just pronoun because personal pronouns often need to be treated differently from other pronoun types. The final output we should get is:


<corpus>
  <SENTENCE ref="1" alloc="0">
    <CHUNK ref="0" type="root">
      <NODE ref="6" alloc="0" slem="iç" smi="v|tv|fut|p1|sg" si="root" UpCase="none" lem="drink" pos="v" val="tv" tns="fut" prs="p1" nbr="sg">
        <NODE ref="5" alloc="0" slem="bira" smi="n|acc" si="dobj" UpCase="none" lem="beer" pos="n" cas="acc">
          <NODE ref="4" alloc="0" slem="al" smi="v|tv|gpr_past|px2sg" si="acl" UpCase="none" lem="buy" pos="v" val="tv" vtype="gpr" tns="past" prs="p2" nbr="sg">
            <NODE ref="1" alloc="0" slem="dün" smi="adv" si="advmod" UpCase="none" lem="yesterday" pos="adv"/>
            <NODE ref="2" alloc="0" slem="ben" smi="prn|pers|p1|sg|gen" si="nmod" UpCase="none" lem="I" pos="prn|pers" prs="p1" nbr="sg" cas="gen">
              <NODE ref="3" alloc="0" slem="için" smi="post" si="case" UpCase="none" lem="for" pos="pr"/>
            </NODE>
          </NODE>
        </NODE>
        <NODE ref="7" alloc="0" slem="." smi="sent" si="punct" UpCase="none" lem="." pos="sent"/>
      </NODE>
    </CHUNK>
  </SENTENCE>
</corpus>

And now we're ready for structural transfer! :D

Structural transfer

In Matxin, structural transfer is done by applying a cascade of rules written in the XSLT programming language. XSLT is basically a complete language for tree transformations. This HOWTO will not give a complete overview of the language... there would be far too much to write. But you can use a search engine to find out about it.

But before we start working on encoding the rules as XSLT, we should get clear what we want to do in terms of linguistics. To help us with that, let's take a look at two trees:

Turkish
English

So, given these two trees, we can discover some obvious stuff, like:

  • A definite accusative noun in Turkish birayı, should get a definite article in English.
  • The synthetic future in Turkish, içeceğim to will drink
  • Subject pronouns for both the main clause and relative clause should be added
  • A relative pronoun to stand in place for the direct object in the relative clause headed by aldığın.


Starting out

Let's create a new file called matxin-tur-eng.tur-eng.t1x:

<transfer>

</transfer>

Rules are defined as XSLT templates within a def-rule tag, so let's start one:

  <def-rule comment="Add definite article if overt accusative">

  </def-rule>

The comment is free-form, you can write what you like.

Now let's get to the main part of the rule, we want to match any NODE in the tree where the pos attribute is n and the cas attribute is acc. We specify that we only want to match nouns as in Turkish, pronouns may also be in accusative (in which case we don't want to add a definite article), and complement clauses are also often marked with accusative.

Matching in XSLT is done with Xpath, a way of finding node sets in a tree, so, expressing the pattern above:

    <template match="//NODE[@pos = 'n' and @cas = 'acc']">

    </template>

Attributes are referred to prefixed with @, the = is equal-to and not assignment, // means search the whole tree. So, what do we want to do when we've found this set of nodes? We want to basically copy the nodes and add a new dependent node for the definite article:

    <template match="//NODE[@pos = 'n' and @cas = 'acc']">
      <copy>
         <apply-templates select="@* | *"/>
         <NODE si="det" lem="the" pos="det|def" nbr="sp"/>
      </copy> 
    </template>

This is one pattern for working with transfer in Matxin... The copy directive means that the output tree will have the nodes that are in the source tree (by applying the apply-templates instruction, and in addition it will have a subnode which is the definite article. In the apply-templates instruction, the select="@* | *" part means that it should be applied to all attributes, @* and also to all subnodes *.

Let's save the rule, and compile it:

$ matxin-preprocess-transfer matxin-tur-eng.tur-eng.t1x tur-eng.t1x.bin
1 rules processed.

And then we can apply the rule using:

$ cat input.txt | cg-proc -f 2 tur-eng.deprlx.bin | matxin-xfer-lex tur-eng.autobil.bin | matxin-transfer tur-eng.t1x.bin
<corpus>
  <SENTENCE ref="1" alloc="0">
    <CHUNK ref="0" type="root">
      <NODE ref="6" alloc="0" slem="iç" smi="v|tv|fut|p1|sg" si="root" UpCase="none" lem="drink" pos="v" val="tv" tns="fut" prs="p1" nbr="sg">
        <NODE ref="5" alloc="0" slem="bira" smi="n|acc" si="dobj" UpCase="none" lem="beer" pos="n" cas="acc">
          <NODE ref="4" alloc="0" slem="al" smi="v|tv|gpr_past|px2sg" si="acl" UpCase="none" lem="buy" pos="v" val="tv" vtype="gpr" tns="past" prs="p2" nbr="sg">
            <NODE ref="1" alloc="0" slem="dün" smi="adv" si="advmod" UpCase="none" lem="yesterday" pos="adv"/>
            <NODE ref="2" alloc="0" slem="ben" smi="prn|pers|p1|sg|gen" si="nmod" UpCase="none" lem="I" pos="prn|pers" prs="p1" nbr="sg" cas="gen">
              <NODE ref="3" alloc="0" slem="için" smi="post" si="case" UpCase="none" lem="for" pos="pr"/>
            </NODE>
          </NODE>
          <NODE si="det" lem="the" pos="det|def" nbr="sp"/>
        </NODE>
        <NODE ref="7" alloc="0" slem="." smi="sent" si="punct" UpCase="none" lem="." pos="sent"/>
      </NODE>
    </CHUNK>
  </SENTENCE>
</corpus>

We can use a very similar rule to get the auxiliary verb "will" added when the verb is finite and in future tense:

  <def-rule comment="Add auxiliary verb 'will' if the head verb is finite and in the future tense">
    <template match="//NODE[@pos = 'v' and @vtype = 'fin' and @tns = 'fut']">
      <copy>
         <apply-templates select="@* | *"/>
         <NODE si="aux" lem="will" pos="vaux" tns="pres"/>
      </copy>
    </template>
  </def-rule>

Save the file and compile the rules again:

$ matxin-preprocess-transfer matxin-tur-eng.tur-eng.t1x tur-eng.t1x.bin
2 rules processed.

So, let's see what our Turkish tree looks like now:

Ex-dep-graph.tur2.svg

Propagating information between nodes

It's looking good, now let's try adding the relative pronoun, this is slightly more complicated as we want the function of the relative pronoun in relation to the head of its clause to be the same as the function of the word it modifies in relation to the matrix clause. In Xpath we can use .. to refer to the parent node.

  <def-rule comment="Add a relative pronoun if the verb type is relative clause">
    <template match="//NODE[@pos = 'v' and @vtype = 'gpr']">
      <copy>
        <apply-templates select="@* | *"/>
        <NODE><attr name="si"><value-of select="../@si"/></attr>
              <attr name="lem">that</attr>
              <attr name="pos">rel</attr>
              <attr name="ani">an</attr>
              <attr name="num">sp</attr></NODE>
      </copy>
    </template>
  </def-rule>

The attributes are used for generation, ani is for animacy (some relative pronouns in English can only be used with animates (e.g. "who") and others with both animate and inanimate (e.g. "that", "which"). The sp value for nbr means that is it the same form for singular and plural.

Conditional statements

Now there are only two words that need to be added, the subject pronouns.

  <def-rule comment="Add subject pronouns to clauses that do not have them">
    <template match="//NODE[@pos = 'v' and not(.//NODE[@si = 'nsubj'])]">
      <copy>
        <apply-templates select="@* | *"/>
        <NODE><attr name="si">nsubj</attr>
              <choose>
                <when test="@prs = 'p1' and @nbr ='sg'">
                  <attr name="lem">I</attr>
                </when>
                <when test="@prs = 'p2' and @nbr ='sg'">
                  <attr name="lem">you</attr>
                </when>
              </choose>
              <attr name="pos">prn|pers</attr>
              <attr name="prs"><value-of select="@prs"/></attr>
              <attr name="nbr"><value-of select="@nbr"/></attr></NODE>
      </copy>
    </template>
  </def-rule>

Now let's compile and test:

$ matxin-preprocess-transfer matxin-tur-eng.tur-eng.t1x tur-eng.t1x.bin
4 rules processed.

$ cat input.txt | cg-proc -f 2 tur-eng.deprlx.bin | matxin-xfer-lex tur-eng.autobil.bin | matxin-transfer tur-eng.t1x.bin
<corpus>
  <SENTENCE ref="1" alloc="0">
    <NODE ref="6" alloc="0" slem="iç" smi="v|tv|fut|p1|sg" si="root" UpCase="none" lem="drink" pos="v" val="tv" vtype="fin" tns="fut" prs="p1" nbr="sg">
      <NODE ref="5" alloc="0" slem="bira" smi="n|acc" si="dobj" UpCase="none" lem="beer" pos="n" cas="acc">
        <NODE ref="4" alloc="0" slem="al" smi="v|tv|gpr_past|px2sg" si="acl" UpCase="none" lem="buy" pos="v" val="tv" vtype="gpr" tns="past" prs="p2" nbr="sg">
          <NODE ref="1" alloc="0" slem="dün" smi="adv" si="advmod" UpCase="none" lem="yesterday" pos="adv"/>
          <NODE ref="2" alloc="0" slem="ben" smi="prn|pers|p1|sg|gen" si="nmod" UpCase="none" lem="I" pos="prn|pers" prs="p1" nbr="sg" cas="gen">
            <NODE ref="3" alloc="0" slem="için" smi="post" si="case" UpCase="none" lem="for" pos="pr"/>
          </NODE>
          <NODE si="dobj" lem="that" pos="rel" ani="an" nbr="sp"/>
          <NODE si="nsubj" lem="you" pos="prn|pers" prs="p2" nbr="sg"/>
        </NODE>
        <NODE lem="the" pos="det|def" mi="sp"/>
      </NODE>
      <NODE ref="7" alloc="0" slem="." smi="sent" si="punct" UpCase="none" lem="." pos="sent"/>
      <NODE si="aux" lem="will" pos="vaux" mi="pri"/>
      <NODE si="nsubj" lem="I" pos="prn|pers" prs="p1" nbr="sg"/>
    </NODE>
  </SENTENCE>
</corpus>

And looking at the tree:

Graph.tur3.svg

Our tree is looking pretty English now! We have two more steps, first we need to reorder the tree, which we call linearisation, and then we need to generate the word forms in English using a morphological generator. First onto linearisation...

Reordering and linearisation

Generation

lttoolbox | hfst

See also