Difference between revisions of "Matxin New Language Pair HOWTO"

From Apertium
Jump to navigation Jump to search
Line 423: Line 423:
</pardef>
</pardef>


<pardef n="tense">
<e><p><l>|fut</l><r><s n="tns"/>fut</r></p><par n="pers"/></e>
</pardef>

<pardef n="tv__v">
<e><p><l>|tv</l><r><s n="val"/>tv</r></p><par n="tense"/></e>
</pardef>
</pre>

Then in the main section:

<pre>
<e><p><l>iç<s n="mi"/>v</l><r>drink<s n="pos"/>v</r></p><par n="tv__v"/></e>
<e><p><l>al<s n="mi"/>v</l><r>buy<s n="pos"/>v</r></p><par n="tv__v"/></e>
</pre>
</pre>



Revision as of 13:30, 13 May 2016

This page describes the process of creating a new language pair with Matxin, a dependency-based machine translation system.

Preliminaries

Make a directory called matxin-tur-eng. Then make two more directories matxin-tur and matxin-eng.

Note that if you are doing this howto for your own language, then tur should be the ISO-639-3 language code of the source language and eng should be the ISO-639-3 for the target language

Analysis

There are a number of ways analysis can be done in Matxin, the Spanish to Basque system uses FreeLing, while the English to Basque system uses a wrapper around the Stanford parser. In this tutorial we're going to be using Constraint Grammar (CG) to do dependency parsing of pre-disambiguated sentences. Writing a morphological analyser and morphological disambiguator is out of the scope of this HOWTO, but for more information, check out the following pages:

So, let's assume that you've been through those tutorials and have a morphological analyser capable of analysing and disambiguating sentences in Turkish. You'll give it a sentence like "Dün benim için aldığın birayı içeceğim." and get some output like:

^Dün/dün<adv>$ ^benim/ben<prn><pers><p1><sg><gen>$ ^için/için<post>$ ^aldığın/al<v><tv><gpr_past><px2sg>$ 
^birayı/bira<n><acc>$ ^içeceğim/iç<v><tv><fut><p1><sg>$^./.<sent>$

Save this output into a file, perhaps called input.txt. We'll need it later.

Now go into the matxin-tur directory, and create a file apertium-tur.tur.deprlx. The first thing we have to do in this file is to specify the delimiters. These are the tokens that CG will use for splitting the input into chunks (like "sentences") for processing:

DELIMITERS = "." ;

Great, now we need to specify some lists of interesting grammatical features that we're going to use to write the rules. First we define some part-of-speech and morphological tags, e.g.

LIST Adv = adv;
LIST Pers = (prn pers) ;
LIST Post = post ;
LIST V = v ;
LIST N = n ;
LIST Acc = acc;
LIST Gen = gen;
LIST Gpr = gpr_past ;
LIST Sent = sent ;
LIST Fin = fut aor past ; # Finite verb forms

Etc. It might seem a bit odd to just be repeating the tag in title case, but the point is that it allows us to distinguish CG tags and tag groups from those in the input morphological analyser.

So, now we've got some morphological information we need to start thinking about the syntactic relations. For the purposes of this tutorial we'll be using the relations and guidelines from the Universal dependencies project, here are some examples:

LIST @root = @root ;     # The root of the sentence, often a finite verb
LIST @nsubj = @nsubj ;   # The nominal subject of the sentence
LIST @advmod = @advmod ; # An adverbial modifier
LIST @case = @case ;     # The relation of an adposition to its head
LIST @acl = @acl ;       # A clause which modifies a nominal
LIST @nmod = @nmod ;     # Nominal modifier 
LIST @dobj = @dobj ;     # The direct object of the sentence
LIST @punct = @punct ;   # Any punctuation
LIST @dep = @dep ;       # Any remaining dependency

Add these lists to the file along with the morphological lists. You'll note that they are prefixed with an @ symbol. This is important as it says that this is a syntactic relation/function and not a morphological tag.

After we've defined our terms, next come the rules, in the first section we're going to map the relations to the words in the sentence. So let's start with:

SECTION

In constraint grammar, all rules come in sections.

So, now a rule, let's start with an easy one. Adverbs nearly always get @advmod relation, whether they are modifying an adjective or a verb, so we can safely map @advmod to the adverb using the following rule:

MAP @advmod TARGET Adv ;

The MAP rule has two compulsory parts (the tag and the target) and one optional part, a context. Here we have no context.

So now let's save the file and try it out! First though we need to compile the rules:

$ cg-comp matxin-tur.tur.deprlx tur.deprlx.bin
Sections: 1, Rules: 1, Sets: 4, Tags: 29

And now try it out:

$ cat input.txt | cg-proc tur.deprlx.bin
^Dün/dün<adv><@advmod>$ ^benim/ben<prn><pers><p1><sg><gen>$ ^için/için<post>$ ^aldığın/al<v><tv><gpr_past><px2sg>$ 
^birayı/bira<n><acc>$ ^içeceğim/iç<v><tv><fut><p1><sg>$^./.<sent>$

Great, it works... so now we can do basicallty the same rule for the verbal adjective, aldığın which should get an @acl tag, the postposition which should get a @case tag and the accusative which should get a @dobj tag.

MAP @case TARGET Post ;
MAP @acl TARGET Gpr ;
MAP @dobj TARGET Acc ;
MAP @punct TARGET Sent ;

Save it and try it again:

$ cg-comp matxin-tur.tur.deprlx tur.deprlx.bin
Sections: 1, Rules: 5, Sets: 12, Tags: 29

$ cat /tmp/input  | cg-proc /tmp/tur.bin 
^Dün/dün<adv><@advmod>$ ^benim/ben<prn><pers><p1><sg><gen>$ ^için/için<post><@case>$ ^aldığın/al<v><tv><gpr_past><px2sg><@acl>$ 
^birayı/bira<n><acc><@dobj>$ ^içeceğim/iç<v><tv><fut><p1><sg>$^./.<sent><@punct>$

Great, now for a couple of harder relations, the functions of benim and of içeceğim. As benim is the head of an adpositional phrase, following the universal dependencies guidelines we assign it the @nmod relation. How do we know it is the head of an adpositional phrase. Well, for this example, we can go for something like "if the word is a personal pronoun and it is followed by a postposition", so let's give it a go:

MAP @nmod TARGET Pers IF (1 Post) ;

And finally the root of the sentence, the finite verb. In Turkish it is quite typical that the finite verb is the last verb in the sentence, so we can go for something like "if the word is a finite verb and the next token is a full stop and there is no other finite verb to the left, then make it the root".

MAP @root TARGET Fin IF (NOT -1* Fin) (1 Sent) ; 

So try those two rules out and we should have a fully labelled input sentence:

^Dün/dün<adv><@advmod>$ ^benim/ben<prn><pers><p1><sg><gen><@nmod>$ ^için/için<post><@case>$ ^aldığın/al<v><tv><gpr_past><px2sg><@acl>$ 
^birayı/bira<n><acc><@dobj>$ ^içeceğim/iç<v><tv><fut><p1><sg><@root>$^./.<sent><@punct>$

That covers our sentence, but we want to add one more rule "just in case" ... and that is a default rule that adds the relation @dep to any token that isn't covered by the other rules:

MAP (@dep) TARGET (*) ;

Tree building

Now we have a full labelled sentence, we can start building the tree, first make another section:

SECTION

The first thing we want to attach the root node:

SETPARENT @root TO (@0 (*)) ;

This basically says, set the parent of the root node to node 0, which is the invisible root that CG uses. Next we want to attach all of the rest of the nodes to this root node:

SETPARENT (*) (NEGATE p (*)) TO (0* @root) ;

Here we have an extra condition (NEGATE p (*)) which means that matched node should not already have a parent.

Let's test the rules, so save the file, and go to the terminal:

$ cat input.txt  | cg-proc -f 2 /tmp/tur.bin
<corpus>
  <SENTENCE ord="1" alloc="0">
    <NODE ord="6" alloc="0" form="içeceğim" lem"iç" mi="v|tv|fut|p1|sg" si="root">
      <NODE ord="1" alloc="0" form="Dün" lem="dün" mi="adv" si="advmod"/>
      <NODE ord="2" alloc="0" form="benim" lem="ben" mi="prn|pers|p1|sg|gen" si="nmod"/>
      <NODE ord="3" alloc="0" form="için" lem="için" mi="post" si="case"/>
      <NODE ord="4" alloc="0" form="aldığın" lem="al" mi="v|tv|gpr_past|px2sg" si="acl"/>
      <NODE ord="5" alloc="0" form="birayı" lem="bira" mi="n|acc" si="dobj"/>
      <NODE ord="7" alloc="0" form="." lem="." mi="sent" si="punct"/>
    </NODE>
  </SENTENCE>
</corpus>

Note how here we have passed -f 2 parameter to the cg-proc program, so now it is output in Matxin XML format.

This is a pretty sad looking tree :( ... But fortunately using rules we can make it a happy tree! One that correctly implements the annotation guidelines.

So, let's think of some rules:

  • The direct object should depend on the finite verb
  • A postposition should depend on its head
  • A relative clause should modify (depend on) a noun
  • An adverb should modify a verb

Let's start with the first one:

SETPARENT @dobj TO (1* Fin) ;

This rule says that the parent of the direct object should be the finite verb label anywhere to the left. Next up:

SETPARENT @case TO (-1 Pers) ;

Set the parent of the word with the @case label to be the previous personal pronoun. And then:

SETPARENT @acl TO (1 N) ; 

Set the parent of the word with the @acl label to be the following noun.

Let's save the file and apply the rules:

$ cat input.txt  | cg-proc -f 2 tur.deprlx.bin 
<corpus>
  <SENTENCE ord="X" alloc="Y">
    <NODE ord="6" alloc="0" form="içeceğim" lemma="iç" mi="v|tv|fut|p1|sg" si="root">
      <NODE ord="1" alloc="0" form="Dün" lemma="dün" mi="adv" si="advmod"/>
      <NODE ord="2" alloc="0" form="benim" lemma="ben" mi="prn|pers|p1|sg|gen" si="nmod">
        <NODE ord="3" alloc="0" form="için" lemma="için" mi="post" si="case"/>
      </NODE>
      <NODE ord="5" alloc="0" form="birayı" lemma="bira" mi="n|acc" si="dobj">
        <NODE ord="4" alloc="0" form="aldığın" lemma="al" mi="v|tv|gpr_past|px2sg" si="acl"/>
      </NODE>
      <NODE ord="7" alloc="0" form="." lemma="." mi="sent" si="punct"/>
    </NODE>
  </SENTENCE>
</corpus>

This tree is looking much better. The only thing we have left to do is to attach the postpositional phrase benim için and the adverb Dün to the appropriate verb, which in this case is the head of the relative clause aldığın.

So we could specify the rule something like: Set the parent of a nominal modifier or an adverb to be the first verb to the right. This rule happens to work in this case, but is not very robust.

SETPARENT @advmod TO (1* V BARRIER V) ;

SETPARENT @nmod TO (1* V BARRIER V) ;

The 1* X BARRIER Y instruction here means that the parser should read to the right looking to match context X (a verb), but it should stop if it finds context Y (in this case another verb).

We can test these rules and see the output:

$ cat input.txt  | cg-proc -f 2 tur.deprlx.bin 
<corpus>
  <SENTENCE ord="X" alloc="Y">
    <NODE ord="6" alloc="0" form="içeceğim" lemma="iç" mi="v|tv|fut|p1|sg" si="root">
      <NODE ord="5" alloc="0" form="birayı" lemma="bira" mi="n|acc" si="dobj">
        <NODE ord="4" alloc="0" form="aldığın" lemma="al" mi="v|tv|gpr_past|px2sg" si="acl">
          <NODE ord="1" alloc="0" form="Dün" lemma="dün" mi="adv" si="advmod"/>
          <NODE ord="2" alloc="0" form="benim" lemma="ben" mi="prn|pers|p1|sg|gen" si="nmod">
            <NODE ord="3" alloc="0" form="için" lemma="için" mi="post" si="case"/>
          </NODE>
        </NODE>
      </NODE>
      <NODE ord="7" alloc="0" form="." lemma="." mi="sent" si="punct"/>
    </NODE>
  </SENTENCE>
</corpus>

Yesss! Now we have a nice tree ready to be translated!

Lexical transfer

The first stage of transfer is lexical transfer. This is where we take the tree that we have just constructed, and we translate the words, and convert the morphological information into attributes in the tree. This is done using an lttoolbox dictionary in a form that will be familiar to those who have used Apertium before. There are however a number of differences which will be explained below.

In any case, first we need to change directory to matxin-tur-eng and to make a new file, matxin-tur-eng.tur-eng.dix. In this file we start by defining our attributes:


<dictionary>
  <sdefs>
    <sdef n="mi"     c="Morphological information"/>
    <sdef n="pos"    c="Part of speech"/>
    <sdef n="nbr"    c="Number"/>
    <sdef n="prs"    c="Person"/>
    <sdef n="cas"    c="Case"/>
    <sdef n="tns"    c="Tense"/>
  </sdefs> 

</dictionary>

The attributes are defined by <sdef> tags, which stands for symbol (in this case an attribute) definition. After this we can start by adding a new section for our first entry:

  <section id="main" type="standard">
  
  </section>

And within that section, our entry:

    <e><p><l>dün<s n="mi"/>adv</l><r>yesterday<s n="pos"/>adv</r></p></e>

So now we can save this file and compile it,

$ lt-comp lr matxin-tur-eng.tur-eng.dix tur-eng.autobil.bin
main@standard 14 13

Note that the lr means to compile the dictionary (which is in fact a finite-state transducer) from left to right, translating from Turkish to English. Unlike in Apertium, in Matxin the bilingual dictionaries are unidirectional.

You can try it out using the matxin-xfer-lex command:


$ cat input.txt | cg-proc -f2 tur-eng.deprlx.bin | matxin-xfer-lex tur-eng.autobil.bin 

<pre>
<corpus>
  <SENTENCE ref="1" alloc="0">
    <NODE ref="6" alloc="0" slem="iç" smi="v|tv|fut|p1|sg" si="root" UpCase="none" lem="iç" mi="v|tv|fut|p1|sg" unknown="transfer">
      <NODE ref="5" alloc="0" slem="bira" smi="n|acc" si="dobj" UpCase="none" lem="bira" mi="n|acc" unknown="transfer">
        <NODE ref="4" alloc="0" slem="al" smi="v|tv|gpr_past|px2sg" si="acl" UpCase="none" lem="al" mi="v|tv|gpr_past|px2sg" unknown="transfer">
          <NODE ref="1" alloc="0" slem="dün" smi="adv" si="advmod" UpCase="none" lem="yesterday" pos="adv"/>
          <NODE ref="2" alloc="0" slem="ben" smi="prn|pers|p1|sg|gen" si="nmod" UpCase="none" lem="ben" mi="prn|pers|p1|sg|gen" unknown="transfer">
            <NODE ref="3" alloc="0" slem="için" smi="post" si="case" UpCase="none" lem="için" mi="post" unknown="transfer"/>
          </NODE>
        </NODE>
      </NODE>
      <NODE ref="7" alloc="0" slem="." smi="sent" si="punct" UpCase="none" lem="." mi="sent" unknown="transfer"/>
    </NODE>
  </SENTENCE>
</corpus>

You'll note that the XML output has substantially changed. Some of the attributes have been renamed, and some new attributes have been created. Probably the most obvious thing is that for all of the words except dün "yesterday", there is a new attribute unknown with the value transfer, this means that the word was not able to be looked up in the bilingual dictionary. This is not surprising as we only added one word. How about the other attributes ?

  • mismi: Morphological information is now "source morphological information"
  • lemslem: Lemma is not "source lemma"
  • Upcase: To do with getting casing right
  • mi: This is copied from the smi
  • lem: With unknown words this is copied from the source, otherwise the translation is inserted.
  • pos: This is the attribute that we defined in our bilingual dictionary.

So with that in mind we can start translating the other words:

    <e><p><l>için<s n="mi"/>post</l><r>for<s n="pos"/>pr</r></p></e>
    <e><p><l>.<s n="mi"/>sent</l><r>.<s n="pos"/>sent</r></p></e>

These two are easy, they work just like the adverb. For categories that inflect however, things get a touch more complex, as we need to convert the morphological information into feature attributes in the XML. We do this using paradigms.

Paradigms

Paradigms are used to convert strings of input tags into features for the tree. The section they belong in goes after the sdefs section and before the section id="main". Let's start looking at nouns:

  <pardefs>
    <pardef n="n__n"> 
      <e><p><l>|nom</l><r><s n="cas"/>nom</r></p></e>
      <e><p><l>|acc</l><r><s n="cas"/>acc</r></p></e>
    </pardef>
  </pardefs>

This paradigm converts the strings |acc and |nom into the attribute cas with the values of nom and acc respectively.

Now we've defined the paradigm, we can try using it, go back to the main section, and add:

    <e><p><l>bira<s n="mi"/>n</l><r>beer<s n="pos"/>n</r></p><par n="n__n"/></e>

Then compile the dictionary:

$ lt-comp lr matxin-tur-eng.tur-eng.dix tur-eng.autobil.bin
main@standard 38 41

And test it:


$ cat  input.txt | cg-proc -f 2 tur-eng.deprlx.bin | matxin-xfer-lex tur-eng.autobil.bin
<corpus>
  <SENTENCE ref="1" alloc="0">
    <CHUNK ref="0" type="root">
      <NODE ref="6" alloc="0" slem="iç" smi="v|tv|fut|p1|sg" si="root" UpCase="none" lem="iç" mi="v|tv|fut|p1|sg" unknown="transfer">
        <NODE ref="5" alloc="0" slem="bira" smi="n|acc" si="dobj" UpCase="none" lem="beer" pos="n" cas="acc">
          <NODE ref="4" alloc="0" slem="al" smi="v|tv|gpr_past|px2sg" si="acl" UpCase="none" lem="al" mi="v|tv|gpr_past|px2sg" unknown="transfer">
            <NODE ref="1" alloc="0" slem="dün" smi="adv" si="advmod" UpCase="none" lem="yesterday" pos="adv"/>
            <NODE ref="2" alloc="0" slem="ben" smi="prn|pers|p1|sg|gen" si="nmod" UpCase="none" lem="ben" mi="prn|pers|p1|sg|gen" unknown="transfer">
              <NODE ref="3" alloc="0" slem="için" smi="post" si="case" UpCase="none" lem="for" pos="pr"/>
            </NODE>
          </NODE>
        </NODE>
        <NODE ref="7" alloc="0" slem="." smi="sent" si="punct" UpCase="none" lem="." pos="sent"/>
      </NODE>
    </CHUNK>
  </SENTENCE>
</corpus>

As we move onto verb forms, it's worth noting that can call other paradigms, for example, if we want to avoid listing all the possible combinations of tense, person and number, we can set up the paradigms like:


    <pardef n="num"> 
      <e><p><l>|sg</l><r><s n="nbr"/>sg</r></p></e>
      <e><p><l>|pl</l><r><s n="nbr"/>pl</r></p></e>
    </pardef>

    <pardef n="pers"> 
      <e><p><l>|p1</l><r><s n="prs"/>p1</r></p><par n="num"/></e>
      <e><p><l>|p2</l><r><s n="prs"/>p2</r></p><par n="num"/></e>
      <e><p><l>|p3</l><r><s n="prs"/>p3</r></p><par n="num"/></e>
    </pardef>

    <pardef n="tense">
      <e><p><l>|fut</l><r><s n="tns"/>fut</r></p><par n="pers"/></e>
    </pardef>

    <pardef n="tv__v">
      <e><p><l>|tv</l><r><s n="val"/>tv</r></p><par n="tense"/></e>
    </pardef>

Then in the main section:

    <e><p><l>iç<s n="mi"/>v</l><r>drink<s n="pos"/>v</r></p><par n="tv__v"/></e>
    <e><p><l>al<s n="mi"/>v</l><r>buy<s n="pos"/>v</r></p><par n="tv__v"/></e>

Structural transfer

Generation

lttoolbox | hfst

See also