Difference between revisions of "Matxin New Language Pair HOWTO"

From Apertium
Jump to navigation Jump to search
Line 43: Line 43:
 
LIST Gpr = gpr_past ;
 
LIST Gpr = gpr_past ;
 
LIST Sent = sent ;
 
LIST Sent = sent ;
  +
LIST Fin = fut aor past ; # Finite verb forms
 
</pre>
 
</pre>
   
Line 114: Line 115:
 
</pre>
 
</pre>
   
Great, now for a couple of harder relations, the functions of ''benim'' and of ''içeceğim''.
+
Great, now for a couple of harder relations, the functions of ''benim'' and of ''içeceğim''. As ''benim'' is the head of an adpositional phrase, following the universal dependencies guidelines we assign it the <code>@nmod</code> relation. How do we know it is the head of an adpositional phrase. Well, for this example, we can go for something like "if the word is a personal pronoun and it is followed by a postposition", so let's give it a go:
  +
  +
<pre>
  +
MAP @nmod TARGET Pers IF (1 Post) ;
  +
</pre>
  +
  +
And finally the root of the sentence, the finite verb. In Turkish it is quite typical that the finite verb is the last verb in the sentence, so we can go for something like "if the word is a finite verb and the next token is a full stop and there is no other finite verb to the left, then make it the root".
  +
  +
<pre>
  +
MAP @root TARGET Fin IF (NOT -1* Fin) (1 Sent) ;
  +
</pre>
  +
  +
So try those two rules out and we should have a fully labelled input sentence:
  +
  +
<pre>
  +
^Dün/dün<adv><@advmod>$ ^benim/ben<prn><pers><p1><sg><gen><@nmod>$ ^için/için<post><@case>$ ^aldığın/al<v><tv><gpr_past><px2sg><@acl>$
  +
^birayı/bira<n><acc><@dobj>$ ^içeceğim/iç<v><tv><fut><p1><sg><@root>$^./.<sent><@punct>$
  +
</pre>
  +
  +
That covers our sentence, but we want to add one more rule "just in case" ... and that is a default rule that adds the relation <code>@dep</code> to any token that isn't covered by the other rules:
  +
  +
<pre>
  +
MAP (@dep) TARGET (*) ;
  +
</pre>
   
 
==Transfer==
 
==Transfer==

Revision as of 14:44, 12 May 2016

This page describes the process of creating a new language pair with Matxin, a dependency-based machine translation system.

Preliminaries

Make a directory called matxin-tur-eng. Then make two more directories matxin-tur and matxin-eng.

Note that if you are doing this howto for your own language, then tur should be the ISO-639-3 language code of the source language and eng should be the ISO-639-3 for the target language

Analysis

There are a number of ways analysis can be done in Matxin, the Spanish to Basque system uses FreeLing, while the English to Basqu system uses a wrapper around the Stanford parser. In this tutorial we're going to be using Constraint Grammar (CG) to do dependency parsing of pre-disambiguated sentences. Writing a morphological analyser and morphological disambiguator is out of the scope of this HOWTO, but for more information, check out the following pages:

So, let's assume that you've been through those tutorials and have a morphological analyser capable of analysing and disambiguating sentences in Turkish. You'll give it a sentence like "Dün benim için aldığın birayı içeceğim." and get some output like:

^Dün/dün<adv>$ ^benim/ben<prn><pers><p1><sg><gen>$ ^için/için<post>$ ^aldığın/al<v><tv><gpr_past><px2sg>$ 
^birayı/bira<n><acc>$ ^içeceğim/iç<v><tv><fut><p1><sg>$^./.<sent>$

Save this output into a file, perhaps called input.txt. We'll need it later.

Now go into the matxin-tur directory, and create a file apertium-tur.tur.deprlx. The first thing we have to do in this file is to specify the delimiters. These are the tokens that CG will use for splitting the input into chunks (like "sentences") for processing:

DELIMITERS = "." ;

Great, now we need to specify some lists of interesting grammatical features that we're going to use to write the rules. First we define some part-of-speech and morphological tags, e.g.

LIST Adv = adv;
LIST Pers = (prn pers) ;
LIST Post = post ;
LIST V = v ;
LIST Acc = acc;
LIST Gen = gen;
LIST Gpr = gpr_past ;
LIST Sent = sent ;
LIST Fin = fut aor past ; # Finite verb forms

Etc. It might seem a bit odd to just be repeating the tag in title case, but the point is that it allows us to distinguish CG tags and tag groups from those in the input morphological analyser.

So, now we've got some morphological information we need to start thinking about the syntactic relations. For the purposes of this tutorial we'll be using the relations and guidelines from the Universal dependencies project, here are some examples:

LIST @root = @root ;     # The root of the sentence, often a finite verb
LIST @nsubj = @nsubj ;   # The nominal subject of the sentence
LIST @advmod = @advmod ; # An adverbial modifier
LIST @case = @case ;     # The relation of an adposition to its head
LIST @acl = @acl ;       # A clause which modifies a nominal
LIST @nmod = @nmod ;     # Nominal modifier 
LIST @dobj = @dobj ;     # The direct object of the sentence
LIST @punct = @punct ;   # Any punctuation
LIST @dep = @dep ;       # Any remaining dependency

Add these lists to the file along with the morphological lists. You'll note that they are prefixed with an @ symbol. This is important as it says that this is a syntactic relation/function and not a morphological tag.

After we've defined our terms, next come the rules, in the first section we're going to map the relations to the words in the sentence. So let's start with:

SECTION

In constraint grammar, all rules come in sections.

So, now a rule, let's start with an easy one. Adverbs nearly always get @advmod relation, whether they are modifying an adjective or a verb, so we can safely map @advmod to the adverb using the following rule:

MAP @advmod TARGET Adv ;

The MAP rule has two compulsory parts (the tag and the target) and one optional part, a context. Here we have no context.

So now let's save the file and try it out! First though we need to compile the rules:

$ cg-comp matxin-tur.tur.deprlx tur.deprlx.bin
Sections: 1, Rules: 1, Sets: 4, Tags: 29

And now try it out:

$ cat input.txt | cg-proc tur.deprlx.bin
^Dün/dün<adv><@advmod>$ ^benim/ben<prn><pers><p1><sg><gen>$ ^için/için<post>$ ^aldığın/al<v><tv><gpr_past><px2sg>$ 
^birayı/bira<n><acc>$ ^içeceğim/iç<v><tv><fut><p1><sg>$^./.<sent>$

Great, it works... so now we can do basicallty the same rule for the verbal adjective, aldığın which should get an @acl tag, the postposition which should get a @case tag and the accusative which should get a @dobj tag.

MAP @case TARGET Post ;
MAP @acl TARGET Gpr ;
MAP @dobj TARGET Acc ;
MAP @punct TARGET Sent ;

Save it and try it again:

$ cg-comp matxin-tur.tur.deprlx tur.deprlx.bin
Sections: 1, Rules: 5, Sets: 12, Tags: 29

$ cat /tmp/input  | cg-proc /tmp/tur.bin 
^Dün/dün<adv><@advmod>$ ^benim/ben<prn><pers><p1><sg><gen>$ ^için/için<post><@case>$ ^aldığın/al<v><tv><gpr_past><px2sg><@acl>$ 
^birayı/bira<n><acc><@dobj>$ ^içeceğim/iç<v><tv><fut><p1><sg>$^./.<sent><@punct>$

Great, now for a couple of harder relations, the functions of benim and of içeceğim. As benim is the head of an adpositional phrase, following the universal dependencies guidelines we assign it the @nmod relation. How do we know it is the head of an adpositional phrase. Well, for this example, we can go for something like "if the word is a personal pronoun and it is followed by a postposition", so let's give it a go:

MAP @nmod TARGET Pers IF (1 Post) ;

And finally the root of the sentence, the finite verb. In Turkish it is quite typical that the finite verb is the last verb in the sentence, so we can go for something like "if the word is a finite verb and the next token is a full stop and there is no other finite verb to the left, then make it the root".

MAP @root TARGET Fin IF (NOT -1* Fin) (1 Sent) ; 

So try those two rules out and we should have a fully labelled input sentence:

^Dün/dün<adv><@advmod>$ ^benim/ben<prn><pers><p1><sg><gen><@nmod>$ ^için/için<post><@case>$ ^aldığın/al<v><tv><gpr_past><px2sg><@acl>$ 
^birayı/bira<n><acc><@dobj>$ ^içeceğim/iç<v><tv><fut><p1><sg><@root>$^./.<sent><@punct>$

That covers our sentence, but we want to add one more rule "just in case" ... and that is a default rule that adds the relation @dep to any token that isn't covered by the other rules:

MAP (@dep) TARGET (*) ;

Transfer

lttoolbox matxin

Generation

lttoolbox | hfst

See also