Difference between revisions of "Matxin New Language Pair HOWTO"

From Apertium
Jump to navigation Jump to search
Line 2: Line 2:


This page describes the process of creating a new language pair with [[Matxin]], a dependency-based machine translation system.
This page describes the process of creating a new language pair with [[Matxin]], a dependency-based machine translation system.

==Preliminaries==

Make a directory called <code>matxin-tur-eng</code>. Then make two more directories <code>matxin-tur</code> and <code>matxin-eng</code>.

Note that if you are doing this howto for your own language, then <code>tur</code> should be the ISO-639-3 language code of the source language and <code>eng</code> should be the ISO-639-3 for the target language


==Analysis==
==Analysis==


There are a number of ways analysis can be done in Matxin, the [[matxin-spa-eus|Spanish to Basque]] system uses [[FreeLing]], while the [[matxin-eng-eus|English to Basqu]] system uses a wrapper around the Stanford parser. In this tutorial we're going to be using [[Constraint Grammar]] to do dependency parsing of pre-disambiguated sentences. Writing a morphological analyser and morphological disambiguator is out of the scope of this HOWTO, but for more information, check out the following pages:
There are a number of ways analysis can be done in Matxin, the [[matxin-spa-eus|Spanish to Basque]] system uses [[FreeLing]], while the [[matxin-eng-eus|English to Basqu]] system uses a wrapper around the Stanford parser. In this tutorial we're going to be using [[Constraint Grammar]] (CG) to do dependency parsing of pre-disambiguated sentences. Writing a morphological analyser and morphological disambiguator is out of the scope of this HOWTO, but for more information, check out the following pages:


* [[Starting a new language with lttoolbox]]
* [[Starting a new language with lttoolbox]]
Line 17: Line 23:
^birayı/bira<n><acc>$ ^içeceğim/iç<v><tv><fut><p1><sg>$^./.<sent>$
^birayı/bira<n><acc>$ ^içeceğim/iç<v><tv><fut><p1><sg>$^./.<sent>$
</pre>
</pre>

Save this output into a file, perhaps called <code>input.txt</code>. We'll need it later.
Now go into the <code>matxin-tur</code> directory, and create a file <code>apertium-tur.tur.deprlx</code>. The first thing we have to do in this file is to specify the delimiters. These are the tokens that CG will use for splitting the input into chunks (like "sentences") for processing:

<pre>
DELIMITERS = "." ;
</pre>

Great, now we need to specify some lists of interesting grammatical features that we're going to use to write the rules. First we define some part-of-speech and morphological tags, e.g.

<pre>
LIST Adv = adv;
LIST Pers = (prn pers) ;
LIST Post = post ;
LIST V = v ;
LIST Acc = acc;
LIST Gen = gen;
LIST Gpr = gpr_past ;
LIST Sent = sent ;
</pre>

Etc. It might seem a bit odd to just be repeating the tag in title case, but the point is that it allows us to distinguish CG tags and tag groups from those in the input morphological analyser.

So, now we've got some morphological information we need to start thinking about the syntactic relations. For the purposes of this tutorial we'll be using the relations and guidelines from the [[Universal dependencies]] project, here are some examples:

<pre>
LIST @root = @root ; # The root of the sentence, often a finite verb
LIST @nsubj = @nsubj ; # The nominal subject of the sentence
LIST @advmod = @advmod ; # An adverbial modifier
LIST @case = @case ; # The relation of an adposition to its head
LIST @acl = @acl ; # A clause which modifies a nominal
LIST @nmod = @nmod ; # Nominal modifier
LIST @dobj = @dobj ; # The direct object of the sentence
LIST @punct = @punct ; # Any punctuation
LIST @dep = @dep ; # Any remaining dependency
</pre>

Add these lists to the file along with the morphological lists. You'll note that they are prefixed with an <code>@</code> symbol. This is important as it says that this is a syntactic relation/function and not a morphological tag.

After we've defined our terms, next come the rules, in the first section we're going to map the relations to the words in the sentence. So let's start with:

<pre>
SECTION
</pre>

In constraint grammar, all rules come in sections.

So, now a rule, let's start with an easy one. Adverbs nearly always get <code>@advmod</code> relation, whether they are modifying an adjective or a verb, so we can safely map <code>@advmod</code> to the adverb using the following rule:

<pre>
MAP @advmod TARGET Adv ;
</pre>

The <code>MAP</code> rule has two compulsory parts (the tag and the target) and one optional part, a context. Here we have no context.

So now let's save the file and try it out! First though we need to compile the rules:

<pre>
$ cg-comp matxin-tur.tur.deprlx tur.deprlx.bin
Sections: 1, Rules: 1, Sets: 4, Tags: 29
</pre>

And now try it out:

<pre>
$ cat input.txt | cg-proc tur.deprlx.bin
^Dün/dün<adv><@advmod>$ ^benim/ben<prn><pers><p1><sg><gen>$ ^için/için<post>$ ^aldığın/al<v><tv><gpr_past><px2sg>$
^birayı/bira<n><acc>$ ^içeceğim/iç<v><tv><fut><p1><sg>$^./.<sent>$
</pre>

Great, it works... so now we can do basicallty the same rule for the verbal adjective, ''aldığın'' which should get an <code>@acl</code> tag, the postposition which should get a <code>@case</code> tag and the accusative which should get a <code>@dobj</code> tag.

<pre>
MAP @case TARGET Post ;
MAP @acl TARGET Gpr ;
MAP @dobj TARGET Acc ;
MAP @punct TARGET Sent ;
</pre>

Save it and try it again:

<pre>
$ cg-comp matxin-tur.tur.deprlx tur.deprlx.bin
Sections: 1, Rules: 5, Sets: 12, Tags: 29

$ cat /tmp/input | cg-proc /tmp/tur.bin
^Dün/dün<adv><@advmod>$ ^benim/ben<prn><pers><p1><sg><gen>$ ^için/için<post><@case>$ ^aldığın/al<v><tv><gpr_past><px2sg><@acl>$
^birayı/bira<n><acc><@dobj>$ ^içeceğim/iç<v><tv><fut><p1><sg>$^./.<sent><@punct>$
</pre>

Great, now for a couple of harder relations, the functions of ''benim'' and of ''içeceğim''.


==Transfer==
==Transfer==

Revision as of 14:35, 12 May 2016

This page describes the process of creating a new language pair with Matxin, a dependency-based machine translation system.

Preliminaries

Make a directory called matxin-tur-eng. Then make two more directories matxin-tur and matxin-eng.

Note that if you are doing this howto for your own language, then tur should be the ISO-639-3 language code of the source language and eng should be the ISO-639-3 for the target language

Analysis

There are a number of ways analysis can be done in Matxin, the Spanish to Basque system uses FreeLing, while the English to Basqu system uses a wrapper around the Stanford parser. In this tutorial we're going to be using Constraint Grammar (CG) to do dependency parsing of pre-disambiguated sentences. Writing a morphological analyser and morphological disambiguator is out of the scope of this HOWTO, but for more information, check out the following pages:

So, let's assume that you've been through those tutorials and have a morphological analyser capable of analysing and disambiguating sentences in Turkish. You'll give it a sentence like "Dün benim için aldığın birayı içeceğim." and get some output like:

^Dün/dün<adv>$ ^benim/ben<prn><pers><p1><sg><gen>$ ^için/için<post>$ ^aldığın/al<v><tv><gpr_past><px2sg>$ 
^birayı/bira<n><acc>$ ^içeceğim/iç<v><tv><fut><p1><sg>$^./.<sent>$

Save this output into a file, perhaps called input.txt. We'll need it later.

Now go into the matxin-tur directory, and create a file apertium-tur.tur.deprlx. The first thing we have to do in this file is to specify the delimiters. These are the tokens that CG will use for splitting the input into chunks (like "sentences") for processing:

DELIMITERS = "." ;

Great, now we need to specify some lists of interesting grammatical features that we're going to use to write the rules. First we define some part-of-speech and morphological tags, e.g.

LIST Adv = adv;
LIST Pers = (prn pers) ;
LIST Post = post ;
LIST V = v ;
LIST Acc = acc;
LIST Gen = gen;
LIST Gpr = gpr_past ;
LIST Sent = sent ;

Etc. It might seem a bit odd to just be repeating the tag in title case, but the point is that it allows us to distinguish CG tags and tag groups from those in the input morphological analyser.

So, now we've got some morphological information we need to start thinking about the syntactic relations. For the purposes of this tutorial we'll be using the relations and guidelines from the Universal dependencies project, here are some examples:

LIST @root = @root ;     # The root of the sentence, often a finite verb
LIST @nsubj = @nsubj ;   # The nominal subject of the sentence
LIST @advmod = @advmod ; # An adverbial modifier
LIST @case = @case ;     # The relation of an adposition to its head
LIST @acl = @acl ;       # A clause which modifies a nominal
LIST @nmod = @nmod ;     # Nominal modifier 
LIST @dobj = @dobj ;     # The direct object of the sentence
LIST @punct = @punct ;   # Any punctuation
LIST @dep = @dep ;       # Any remaining dependency

Add these lists to the file along with the morphological lists. You'll note that they are prefixed with an @ symbol. This is important as it says that this is a syntactic relation/function and not a morphological tag.

After we've defined our terms, next come the rules, in the first section we're going to map the relations to the words in the sentence. So let's start with:

SECTION

In constraint grammar, all rules come in sections.

So, now a rule, let's start with an easy one. Adverbs nearly always get @advmod relation, whether they are modifying an adjective or a verb, so we can safely map @advmod to the adverb using the following rule:

MAP @advmod TARGET Adv ;

The MAP rule has two compulsory parts (the tag and the target) and one optional part, a context. Here we have no context.

So now let's save the file and try it out! First though we need to compile the rules:

$ cg-comp matxin-tur.tur.deprlx tur.deprlx.bin
Sections: 1, Rules: 1, Sets: 4, Tags: 29

And now try it out:

$ cat input.txt | cg-proc tur.deprlx.bin
^Dün/dün<adv><@advmod>$ ^benim/ben<prn><pers><p1><sg><gen>$ ^için/için<post>$ ^aldığın/al<v><tv><gpr_past><px2sg>$ 
^birayı/bira<n><acc>$ ^içeceğim/iç<v><tv><fut><p1><sg>$^./.<sent>$

Great, it works... so now we can do basicallty the same rule for the verbal adjective, aldığın which should get an @acl tag, the postposition which should get a @case tag and the accusative which should get a @dobj tag.

MAP @case TARGET Post ;
MAP @acl TARGET Gpr ;
MAP @dobj TARGET Acc ;
MAP @punct TARGET Sent ;

Save it and try it again:

$ cg-comp matxin-tur.tur.deprlx tur.deprlx.bin
Sections: 1, Rules: 5, Sets: 12, Tags: 29

$ cat /tmp/input  | cg-proc /tmp/tur.bin 
^Dün/dün<adv><@advmod>$ ^benim/ben<prn><pers><p1><sg><gen>$ ^için/için<post><@case>$ ^aldığın/al<v><tv><gpr_past><px2sg><@acl>$ 
^birayı/bira<n><acc><@dobj>$ ^içeceğim/iç<v><tv><fut><p1><sg>$^./.<sent><@punct>$

Great, now for a couple of harder relations, the functions of benim and of içeceğim.

Transfer

lttoolbox matxin

Generation

lttoolbox | hfst

See also