Difference between revisions of "Matxin New Language Pair HOWTO"
Line 2: | Line 2: | ||
This page describes the process of creating a new language pair with [[Matxin]], a dependency-based machine translation system. |
This page describes the process of creating a new language pair with [[Matxin]], a dependency-based machine translation system. |
||
==Preliminaries== |
|||
Make a directory called <code>matxin-tur-eng</code>. Then make two more directories <code>matxin-tur</code> and <code>matxin-eng</code>. |
|||
Note that if you are doing this howto for your own language, then <code>tur</code> should be the ISO-639-3 language code of the source language and <code>eng</code> should be the ISO-639-3 for the target language |
|||
==Analysis== |
==Analysis== |
||
There are a number of ways analysis can be done in Matxin, the [[matxin-spa-eus|Spanish to Basque]] system uses [[FreeLing]], while the [[matxin-eng-eus|English to Basqu]] system uses a wrapper around the Stanford parser. In this tutorial we're going to be using [[Constraint Grammar]] to do dependency parsing of pre-disambiguated sentences. Writing a morphological analyser and morphological disambiguator is out of the scope of this HOWTO, but for more information, check out the following pages: |
There are a number of ways analysis can be done in Matxin, the [[matxin-spa-eus|Spanish to Basque]] system uses [[FreeLing]], while the [[matxin-eng-eus|English to Basqu]] system uses a wrapper around the Stanford parser. In this tutorial we're going to be using [[Constraint Grammar]] (CG) to do dependency parsing of pre-disambiguated sentences. Writing a morphological analyser and morphological disambiguator is out of the scope of this HOWTO, but for more information, check out the following pages: |
||
* [[Starting a new language with lttoolbox]] |
* [[Starting a new language with lttoolbox]] |
||
Line 17: | Line 23: | ||
^birayı/bira<n><acc>$ ^içeceğim/iç<v><tv><fut><p1><sg>$^./.<sent>$ |
^birayı/bira<n><acc>$ ^içeceğim/iç<v><tv><fut><p1><sg>$^./.<sent>$ |
||
</pre> |
</pre> |
||
Save this output into a file, perhaps called <code>input.txt</code>. We'll need it later. |
|||
Now go into the <code>matxin-tur</code> directory, and create a file <code>apertium-tur.tur.deprlx</code>. The first thing we have to do in this file is to specify the delimiters. These are the tokens that CG will use for splitting the input into chunks (like "sentences") for processing: |
|||
<pre> |
|||
DELIMITERS = "." ; |
|||
</pre> |
|||
Great, now we need to specify some lists of interesting grammatical features that we're going to use to write the rules. First we define some part-of-speech and morphological tags, e.g. |
|||
<pre> |
|||
LIST Adv = adv; |
|||
LIST Pers = (prn pers) ; |
|||
LIST Post = post ; |
|||
LIST V = v ; |
|||
LIST Acc = acc; |
|||
LIST Gen = gen; |
|||
LIST Gpr = gpr_past ; |
|||
LIST Sent = sent ; |
|||
</pre> |
|||
Etc. It might seem a bit odd to just be repeating the tag in title case, but the point is that it allows us to distinguish CG tags and tag groups from those in the input morphological analyser. |
|||
So, now we've got some morphological information we need to start thinking about the syntactic relations. For the purposes of this tutorial we'll be using the relations and guidelines from the [[Universal dependencies]] project, here are some examples: |
|||
<pre> |
|||
LIST @root = @root ; # The root of the sentence, often a finite verb |
|||
LIST @nsubj = @nsubj ; # The nominal subject of the sentence |
|||
LIST @advmod = @advmod ; # An adverbial modifier |
|||
LIST @case = @case ; # The relation of an adposition to its head |
|||
LIST @acl = @acl ; # A clause which modifies a nominal |
|||
LIST @nmod = @nmod ; # Nominal modifier |
|||
LIST @dobj = @dobj ; # The direct object of the sentence |
|||
LIST @punct = @punct ; # Any punctuation |
|||
LIST @dep = @dep ; # Any remaining dependency |
|||
</pre> |
|||
Add these lists to the file along with the morphological lists. You'll note that they are prefixed with an <code>@</code> symbol. This is important as it says that this is a syntactic relation/function and not a morphological tag. |
|||
After we've defined our terms, next come the rules, in the first section we're going to map the relations to the words in the sentence. So let's start with: |
|||
<pre> |
|||
SECTION |
|||
</pre> |
|||
In constraint grammar, all rules come in sections. |
|||
So, now a rule, let's start with an easy one. Adverbs nearly always get <code>@advmod</code> relation, whether they are modifying an adjective or a verb, so we can safely map <code>@advmod</code> to the adverb using the following rule: |
|||
<pre> |
|||
MAP @advmod TARGET Adv ; |
|||
</pre> |
|||
The <code>MAP</code> rule has two compulsory parts (the tag and the target) and one optional part, a context. Here we have no context. |
|||
So now let's save the file and try it out! First though we need to compile the rules: |
|||
<pre> |
|||
$ cg-comp matxin-tur.tur.deprlx tur.deprlx.bin |
|||
Sections: 1, Rules: 1, Sets: 4, Tags: 29 |
|||
</pre> |
|||
And now try it out: |
|||
<pre> |
|||
$ cat input.txt | cg-proc tur.deprlx.bin |
|||
^Dün/dün<adv><@advmod>$ ^benim/ben<prn><pers><p1><sg><gen>$ ^için/için<post>$ ^aldığın/al<v><tv><gpr_past><px2sg>$ |
|||
^birayı/bira<n><acc>$ ^içeceğim/iç<v><tv><fut><p1><sg>$^./.<sent>$ |
|||
</pre> |
|||
Great, it works... so now we can do basicallty the same rule for the verbal adjective, ''aldığın'' which should get an <code>@acl</code> tag, the postposition which should get a <code>@case</code> tag and the accusative which should get a <code>@dobj</code> tag. |
|||
<pre> |
|||
MAP @case TARGET Post ; |
|||
MAP @acl TARGET Gpr ; |
|||
MAP @dobj TARGET Acc ; |
|||
MAP @punct TARGET Sent ; |
|||
</pre> |
|||
Save it and try it again: |
|||
<pre> |
|||
$ cg-comp matxin-tur.tur.deprlx tur.deprlx.bin |
|||
Sections: 1, Rules: 5, Sets: 12, Tags: 29 |
|||
$ cat /tmp/input | cg-proc /tmp/tur.bin |
|||
^Dün/dün<adv><@advmod>$ ^benim/ben<prn><pers><p1><sg><gen>$ ^için/için<post><@case>$ ^aldığın/al<v><tv><gpr_past><px2sg><@acl>$ |
|||
^birayı/bira<n><acc><@dobj>$ ^içeceğim/iç<v><tv><fut><p1><sg>$^./.<sent><@punct>$ |
|||
</pre> |
|||
Great, now for a couple of harder relations, the functions of ''benim'' and of ''içeceğim''. |
|||
==Transfer== |
==Transfer== |
Revision as of 14:35, 12 May 2016
This page describes the process of creating a new language pair with Matxin, a dependency-based machine translation system.
Preliminaries
Make a directory called matxin-tur-eng
. Then make two more directories matxin-tur
and matxin-eng
.
Note that if you are doing this howto for your own language, then tur
should be the ISO-639-3 language code of the source language and eng
should be the ISO-639-3 for the target language
Analysis
There are a number of ways analysis can be done in Matxin, the Spanish to Basque system uses FreeLing, while the English to Basqu system uses a wrapper around the Stanford parser. In this tutorial we're going to be using Constraint Grammar (CG) to do dependency parsing of pre-disambiguated sentences. Writing a morphological analyser and morphological disambiguator is out of the scope of this HOWTO, but for more information, check out the following pages:
So, let's assume that you've been through those tutorials and have a morphological analyser capable of analysing and disambiguating sentences in Turkish. You'll give it a sentence like "Dün benim için aldığın birayı içeceğim." and get some output like:
^Dün/dün<adv>$ ^benim/ben<prn><pers><p1><sg><gen>$ ^için/için<post>$ ^aldığın/al<v><tv><gpr_past><px2sg>$ ^birayı/bira<n><acc>$ ^içeceğim/iç<v><tv><fut><p1><sg>$^./.<sent>$
Save this output into a file, perhaps called input.txt
. We'll need it later.
Now go into the matxin-tur
directory, and create a file apertium-tur.tur.deprlx
. The first thing we have to do in this file is to specify the delimiters. These are the tokens that CG will use for splitting the input into chunks (like "sentences") for processing:
DELIMITERS = "." ;
Great, now we need to specify some lists of interesting grammatical features that we're going to use to write the rules. First we define some part-of-speech and morphological tags, e.g.
LIST Adv = adv; LIST Pers = (prn pers) ; LIST Post = post ; LIST V = v ; LIST Acc = acc; LIST Gen = gen; LIST Gpr = gpr_past ; LIST Sent = sent ;
Etc. It might seem a bit odd to just be repeating the tag in title case, but the point is that it allows us to distinguish CG tags and tag groups from those in the input morphological analyser.
So, now we've got some morphological information we need to start thinking about the syntactic relations. For the purposes of this tutorial we'll be using the relations and guidelines from the Universal dependencies project, here are some examples:
LIST @root = @root ; # The root of the sentence, often a finite verb LIST @nsubj = @nsubj ; # The nominal subject of the sentence LIST @advmod = @advmod ; # An adverbial modifier LIST @case = @case ; # The relation of an adposition to its head LIST @acl = @acl ; # A clause which modifies a nominal LIST @nmod = @nmod ; # Nominal modifier LIST @dobj = @dobj ; # The direct object of the sentence LIST @punct = @punct ; # Any punctuation LIST @dep = @dep ; # Any remaining dependency
Add these lists to the file along with the morphological lists. You'll note that they are prefixed with an @
symbol. This is important as it says that this is a syntactic relation/function and not a morphological tag.
After we've defined our terms, next come the rules, in the first section we're going to map the relations to the words in the sentence. So let's start with:
SECTION
In constraint grammar, all rules come in sections.
So, now a rule, let's start with an easy one. Adverbs nearly always get @advmod
relation, whether they are modifying an adjective or a verb, so we can safely map @advmod
to the adverb using the following rule:
MAP @advmod TARGET Adv ;
The MAP
rule has two compulsory parts (the tag and the target) and one optional part, a context. Here we have no context.
So now let's save the file and try it out! First though we need to compile the rules:
$ cg-comp matxin-tur.tur.deprlx tur.deprlx.bin Sections: 1, Rules: 1, Sets: 4, Tags: 29
And now try it out:
$ cat input.txt | cg-proc tur.deprlx.bin ^Dün/dün<adv><@advmod>$ ^benim/ben<prn><pers><p1><sg><gen>$ ^için/için<post>$ ^aldığın/al<v><tv><gpr_past><px2sg>$ ^birayı/bira<n><acc>$ ^içeceğim/iç<v><tv><fut><p1><sg>$^./.<sent>$
Great, it works... so now we can do basicallty the same rule for the verbal adjective, aldığın which should get an @acl
tag, the postposition which should get a @case
tag and the accusative which should get a @dobj
tag.
MAP @case TARGET Post ; MAP @acl TARGET Gpr ; MAP @dobj TARGET Acc ; MAP @punct TARGET Sent ;
Save it and try it again:
$ cg-comp matxin-tur.tur.deprlx tur.deprlx.bin Sections: 1, Rules: 5, Sets: 12, Tags: 29 $ cat /tmp/input | cg-proc /tmp/tur.bin ^Dün/dün<adv><@advmod>$ ^benim/ben<prn><pers><p1><sg><gen>$ ^için/için<post><@case>$ ^aldığın/al<v><tv><gpr_past><px2sg><@acl>$ ^birayı/bira<n><acc><@dobj>$ ^içeceğim/iç<v><tv><fut><p1><sg>$^./.<sent><@punct>$
Great, now for a couple of harder relations, the functions of benim and of içeceğim.
Transfer
lttoolbox matxin
Generation
lttoolbox | hfst