Difference between revisions of "Matxin New Language Pair HOWTO"
Line 43: | Line 43: | ||
LIST Gpr = gpr_past ; |
LIST Gpr = gpr_past ; |
||
LIST Sent = sent ; |
LIST Sent = sent ; |
||
LIST Fin = fut aor past ; # Finite verb forms |
|||
</pre> |
</pre> |
||
Line 114: | Line 115: | ||
</pre> |
</pre> |
||
Great, now for a couple of harder relations, the functions of ''benim'' and of ''içeceğim''. |
Great, now for a couple of harder relations, the functions of ''benim'' and of ''içeceğim''. As ''benim'' is the head of an adpositional phrase, following the universal dependencies guidelines we assign it the <code>@nmod</code> relation. How do we know it is the head of an adpositional phrase. Well, for this example, we can go for something like "if the word is a personal pronoun and it is followed by a postposition", so let's give it a go: |
||
<pre> |
|||
MAP @nmod TARGET Pers IF (1 Post) ; |
|||
</pre> |
|||
And finally the root of the sentence, the finite verb. In Turkish it is quite typical that the finite verb is the last verb in the sentence, so we can go for something like "if the word is a finite verb and the next token is a full stop and there is no other finite verb to the left, then make it the root". |
|||
<pre> |
|||
MAP @root TARGET Fin IF (NOT -1* Fin) (1 Sent) ; |
|||
</pre> |
|||
So try those two rules out and we should have a fully labelled input sentence: |
|||
<pre> |
|||
^Dün/dün<adv><@advmod>$ ^benim/ben<prn><pers><p1><sg><gen><@nmod>$ ^için/için<post><@case>$ ^aldığın/al<v><tv><gpr_past><px2sg><@acl>$ |
|||
^birayı/bira<n><acc><@dobj>$ ^içeceğim/iç<v><tv><fut><p1><sg><@root>$^./.<sent><@punct>$ |
|||
</pre> |
|||
That covers our sentence, but we want to add one more rule "just in case" ... and that is a default rule that adds the relation <code>@dep</code> to any token that isn't covered by the other rules: |
|||
<pre> |
|||
MAP (@dep) TARGET (*) ; |
|||
</pre> |
|||
==Transfer== |
==Transfer== |
Revision as of 14:44, 12 May 2016
This page describes the process of creating a new language pair with Matxin, a dependency-based machine translation system.
Preliminaries
Make a directory called matxin-tur-eng
. Then make two more directories matxin-tur
and matxin-eng
.
Note that if you are doing this howto for your own language, then tur
should be the ISO-639-3 language code of the source language and eng
should be the ISO-639-3 for the target language
Analysis
There are a number of ways analysis can be done in Matxin, the Spanish to Basque system uses FreeLing, while the English to Basqu system uses a wrapper around the Stanford parser. In this tutorial we're going to be using Constraint Grammar (CG) to do dependency parsing of pre-disambiguated sentences. Writing a morphological analyser and morphological disambiguator is out of the scope of this HOWTO, but for more information, check out the following pages:
So, let's assume that you've been through those tutorials and have a morphological analyser capable of analysing and disambiguating sentences in Turkish. You'll give it a sentence like "Dün benim için aldığın birayı içeceğim." and get some output like:
^Dün/dün<adv>$ ^benim/ben<prn><pers><p1><sg><gen>$ ^için/için<post>$ ^aldığın/al<v><tv><gpr_past><px2sg>$ ^birayı/bira<n><acc>$ ^içeceğim/iç<v><tv><fut><p1><sg>$^./.<sent>$
Save this output into a file, perhaps called input.txt
. We'll need it later.
Now go into the matxin-tur
directory, and create a file apertium-tur.tur.deprlx
. The first thing we have to do in this file is to specify the delimiters. These are the tokens that CG will use for splitting the input into chunks (like "sentences") for processing:
DELIMITERS = "." ;
Great, now we need to specify some lists of interesting grammatical features that we're going to use to write the rules. First we define some part-of-speech and morphological tags, e.g.
LIST Adv = adv; LIST Pers = (prn pers) ; LIST Post = post ; LIST V = v ; LIST Acc = acc; LIST Gen = gen; LIST Gpr = gpr_past ; LIST Sent = sent ; LIST Fin = fut aor past ; # Finite verb forms
Etc. It might seem a bit odd to just be repeating the tag in title case, but the point is that it allows us to distinguish CG tags and tag groups from those in the input morphological analyser.
So, now we've got some morphological information we need to start thinking about the syntactic relations. For the purposes of this tutorial we'll be using the relations and guidelines from the Universal dependencies project, here are some examples:
LIST @root = @root ; # The root of the sentence, often a finite verb LIST @nsubj = @nsubj ; # The nominal subject of the sentence LIST @advmod = @advmod ; # An adverbial modifier LIST @case = @case ; # The relation of an adposition to its head LIST @acl = @acl ; # A clause which modifies a nominal LIST @nmod = @nmod ; # Nominal modifier LIST @dobj = @dobj ; # The direct object of the sentence LIST @punct = @punct ; # Any punctuation LIST @dep = @dep ; # Any remaining dependency
Add these lists to the file along with the morphological lists. You'll note that they are prefixed with an @
symbol. This is important as it says that this is a syntactic relation/function and not a morphological tag.
After we've defined our terms, next come the rules, in the first section we're going to map the relations to the words in the sentence. So let's start with:
SECTION
In constraint grammar, all rules come in sections.
So, now a rule, let's start with an easy one. Adverbs nearly always get @advmod
relation, whether they are modifying an adjective or a verb, so we can safely map @advmod
to the adverb using the following rule:
MAP @advmod TARGET Adv ;
The MAP
rule has two compulsory parts (the tag and the target) and one optional part, a context. Here we have no context.
So now let's save the file and try it out! First though we need to compile the rules:
$ cg-comp matxin-tur.tur.deprlx tur.deprlx.bin Sections: 1, Rules: 1, Sets: 4, Tags: 29
And now try it out:
$ cat input.txt | cg-proc tur.deprlx.bin ^Dün/dün<adv><@advmod>$ ^benim/ben<prn><pers><p1><sg><gen>$ ^için/için<post>$ ^aldığın/al<v><tv><gpr_past><px2sg>$ ^birayı/bira<n><acc>$ ^içeceğim/iç<v><tv><fut><p1><sg>$^./.<sent>$
Great, it works... so now we can do basicallty the same rule for the verbal adjective, aldığın which should get an @acl
tag, the postposition which should get a @case
tag and the accusative which should get a @dobj
tag.
MAP @case TARGET Post ; MAP @acl TARGET Gpr ; MAP @dobj TARGET Acc ; MAP @punct TARGET Sent ;
Save it and try it again:
$ cg-comp matxin-tur.tur.deprlx tur.deprlx.bin Sections: 1, Rules: 5, Sets: 12, Tags: 29 $ cat /tmp/input | cg-proc /tmp/tur.bin ^Dün/dün<adv><@advmod>$ ^benim/ben<prn><pers><p1><sg><gen>$ ^için/için<post><@case>$ ^aldığın/al<v><tv><gpr_past><px2sg><@acl>$ ^birayı/bira<n><acc><@dobj>$ ^içeceğim/iç<v><tv><fut><p1><sg>$^./.<sent><@punct>$
Great, now for a couple of harder relations, the functions of benim and of içeceğim. As benim is the head of an adpositional phrase, following the universal dependencies guidelines we assign it the @nmod
relation. How do we know it is the head of an adpositional phrase. Well, for this example, we can go for something like "if the word is a personal pronoun and it is followed by a postposition", so let's give it a go:
MAP @nmod TARGET Pers IF (1 Post) ;
And finally the root of the sentence, the finite verb. In Turkish it is quite typical that the finite verb is the last verb in the sentence, so we can go for something like "if the word is a finite verb and the next token is a full stop and there is no other finite verb to the left, then make it the root".
MAP @root TARGET Fin IF (NOT -1* Fin) (1 Sent) ;
So try those two rules out and we should have a fully labelled input sentence:
^Dün/dün<adv><@advmod>$ ^benim/ben<prn><pers><p1><sg><gen><@nmod>$ ^için/için<post><@case>$ ^aldığın/al<v><tv><gpr_past><px2sg><@acl>$ ^birayı/bira<n><acc><@dobj>$ ^içeceğim/iç<v><tv><fut><p1><sg><@root>$^./.<sent><@punct>$
That covers our sentence, but we want to add one more rule "just in case" ... and that is a default rule that adds the relation @dep
to any token that isn't covered by the other rules:
MAP (@dep) TARGET (*) ;
Transfer
lttoolbox matxin
Generation
lttoolbox | hfst