Difference between revisions of "Курсы машинного перевода для языков России/Session 4"

From Apertium
Jump to navigation Jump to search
 
(28 intermediate revisions by 4 users not shown)
Line 14: Line 14:
 
! Chuvash !! !! Russian
 
! Chuvash !! !! Russian
 
|-
 
|-
| тӳпе || || вершина
+
| тӳпе || || вершина
 
|-
 
|-
| temporada || → || saison
+
| тăрă || → || вершина
 
|-
 
|-
| тӳпе || || крыша
+
| тӳпе || || небо
 
|-
 
|-
| тӳпе || ← || небо
+
| тӳпе || ← || крыша
 
|-
 
|-
  +
|colspan=3 align="center"|...
 
|}
 
|}
 
</div>
 
</div>
 
This is an example of a more-or-less one-to-one correspondence, the meanings for ''жаз'' in Kyrgyz and ''весна'' in Russian match up. However this is not always the case. Another possibility is that more than one word in the source language translates to a single word in the target language. For example, both of the Russian words ''вершина'' and ''небо'' can translate to ''тӳпе'' in Chuvash. In fact, the relationship between the words is many-to-many.
 
This is an example of a more-or-less one-to-one correspondence, the meanings for ''жаз'' in Kyrgyz and ''весна'' in Russian match up. However this is not always the case. Another possibility is that more than one word in the source language translates to a single word in the target language. For example, both of the Russian words ''вершина'' and ''небо'' can translate to ''тӳпе'' in Chuvash. In fact, the relationship between the words is many-to-many.
   
For machine translation, to say that ''вершина'', ''небо'' and ''крыша'' all translate to ''тӳпе'' is not problematic. That ''saison'' can also translate to ''temporada'' and that ''тӳпе'' has three (or more) translations in Russian is problematic.
+
For machine translation, to say that ''вершина'', ''небо'' and ''крыша'' all translate to ''тӳпе'' is not problematic. That ''тӳпе'' has three (or more) translations in Russian is problematic.
   
Using part-of-speech information to disambiguate lexical relationships can reduce the problem, but not eliminate it entirely. Even after the part-of-speech has been disambiguated, one word may have many translations, as in the ''тӳпе'' example. So, how can this one-to-many translation problem be solved. The most obvious way is to choose the most frequent, or general translation &mdash; in this case ''xxx''.
+
Using part-of-speech information to disambiguate lexical relationships can reduce the problem, but not eliminate it entirely. Even after the part-of-speech has been disambiguated, one word may have many translations, as in the ''тӳпе'' example. So, how can this one-to-many translation problem be solved. The most obvious way is to choose the most frequent, or general translation &mdash; in this case ''тӳпе''.
   
 
Apertium currently has an experimental module to treat the problem of ''lexical selection'', that is choosing the most appropriate translation of a source language lexical form given its context, but the use of multiwords can also offer a partial solution, for example in the cases where there are frequent collocations which are exceptions to the general translation (for example, Spanish to French: ''estación del año'' → ''saison de l'année'').
 
Apertium currently has an experimental module to treat the problem of ''lexical selection'', that is choosing the most appropriate translation of a source language lexical form given its context, but the use of multiwords can also offer a partial solution, for example in the cases where there are frequent collocations which are exceptions to the general translation (for example, Spanish to French: ''estación del año'' → ''saison de l'année'').
  +
  +
===Dialect forms===
  +
  +
Often it can be desirable to be able to translate dialect forms from one language into standard forms in another. For example, in Russian, the word ''мясо'' "meat" can be translated into Chuvash as ''аш'', ''какай'', оr ''аш-какай''. The first two translations are considered more dialectal and mostly used in the spoken language. The third is more standard and preferred in the written language.
   
 
===Grammatical divergence===
 
===Grammatical divergence===
Line 97: Line 102:
 
</pre>
 
</pre>
   
For the importance of direction restrictions, see [[Session 7]]. Now compile the dictionary and test the new entry:
+
For the importance of direction restrictions, see [[Машинный_перевод_для_языков_России/Session_7|Session 7]]. Now compile the dictionary and test the new entry:
   
 
<pre>
 
<pre>
Line 125: Line 130:
 
{|class=wikitable
 
{|class=wikitable
 
|-
 
|-
! French !! !! Spanish
+
! Finnish !! !! North Sámi
 
|-
 
|-
| pioche || ↔ || pico
+
| pitää || ↔ || doallat
 
|-
 
|-
| bec || → || pico
+
| pitää || → || berret
 
|-
 
|-
| bec:1 || → || boquilla
+
| pitää || → || liikot
 
|-
 
|-
| bec || || boquilla
+
| pitää || || coakcut
 
|-
 
|-
  +
| pitää || ← || galgat
 
|}
 
|}
 
</div>
 
</div>
  +
For this section of the practical, you need to be in the <code>apertium-sme-fin</code> directory. The bilingual dictionary is <code>apertium-sme-fin.sme-fin.dix</code>
Although there is currently no lexical selection module in Apertium &mdash; to choose the most adequate translation of a source language lexical form into the target language given the source language context when there is more than one possible translation in the target language &mdash; the dictionaries are prepared for the eventuality that one is developed. So in this subsection we will add some entries for lexical forms with more than one translation, following the Apertium convention, even though it is not currently possible to select the non-default translations.
 
   
  +
In this subsection we will add some entries for lexical forms with more than one translation, and write lexical selection rules to select the non-default translations.
If we continue with the ''bec'' example, we find that it can also be translated with ''boquilla'' in Spanish, so if we want to add a non-default translation to ''boquilla'', we can do it as follows:
 
  +
 
If we take the example of ''pitää'' in Finnish, we find that it can also be translated with several words in North Sámi, so if we want to add some non-default translations of it, we can do it as follows:
   
 
<pre>
 
<pre>
<e a="webform"><p><l>pioche<s n="n"/><s n="f"/></l><r>pico<s n="n"/><s n="m"/></r></p></e>
+
<e c="hold (acc)"><p><l>doallat<s n="V"/><s n="TV"/></l><r>pitää<s n="V"/></r></p><par n="V_V"/></e>
<e r="LR"><p><l>bec<s n="n"/><s n="m"/></l><r>pico<s n="n"/><s n="m"/></r></p></e>
+
<e c="ought to"><p><l>berret<s n="V"/><s n="IV"/></l><r>pitää<s n="V"/></r></p><par n="V_V"/></e>
<e slr="1" r="LR"><p><l>bec<s n="n"/><s n="m"/></l><r>boquilla<s n="n"/><s n="f"/></r></p></e>
+
<e c="like (ela)"><p><l>liikot<s n="V"/><s n="IV"/></l><r>pitää<s n="V"/></r></p><par n="V_V"/></e>
 
<e c="get a foothold"><p><l>coakcut<s n="V"/><s n="IV"/></l><r>pitää<s n="V"/></r></p><par n="V_V"/></e>
 
</pre>
 
</pre>
   
 
If we also want to add a translation from ''galgat'' → ''pitää'', then we need to add another entry, this time marking it with <code>LR</code> for translating only from North Sámi to Finnish.
The <code>slr</code> attribute defines an alternative translation, these can be numbers, as in the case above, or any other appropriate mnemonic, for example, the following would also be possible:
 
   
 
<pre>
 
<pre>
<e slr="boquilla" r="LR"><p><l>bec<s n="n"/><s n="m"/></l><r>boquilla<s n="n"/><s n="f"/></r></p></e>
+
<e r="LR"><p><l>galgat<s n="V"/><s n="IV"/></l><r>pitää<s n="V"/></r></p><par n="V_V"/></e>--><!-- skulle, should -->
 
</pre>
 
</pre>
   
  +
Compile the dictionary in the usual way:
If we also want to add a translation from ''boquilla'' → ''bec'', then we need to add another entry, this time marking it with <code>RL</code> for translating only from Spanish to French.
 
   
 
<pre>
 
<pre>
  +
<e r="RL"><p><l>bec<s n="n"/><s n="m"/></l><r>boquilla<s n="n"/><s n="f"/></r></p></e>
 
  +
$ lt-comp rl apertium-sme-fin.sme-fin.dix fin-sme.autobil.bin
  +
main@standard 15454 19523
  +
 
</pre>
 
</pre>
   
  +
And try out the new entries as follows:
===Many to many===
 
   
  +
<pre>
Many more entries in Apertium transfer lexica (bilingual dictionaries) are many-to-many relationships, or could be, for example given the example above, the current entries we have are:
 
  +
$ echo "Minä pidän kirjan." | hfst-proc fin-sme.automorf.hfst | cg-proc fin-sme.rlx.bin | apertium-tagger -g fin-sme.prob | lt-proc -b fin-sme.autobil.bin
  +
^Mikä<Pron><Interr><Sg><Ess>/Mii<Pron><Interr><Sg><Ess>$
  +
^pitää<V><Act><Ind><Prs><Sg1><@+FMAINV>/berret<V><IV><Ind><Prs><Sg1><@+FMAINV>/liikot<V><IV><Ind><Prs><Sg1><@+FMAINV>/doallat<V><TV><Ind><Prs><Sg1><@+FMAINV>/coakcut<V><IV><Ind><Prs><Sg1><@+FMAINV>$
  +
^kirja<N><Sg><Gen><@←OBJ>/girji<N><Sg><Gen><@←OBJ>$^.<Punct><CLB>/.<CLB>$
  +
</pre>
  +
  +
In the case of ambiguity in the lexical transfer, the transfer component will pick the first translation to continue with, if the first translation is not the desired translation, a lexical selection rule can be made which chooses a different one:
  +
  +
Make a file <code>apertium-sme-fin.fin-sme.lrx</code>, and paste the following text
   
 
<pre>
 
<pre>
  +
<rules>
<e a="webform"><p><l>pioche<s n="n"/><s n="f"/></l><r>pico<s n="n"/><s n="m"/></r></p></e>
 
  +
<rule>
<e r="LR"><p><l>bec<s n="n"/><s n="m"/></l><r>pico<s n="n"/><s n="m"/></r></p></e>
 
  +
<match lemma="pitää" tags="V.*">
<e slr="1" r="LR"><p><l>bec<s n="n"/><s n="m"/></l><r>boquilla<s n="n"/><s n="f"/></r></p></e>
 
  +
<select lemma="doallat" tags="V.TV.*"/>
<e r="RL"><p><l>bec<s n="n"/><s n="m"/></l><r>boquilla<s n="n"/><s n="f"/></r></p></e>
 
  +
</match>
  +
</rule>
  +
</rules>
 
</pre>
 
</pre>
   
  +
You can compile this rule using <code>apertium-lrx-comp</code>:
To this could be added:
 
   
  +
<pre>
* ''pic''{{slc|fr}} → ''pico''{{slc|es}}
 
  +
$ apertium-lrx-comp apertium-sme-fin.fin-sme.lrx fin-sme.lrx.bin
* ''pic''{{slc|fr}} ← ''pájaro carpintero''{{slc|es}}
 
  +
1
  +
Written 1 rules, 2 patterns.
  +
</pre>
   
And multiword entries like:
+
And test it as follows:
   
  +
<pre>
* ''bec à bec''{{slc|fr}} ↔ ''frente a frente''{{slc|es}}
 
  +
$ echo "Minä pidän kirjan." | hfst-proc fin-sme.automorf.hfst | cg-proc fin-sme.rlx.bin | apertium-tagger -g fin-sme.prob | lt-proc -b fin-sme.autobil.bin | apertium-lrx-proc fin-sme.lrx.bin
  +
^Mikä<Pron><Interr><Sg><Ess>/Mii<Pron><Interr><Sg><Ess>$
  +
^pitää<V><Act><Ind><Prs><Sg1><@+FMAINV>/doallat<V><TV><Ind><Prs><Sg1><@+FMAINV>$
  +
^kirja<N><Sg><Gen><@←OBJ>/girji<N><Sg><Gen><@←OBJ>$^.<Punct><CLB>/.<CLB>$
  +
</pre>
  +
  +
As you can see the default translation ''doallat'' has been chosen. But what if we want to choose a non-default translation in certain contexts ? With an elative following as opposed to an accusative/genitive, a better translation is ''liikot'', so let's add a rule for that:
  +
  +
<pre>
  +
<rule>
  +
<match lemma="pitää" tags="V.*">
  +
<select lemma="liikot" tags="V.*"/>
  +
</match>
  +
<match tags="*.Ela"/>
  +
</rule>
  +
</pre>
   
  +
You can compile it again, and try this rule with a sentence like ''Minä pidän sinusta.'' "I like you" .
If you have time, add these entries, along with the direction restrictions (→ = <code>r="LR"</code>, ← = <code>r="RL"</code>) to the Spanish and French dictionary.
 
   
 
===Lexical selection===
 
===Lexical selection===

Latest revision as of 12:00, 31 January 2012

The objective of this session is to describe the process of lexical transfer and lexical selection. The theory section will cover some details of why a transfer lexicon (or bilingual dictionary) cannot always be just correspondences between lemmas/parts-of-speech in one language and lemmas/parts-of-speech in another. In the practice section, we will add three entries, a simple one-to-one entry, a many-to-one entry and a one-to-many entry.

Theory[edit]

There are two aspects to lexical transfer, the first is choosing the most adequate translation, the second is marking in the entries features which need to be inferred by transfer rules. For example, choosing the most frequent translation of the Chuvash word тӳпе into Russian, when there are several possible translations including вершина, крыша and небо comes under the first. And deciding the number of the Russian word брюки when translating to Tatar (e.g. чалбар, чалбарлар) comes into the second.

Translation equivalences[edit]

The simple model of translation is that you look up each word in the source language sentence in a list to find the target language equivalent and then substitute it. This is problematic because the relationship between words in the source language, and words in the target language is often many-to-many, not one-to-one — this is because words can be polysemous (have many meanings) and these meanings can have different translations. One way of extending this basic model is to allow words to be distinguished based on lexical category, or part-of-speech. So for example, the Kyrgyz word жаз could be a verb, or a noun, leading to the translations писать or весна in Russian respectively.

Chuvash Russian
тӳпе вершина
тăрă вершина
тӳпе небо
тӳпе крыша
...

This is an example of a more-or-less one-to-one correspondence, the meanings for жаз in Kyrgyz and весна in Russian match up. However this is not always the case. Another possibility is that more than one word in the source language translates to a single word in the target language. For example, both of the Russian words вершина and небо can translate to тӳпе in Chuvash. In fact, the relationship between the words is many-to-many.

For machine translation, to say that вершина, небо and крыша all translate to тӳпе is not problematic. That тӳпе has three (or more) translations in Russian is problematic.

Using part-of-speech information to disambiguate lexical relationships can reduce the problem, but not eliminate it entirely. Even after the part-of-speech has been disambiguated, one word may have many translations, as in the тӳпе example. So, how can this one-to-many translation problem be solved. The most obvious way is to choose the most frequent, or general translation — in this case тӳпе.

Apertium currently has an experimental module to treat the problem of lexical selection, that is choosing the most appropriate translation of a source language lexical form given its context, but the use of multiwords can also offer a partial solution, for example in the cases where there are frequent collocations which are exceptions to the general translation (for example, Spanish to French: estación del añosaison de l'année).

Dialect forms[edit]

Often it can be desirable to be able to translate dialect forms from one language into standard forms in another. For example, in Russian, the word мясо "meat" can be translated into Chuvash as аш, какай, оr аш-какай. The first two translations are considered more dialectal and mostly used in the spoken language. The third is more standard and preferred in the written language.

Grammatical divergence[edit]

Another problem of lexical transfer is grammatical divergence between the two languages.

  • Words may change gender and number between languages (e.g. Russian брюки — in plural translates to Bashkir салбар — in singular)
  • Words in the source language may be ambiguous for features such as number (French temps to Spanish tiempo or tiempos) or gender (Spanish estudiante — which can be masculine or feminine to French étudiant, masculine or étudiante, feminine).
  • Other features than can be useful for lexical transfer between some languages are if the target noun is mass or count — for example to decide if an article should be inserted when translating from a language without articles to a language with articles.
  • Also, for languages that have both adjectives which inflect for comparison, and adjectives which don't — it is useful to specify this in the bilingual dictionary.

All of these need to be taken care of in the transfer lexicon (bilingual dictionary).

Practice[edit]

For this practical, the examples will be from the Tatar--Bashkir language pair, so navigate to the directory apertium-tt-ba. The bilingual dictionary (or transfer lexicon) is in the file apertium-tt-ba.tt-ba.dix. Open it.

One to one[edit]

You may remember that we added the word чалбар to the Tatar dictionary in the first session, we'll keep with this example for the simple one-to-one bilingual dictionary entries.

So, for example, search for the entry >песи<,

    <e><p><l>песи<s n="n"/></l><r>бесәй<s n="n"/></r></p></e>

Copy the entry and change the lexical forms on the source and target side, so you have something that looks like:

    <e><p><l>песи<s n="n"/></l><r>бесәй<s n="n"/></r></p></e>

    <e><p><l>чалбар<s n="n"/></l><r>салбар<s n="n"/></r></p></e>

Save the dictionary and exit out of the text editor, and we can compile the dictionary with the following command:

$ lt-comp lr apertium-tt-ba.tt-ba.dix tt-ba.autobil.bin

We can test it as follows:

$ echo "чалбар" | hfst-proc tt-ba.automorf.hfst  | apertium-tagger -g tt-ba.prob  |\
  apertium-pretransfer | lt-proc -b tt-ba.autobil.bin 
^чалбар<n><nom>/салбар<n><nom>$

This command shows you the input lexical form, and the output lexical form after having been passed through the bilingual dictionary.

Many to one[edit]

The next job is to perform a translation from many words, to one word. The word морон in Tatar can be transated as борын in Bashkir, but this word is not currently in the dictionaries. We first want to search for the word борын in the file apertium-tt-ba.tt-ba.dix, and we'll find the following entry:

    <e><p><l>борын<s n="n"/></l><r>танау<s n="n"/></r></p></e>

We can add an entry to translate моронборын, but in order to do this we need to also add a direction restriction, so that that this entry only applies when translating from Bashkir to Tatar. Restriction are added to the <e> element with the r attribute and come in two flavours, LR or left-to-right and RL or right-to-left. The file is called apertium-tt-ba.tt-ba.dix with Tatar on the left and Bashkir on the right. So, if we want to translate only from Bashkir to Tatar we need to add a right-to-left (RL) direction restriction. Copy the entry and paste it below, then change the lemma and add the restriction.

    <e><p><l>борын<s n="n"/></l><r>танау<s n="n"/></r></p></e>
    <e r="RL"><p><l>борын<s n="n"/></l><r>морон<s n="n"/></r></p></e>

For the importance of direction restrictions, see Session 7. Now compile the dictionary and test the new entry:

$ lt-comp rl apertium-tt-ba.tt-ba.dix ba-tt.autobil.bin 
main@standard 1228 1995

$ lt-comp lr apertium-tt-ba.tt-ba.dix tt-ba.autobil.bin 
main@standard 1226 1990

$ echo "борын" | hfst-proc tt-ba.automorf.hfst  | apertium-tagger -g tt-ba.prob  |\
    apertium-pretransfer | lt-proc -b tt-ba.autobil.bin 
^борын<n><nom>/танау<n><nom>$

$ echo "танау" | hfst-proc ba-tt.automorf.hfst | apertium-tagger -g ba-tt.prob  |\
   apertium-pretransfer | lt-proc -b ba-tt.autobil.bin 
^танау<n><nom>/борын<n><nom>$

$ echo "морон" | hfst-proc ba-tt.automorf.hfst | apertium-tagger -g ba-tt.prob  |\
   apertium-pretransfer | lt-proc -b ba-tt.autobil.bin 
^морон<n><nom>/борын<n><nom>$

We use the direction restrictions when we never want to be able to translate a word in that direction, e.g. in the case of dialect or substandard forms. But sometimes it might be desirable to have two possible translations.

One to many[edit]

Finnish North Sámi
pitää doallat
pitää berret
pitää liikot
pitää coakcut
pitää galgat

For this section of the practical, you need to be in the apertium-sme-fin directory. The bilingual dictionary is apertium-sme-fin.sme-fin.dix

In this subsection we will add some entries for lexical forms with more than one translation, and write lexical selection rules to select the non-default translations.

If we take the example of pitää in Finnish, we find that it can also be translated with several words in North Sámi, so if we want to add some non-default translations of it, we can do it as follows:

  <e c="hold (acc)"><p><l>doallat<s n="V"/><s n="TV"/></l><r>pitää<s n="V"/></r></p><par n="V_V"/></e>
  <e c="ought to"><p><l>berret<s n="V"/><s n="IV"/></l><r>pitää<s n="V"/></r></p><par n="V_V"/></e>
  <e c="like (ela)"><p><l>liikot<s n="V"/><s n="IV"/></l><r>pitää<s n="V"/></r></p><par n="V_V"/></e>
  <e c="get a foothold"><p><l>coakcut<s n="V"/><s n="IV"/></l><r>pitää<s n="V"/></r></p><par n="V_V"/></e>

If we also want to add a translation from galgatpitää, then we need to add another entry, this time marking it with LR for translating only from North Sámi to Finnish.

  <e r="LR"><p><l>galgat<s n="V"/><s n="IV"/></l><r>pitää<s n="V"/></r></p><par n="V_V"/></e>--><!-- skulle, should -->

Compile the dictionary in the usual way:


$ lt-comp rl apertium-sme-fin.sme-fin.dix fin-sme.autobil.bin 
main@standard 15454 19523

And try out the new entries as follows:

$ echo "Minä pidän kirjan." | hfst-proc fin-sme.automorf.hfst  | cg-proc fin-sme.rlx.bin  | apertium-tagger -g fin-sme.prob  | lt-proc -b fin-sme.autobil.bin 
^Mikä<Pron><Interr><Sg><Ess>/Mii<Pron><Interr><Sg><Ess>$ 
^pitää<V><Act><Ind><Prs><Sg1><@+FMAINV>/berret<V><IV><Ind><Prs><Sg1><@+FMAINV>/liikot<V><IV><Ind><Prs><Sg1><@+FMAINV>/doallat<V><TV><Ind><Prs><Sg1><@+FMAINV>/coakcut<V><IV><Ind><Prs><Sg1><@+FMAINV>$ 
^kirja<N><Sg><Gen><@←OBJ>/girji<N><Sg><Gen><@←OBJ>$^.<Punct><CLB>/.<CLB>$

In the case of ambiguity in the lexical transfer, the transfer component will pick the first translation to continue with, if the first translation is not the desired translation, a lexical selection rule can be made which chooses a different one:

Make a file apertium-sme-fin.fin-sme.lrx, and paste the following text

<rules>
  <rule> 
    <match lemma="pitää" tags="V.*">
      <select lemma="doallat" tags="V.TV.*"/>
    </match>
  </rule>
</rules>

You can compile this rule using apertium-lrx-comp:

$ apertium-lrx-comp apertium-sme-fin.fin-sme.lrx fin-sme.lrx.bin
1
Written 1 rules, 2 patterns.

And test it as follows:

$ echo "Minä pidän kirjan." | hfst-proc fin-sme.automorf.hfst  | cg-proc fin-sme.rlx.bin  | apertium-tagger -g fin-sme.prob  | lt-proc -b fin-sme.autobil.bin | apertium-lrx-proc fin-sme.lrx.bin 
^Mikä<Pron><Interr><Sg><Ess>/Mii<Pron><Interr><Sg><Ess>$ 
^pitää<V><Act><Ind><Prs><Sg1><@+FMAINV>/doallat<V><TV><Ind><Prs><Sg1><@+FMAINV>$ 
^kirja<N><Sg><Gen><@←OBJ>/girji<N><Sg><Gen><@←OBJ>$^.<Punct><CLB>/.<CLB>$

As you can see the default translation doallat has been chosen. But what if we want to choose a non-default translation in certain contexts ? With an elative following as opposed to an accusative/genitive, a better translation is liikot, so let's add a rule for that:

  <rule>
    <match lemma="pitää" tags="V.*">
      <select lemma="liikot" tags="V.*"/>
    </match>
    <match tags="*.Ela"/>
  </rule>

You can compile it again, and try this rule with a sentence like Minä pidän sinusta. "I like you" .

Lexical selection[edit]

There is a separate handout on using lexical selection rules.

See also[edit]

Further reading[edit]