Difference between revisions of "Курсы машинного перевода для языков России/Session 5"

From Apertium
Jump to navigation Jump to search
 
(17 intermediate revisions by 3 users not shown)
Line 14: Line 14:
   
 
* Kazakh has no grammatical gender where Russian has three (masculine, feminine and neuter). Gender in Russian plays a rôle in agreement within noun phrases.
 
* Kazakh has no grammatical gender where Russian has three (masculine, feminine and neuter). Gender in Russian plays a rôle in agreement within noun phrases.
** ''Жаңы әуежай''{{slc|kz}} → ''Новый аэропорт''{{slc|ru}}
+
** ''Жаңа әуежай''{{slc|kk}} → ''Новый аэропорт''{{slc|ru}}
* German has two numbers (singular and plural), where Slovenian has three (singular, dual and plural). Number plays a rôle in agreement within noun phrases.
+
* Russian has two numbers (singular and plural), where Mansi has three (singular, dual and plural).
** ''Zwei Wölfe''{{slc|de}} → ''Dva volkova''{{slc|sl}}
+
** ''Я читаю эту книгу''{{slc|ru}} → ''Ам ти книга ловиньтилум''{{slc|mns}}
  +
** ''Я читаю эти две книги''{{slc|ru}} → ''Ам ти книгаг ловиньтиягум''{{slc|mns}}
  +
** ''Я читаю эти книги''{{slc|ru}} → ''Ам ти книгат ловиньтиянум''{{slc|mns}}
 
* Hungarian has a large number of cases which don't exist in Finnish (for example the temporal case ''-kor'', which can be translated into Finnish by the ablative, adessive, or essive). Case in Hungarian and Finnish does not play a rôle in noun-phrase agreement.
 
* Hungarian has a large number of cases which don't exist in Finnish (for example the temporal case ''-kor'', which can be translated into Finnish by the ablative, adessive, or essive). Case in Hungarian and Finnish does not play a rôle in noun-phrase agreement.
 
** ''Ötkor''{{slc|hu}} → ''Viideltä''{{slc|fi}} <!-- "at five", Abl -->
 
** ''Ötkor''{{slc|hu}} → ''Viideltä''{{slc|fi}} <!-- "at five", Abl -->
Line 26: Line 28:
 
Syntactic contrasts are differences in how the syntax of the languages works, for example the existence or not of articles, ordering of clitic pronouns, case inventories, analytic verb tenses, usage of subject pronouns, etc.
 
Syntactic contrasts are differences in how the syntax of the languages works, for example the existence or not of articles, ordering of clitic pronouns, case inventories, analytic verb tenses, usage of subject pronouns, etc.
   
* Abkhaz has a postfixed definite article, where Russian does not.
+
* Abkhaz has a postfixed indefinite article, and a prefixed definite article where Russian does not.
 
** ''Сара кьыла'''к''' акәац аасхәоит''{{slc|ab}} → ''Я покупаю килограмм мяса''{{slc|ru}}
 
** ''Сара кьыла'''к''' акәац аасхәоит''{{slc|ab}} → ''Я покупаю килограмм мяса''{{slc|ru}}
 
** ''Иара '''а'''мшын ахь дцоит.''{{slc|ab}} → ''Он идёт в море.''{{slc|ru}} <!-- he is going to the sea -->
 
** ''Иара '''а'''мшын ахь дцоит.''{{slc|ab}} → ''Он идёт в море.''{{slc|ru}} <!-- he is going to the sea -->
 
* Russian has compulsory subject pronouns, where Hungarian does not,
 
* Russian has compulsory subject pronouns, where Hungarian does not,
 
** ''Я сплю.''{{slc|ru}} → ''Alszom.''{{slc|hu}}
 
** ''Я сплю.''{{slc|ru}} → ''Alszom.''{{slc|hu}}
* Chuvash has an ablative case, where Russian does not,
+
* Chuvash has an ablative case, where Russian uses a preposition and a different case.
 
** ''Я получил письмо от друга.''{{slc|ru}} → ''Юлташран ҫыру илтӗм.''{{slc|cv}}
 
** ''Я получил письмо от друга.''{{slc|ru}} → ''Юлташран ҫыру илтӗм.''{{slc|cv}}
 
 
<!-- * In Spanish, clitic pronouns come after the infinitive, where in French, they precede it.
 
<!-- * In Spanish, clitic pronouns come after the infinitive, where in French, they precede it.
 
** ''Voy a enviárselos.''{{slc|es}} → ''Je vais lui les envoyer.''{{slc|fr}} -->
 
** ''Voy a enviárselos.''{{slc|es}} → ''Je vais lui les envoyer.''{{slc|fr}} -->
Line 47: Line 48:
 
;A) Insertion
 
;A) Insertion
   
Insertion is the operation of adding a new tag or word, for example to translate a noun from Dolgan to Russian, a tag denoting case (nominative, accusative, ...) would need to be inserted, as Dolgan does not have this morphological feature. Many times translation also requires the insertion of a word. For example when translating the imperative from Turkish to Chuvash, the adverb ''ан'' needs to be added to the beginning: ''xxx'' → ''Ан çи! ''.)
+
Insertion is the operation of adding a new tag or word, for example to translate a noun from Dungan to Russian, a tag denoting case (nominative, accusative, ...) would need to be inserted, as Dungan does not have this morphological feature. Many times translation also requires the insertion of a word. For example when translating the imperative from Turkish to Chuvash, the adverb ''ан'' needs to be added to the beginning: ''Yeme!'' → ''Ан çи!''.
   
 
;B) Deletion
 
;B) Deletion
Line 55: Line 56:
 
;C) Substitution
 
;C) Substitution
   
Substitution is replacing one tag with another tag, such as for example changing the gender of a word. This is often done in the transfer lexicon, as we saw in the last session, but it can also take place in transfer. For example if you are translating from Spanish to French, ''Las estudiantes guapas'', the word ''estudiante'' would be marked with a special tag <code>GD</code> (gender to be determined) in the transfer lexicon, and then the transfer rules would substitute this with the gender of the adjective or determiner.
+
Substitution is replacing one tag with another tag, such as for example changing the gender of a word. This is often done in the transfer lexicon, as we saw in the last session, but it can also take place in transfer. For example if you are translating from Russian to Bashkir, ''мои брюки'', the word ''брюки'' would be marked with a special tag <code>ND</code> (number to be determined) in the transfer lexicon, and then the transfer rules would substitute this with the appropriate number.
   
 
;D) Reordering
 
;D) Reordering
   
Reordering is changing the order of tags or words. For example the order of number and possessive when translating from Turkich into Chuvash ''xxx'' → ''yyy''.
+
Reordering is changing the order of tags or words. For example the order of number and possessive when translating from Turkish into Chuvash ''kitap·lar·ım'' → ''кӗнеке·м·сем''.
   
 
;E) Combining operations
 
;E) Combining operations
   
A lot of the time, a combination of these operations is required to make a transfer rule, for example when translating from languages with postfixed articles into language with free-standing articles, such as Romanian, Bulgarian, Danish or Swedish to German or French, two operations are required, first the word for article needs to be introduced before the noun, and secondly the feature denoting definiteness on the noun needs to be removed.
+
A lot of the time, a combination of these operations is required to make a transfer rule, for example when translating from language with a case for a given meaning into a language with a preposition and a case for the same meaning, two operations are required, first the word for preposition needs to be inserted before the noun, and secondly the feature denoting case on the noun needs to be changed.
   
 
===Left-to-right longest match (LRLM)===
 
===Left-to-right longest match (LRLM)===
Line 108: Line 109:
 
In addition to the parts that are defined in <code>section-def-attrs</code>, the transfer module predefines a number of default parts of each lexical unit.
 
In addition to the parts that are defined in <code>section-def-attrs</code>, the transfer module predefines a number of default parts of each lexical unit.
   
To more easily demonstrate the default parts that are available, we'll take an example we've already seen:
+
To more easily demonstrate the default parts that are available, we'll use the follow example:
   
 
<pre>
 
<pre>
^transducteur# à états finis<n><m><pl>$
+
^правило# по безопасности<n><nt><nn><pl>$
 
</pre>
 
</pre>
   
 
* <code>whole</code>: as the name suggests, this provides the whole content of the lexical unit.
 
* <code>whole</code>: as the name suggests, this provides the whole content of the lexical unit.
* <code>lem</code>: the lemma: in this case, <tt>transducteur# à états finis</tt>
+
* <code>lem</code>: the lemma: in this case, <tt>правило# по безопасности</tt>
* <code>tags</code>: all of the tags, in one unit: <tt><nowiki><n><m><pl></nowiki></tt>
+
* <code>tags</code>: all of the tags, in one unit: <tt><nowiki><n><nt><nn><pl></nowiki></tt>
* <code>lemh</code>: the lemma ''head''; the part of the multiword which inflects: <tt>transducteur</tt>
+
* <code>lemh</code>: the lemma ''head''; the part of the multiword which inflects: <tt>правило</tt>
* <code>lemq</code>: the lemma ''queue''; the rest of a multiword (the <code><nowiki><g></g></nowiki></code> part in the dictionaries): <tt># à états finis</tt>
+
* <code>lemq</code>: the lemma ''queue''; the rest of a multiword (the <code><nowiki><g></g></nowiki></code> part in the dictionaries): <tt># по безопасности</tt>
   
 
==Practice==
 
==Practice==
   
For this practical, we're going to use the Spanish--Portuguese language pair in Apertium, you could also choose another pair with single-level transfer (such as Swedish--Danish). So, go to the directory <code>apertium-es-pt</code> and open the file of transfer rules for Portuguese to Spanish <code>apertium-es-pt.pt-es.t1x</code>, in other pairs this might be a <code>.t1x</code> file.
+
For this practical, we're going to use the Turkish--Chuvash language pair in Apertium, you could also choose another pair with single-level transfer (such as Tatar--Bashkir). So, go to the directory <code>apertium-cv-tr</code> and open the file of transfer rules for Turkish to Chuvash <code>apertium-cv-tr.tr-cv.t1x</code>, in other pairs this might be a <code>.t1x</code> file.
   
 
===Describe a transfer rule===
 
===Describe a transfer rule===
   
The objective of this exercise is to describe the behaviour of an existing transfer rule given the input sentence in Portuguese: ''Vou ligar''. First we're going to give some preliminaries for how transfer rules are structured, then we're going to look at an actual rule and describe it.
+
The objective of this exercise is to describe the behaviour of an existing transfer rule given the input sentence in Turkish: ''Yeme!''. First we're going to give some preliminaries for how transfer rules are structured, then we're going to look at an actual rule and describe it.
   
 
Transfer rules in Apertium are made up of two principle parts. A ''pattern'' &mdash; that is a sequence of categories which will be matched, and an ''action'', which contains operations that are carried out on the patterns matched. An overall schema for a rule might look like the following:
 
Transfer rules in Apertium are made up of two principle parts. A ''pattern'' &mdash; that is a sequence of categories which will be matched, and an ''action'', which contains operations that are carried out on the patterns matched. An overall schema for a rule might look like the following:
Line 182: Line 183:
 
|}
 
|}
   
The rule we're going to describe is <code>REGLA 36: VERB IR + INFINITIU → IR + A + INFINITIU</code>, so search for it in the file. Here is what the rule looks like:
+
The rule we're going to describe is <code>regla: v_neg_imp</code>, so search for it in the file. Here is what the rule looks like:
   
 
<pre>
 
<pre>
  +
<rule>
 
  +
<rule comment="regla: v_neg_imp (yeme! → ан ҫи!)">
<!--REGLA 36: VERB IR + INFINITIU -> IR + A + INFINITIU -->
 
 
<pattern>
 
<pattern>
<pattern-item n="vbir"/>
+
<pattern-item n="v_neg_imp"/>
<pattern-item n="inf"/>
 
 
</pattern>
 
</pattern>
 
<action>
 
<action>
<choose>
+
<let>
<when>
+
<clip pos="1" side="tl" part="a_neg"/>
<test>
+
<lit v=""/>
<or>
+
</let>
<equal>
 
<clip pos="1" side="sl" part="temps"/>
 
<lit-tag v="infps"/>
 
</equal>
 
<equal>
 
<clip pos="1" side="sl" part="temps"/>
 
<lit-tag v="fts"/>
 
</equal>
 
</or>
 
</test>
 
<let>
 
<clip pos="1" side="tl" part="temps"/>
 
<lit-tag v="prs"/>
 
</let>
 
</when>
 
</choose>
 
 
<out>
 
<out>
 
<lu>
 
<lu>
<clip pos="1" side="tl" part="whole"/>
+
<lit v="ан"/>
</lu>
+
<lit-tag v="adv"/>
<b pos="1"/>
 
<lu>
 
<get-case-from pos="2">
 
<lit v="a"/>
 
</get-case-from>
 
<lit-tag v="pr"/>
 
 
</lu>
 
</lu>
 
<b/>
 
<b/>
 
<lu>
 
<lu>
<clip pos="2" side="tl" part="lemh"/>
+
<clip pos="1" side="tl" part="whole"/>
<clip pos="2" side="tl" part="a_verb"/>
 
<clip pos="2" side="tl" part="temps"/>
 
<clip pos="2" side="tl" part="persona"/>
 
<clip pos="2" side="tl" part="nbr"/>
 
<clip pos="2" side="tl" part="lemq"/>
 
 
</lu>
 
</lu>
 
</out>
 
</out>
Line 238: Line 211:
 
</pre>
 
</pre>
   
For reference, the patterns being matched are defined as:
+
For reference, the pattern being matched are defined as:
   
 
<pre>
 
<pre>
 
<section-def-cats>
 
<section-def-cats>
 
...
 
...
<def-cat n="vbir">
+
<def-cat n="v_neg_imp">
<cat-item lemma="ir" tags="vblex.*"/>
+
<cat-item tags="v.*.neg.imp.*"/>
<cat-item lemma="vir" tags="vblex.*"/>
 
</def-cat>
 
...
 
<def-cat n="inf">
 
<cat-item tags="vblex.inf"/>
 
<cat-item tags="vbser.inf"/>
 
<cat-item tags="vbhaver.inf"/>
 
<cat-item tags="vbmod.inf"/>
 
 
</def-cat>
 
</def-cat>
  +
 
...
 
...
 
</section-def-cats>
 
</section-def-cats>
Line 262: Line 228:
 
So, given the description of the tags, and the contents of the rule, we could describe the rule as follows:
 
So, given the description of the tags, and the contents of the rule, we could describe the rule as follows:
   
* This rule matches a sequence of "vbir" followed by "inf".
+
* This rule matches a sequence of "vb_neg_imp"
* The lexical form pattern "vbir" is defined, in the {{tag|section-def-cats}} of the file, as either the lemma "ir" followed by the tag {{tag|vblex}} and any other tags, or the lemma "vir" followed by the tag {{tag|vblex}} and any other tags.
+
* The lexical form pattern "vb_neg_imp" is defined, in the {{tag|section-def-cats}} of the file, as the tag {{tag|v}} followed by any tag (e.g. {{tag|iv}} or {{tag|tv}} followed by the tags {{tag|neg><imp}} and then any other tags.
  +
* The rule then sets the value of the part of the lexical unit which matches the attribute <code>a_neg</code> to nothing (meaning it "deletes" it)
* The pattern "inf" is defined as either the verb "to be" {{tag|vbser}}, the verb "to have" {{tag|vbhaver}}, a modal verb {{tag|vbmod}} or a normal verb {{tag|vblex}}, followed by the tag {{tag|inf}} for infinitive.
 
  +
* The rule outputs the lemma "ан", followed by the tag {{tag|adv}} in a single lexical unit (between <code>^</code> and <code>$</code>).
* First the rule checks to see if the tense of the verb is {{tag|infps}} ''Personal infinitive'' or {{tag|fts}} ''Future subjunctive''.
 
 
* The rule then outputs a blank space
** If the tense of the verb is {{tag|infps}} or {{tag|fts}} then the rule sets the value of <code>temps</code> in the target language (<code>tl</code>) to {{tag|prs}} ''Present subjunctive''
 
* The rule then outputs the first lexical unit matched by the pattern, <code>whole</code> refers to the entire lexical unit, that is lemma and tags, instead of just a part.
+
* The rule then outputs the a lexical unit with the information from the verb matched by the original pattern.
* The rule then outputs the formatting in position 1 (the space in between the two patterns)
 
* Then the rule outputs a lexical unit with the lemma of ''a'', taking the case from the word matched by the second pattern, followed by the tag for preposition {{tag|pr}}.
 
* The rule outputs a space
 
* The rule then outputs the a lexical unit with the information from the verb matched by the second pattern.
 
   
 
====Example output====
 
====Example output====
Line 278: Line 240:
   
 
<pre>
 
<pre>
$ echo "vou ligar" | lt-proc pt-es.automorf.bin | apertium-tagger -g pt-es.prob
+
$ echo 'yeme!' | hfst-proc tr-cv.automorf.hfst | cg-proc tr-cv.rlx.bin | apertium-tagger -g tr-cv.prob
  +
^ir<vblex><pri><p1><sg>$ ^ligar<vblex><inf>$
 
  +
^ye<v><tv><neg><imp><p2><sg>$^!<sent>$^.<sent>$
 
</pre>
 
</pre>
   
Line 285: Line 248:
   
 
<pre>
 
<pre>
$ echo "vou ligar" | lt-proc pt-es.automorf.bin | apertium-tagger -g pt-es.prob |\
+
$ echo 'yeme!' | hfst-proc tr-cv.automorf.hfst | cg-proc tr-cv.rlx.bin | apertium-tagger -g tr-cv.prob |\
apertium-transfer apertium-es-pt.pt-es.t1x pt-es.t1x.bin pt-es.autobil.bin
+
apertium-pretransfer | lt-proc -b tr-cv.autobil.bin | apertium-lrx-proc tr-cv.lrx.bin |\
  +
apertium-transfer -b apertium-cv-tr.tr-cv.t1x tr-cv.t1x.bin
^ir<vblex><pri><p1><sg>$ ^a<pr>$ ^conectar<vblex><inf>$
 
  +
  +
^ан<adv>$ ^ҫи<v><tv><imp><p2><sg>$^!<sent>$
 
</pre>
 
</pre>
   
Line 295: Line 260:
   
 
<pre>
 
<pre>
$ echo "vou ligar" | lt-proc pt-es.automorf.bin | apertium-tagger -g pt-es.prob |\
+
$ echo 'yeme!' | hfst-proc tr-cv.automorf.hfst | cg-proc tr-cv.rlx.bin | apertium-tagger -g tr-cv.prob |\
apertium-transfer -t apertium-es-pt.pt-es.t1x pt-es.t1x.bin pt-es.autobil.bin
+
apertium-pretransfer | lt-proc -b tr-cv.autobil.bin | apertium-lrx-proc tr-cv.lrx.bin |\
  +
apertium-transfer -t -b apertium-cv-tr.tr-cv.t1x tr-cv.t1x.bin
   
apertium-transfer: Rule 62 ir<vblex><pri><p1><sg>
+
apertium-transfer: Rule 3 ye<v><tv><neg><imp><p2><sg>/ҫи<v><tv><neg><imp><p2><sg>
  +
^ан<adv>$ ^ҫи<v><tv><imp><p2><sg>$^!<sent>$
 
apertium-transfer: Rule 69 ir<vblex><pri><p1><sg> ligar<vblex><inf>
 
^ir<vblex><pri><p1><sg>$ ^a<pr>$ ^conectar<vblex><inf>$
 
 
</pre>
 
</pre>
 
 
   
 
===Describe a new transfer rule===
 
===Describe a new transfer rule===
Line 315: Line 277:
   
   
  +
[[Category:Машинный перевод для языков России|Session 5]]
 
[[Category:Session 5|*]]
 

Latest revision as of 12:00, 31 January 2012

The aim of this session is to give an introduction to the idea of contrastive morphological and syntactic analysis, and show how basic (that is, single level) morphological and syntactic transfer rules can be made in Apertium.

Theory[edit]

Contrastive analysis[edit]

Contrastive analysis is the process of examining two or more languages together to find out what kind of features they share, and how they are distinguished. When working on shallow-transfer machine translation, we can consider for example, both morphological contrasts and syntactic contrasts.

Morphological contrasts[edit]

Morphological contrasts are differences in which morphological features are expressed in each language, and how they differ. For example, if one language expresses definiteness morphologically, and another doesn't, or one language has case or gender while the other doesn't. Or if the case/gender inventories differ between the languages. Examples:

  • Kazakh has no grammatical gender where Russian has three (masculine, feminine and neuter). Gender in Russian plays a rôle in agreement within noun phrases.
    • Жаңа әуежай(kk)Новый аэропорт(ru)
  • Russian has two numbers (singular and plural), where Mansi has three (singular, dual and plural).
    • Я читаю эту книгу(ru)Ам ти книга ловиньтилум(mns)
    • Я читаю эти две книги(ru)Ам ти книгаг ловиньтиягум(mns)
    • Я читаю эти книги(ru)Ам ти книгат ловиньтиянум(mns)
  • Hungarian has a large number of cases which don't exist in Finnish (for example the temporal case -kor, which can be translated into Finnish by the ablative, adessive, or essive). Case in Hungarian and Finnish does not play a rôle in noun-phrase agreement.
    • Ötkor(hu)Viideltä(fi)
    • Éjfélkor(hu)Keskiyöllä(fi)
    • Karácsonykor(hu)Jouluna(fi)

Syntactic contrasts[edit]

Syntactic contrasts are differences in how the syntax of the languages works, for example the existence or not of articles, ordering of clitic pronouns, case inventories, analytic verb tenses, usage of subject pronouns, etc.

  • Abkhaz has a postfixed indefinite article, and a prefixed definite article where Russian does not.
    • Сара кьылак акәац аасхәоит(ab)Я покупаю килограмм мяса(ru)
    • Иара амшын ахь дцоит.(ab)Он идёт в море.(ru)
  • Russian has compulsory subject pronouns, where Hungarian does not,
    • Я сплю.(ru)Alszom.(hu)
  • Chuvash has an ablative case, where Russian uses a preposition and a different case.
    • Я получил письмо от друга.(ru)Юлташран ҫыру илтӗм.(cv)

Transfer[edit]

Transfer is the process of altering the intermediate representation of a source language into that of the target language. In Apertium, transfer works on an intermediate representation of lemmas and tags specifying morphological features (lexical forms).

Basic operations[edit]

When thinking about implementing transfer rules, it is often worth trying to think in terms of some very basic operations.

A) Insertion

Insertion is the operation of adding a new tag or word, for example to translate a noun from Dungan to Russian, a tag denoting case (nominative, accusative, ...) would need to be inserted, as Dungan does not have this morphological feature. Many times translation also requires the insertion of a word. For example when translating the imperative from Turkish to Chuvash, the adverb ан needs to be added to the beginning: Yeme!Ан çи!.

B) Deletion

Deletion is the opposite operation to insertion, a word or tag needs to be removed, for example as in the previous example, once the adverb has been inserted, the negative would need to be deleted (or not output) from the following verb, as it is the adverb which provides the negation.

C) Substitution

Substitution is replacing one tag with another tag, such as for example changing the gender of a word. This is often done in the transfer lexicon, as we saw in the last session, but it can also take place in transfer. For example if you are translating from Russian to Bashkir, мои брюки, the word брюки would be marked with a special tag ND (number to be determined) in the transfer lexicon, and then the transfer rules would substitute this with the appropriate number.

D) Reordering

Reordering is changing the order of tags or words. For example the order of number and possessive when translating from Turkish into Chuvash kitap·lar·ımкӗнеке·м·сем.

E) Combining operations

A lot of the time, a combination of these operations is required to make a transfer rule, for example when translating from language with a case for a given meaning into a language with a preposition and a case for the same meaning, two operations are required, first the word for preposition needs to be inserted before the noun, and secondly the feature denoting case on the noun needs to be changed.

Left-to-right longest match (LRLM)[edit]

Transfer structure[edit]

Files containing structural transfer rules in Apertium are laid out in the following fashion:

<transfer default="chunk">
  <section-def-cats>
    <def-cat n="adj_or_pp">
      <cat-item tags="adj.*">
      <cat-item tags="vblex.pp.*">
    </def-cat>
    ...
  </section-def-cats>
  <section-def-attrs>
    <def-attr n="nbr">
      <attr-item n="sg"/>
      <attr-item n="pl"/>
    </def-attr>
    ...
  </section-def-attrs>
  <section-def-vars>
    <def-var n="number"/> 
    ...
  </section-def-vars>
  <section-rules>
    ...
  </section-rules>
</transfer>

This is the minimal layout. The four sections are:

  • <section-def-cats>: Contains one or more <def-cat> entries. These specify patterns that can be matched by the transfer rules, and may match either tag sequences, or lemmas. The <def-cat> entry above for adj_or_pp matches any lexical unit in which the tags start with <adj> or <vblex><pp>. This might be useful where two categories behave similarly (for example adjectives and past particiles in Spanish).
  • <section-def-attrs>: Contains one or more <def-attr> entries. These list possible tags corresponding to a feature. For example, in this case the feature of nbr "number" may be one of two tags <sg> "singular" or <pl> "plural". When we use the <clip> tag, to extract a part, this allows us to define our own parts, in addition to the default parts.
  • <section-def-vars>: Contains variable definitions. Variables are used to pass information between rules. For example we might want to keep track of the last gender or number that we have seen, or the lemma of the last finite verb.
  • <section-rules>: Contains the rules.

Default parts[edit]

In addition to the parts that are defined in section-def-attrs, the transfer module predefines a number of default parts of each lexical unit.

To more easily demonstrate the default parts that are available, we'll use the follow example:

^правило# по безопасности<n><nt><nn><pl>$
  • whole: as the name suggests, this provides the whole content of the lexical unit.
  • lem: the lemma: in this case, правило# по безопасности
  • tags: all of the tags, in one unit: <n><nt><nn><pl>
  • lemh: the lemma head; the part of the multiword which inflects: правило
  • lemq: the lemma queue; the rest of a multiword (the <g></g> part in the dictionaries): # по безопасности

Practice[edit]

For this practical, we're going to use the Turkish--Chuvash language pair in Apertium, you could also choose another pair with single-level transfer (such as Tatar--Bashkir). So, go to the directory apertium-cv-tr and open the file of transfer rules for Turkish to Chuvash apertium-cv-tr.tr-cv.t1x, in other pairs this might be a .t1x file.

Describe a transfer rule[edit]

The objective of this exercise is to describe the behaviour of an existing transfer rule given the input sentence in Turkish: Yeme!. First we're going to give some preliminaries for how transfer rules are structured, then we're going to look at an actual rule and describe it.

Transfer rules in Apertium are made up of two principle parts. A pattern — that is a sequence of categories which will be matched, and an action, which contains operations that are carried out on the patterns matched. An overall schema for a rule might look like the following:

<rule>
 <pattern>
        ...
 </pattern>
 <action>
        ...
 </action>
</rule>

An overview of the meanings of the tags which are used in the example below are given here, along with references to the page of the documentation where a full description can be found.

Name Doc. ref. Description
<rule> §3.5.4.18 Starts a new rule, it contains at the highest level a pattern and an action.
<pattern> §3.5.4.19 Contains one or more pattern-item tags that define the pattern of lexical forms to be matched.
<pattern-item> §3.5.4.20 Contains references to patterns of lexical forms defined in the section-def-cats part of the rule file.
<action> §3.5.4.21 Contains tags defining the actions that should take place when a sequence of lexical units matching pattern is matched.
<choose> §3.5.4.24 Contains one or more when statements which define different actions depending on different conditions, and optionally an otherwise statement, which defines a default action if none of the whens are matched.
<when> §3.5.4.25 Specifies a condition and an action to take if that condition is fulfilled. The condition is in a test tag, and the action comes after.
<otherwise> §3.5.4.26 A default condition, which contains actions that are run if none of the when conditions in a choose block are matched.
<test> §3.5.4.27 Describes a condition. For example "test if the number of the first pattern matched is singular".
<or> §3.5.4.28 A tag allowing more than one condition to be matched. This tag is a Boolean operator, other Boolean operators available are: <and> and <not>
<equal> §3.5.4.28 Tests if two strings or tags are equal.
<clip> §3.5.4.29 Extracts the part of a lexical unit corresponding to an attribute as defined in the section-def-attrs section of the transfer file. The part to be extracted is specified in the part attribute.
<lit-tag> §3.5.4.31 Generates a string enclosed in < >, that is to say, a tag.
<lit> §3.5.4.30 Generates a string.
<out> §3.5.4.40 Contains everything that will be output by the rule.
<lu> §3.5.4.41 Encloses the contents in ^ and $, that is defines the contents as a lexical unit.
<b> §3.5.4.46 Outputs a space, or the formatting contained at the given position.
<get-case-from> §3.5.4.34 Converts the case (e.g. upper, lower) of whatever is enclosed in it to the case of the word marked by the pos attribute.

The rule we're going to describe is regla: v_neg_imp, so search for it in the file. Here is what the rule looks like:


    <rule comment="regla: v_neg_imp (yeme! → ан ҫи!)">
      <pattern>
        <pattern-item n="v_neg_imp"/>
      </pattern>
      <action>
        <let>
          <clip pos="1" side="tl" part="a_neg"/>
          <lit v=""/>
        </let>
        <out>
          <lu>
            <lit v="ан"/>
            <lit-tag v="adv"/>
          </lu>
          <b/>
          <lu>
            <clip pos="1" side="tl" part="whole"/>
          </lu>
        </out>
      </action>
    </rule>

For reference, the pattern being matched are defined as:

  <section-def-cats>
    ...
    <def-cat n="v_neg_imp">
      <cat-item tags="v.*.neg.imp.*"/>
    </def-cat>

    ...
  </section-def-cats>

Description[edit]

So, given the description of the tags, and the contents of the rule, we could describe the rule as follows:

  • This rule matches a sequence of "vb_neg_imp"
  • The lexical form pattern "vb_neg_imp" is defined, in the <section-def-cats> of the file, as the tag <v> followed by any tag (e.g. <iv> or <tv> followed by the tags <neg><imp> and then any other tags.
  • The rule then sets the value of the part of the lexical unit which matches the attribute a_neg to nothing (meaning it "deletes" it)
  • The rule outputs the lemma "ан", followed by the tag <adv> in a single lexical unit (between ^ and $).
  • The rule then outputs a blank space
  • The rule then outputs the a lexical unit with the information from the verb matched by the original pattern.

Example output[edit]

Input text
$ echo 'yeme!' | hfst-proc tr-cv.automorf.hfst | cg-proc tr-cv.rlx.bin | apertium-tagger -g tr-cv.prob

^ye<v><tv><neg><imp><p2><sg>$^!<sent>$^.<sent>$
Output from transfer
$ echo 'yeme!' | hfst-proc tr-cv.automorf.hfst | cg-proc tr-cv.rlx.bin | apertium-tagger -g tr-cv.prob |\
   apertium-pretransfer | lt-proc -b tr-cv.autobil.bin | apertium-lrx-proc tr-cv.lrx.bin  |\
   apertium-transfer -b apertium-cv-tr.tr-cv.t1x tr-cv.t1x.bin 

^ан<adv>$ ^ҫи<v><tv><imp><p2><sg>$^!<sent>$

If you want to trace which rules are matched, add the -t option to the apertium-transfer:

Output from transfer
$ echo 'yeme!' | hfst-proc tr-cv.automorf.hfst | cg-proc tr-cv.rlx.bin | apertium-tagger -g tr-cv.prob |\
   apertium-pretransfer | lt-proc -b tr-cv.autobil.bin | apertium-lrx-proc tr-cv.lrx.bin  |\
   apertium-transfer -t -b apertium-cv-tr.tr-cv.t1x tr-cv.t1x.bin 

apertium-transfer: Rule 3 ye<v><tv><neg><imp><p2><sg>/ҫи<v><tv><neg><imp><p2><sg>
^ан<adv>$ ^ҫи<v><tv><imp><p2><sg>$^!<sent>$

Describe a new transfer rule[edit]

The objective of this part of the practical is to describe a new transfer rule. Run some text through the translator of your choice, and describe a new rule in natural language, as was done in the previous subsection. If you have time after defining the rule, try and write it out in XML format.

Further reading[edit]