Difference between revisions of "Курсы машинного перевода для языков России/Session 6"

From Apertium
Jump to navigation Jump to search
 
(30 intermediate revisions by 2 users not shown)
Line 1: Line 1:
 
{{TOCD}}
 
{{TOCD}}
   
While the basic structural transfer described in [[/Session 5|session 5]] is enough to deal with the majority of frequent divergences between closely related languages (for example Bashkir and Tatar or Finnish and Kven), when working with languages with more divergent morphology and syntax, a more powerful structural transfer module is necessary. This session describes the Apertium 3+ level transfer system which was designed to allow easier treatment of longer patterns, and more divergent languages.
+
While the basic structural transfer described in [[Машинный перевод для языков России/Session 5|session 5]] is enough to deal with the majority of frequent divergences between closely related languages (for example Bashkir and Tatar or Finnish and Kven), when working with languages with more divergent morphology and syntax, a more powerful structural transfer module is necessary. This session describes the Apertium 3+ level transfer system which was designed to allow easier treatment of longer patterns, and more divergent languages.
   
 
==Theory==
 
==Theory==
Line 27: Line 27:
 
| <code>num nom</code> || икĕ ҫурт || <code>SN{num nom}</code> || два дома
 
| <code>num nom</code> || икĕ ҫурт || <code>SN{num nom}</code> || два дома
 
|-
 
|-
| <code>num nom</code> || пилĕк ҫурт || <code>SN{num nom}</code> || пять домов
+
| <code>num nom</code> || пилĕ ҫурт || <code>SN{num nom}</code> || пять домов
 
|-
 
|-
 
| <code>adj nom</code> || хитре ҫуртсем || <code>SN{adj nom}</code> || красивые домы
 
| <code>adj nom</code> || хитре ҫуртсем || <code>SN{adj nom}</code> || красивые домы
Line 33: Line 33:
 
| <code>adv adj nom</code> || питĕ хитре ҫурт || <code>SN{adv adj nom}</code> || очень красивый дом
 
| <code>adv adj nom</code> || питĕ хитре ҫурт || <code>SN{adv adj nom}</code> || очень красивый дом
 
|-
 
|-
| <code>num adv adj nom</code> || пилĕк питĕ хитре ҫурт || <code>SN{num adv adj nom}</code> || пять очень красивых домов
+
| <code>num adv adj nom</code> || пилĕ питĕ хитре ҫурт || <code>SN{num adv adj nom}</code> || пять очень красивых домов
 
|-
 
|-
 
|}
 
|}
Line 45: Line 45:
 
! Input pattern !! Example !! Output chunk !! Example
 
! Input pattern !! Example !! Output chunk !! Example
 
|-
 
|-
| <code>verb</code> || komt || <code>V{verb}</code> || читает
+
| <code>verb</code> || вулать || <code>V{verb}</code> || читает
 
|-
 
|-
| <code>verb neg_adv</code> || komt niet || <code>V{neg_adv verb}</code> || не читает
+
| <code>verb</code> || вуламасть || <code>V{neg_adv verb}</code> || не читает
 
|-
 
|-
| <code>zijn neg_adv pp</code> || is niet gekomen || <code>V{neg_adv haber pp}</code> || не читал
+
| <code>verb</code> || вуларĕ || <code>V{verb}</code> || читал
  +
|-
 
| <code>verb</code> || вуламĕ || <code>V{aux inf}</code> || будет читать
  +
|-
  +
| <code>verb</code> || вуламарĕ || <code>V{neg_adv aux inf}</code> || не будет читать
  +
|-
  +
| <code>verb</code> || вуласшăн || <code>V{aux part inf}</code> || хотел бы говорить
  +
|-
  +
| <code>verb</code> || вулӑттӑм || <code>V{verb part}</code> || говорил бы
  +
|-
  +
| <code>adv verb</code> || ан вула ! || <code>V{adv verb}</code> || не читай !
  +
|-
  +
| <code>ger verb</code> || вулама пуçлать || <code>V{verb inf}</code> || начинает читать.
 
|-
 
|-
| <code>aux neg_adv inf</code> || zal niet komen || <code>V{neg_adv verb}</code> || не будет читать
 
 
|}
 
|}
 
   
 
Thus, if we want to concord a noun phrase with a main verb, we can just use one rule (for <code>SN V</code>) in the second module of the transfer (the ''interchunk'') instead of having separate rules for <code>nom verb</code>, <code>adj nom verb</code>, <code>det adj nom verb</code>, etc.
 
Thus, if we want to concord a noun phrase with a main verb, we can just use one rule (for <code>SN V</code>) in the second module of the transfer (the ''interchunk'') instead of having separate rules for <code>nom verb</code>, <code>adj nom verb</code>, <code>det adj nom verb</code>, etc.
Line 65: Line 75:
 
Once these chunks are made, the next module ''interchunk'' allows operations to be made between chunks as if they were lexical units in themselves: chunks are used as a level of abstraction, so that equivalent words and phrases can be translated using the same rules.
 
Once these chunks are made, the next module ''interchunk'' allows operations to be made between chunks as if they were lexical units in themselves: chunks are used as a level of abstraction, so that equivalent words and phrases can be translated using the same rules.
   
  +
As well as gender concordance and word reordering, this allows person 'detection' &mdash; for example to concord a verb in the past tense in Chuvash with the pronoun in the sentence. In the Russian sentence ''Я вчера читалa'', the chunker would give the following output:
Consider the Spanish sentences:
 
 
* ''El hombre vio el perro'',
 
* ''El hombre ha visto el perro'',
 
* ''El hombre alto podría haber visto el perro blanco''
 
 
Each of these sentences would produce the same output chunks in the chunker: 'nominal chunk' 'verbal chunk' 'nominal chunk' &mdash; which interchunk then performs a second set of structural transformations on.
 
 
As well as gender concordance and word reordering, this allows gender 'detection'. Consider the Spanish word 'doctor', which has a feminine form 'doctora'. In the Spanish analyser, 'doctora' is analysed as a form of 'doctor', rather than as a separate word in its own right, and in the bilingual dictionary it has the tag 'GD' attached. In the Dutch sentence 'Maria is een dokter', the chunker would give the following output:
 
   
 
<pre>
 
<pre>
^Nom<SN><UNDET><f><sg>{^Maria<np><ant><3><4>$}$
+
^pron<SN><p1><mf><sg><nom>{^Эпĕ<prn><pers><2><3><4><5>$}$
  +
^adv<ADV>{^ĕнер<adv>$}$
^zijn<Vcop><vbser><pri><p3><sg>{^ser<vbser><3><4><5>$}$
 
^det_nom<SN><DET><GD><sg>{^uno<det><ind><3><4>$ ^doctor<n><3><4>$}$
+
^verb<SV><imperf><tv><evid><PD><f><sg>{^вула<v><3><4><5><7>$}$
 
</pre>
 
</pre>
   
 
The format of chunks is much like that of lexical units, <code>^</code> indicates the start, and <code>$</code> the end. The difference being that a chunk may contain other lexical units within <code>{</code> and <code>}</code>.
 
The format of chunks is much like that of lexical units, <code>^</code> indicates the start, and <code>$</code> the end. The difference being that a chunk may contain other lexical units within <code>{</code> and <code>}</code>.
   
The lexical units inside the chunk (between the <code>{</code> and <code>}</code> signs) cannot be accessed or modified in the interchunk; here you can only access or modify elements from the description of the chunk, which is the part after <code>^</code> and before the first <code>{</code>. The description of the chunk contains the lemma of the chunk (like <code>det_nom</code> in the previous example) and the morphological tags of the chunk (which for <code>det_nom</code> are {{tag|SN><DET><GD><sg}}). These tags can be linked with the lexical forms inside the chunk: this is the reason for the numbers {{tag|3}} and {{tag|4}} in the lexical forms of the <code>det_nom</code> chunk: they are linked with the third and fourth tags of the chunk ({{tag|GD}} and {{tag|sg}}) and will be substituted for them in the postchunk module.
+
The lexical units inside the chunk (between the <code>{</code> and <code>}</code> signs) cannot be accessed or modified in the interchunk; here you can only access or modify elements from the description of the chunk, which is the part after <code>^</code> and before the first <code>{</code>. The description of the chunk contains the lemma of the chunk (like <code>pron</code> in the previous example) and the morphological tags of the chunk (which for <code>pron</code> are {{tag|SN><p1><mf><sg><nom}}).
   
  +
These tags can be linked with the lexical forms inside the chunk: this is the reason for the numbers {{tag|5}} and {{tag|7}} in the lexical forms of the <code>verb</code> chunk: they are linked with the fifth and seventh tags of the chunk ({{tag|PD}} and {{tag|sg}}) and will be substituted for them in the postchunk module.
Interchunk has a rule for 'nominal chunk' 'copula' 'nominal chunk', which copies the gender from the first nominal chunk to the second, replacing the 'GD' tag; in this example, giving it the feminine value:
 
  +
 
Interchunk has a rule for 'nominal chunk' 'adv' 'verb chunk', which copies the person from the first nominal chunk to the verb chunk, replacing the 'PD' tag; in this example, giving it the {{tag|p1}} (first person) value:
   
 
<pre>
 
<pre>
^Nom<SN><PDET><f><sg>{^Maria<np><ant><3><4>$}$
+
^pron<SN><p1><mf><sg><nom>{^Эпĕ<prn><pers><2><3><4><5>$}$
  +
^adv<ADV>{^ĕнер<adv>$}$
^zijn<Vcop><vbser><pri><p3><sg>{^ser<vbser><3><4><5>$}$
 
^det_nom<SN><DET><f><sg>{^uno<det><ind><3><4>$ ^doctor<n><3><4>$}$
+
^verb<SV><imperf><tv><evid><p1><f><sg>{^вула<v><3><4><5><7>$}$
 
</pre>
 
</pre>
   
The postchunk module will assign this tag to the determiner and the noun inside the chunk.
+
The postchunk module will assign this tag to the verb inside the chunk.
   
 
====Postchunk====
 
====Postchunk====
Line 100: Line 104:
   
 
Changes made on the chunks in the interchunk module, will be applied to the contents of the chunk: tags containing a number will be substituted for the value of the corresponding tag outside of the chunk. The ''postchunk'' module removes the chunk ''lemma'' and tags, and leaves the output as a sequence of lexical units.
 
Changes made on the chunks in the interchunk module, will be applied to the contents of the chunk: tags containing a number will be substituted for the value of the corresponding tag outside of the chunk. The ''postchunk'' module removes the chunk ''lemma'' and tags, and leaves the output as a sequence of lexical units.
 
In the ''Maria is een dokter'' example, {{tag|UNDET}} changed to {{tag|PDET}}. This is an indicator to the postchunk module that this ''may'' be a chunk which takes a definite article in Spanish (in this particular case, it's not).
 
   
 
Postchunk operates on a single chunk at a time. In addition to the <tt>clip</tt> elements which refer to individual words contained in the chunk, there is also a <tt>clip</tt> numbered 0 (zero), which allows us to access information from the chunk lemma, which can be used to take information from "outside" the chunk (changed in interchunk) to the words inside. Also, because the number of words in a chunk may vary, there is an element, <tt>lu-count</tt>, which allows us to test how many words the chunk contains, and act accordingly.
 
Postchunk operates on a single chunk at a time. In addition to the <tt>clip</tt> elements which refer to individual words contained in the chunk, there is also a <tt>clip</tt> numbered 0 (zero), which allows us to access information from the chunk lemma, which can be used to take information from "outside" the chunk (changed in interchunk) to the words inside. Also, because the number of words in a chunk may vary, there is an element, <tt>lu-count</tt>, which allows us to test how many words the chunk contains, and act accordingly.
Line 107: Line 109:
 
==Practice==
 
==Practice==
   
For the practice section, we are going to look at how a transfer is performed in three stages by the Apertium Spanish&mdash;Italian pair, <code>apertium-es-it</code>, and then describe a transfer rule in terms of three or more levels. So change directory to <code>apertium-es-it</code> and make sure the pair is compiled.
+
For the practice section, we are going to look at how a transfer is performed in three stages by the Apertium Tatar&mdash;Kyrgyz pair, <code>apertium-tt-ky</code>, and then describe a transfer rule in terms of three or more levels. So change directory to <code>apertium-tt-ky</code> and make sure the pair is compiled.
   
 
===Looking at three-stage transfer===
 
===Looking at three-stage transfer===
   
We're going to translate the sentence ''Los zapatos nuevos son demasiado pequeños.'' from Spanish to Italian and follow the translation process through the three levels.
+
We're going to translate the sentence ''Әхмәт тиз генә иске зур бер агачка йөгерә.'' from Tatar to Kyrgyz and follow the translation process through the three levels.
   
 
====Input====
 
====Input====
   
  +
Because we don't yet have a full translator for Tatar and Kyrgyz, we're going to use some preprepared input from the Tatar and Bashkir pair.
First we morphologically analyse and tag the text:
 
   
 
<pre>
 
<pre>
  +
$ cat input
$ echo "Los zapatos nuevos son demasiado pequeños" | apertium -d . es-it-tagger
 
^El<det><def><m><pl>$ ^zapato<n><m><pl>$ ^nuevo<adj><m><pl>$ ^ser<vblex><pri><p3><pl>$
+
^Әхмәт<np><ant><m><nom>$ ^тиз<adv>$ ^гына<postadv>$ ^иске<adj>$ ^зур<adj>$ ^бер<det><ind>$
^demasiado<adv>$ ^pequeño<adj><m><pl>$^.<sent>$
+
^агач<n><dat>$ ^йөгер<v><iv><pres><p3><sg>$^..<sent>$
 
</pre>
 
</pre>
   
 
====Chunker====
 
====Chunker====
   
Then output of the part-of-speech tagger is passed to the first level of transfer:
+
The output of the part-of-speech tagger is passed to the lexical transfer, and then the first level of transfer:
   
 
<pre>
 
<pre>
$ echo "Los zapatos nuevos son demasiado pequeños" | apertium -d . es-it-chunker
+
$ cat input | lt-proc -b tt-ky.autobil.bin | apertium-transfer -b apertium-tt-ky.tt-ky.t1x tt-ky.t1x.bin
  +
^Det_nom_adj<SN><f><pl>{^il<det><def><2><3>$ ^scarpa<n><2><3>$ ^nuovo<adj><2><3>$}$
 
^verb<SV><vbser><pri><p3><pl>{^essere<vbser><pri><p3><5>$}$
+
^ant<SN>{^Акмат<np><ant><m><nom>$}$ ^adv<ADV>{^катуу<adv>$ ^гана<postadv>$}$
^adv_adj<SA><m><pl>{^troppo<adv>$ ^piccolo<adj><2><3>$}$^punt<sent>{^.<sent>$}$
+
^a_a_d_n<SN><dat>{^эски<adj><pst>$ ^чоң<adj><pst>$ ^бир<det><ind>$ ^дарак<n><2>$}$
  +
^чурка<V>{^чурка<v><iv><prt_perf>$ ^бар<v><iv><prt_impf>$ ^жат<vaux><aor><p3><sg>$}$^sent<SENT>{^..<sent>$}$
 
</pre>
 
</pre>
   
There are three rules applied by the first-level transfer module:
+
There are four rules applied by the first-level transfer module:
   
* <code>REGLA: DET ADJ NOM</code>: This rule matches a determiner, followed by an adjective followed by a noun. It creates a new nominal chunk and sets the gender and number of the chunk to be those of the head noun. The tags inside the chunk for gender and number are replaced by pointers to the tags in the chunk.
+
* <code>ПРАВИЛО: NP-ANT</code>: This rule matches an anthroponym (person's first name). It creates a new nominal {{tag|SN}} chunk.
  +
* <code>ПРАВИЛО: ADV POSTADV</code>: This rule matches a sequence of adverb and postadverb, it outputs an adverbial chunk {{tag|ADV}} containing the two lexical units.
* <code>REGLA: VERB</code>: This is the default verb rule, it matches any verb, and performs some local changes. For example, changing the future subjunctive to the imperfect subjunctive (in this case not applicable). It outputs the verb, verb type and other information in the chunk, and links the number to the chunk for possible future concordance operations.
 
* <code>REGLA: ADV ADJ</code>: This rule matches an adverb followed by an adjective. It outputs an adjective chunk and links the gender and number of the adjective to the chunk.
+
* <code>ПРАВИЛО: ADJ ADJ DET NOM</code>: This rule matches a sequence of two adjectives, a determiner and a noun. These are put inside a nominal chunk {{tag|SN}} and the case of the chunk is set to the case of the noun. A pointer {{tag|2}} is put on the noun so that when the case of the chunk is changed, it will be propagated inside.
  +
* <code>ПРАВИЛО: V-PRES</code>: This is the default present tense verb rule, it matches any verb in the present tense. It currently changes the synthetic present in Tatar into a progressive present tense with an auxiliary verb in Kyrgyz. This is because the Kyrgyz cognate to the Tatar present means either "future" or "habitual/general". This Tatar form is "habitual/general" and "present progressive". When translating the "present progressive" reading of the Tatar "present", then we need to translate to a different form in Kyrgyz, namely the participle + ''жат'' auxiliary.
   
Note that after the first stage of transfer there is an agreement error between the subject and the predicate. The gender of the subject ''Los zapatos nuevos'' has been changed from masculine to feminine, but that of the predicate ''demasiado pequeños'' has not.
+
Note that after the first stage of transfer there are a couple of problems. The tense is correct, but the case of the noun is wrong, and the adverbial is in the wrong place. In Kyrgyz it should come before the verb.
   
 
====Interchunk====
 
====Interchunk====
   
 
<pre>
 
<pre>
$ echo "Los zapatos nuevos son demasiado pequeños" | apertium -d . es-it-interchunk
+
$ cat input | lt-proc -b tt-ky.autobil.bin | apertium-transfer -b apertium-tt-ky.tt-ky.t1x tt-ky.t1x.bin |\
  +
apertium-interchunk apertium-tt-ky.tt-ky.t2x tt-ky.t2x.bin
^Det_nom_adj<SN><f><pl>{^il<det><def><2><3>$ ^scarpa<n><2><3>$ ^nuovo<adj><2><3>$}$
 
  +
^verb<SV><vbser><pri><p3><pl>{^essere<vbser><pri><p3><5>$}$
 
^adv_adj<SA><f><pl>{^troppo<adv>$ ^piccolo<adj><2><3>$}$^punt<sent>{^.<sent>$}$
+
^ant<SN>{^Акмат<np><ant><m><nom>$}$ ^a_a_d_n<SN><acc>{^эски<adj><pst>$ ^чоң<adj><pst>$ ^бир<det><ind>$ ^дарак<n><2>$}$
  +
^post<POST>{^көздөй<post>$}$ ^adv<ADV>{^катуу<adv>$ ^гана<postadv>$}$
  +
^чурка<V>{^чурка<v><iv><prt_perf>$ ^бар<v><iv><prt_impf>$ ^жат<vaux><aor><p3><sg>$}$^sent<SENT>{^..<sent>$}$
 
</pre>
 
</pre>
   
 
One rule is applied in the interchunk module:
 
One rule is applied in the interchunk module:
   
* <code>REGLA: SN ser SA</code>: The rule matches a nominal chunk (<code>SN</code>) followed by the verb ''essere'' and an adjective chunk (<code>SA</code>). It contains a macro <code>concorda_SN_SA</code> which checks to see if the adjective chunk agrees in gender and number with the nominal chunk.
+
* <code>ПРАВИЛО: ADV SN V</code>: The rule matches an adverbial chunk (<code>ADV</code>) followed by a nominal chunk (<code>SN</code>) and then a verbal chunk (<code>V</code>). It contains a call to a macro <code>conv_arg1</code> which adjusts the case of the nominal chunk, and outputs a postposition depending on the lemma of the verbal chunk. It also switches the position of the nominal chunk and the adverbial chunk, placing the adverbial before the verb.
   
We can see that in the output of interchunk, the adjective gender has been changed to that of the nominal chunk.
+
We can see that in the output of interchunk, the adverbial has been moved and the nominal chunk is in the correct case with a postposition.
   
 
====Postchunk====
 
====Postchunk====
   
The final module of transfer takes the chunks output by the interchunk module, and replaces the linked tags ({{tag|2}}, {{tag|3}}, etc.) with their values from the chunk.
+
The final module of transfer takes the chunks output by the interchunk module, and replaces the linked tag (e.g. {{tag|2}}) with its value from the chunk.
   
 
<pre>
 
<pre>
$ echo "Los zapatos nuevos son demasiado pequeños" | apertium -d . es-it-postchunk
+
$ cat input | lt-proc -b tt-ky.autobil.bin | apertium-transfer -b apertium-tt-ky.tt-ky.t1x tt-ky.t1x.bin |\
  +
apertium-interchunk apertium-tt-ky.tt-ky.t2x tt-ky.t2x.bin | apertium-postchunk apertium-tt-ky.tt-ky.t3x tt-ky.t3x.bin
^Il<det><def><f><pl>$ ^scarpa<n><f><pl>$ ^nuovo<adj><f><pl>$ ^essere<vbser><pri><p3><pl>$
 
  +
^troppo<adv>$ ^piccolo<adj><f><pl>$^.<sent>$
 
  +
^Акмат<np><ant><m><nom>$ ^эски<adj><pst>$ ^чоң<adj><pst>$ ^бир<det><ind>$ ^дарак<n><acc>$ ^көздөй<post>$ ^катуу<adv>$
  +
^гана<postadv>$ ^чурка<v><iv><prt_perf>$ ^бар<v><iv><prt_impf>$ ^жат<vaux><aor><p3><sg>$^..<sent>$
  +
 
</pre>
 
</pre>
   
Now the sentence is ready to be morphologically generated.
+
Now the sentence is ready to be morphologically generated. The file <code>tr-ky.autogen.hfst</code> can be copied from the <code>apertium-tr-ky</code> pair in <code>trunk/</code>.
   
 
====Output====
 
====Output====
   
 
<pre>
 
<pre>
$ echo "Los zapatos nuevos son demasiado pequeños" | apertium -d . es-it
+
$ cat input | lt-proc -b tt-ky.autobil.bin | apertium-transfer -b apertium-tt-ky.tt-ky.t1x tt-ky.t1x.bin |\
  +
apertium-interchunk apertium-tt-ky.tt-ky.t2x tt-ky.t2x.bin | apertium-postchunk apertium-tt-ky.tt-ky.t3x tt-ky.t3x.bin |\
Le scarpe nuove sono troppo piccole
 
  +
hfst-proc -g tr-ky.autogen.hfst
  +
  +
Акмат эски чоң бир даракты көздөй катуу гана чуркап бара жатат.
 
</pre>
 
</pre>
   
Line 184: Line 196:
 
* Ginestí i Rosell, M. (ed.) (2007) [https://wiki.apertium.org/w/images/d/d0/Apertium2-documentation.pdf Documentation of the Open-Source Shallow-Transfer Machine Translation Platform ''Apertium'']
 
* Ginestí i Rosell, M. (ed.) (2007) [https://wiki.apertium.org/w/images/d/d0/Apertium2-documentation.pdf Documentation of the Open-Source Shallow-Transfer Machine Translation Platform ''Apertium'']
   
{{Sessions}}
 
   
[[Category:Session 6|*]]
+
[[Category:Машинный перевод для языков России|Session 6]]

Latest revision as of 12:00, 31 January 2012

While the basic structural transfer described in session 5 is enough to deal with the majority of frequent divergences between closely related languages (for example Bashkir and Tatar or Finnish and Kven), when working with languages with more divergent morphology and syntax, a more powerful structural transfer module is necessary. This session describes the Apertium 3+ level transfer system which was designed to allow easier treatment of longer patterns, and more divergent languages.

Theory[edit]

Chunking-based transfer[edit]

The typical implementation of the chunking-based transfer consists of three modules: a chunker, an interchunk and a postchunk. This design can be extended to contain two or more interchunk modules if needed.

Chunker[edit]

The idea of the chunker is to extend the existing transfer rules to allow sequences of lexical units to be grouped. These groups are called chunks . A typical chunk might be for grouping nominals, doing concordance, inserting or deleting words, and performing local reordering, for example,

Input pattern Example Output chunk Example
nom ҫурт SN{nom} дом
adj nom хитре ҫурт SN{nom adj} красивый дом
nom ҫуртӑм SN{det nom} мой дом
det nom манăн ҫурт SN{det nom} мой дом
det nom манăн ҫуртӑм SN{det nom} мой дом
num nom икĕ ҫурт SN{num nom} два дома
num nom пилĕ ҫурт SN{num nom} пять домов
adj nom хитре ҫуртсем SN{adj nom} красивые домы
adv adj nom питĕ хитре ҫурт SN{adv adj nom} очень красивый дом
num adv adj nom пилĕ питĕ хитре ҫурт SN{num adv adj nom} пять очень красивых домов

Where nom = noun, adj = adjective, num = numeral, det = determiner, SN = noun phrase.

The same process also works for verb chunks:

Input pattern Example Output chunk Example
verb вулать V{verb} читает
verb вуламасть V{neg_adv verb} не читает
verb вуларĕ V{verb} читал
verb вуламĕ V{aux inf} будет читать
verb вуламарĕ V{neg_adv aux inf} не будет читать
verb вуласшăн V{aux part inf} хотел бы говорить
verb вулӑттӑм V{verb part} говорил бы
adv verb ан вула ! V{adv verb} не читай !
ger verb вулама пуçлать V{verb inf} начинает читать.

Thus, if we want to concord a noun phrase with a main verb, we can just use one rule (for SN V) in the second module of the transfer (the interchunk) instead of having separate rules for nom verb, adj nom verb, det adj nom verb, etc.

An important thing to remember is that chunks cannot be nested (i.e. a chunk may not contain another chunk). In some circumstances, and with some effort they can be merged at the interchunk stage — for example to join together one or more coordinated noun phrases, but not nested.

It should be noted that lexical forms are translated into the target language in this first module; the subsequent modules only work with lexical forms in the target language.

Interchunk[edit]

Once these chunks are made, the next module interchunk allows operations to be made between chunks as if they were lexical units in themselves: chunks are used as a level of abstraction, so that equivalent words and phrases can be translated using the same rules.

As well as gender concordance and word reordering, this allows person 'detection' — for example to concord a verb in the past tense in Chuvash with the pronoun in the sentence. In the Russian sentence Я вчера читалa, the chunker would give the following output:

^pron<SN><p1><mf><sg><nom>{^Эпĕ<prn><pers><2><3><4><5>$}$ 
^adv<ADV>{^ĕнер<adv>$}$
^verb<SV><imperf><tv><evid><PD><f><sg>{^вула<v><3><4><5><7>$}$ 

The format of chunks is much like that of lexical units, ^ indicates the start, and $ the end. The difference being that a chunk may contain other lexical units within { and }.

The lexical units inside the chunk (between the { and } signs) cannot be accessed or modified in the interchunk; here you can only access or modify elements from the description of the chunk, which is the part after ^ and before the first {. The description of the chunk contains the lemma of the chunk (like pron in the previous example) and the morphological tags of the chunk (which for pron are <SN><p1><mf><sg><nom>).

These tags can be linked with the lexical forms inside the chunk: this is the reason for the numbers <5> and <7> in the lexical forms of the verb chunk: they are linked with the fifth and seventh tags of the chunk (<PD> and <sg>) and will be substituted for them in the postchunk module.

Interchunk has a rule for 'nominal chunk' 'adv' 'verb chunk', which copies the person from the first nominal chunk to the verb chunk, replacing the 'PD' tag; in this example, giving it the <p1> (first person) value:

^pron<SN><p1><mf><sg><nom>{^Эпĕ<prn><pers><2><3><4><5>$}$ 
^adv<ADV>{^ĕнер<adv>$}$
^verb<SV><imperf><tv><evid><p1><f><sg>{^вула<v><3><4><5><7>$}$ 

The postchunk module will assign this tag to the verb inside the chunk.

Postchunk[edit]

Postchunk allows us to take the output of interchunk, and once again operate on the contents.

Changes made on the chunks in the interchunk module, will be applied to the contents of the chunk: tags containing a number will be substituted for the value of the corresponding tag outside of the chunk. The postchunk module removes the chunk lemma and tags, and leaves the output as a sequence of lexical units.

Postchunk operates on a single chunk at a time. In addition to the clip elements which refer to individual words contained in the chunk, there is also a clip numbered 0 (zero), which allows us to access information from the chunk lemma, which can be used to take information from "outside" the chunk (changed in interchunk) to the words inside. Also, because the number of words in a chunk may vary, there is an element, lu-count, which allows us to test how many words the chunk contains, and act accordingly.

Practice[edit]

For the practice section, we are going to look at how a transfer is performed in three stages by the Apertium Tatar—Kyrgyz pair, apertium-tt-ky, and then describe a transfer rule in terms of three or more levels. So change directory to apertium-tt-ky and make sure the pair is compiled.

Looking at three-stage transfer[edit]

We're going to translate the sentence Әхмәт тиз генә иске зур бер агачка йөгерә. from Tatar to Kyrgyz and follow the translation process through the three levels.

Input[edit]

Because we don't yet have a full translator for Tatar and Kyrgyz, we're going to use some preprepared input from the Tatar and Bashkir pair.

$ cat input 
^Әхмәт<np><ant><m><nom>$ ^тиз<adv>$ ^гына<postadv>$ ^иске<adj>$ ^зур<adj>$ ^бер<det><ind>$ 
^агач<n><dat>$ ^йөгер<v><iv><pres><p3><sg>$^..<sent>$

Chunker[edit]

The output of the part-of-speech tagger is passed to the lexical transfer, and then the first level of transfer:

$ cat input | lt-proc -b tt-ky.autobil.bin | apertium-transfer -b apertium-tt-ky.tt-ky.t1x tt-ky.t1x.bin 

^ant<SN>{^Акмат<np><ant><m><nom>$}$ ^adv<ADV>{^катуу<adv>$ ^гана<postadv>$}$ 
^a_a_d_n<SN><dat>{^эски<adj><pst>$ ^чоң<adj><pst>$ ^бир<det><ind>$ ^дарак<n><2>$}$ 
^чурка<V>{^чурка<v><iv><prt_perf>$ ^бар<v><iv><prt_impf>$ ^жат<vaux><aor><p3><sg>$}$^sent<SENT>{^..<sent>$}$

There are four rules applied by the first-level transfer module:

  • ПРАВИЛО: NP-ANT: This rule matches an anthroponym (person's first name). It creates a new nominal <SN> chunk.
  • ПРАВИЛО: ADV POSTADV: This rule matches a sequence of adverb and postadverb, it outputs an adverbial chunk <ADV> containing the two lexical units.
  • ПРАВИЛО: ADJ ADJ DET NOM: This rule matches a sequence of two adjectives, a determiner and a noun. These are put inside a nominal chunk <SN> and the case of the chunk is set to the case of the noun. A pointer <2> is put on the noun so that when the case of the chunk is changed, it will be propagated inside.
  • ПРАВИЛО: V-PRES: This is the default present tense verb rule, it matches any verb in the present tense. It currently changes the synthetic present in Tatar into a progressive present tense with an auxiliary verb in Kyrgyz. This is because the Kyrgyz cognate to the Tatar present means either "future" or "habitual/general". This Tatar form is "habitual/general" and "present progressive". When translating the "present progressive" reading of the Tatar "present", then we need to translate to a different form in Kyrgyz, namely the participle + жат auxiliary.

Note that after the first stage of transfer there are a couple of problems. The tense is correct, but the case of the noun is wrong, and the adverbial is in the wrong place. In Kyrgyz it should come before the verb.

Interchunk[edit]

$ cat input | lt-proc -b tt-ky.autobil.bin | apertium-transfer -b apertium-tt-ky.tt-ky.t1x tt-ky.t1x.bin |\
   apertium-interchunk apertium-tt-ky.tt-ky.t2x tt-ky.t2x.bin 

^ant<SN>{^Акмат<np><ant><m><nom>$}$ ^a_a_d_n<SN><acc>{^эски<adj><pst>$ ^чоң<adj><pst>$ ^бир<det><ind>$ ^дарак<n><2>$}$ 
^post<POST>{^көздөй<post>$}$ ^adv<ADV>{^катуу<adv>$ ^гана<postadv>$}$ 
^чурка<V>{^чурка<v><iv><prt_perf>$ ^бар<v><iv><prt_impf>$ ^жат<vaux><aor><p3><sg>$}$^sent<SENT>{^..<sent>$}$

One rule is applied in the interchunk module:

  • ПРАВИЛО: ADV SN V: The rule matches an adverbial chunk (ADV) followed by a nominal chunk (SN) and then a verbal chunk (V). It contains a call to a macro conv_arg1 which adjusts the case of the nominal chunk, and outputs a postposition depending on the lemma of the verbal chunk. It also switches the position of the nominal chunk and the adverbial chunk, placing the adverbial before the verb.

We can see that in the output of interchunk, the adverbial has been moved and the nominal chunk is in the correct case with a postposition.

Postchunk[edit]

The final module of transfer takes the chunks output by the interchunk module, and replaces the linked tag (e.g. <2>) with its value from the chunk.

$ cat input | lt-proc -b tt-ky.autobil.bin | apertium-transfer -b apertium-tt-ky.tt-ky.t1x tt-ky.t1x.bin |\
  apertium-interchunk apertium-tt-ky.tt-ky.t2x tt-ky.t2x.bin | apertium-postchunk apertium-tt-ky.tt-ky.t3x tt-ky.t3x.bin 

^Акмат<np><ant><m><nom>$ ^эски<adj><pst>$ ^чоң<adj><pst>$ ^бир<det><ind>$ ^дарак<n><acc>$ ^көздөй<post>$ ^катуу<adv>$ 
^гана<postadv>$ ^чурка<v><iv><prt_perf>$ ^бар<v><iv><prt_impf>$ ^жат<vaux><aor><p3><sg>$^..<sent>$

Now the sentence is ready to be morphologically generated. The file tr-ky.autogen.hfst can be copied from the apertium-tr-ky pair in trunk/.

Output[edit]

$ cat input | lt-proc -b tt-ky.autobil.bin | apertium-transfer -b apertium-tt-ky.tt-ky.t1x tt-ky.t1x.bin |\
   apertium-interchunk apertium-tt-ky.tt-ky.t2x tt-ky.t2x.bin | apertium-postchunk apertium-tt-ky.tt-ky.t3x tt-ky.t3x.bin |\
   hfst-proc -g tr-ky.autogen.hfst 

Акмат эски чоң бир даракты көздөй катуу гана чуркап бара жатат.

Describing a multi-stage transfer rule[edit]

The important thing to work out when writing a multi-stage transfer rule is how to split the rule between the different parts of transfer. For example, local reorderings (at the level of 1—5 words) should probably be done in the first stage. The chunks should be in some way thematic, so for example, finite verbs should probably not be chunked with adjectives or nominals.

Further reading[edit]