A long introduction to transfer rules

From Apertium
Jump to navigation Jump to search

Writing transfer rules seems to be tricky. People generally understand the basic concepts, but they struggle with the formalism. We think the formalism isn't that bad. And compared to many other formalisms,[1] it's fairly straightforward. Maybe one of the reasons people struggle is that we mix declarative and procedural programming. Could be.

Some formalities

Before starting, it is important to give some idea of what we can't do, before explaining what we can. If you come at rule-learning expecting something else, then it's likely to be confusing.

  • There are no recursive rules. Rules match fixed-length patterns. There is no optionality at the level of words. There is no way of saying one-or-more, it's just one.
  • Apertium's rules are very tied to the Apertium stream format. If you don't understand the stream format, it will be a lot more difficult to understand the rules.
  • Rules contain both declarative parts and procedural parts. You can't just expect to say what you want or how you want to do it. You need to do both -- but in different places (but it's quite intuitive).
  • Patterns match only on the source side. Not on the target side.
  • The structural transfer has no access to the information in the target language morphological dictionary. This means that if the transfer needs some information about the available forms of a particular word, e.g. if it is only singular, or only plural. Then this information needs to go in the bilingual dictionary.

Lexical transfer and structural transfer

See also: Bilingual dictionaries

At this point it's worth not confusing the rôles of lexical transfer and structural transfer. There is a grey area between the two, but there are also big parts that don't overlap.

  • Lexical transfer:
    • Nearly always gives translations between words, not tags.
    • Can add or change tags, on a per-lemma basis.
    • Doesn't do reordering.
    • Can be used to give a head's up to the structural transfer to draw attention to missing features, or features that cannot be decided on a no-context basis. For example:
      • the <ND> and <GD> tags which say "Hey, when I'm translating this word I don't know what the gender or number should be -- structural transfer! I need your help to find out", or
      • the <sint> tag which says "¡Ojo! if you're writing a transfer rule for adjectives, and it matches this adjective then you need to think about how you're going to handle the comparative and superlative forms"
  • Structural transfer:
    • Rarely gives translations between single words.
    • Often adds or changes tags on a per-category (groups of lemmas) basis.
    • Can change the order of words.

A rule-of-thumb is that if the rule applies to all words in a category, it probably wants to be treated in the structural transfer, and if it applies to just part of those words, then maybe it needs to be dealt with in the lexical transfer.

The output of the lexical transfer looks like this:


 ^slword<sometag>/tlword<sometag>$  ^slword1<sometag><blah>/tlword3<sometag><foo>$ ^slword3<sometag><blah>/tlword2<sometag><GD>$ 

Where the output of the structural transfer would look like this:


 ^tlword<sometag>$ ^tlword3<sometag><foo>$ ^tlword2<sometag><GD>$ 

That is, when you are in the first structural transfer stage you have access to both the source and target sides of the translation. After the first structural transfer stage, you only have access to the target side.

Some preliminaries

  • Pattern:
  • Action:

Overview of a transfer file

It's hard to give a step-by-step overview of what a transfer file looks like because there is quite a lot of obligatory parts that need to go into even the most basic file. But, it's important to get a general view before we go into the details. Here is an example in which I'm deliberately not going to use linguistic names for the different parts, to try and avoid assumptions.

<?xml version="1.0" encoding="utf-8"?>
<transfer>
  <section-def-cats>
    <def-cat n="some_word_category">
      <cat-item tags="mytag.*"/>
    </def-cat>
  </section-def-cats>
  <section-def-attrs>
    <def-attr n="some_feature_of_a_word">
      <attr-item tags="myfeature"/>
      <attr-item tags="myotherfeature"/>
    </def-attr>
  </section-def-attrs>
  <section-def-vars>
    <def-var n="blank"/>
  </section-def-vars>  
  <section-rules>
    <rule>
      <pattern>
        <pattern-item n="some_word_category"/>
      </pattern>
      <action>
        <let><clip pos="1" side="tl" part="some_feature_of_a_word"/><lit-tag v="myotherfeature"/></let>
        <out>
          <lu><clip pos="1" side="tl" part="whole"/></lu>
        </out>
      </action>
    </rule>
  </section-rules>
</transfer>

I'll try and give a tag-by-tag account... the <transfer> and </transfer> tags don't do anything. They just encapsulate the rest of the sections.

Practical examples

Apertium 1

Input: Otišla si tiho i bez pozdrava
Output: You left quietly and without a word

Lexical transfer

We pass the phrase through our lexical transfer stage, and come up with the following output:

^otići<vblex><perf><iv><lp><f><sg>/leave<vblex><lp><f><sg>$ 
^biti<vbser><clt><pres><p2><sg>/be<vbser><clt><pres><p2><sg>$ 
^tiho<adv>/quietly<adv>$ 
^i<cnjcoo>/and<cnjcoo>$ 
^bez<pr>/without<pr>$ 
^pozdrav<n><mi><sg><gen>/word<n><sg><gen>$

Put the output into a file called example1.txt we're going to be needing it later.

At this point if you haven't already, it's worth trying model zero, that is no transfer rules, so try passing the output through the transfer file we made earlier and generating it:

$ cat /tmp/example1.txt | apertium-transfer -b transfer.t1x transfer.bin
^leave<vblex><lp><f><sg>$
^be<vbser><clt><pres><p2><sg>$ 
^quietly<adv>$ 
^and<cnjcoo>$ 
^without<pr>$ 
^word<n><sg><gen>$

Then try and pass it through a generator of English:

$ cat /tmp/example1.txt | apertium-transfer -b transfer.t1x transfer.bin | lt-proc -g sh-en.autogen.bin
#leave 
#be 
quietly 
and 
without 
#word

This is obviously inadequate, but don't worry, we're going to use the structural transfer module to make it adequate!

Thinking it through

Let's think about what changes we need to make in order to convert this into an adequate form for target language generation. NB: If we want to change information, it's a procedure, if we want to output it or not, it's a declaration.

Procedures
  1. If the source language tag is <lp>, change the target language tag to <past>
Declarations
  1. Output a subject pronoun which takes its person and number information from the main verb.
  2. Output the main verb with information on category and tense (but not gender and number).
  3. Not output the auxiliary verb (biti, "be").
  4. Output nouns with category and number (but not case).
  5. Words should be output encapsulated in ^ and $
  6. Tags should be output encapsulated in < and >
Work order

So, what order do we do these in, well it doesn't really matter -- an experienced developer would probably do it in two stages, but for pedagogical purposes, we're going to split it up into five stages:

  • First we're going to write a rule which matches the lp and auxiliary construction, and outputs only the main verb (declarations: 2, 3, 5)
    • Define the categories of "lp" and "auxiliary"
  • Second we're going to edit that rule to change the source language tag from <lp> to <past> (procedure: 1)
    • Define the attribute of "tense"
  • Third we're going to edit the same rule to not output gender and number (declaration: 2)
    • Define the attribute of "verb_type"
  • Fourth we're going to edit the same rule to output a subject pronoun before the verb (declarations: 1, 5, 6)
    • Define the attributes of "person" and "number"
  • Fifth we're going to write a new rule which matches the noun construction and output only category and number (declarations: 4, 5)
    • Define the category of "noun"
    • Define the attribute of "noun_type"
Cheatsheet

Here is what the input and output of each of the changes above will look like.

Change Input Output
1 ^otići<vblex><perf><iv><lp><f><sg>/leave<vblex><lp><f><sg>$
^biti<vbser><clt><pres><p2><sg>/be<vbser><clt><pres><p2><sg>$
^leave<vblex><lp><f><sg>$
2 ^otići<vblex><perf><iv><lp><f><sg>/leave<vblex><lp><f><sg>$
^biti<vbser><clt><pres><p2><sg>/be<vbser><clt><pres><p2><sg>$
^leave<vblex><past><f><sg>$
3 ^otići<vblex><perf><iv><lp><f><sg>/leave<vblex><lp><f><sg>$
^biti<vbser><clt><pres><p2><sg>/be<vbser><clt><pres><p2><sg>$
^leave<vblex><past>$
4 ^otići<vblex><perf><iv><lp><f><sg>/leave<vblex><lp><f><sg>$
^biti<vbser><clt><pres><p2><sg>/be<vbser><clt><pres><p2><sg>$
^prpers<prn><subj><p2><mf><sg>$
^leave<vblex><past>$
5 ^pozdrav<n><mi><sg><gen>/word<n><sg><gen>$ ^word<n><sg>$

Implementation

Step 1

We're working on the output:

^otići<vblex><perf><iv><lp><f><sg>/leave<vblex><lp><f><sg>$
^biti<vbser><clt><pres><p2><sg>/be<vbser><clt><pres><p2><sg>$ 

1. Make our categories:

  <section-def-cats>
    <def-cat n="lp">
      <cat-item tags="vblex.*.*.lp.*"/>
    </def-cat>
    <def-cat n="biti-clt">
      <cat-item tags="vbser.clt.pres.*"/>
    </def-cat>
  </section-def-cats>

Why do we need .*.* ? -- Because of how the matching system in categories works. In the middle of tag sequences, a * is counted as a single tag. At the end, it is counted as any sequence of tags. So, we have <vblex> followed by any tag, followed by any tag, followed by <lp> followed by any sequence of tags.

2. Edit the example rule and replace the pattern.

      <pattern>
        <pattern-item n="lp"/>
        <pattern-item n="biti-clt"/>
      </pattern>

3. Save and compile the rule file.


$ apertium-preprocess-transfer rules.t1x rules.bin

Now test it:


$ cat example1.txt | apertium-transfer -b rules.t1x rules.bin

^leave<vblex><lp><f><sg>$ 

Great!

Step 2

Now test it:

Step 3

Now test it:

Step 4

Now test it:

Step 5

Now test it:

Apertium 3

Resorni ministar je navlačio ljude, kaže sejte biljku zelenu i čudo će da bude

The minister of agriculture tricks the people, he says plant the green herb and there will be a miracle

Lexical transfer

Notes

  1. e.g. Matxin, OpenLogos, ...