Difference between revisions of "Chunking: A full example"

From Apertium
Jump to navigation Jump to search
Line 46: Line 46:
 
</pre>
 
</pre>
   
  +
Now let's define some rules. First for the determiner:
+
Now let's define some rules.
 
To begin with we will define some basic rules a la [[Apertium New Language Pair HOWTO#Transfer rules]].
  +
First for the determiner. First we define the category 'c_det' which contains all words marked as <det> and <det><other tags>. Then we define attributes for number and case (nominative and accustive).
 
<pre><nowiki>
 
<pre><nowiki>
  +
<def-cat n="c_det">
  +
<cat-item tags="det"/>
  +
<cat-item tags="det.*"/>
  +
</def-cat>
  +
  +
...
  +
<def-attr n="a_nbr">
  +
<attr-item tags="sg"/>
  +
<attr-item tags="sp"/>
  +
<attr-item tags="pl"/>
  +
</def-attr>
  +
  +
<def-attr n="a_det">
  +
<attr-item tags="det.def"/>
  +
<attr-item tags="det.ind"/>
  +
<attr-item tags="det.pos"/>
  +
<attr-item tags="det.qnt"/>
  +
</def-attr>
  +
...
 
<rule>
 
<rule>
 
<pattern>
 
<pattern>
<pattern-item n="det"/>
+
<pattern-item n="c_det"/>
 
</pattern>
 
</pattern>
 
<action>
 
<action>
Line 57: Line 79:
 
<tags>
 
<tags>
 
<tag><lit-tag v="SN"/></tag>
 
<tag><lit-tag v="SN"/></tag>
<tag><clip pos="1" side="tl" part="nbr"/></tag>
+
<tag><clip pos="1" side="tl" part="a_nbr"/></tag>
<tag><clip pos="1" side="tl" part="cas"/></tag>
 
 
</tags>
 
</tags>
 
<lu>
 
<lu>
 
<clip pos="1" side="tl" part="lem"/>
 
<clip pos="1" side="tl" part="lem"/>
 
<clip pos="1" side="tl" part="a_det"/>
 
<clip pos="1" side="tl" part="a_det"/>
<clip pos="1" side="sl" part="nbr" link-to="2"/>
+
<clip pos="1" side="sl" part="a_nbr" link-to="2"/>
 
</lu>
 
</lu>
 
</chunk>
 
</chunk>
Line 71: Line 92:
 
</nowiki></pre>
 
</nowiki></pre>
   
  +
This recognizes the ^La<det><def><sp>$ (defined in the category 'c_det') and the number attribute.
 
  +
^det<SN><sp>{^The<det><def><2>$}$
To begin with we will define some basic rules a la [[Apertium New Language Pair HOWTO#Transfer rules]]:
 
   
 
http://wiki.apertium.org/wiki/Apertium_New_Language_Pair_HOWTO#Transfer_rules
 
http://wiki.apertium.org/wiki/Apertium_New_Language_Pair_HOWTO#Transfer_rules

Revision as of 19:51, 28 September 2008

This will be a full example of chunking, which we build from the ground up.

We will look at Esperanto <-> English and try to translate the sentence "La libro estas blua" to "The book is blue".

Overview

First a little overview of how 3-stage transfer normally works:

  • Transfer stage: Words are translated using the bidix and categorized and put into chunks (in the .t1x file). Here the tags in the words can also be added, removed or made into 'pointers' that points to the tags in the enclosing chunk.
  • Interchunk stage: Chunks are reordered, combined and split and chunk tags changed (in the .t2x file)
  • Postchunk stage: The words in the chunks are restored (in the .t3x file)


If we look at how "The blue book is good" goes throgh the system, we have just before transfer:

^The<det><def><sp>$ ^blue<adj>$ ^book<n><sg>$ ^be<vbser><pres><p3><sg>$ ^good<adj><sint>$

which is tranfered to Esperanto and chunked into

^det_adj_nom<SN><sg><nom>{^La<det><def><2><3>$ ^blua<adj><2><3>$ ^libro<n><2><3>$}$ 
^ser<SV><pres><p3><sg>{^esti<vbser><pres>$}$ 
^adj<SN><sg>{^bona<adj><sg><nom>$}$

Here 'det_adj_nom' is the name of the chunk and <SN><sg><nom> the chunk's tags. The content of the chunk is {^La<det><def><2><3>$ ^blua<adj><2><3>$ ^libro<n><2><3>$} where the <2> and <3> are pointers to the chunk's tag (<sg> and <nom> respectively). This allows us to change the values at chunk level later on, if necessary.

In this simple case nothing happens at the interchunk stage. After the postchunk stage it looks like:

^La<det><def><sg><nom>$ ^blua<adj><sg><nom>$ ^libro<n><sg><nom>$ ^esti<vbser><pres>$ ^bona<adj><sg><nom>$

which becomes "La blua libro estas bona".

Starting from the ground

Now we will try the same sentence Esperanto -> English, but with more or less empty t1x, t2x and t3x files.

^La<det><def><sp>$ ^blua<adj><sg><nom>$ ^libro<n><sg><nom>$ 
^esti<vbser><pres>$ 
^bona<adj><sg><nom>$

Without any rules the result will just be that each work get a 'default' chunk:

^default{^The<det><def><sp>$}$ ^default{^blue<adj><sg><nom>$}$ ^default{^book<n><sg><nom>$}$  
^default{^be<vbser><pres>$}$  
^default{^good<adj><sint><sg><nom>$}$


Now let's define some rules. To begin with we will define some basic rules a la Apertium New Language Pair HOWTO#Transfer rules. First for the determiner. First we define the category 'c_det' which contains all words marked as <det> and <det><other tags>. Then we define attributes for number and case (nominative and accustive).

  <def-cat n="c_det">
     <cat-item tags="det"/>
     <cat-item tags="det.*"/>
  </def-cat>

...
  <def-attr n="a_nbr">
     <attr-item tags="sg"/>
     <attr-item tags="sp"/>
     <attr-item tags="pl"/>
  </def-attr>

  <def-attr n="a_det">
     <attr-item tags="det.def"/>
     <attr-item tags="det.ind"/>
     <attr-item tags="det.pos"/>
     <attr-item tags="det.qnt"/>
  </def-attr>
...
  <rule>
     <pattern>
       <pattern-item n="c_det"/>
     </pattern>
     <action>
       <out>
         <chunk name="det" case="caseFirstWord">
           <tags>
             <tag><lit-tag v="SN"/></tag>
             <tag><clip pos="1" side="tl" part="a_nbr"/></tag>
           </tags>
           <lu>
             <clip pos="1" side="tl" part="lem"/>
             <clip pos="1" side="tl" part="a_det"/>
             <clip pos="1" side="sl" part="a_nbr" link-to="2"/>
           </lu>
         </chunk>
       </out>
     </action>
  </rule>

This recognizes the ^La<det><def><sp>$ (defined in the category 'c_det') and the number attribute.

^det<SN><sp>{^The<det><def><2>$}$ 

http://wiki.apertium.org/wiki/Apertium_New_Language_Pair_HOWTO#Transfer_rules

Word/chunk reordering

Now that "La libro estas bona" -> "The book is good" works, lets look at how chunk reordering works. In Esperanto you make a sentence into a question by putting "Ĉu" in the start of the sentence: "Ĉu la libro estas bona?" In English the verb needs to come first: "Is the book good?".