Difference between revisions of "Post-generator"

From Apertium
Jump to navigation Jump to search
(Created page with 'Many languages use a post-generator FST to fix minor orthographical issues. This FST is in lttoolbox format and is run by <code>lt-proc</code> with the <code>-p</code> or <co…')
 
(debugging postgen)
 
(3 intermediate revisions by one other user not shown)
Line 1: Line 1:
Many languages use a post-generator FST to fix minor orthographical issues. This FST is in [[lttoolbox]] format and is run by <code>lt-proc</code> with the <code>-p</code> or <code>--post-generation</code> switch. An example of such an orthographical issue is the "a" vs "an" difference in English. The english generator will output <code>~a</code>, and the post-generation FST changes that to a or an depending on the following word.
+
Many languages use a '''post-generator''' FST to fix minor orthographical issues. This FST is in [[lttoolbox]] format and is run by <code>lt-proc</code> with the <code>-p</code> or <code>--post-generation</code> switch. An example of such an orthographical issue is the "a" vs "an" difference in English. The english generator will output <code>~a</code>, and the post-generation FST changes that to a or an depending on the following word.
  +
  +
The source dictionary is typically named something like <code>apertium-cat.post-cat.dix</code>, while the compiled file gets a name like <code>spa-cat.autopgen.bin</code>.
  +
  +
Here's a minimal example for turning ~a into an before vowels:
  +
<pre>
  +
<?xml version="1.0" encoding="UTF-8"?>
  +
<dictionary>
  +
<alphabet/>
  +
<sdefs>
  +
<sdef n="n" c="Noun"/>
  +
</sdefs>
  +
<pardefs>
  +
<pardef n="vocals">
  +
<e>
  +
<i>a</i>
  +
</e>
  +
<e>
  +
<i>e</i>
  +
</e>
  +
<e>
  +
<i>i</i>
  +
</e>
  +
<e>
  +
<i>o</i>
  +
</e>
  +
<e>
  +
<i>u</i>
  +
</e>
  +
</pardef>
  +
</pardefs>
  +
<section id="main" type="standard">
  +
<e>
  +
<p>
  +
<l><a/>a<b/></l>
  +
<r>an<b/></r>
  +
</p>
  +
<par n="vocals"/>
  +
</e>
  +
</section>
  +
</dictionary>
  +
</pre>
  +
  +
  +
== Debugging ==
  +
  +
A debugging version of a postgen dictionary can be made by compiling with the <code>-d</code>/<code>--debug</code> flag, such as with
  +
  +
lt-comp --debug lr apertium-cat.post-cat.dix cat-debug.bin
  +
  +
Then the output will include the approximate line number of each rule that applies as in
  +
  +
$ echo "~a apple" | lt-proc -p cat-debug.bin
  +
Line near 29 an apple
  +
  +
Unfortunately, due to a limitation of the XML parsing library that lt-comp uses, the line number reported will often be a few lines past the entry in question. To help with this, a comment can be added to the entry like
  +
  +
<e c="indef+vowel">
  +
  +
and then the output will be
  +
  +
$ echo "~a apple" | lt-proc -p cat-debug.bin
  +
Line near 29 indef+vowel an apple
  +
  +
[[Category:Lttoolbox]]
  +
[[Category:Morphological analysers]]
  +
[[Category:Documentation in English]]

Latest revision as of 18:14, 8 July 2022

Many languages use a post-generator FST to fix minor orthographical issues. This FST is in lttoolbox format and is run by lt-proc with the -p or --post-generation switch. An example of such an orthographical issue is the "a" vs "an" difference in English. The english generator will output ~a, and the post-generation FST changes that to a or an depending on the following word.

The source dictionary is typically named something like apertium-cat.post-cat.dix, while the compiled file gets a name like spa-cat.autopgen.bin.

Here's a minimal example for turning ~a into an before vowels:

<?xml version="1.0" encoding="UTF-8"?>
<dictionary>
  <alphabet/>
  <sdefs>
    <sdef n="n" c="Noun"/>
  </sdefs>
  <pardefs>
    <pardef n="vocals">
      <e>
        <i>a</i>
      </e>
      <e>
        <i>e</i>
      </e>
      <e>
        <i>i</i>
      </e>
      <e>
        <i>o</i>
      </e>
      <e>
        <i>u</i>
      </e>
    </pardef>
  </pardefs>
  <section id="main" type="standard">
    <e>
      <p>
        <l><a/>a<b/></l>
        <r>an<b/></r>
      </p>
      <par n="vocals"/>
    </e>
  </section>
</dictionary>


Debugging[edit]

A debugging version of a postgen dictionary can be made by compiling with the -d/--debug flag, such as with

lt-comp --debug lr apertium-cat.post-cat.dix cat-debug.bin

Then the output will include the approximate line number of each rule that applies as in

$ echo "~a apple" | lt-proc -p cat-debug.bin
Line near 29 an apple

Unfortunately, due to a limitation of the XML parsing library that lt-comp uses, the line number reported will often be a few lines past the entry in question. To help with this, a comment can be added to the entry like

<e c="indef+vowel">

and then the output will be

$ echo "~a apple" | lt-proc -p cat-debug.bin
Line near 29 indef+vowel an apple