Difference between revisions of "Post-generator"
(Created page with 'Many languages use a post-generator FST to fix minor orthographical issues. This FST is in lttoolbox format and is run by <code>lt-proc</code> with the <code>-p</code> or <co…') |
Popcorndude (talk | contribs) (debugging postgen) |
||
(3 intermediate revisions by one other user not shown) | |||
Line 1: | Line 1: | ||
− | Many languages use a post-generator FST to fix minor orthographical issues. This FST is in [[lttoolbox]] format and is run by <code>lt-proc</code> with the <code>-p</code> or <code>--post-generation</code> switch. An example of such an orthographical issue is the "a" vs "an" difference in English. The english generator will output <code>~a</code>, and the post-generation FST changes that to a or an depending on the following word. |
+ | Many languages use a '''post-generator''' FST to fix minor orthographical issues. This FST is in [[lttoolbox]] format and is run by <code>lt-proc</code> with the <code>-p</code> or <code>--post-generation</code> switch. An example of such an orthographical issue is the "a" vs "an" difference in English. The english generator will output <code>~a</code>, and the post-generation FST changes that to a or an depending on the following word. |
+ | |||
+ | The source dictionary is typically named something like <code>apertium-cat.post-cat.dix</code>, while the compiled file gets a name like <code>spa-cat.autopgen.bin</code>. |
||
+ | |||
+ | Here's a minimal example for turning ~a into an before vowels: |
||
+ | <pre> |
||
+ | <?xml version="1.0" encoding="UTF-8"?> |
||
+ | <dictionary> |
||
+ | <alphabet/> |
||
+ | <sdefs> |
||
+ | <sdef n="n" c="Noun"/> |
||
+ | </sdefs> |
||
+ | <pardefs> |
||
+ | <pardef n="vocals"> |
||
+ | <e> |
||
+ | <i>a</i> |
||
+ | </e> |
||
+ | <e> |
||
+ | <i>e</i> |
||
+ | </e> |
||
+ | <e> |
||
+ | <i>i</i> |
||
+ | </e> |
||
+ | <e> |
||
+ | <i>o</i> |
||
+ | </e> |
||
+ | <e> |
||
+ | <i>u</i> |
||
+ | </e> |
||
+ | </pardef> |
||
+ | </pardefs> |
||
+ | <section id="main" type="standard"> |
||
+ | <e> |
||
+ | <p> |
||
+ | <l><a/>a<b/></l> |
||
+ | <r>an<b/></r> |
||
+ | </p> |
||
+ | <par n="vocals"/> |
||
+ | </e> |
||
+ | </section> |
||
+ | </dictionary> |
||
+ | </pre> |
||
+ | |||
+ | |||
+ | == Debugging == |
||
+ | |||
+ | A debugging version of a postgen dictionary can be made by compiling with the <code>-d</code>/<code>--debug</code> flag, such as with |
||
+ | |||
+ | lt-comp --debug lr apertium-cat.post-cat.dix cat-debug.bin |
||
+ | |||
+ | Then the output will include the approximate line number of each rule that applies as in |
||
+ | |||
+ | $ echo "~a apple" | lt-proc -p cat-debug.bin |
||
+ | Line near 29 an apple |
||
+ | |||
+ | Unfortunately, due to a limitation of the XML parsing library that lt-comp uses, the line number reported will often be a few lines past the entry in question. To help with this, a comment can be added to the entry like |
||
+ | |||
+ | <e c="indef+vowel"> |
||
+ | |||
+ | and then the output will be |
||
+ | |||
+ | $ echo "~a apple" | lt-proc -p cat-debug.bin |
||
+ | Line near 29 indef+vowel an apple |
||
+ | |||
+ | [[Category:Lttoolbox]] |
||
+ | [[Category:Morphological analysers]] |
||
+ | [[Category:Documentation in English]] |
Latest revision as of 18:14, 8 July 2022
Many languages use a post-generator FST to fix minor orthographical issues. This FST is in lttoolbox format and is run by lt-proc
with the -p
or --post-generation
switch. An example of such an orthographical issue is the "a" vs "an" difference in English. The english generator will output ~a
, and the post-generation FST changes that to a or an depending on the following word.
The source dictionary is typically named something like apertium-cat.post-cat.dix
, while the compiled file gets a name like spa-cat.autopgen.bin
.
Here's a minimal example for turning ~a into an before vowels:
<?xml version="1.0" encoding="UTF-8"?> <dictionary> <alphabet/> <sdefs> <sdef n="n" c="Noun"/> </sdefs> <pardefs> <pardef n="vocals"> <e> <i>a</i> </e> <e> <i>e</i> </e> <e> <i>i</i> </e> <e> <i>o</i> </e> <e> <i>u</i> </e> </pardef> </pardefs> <section id="main" type="standard"> <e> <p> <l><a/>a<b/></l> <r>an<b/></r> </p> <par n="vocals"/> </e> </section> </dictionary>
Debugging[edit]
A debugging version of a postgen dictionary can be made by compiling with the -d
/--debug
flag, such as with
lt-comp --debug lr apertium-cat.post-cat.dix cat-debug.bin
Then the output will include the approximate line number of each rule that applies as in
$ echo "~a apple" | lt-proc -p cat-debug.bin Line near 29 an apple
Unfortunately, due to a limitation of the XML parsing library that lt-comp uses, the line number reported will often be a few lines past the entry in question. To help with this, a comment can be added to the entry like
<e c="indef+vowel">
and then the output will be
$ echo "~a apple" | lt-proc -p cat-debug.bin Line near 29 indef+vowel an apple