Postgenerator

From Apertium
Revision as of 19:39, 1 March 2024 by Popcorndude (talk | contribs) (→‎Postgeneration Using apertium-separable: we don't use literal slashes anymore)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Sometimes you want to be able to merge two tokens in output, for example for contractions, e.g. de + el = del.

You can do this using the postgenerator.

First make sure you add the postgenerator wakeup symbol to your monolingual dictionary, e.g. apertium-aaa.aaa.dix

apertium-aaa.aaa.dix:

   <pardef n="/de__pr">
     <e r="LR"><p><l>de</l><r>de<s n="pr"/></r></p></e>
     <e r="RL"><p><l><a/>de</l><r>de<s n="pr"/></r></p></e>
   </pardef>

...

   <e lm="de"><i></i><par n="/de__pr"/></e>

...

You should get entries like:

de:>:de<pr>
~de:<:de<pr>

from lt-expand apertium-aaa.aaa.dix. apertium-aaa.post-aaa.dix:


<?xml version="1.0" encoding="UTF-8"?>
<dictionary>
  <alphabet/>
  <sdefs>
    <sdef n="test"/>
  </sdefs>
  <section id="main" type="standard">

     <e> <p><l><a/>de<b/>el</l><r>del</r></p></e>
  </section>
</dictionary>

You can compile it like:

$ lt-comp lr apertium-aaa.post-aaa.dix aaa.autopgen.bin
main@standard 7 6

And use it like:

$ echo "~de el" | lt-proc -p aaa.autopgen.bin 
del

In your modes file:

...
      <program name="lt-proc $1">
        <file name="aaa-bbb.autogen.bin"/>
      </program>
      <program name="lt-proc -p">
        <file name="aaa-bbb.autopgen.bin"/>
      </program>
...

Postgeneration Using apertium-separable

If you have at least version 0.7.0 of apertium-separable, you can accomplish the same as above using lsx-proc.

This allows you to write postgeneration rules conditioned on lemmas and tags rather than needing multiple copies of each relevant dictionary entry.

For the de + el → del rule above, we can write the following:

 1 <?xml version="1.0" encoding="UTF-8"?>
 2 <dictionary type="separable">
 3   <alphabet/>
 4   <sdefs>
 5     <sdef n="pr"/>
 6     <sdef n="det"/>
 7   </sdefs>
 8   <section id="main" type="standard">
 9     <e>
10       <i>de<s n="pr"/><f/>d</i>
11       <p><l>e</l><r></r></p>
12       <i><d space="no"/>el<s n="det"/><t/><f/>el</i>
13     </e>
14   </section>
15 </dictionary>

The <f/> on lines 10 and 12 represents a reading separator (/ in stream format).

This will turn ^de<pr>/de$ ^el<det><def><m><sg>/el$ into ^de<pr>/d$^el<det><def><m><sg>/el$.

If this were a rule that applied for any word beginning with an e rather than just the definite article, we could skip the lemma and tags and just write

1 <e>
2   <i>de<s n="pr"/><f/>d</i>
3   <p><l>e</l><r></r></p>
4   <i><d space="no"/>el</i>
5 </e>

Here there is a <f/>, representing a slash, in the first word, because we are listing both the analysis and the surface form, but there is no <f/>/slash in the second because we are only listing the surface form.

This is, admittedly, a bit more complicated to write than the lt-proc way of doing postgen, but it does allow the rules to apply to each other's output if lsx-proc is run with -r/--repeat, and it also allows case handling to be moved out of the postgenerator to the dedicated module, both of which may significantly reduce the number of rules. In addition, the morphological dictionaries do not need to create dedicated entries with the alarm character.