Difference between revisions of "Postgenerator"

From Apertium
Jump to navigation Jump to search
(lsx-proc -p)
Line 83: Line 83:
<syntaxhighlight lang="xml" line>
<syntaxhighlight lang="xml" line>
<?xml version="1.0" encoding="UTF-8"?>
<?xml version="1.0" encoding="UTF-8"?>
<dictionary>
<dictionary type="separable">
<alphabet/>
<alphabet/>
<sdefs>
<sdefs>
<sdef n="test"/>
<sdef n="pr"/>
<sdef n="det"/>
</sdefs>
</sdefs>
<section id="main" type="standard">
<section id="main" type="standard">
Line 92: Line 93:
<i>de<s n="pr"/>/d</i>
<i>de<s n="pr"/>/d</i>
<p><l>e</l><r></r></p>
<p><l>e</l><r></r></p>
<i><d space="no"/>el<t/>/el</i>
<i><d space="no"/>el<s n="det"/><t/>/el</i>
</e>
</e>
</section>
</section>
</dictionary>
</dictionary>
</syntaxhighlight>
</syntaxhighlight>

Note the literal slashes on lines 10 and 12.

This will turn <code>^de<pr>/de$ ^el<det><def><m><sg>/el$</code> into <code>^de<pr>/d$^el<det><def><m><sg>/el$</code>.

If this were a rule that applied for any word beginning with an e rather than just the definite article, we could skip the lemma and tags and just write

<syntaxhighlight lang="xml" line>
<e>
<i>de<s n="pr"/>/d</i>
<p><l>e</l><r></r></p>
<i><d space="no"/>el</i>
</e>
</syntaxhighlight>

Here there is a slash in the first word, because we are listing both the analysis and the surface form, but there is no slash in the second because we are only listing the surface form.

This is, admittedly, a bit more complicated to write than the <code>lt-proc</code> way of doing postgen, but it does allow the rules to apply to each other's output if <code>lsx-proc</code> is run with <code>-r</code>/<code>--repeat</code>, and it also allows case handling to be moved out of the postgenerator to the [[Capitalization restoration|dedicated module]], both of which may significantly reduce the number of rules. In addition, the morphological dictionaries do not need to create dedicated entries with the alarm character.


[[Category:Modules]]
[[Category:Modules]]

Revision as of 20:59, 22 December 2022

Sometimes you want to be able to merge two tokens in output, for example for contractions, e.g. de + el = del.

You can do this using the postgenerator.

First make sure you add the postgenerator wakeup symbol to your monolingual dictionary, e.g. apertium-aaa.aaa.dix

apertium-aaa.aaa.dix:

   <pardef n="/de__pr">
     <e r="LR"><p><l>de</l><r>de<s n="pr"/></r></p></e>
     <e r="RL"><p><l><a/>de</l><r>de<s n="pr"/></r></p></e>
   </pardef>

...

   <e lm="de"><i></i><par n="/de__pr"/></e>

...

You should get entries like:

de:>:de<pr>
~de:<:de<pr>

from lt-expand apertium-aaa.aaa.dix. apertium-aaa.post-aaa.dix:


<?xml version="1.0" encoding="UTF-8"?>
<dictionary>
  <alphabet/>
  <sdefs>
    <sdef n="test"/>
  </sdefs>
  <section id="main" type="standard">

     <e> <p><l><a/>de<b/>el</l><r>del</r></p></e>
  </section>
</dictionary>

You can compile it like:

$ lt-comp lr apertium-aaa.post-aaa.dix aaa.autopgen.bin
main@standard 7 6

And use it like:

$ echo "~de el" | lt-proc -p aaa.autopgen.bin 
del

In your modes file:

...
      <program name="lt-proc $1">
        <file name="aaa-bbb.autogen.bin"/>
      </program>
      <program name="lt-proc -p">
        <file name="aaa-bbb.autopgen.bin"/>
      </program>
...

Postgeneration Using apertium-separable

If you have at least version 0.7.0 of apertium-separable, you can accomplish the same as above using lsx-proc.

This allows you to write postgeneration rules conditioned on lemmas and tags rather than needing multiple copies of each relevant dictionary entry.

For the de + el → del rule above, we can write the following:

 1 <?xml version="1.0" encoding="UTF-8"?>
 2 <dictionary type="separable">
 3   <alphabet/>
 4   <sdefs>
 5     <sdef n="pr"/>
 6     <sdef n="det"/>
 7   </sdefs>
 8   <section id="main" type="standard">
 9     <e>
10       <i>de<s n="pr"/>/d</i>
11       <p><l>e</l><r></r></p>
12       <i><d space="no"/>el<s n="det"/><t/>/el</i>
13     </e>
14   </section>
15 </dictionary>

Note the literal slashes on lines 10 and 12.

This will turn ^de<pr>/de$ ^el<det><def><m><sg>/el$ into ^de<pr>/d$^el<det><def><m><sg>/el$.

If this were a rule that applied for any word beginning with an e rather than just the definite article, we could skip the lemma and tags and just write

1 <e>
2   <i>de<s n="pr"/>/d</i>
3   <p><l>e</l><r></r></p>
4   <i><d space="no"/>el</i>
5 </e>

Here there is a slash in the first word, because we are listing both the analysis and the surface form, but there is no slash in the second because we are only listing the surface form.

This is, admittedly, a bit more complicated to write than the lt-proc way of doing postgen, but it does allow the rules to apply to each other's output if lsx-proc is run with -r/--repeat, and it also allows case handling to be moved out of the postgenerator to the dedicated module, both of which may significantly reduce the number of rules. In addition, the morphological dictionaries do not need to create dedicated entries with the alarm character.