Difference between revisions of "Postgenerator"
Popcorndude (talk | contribs) (→Postgeneration Using apertium-separable: we don't use literal slashes anymore) |
|||
(3 intermediate revisions by 2 users not shown) | |||
Line 72: | Line 72: | ||
</pre> |
</pre> |
||
== Postgeneration Using apertium-separable == |
|||
If you have at least version 0.7.0 of apertium-separable, you can accomplish the same as above using <code>lsx-proc</code>. |
|||
This allows you to write postgeneration rules conditioned on lemmas and tags rather than needing multiple copies of each relevant dictionary entry. |
|||
For the <code>de + el → del</code> rule above, we can write the following: |
|||
<syntaxhighlight lang="xml" line> |
|||
<?xml version="1.0" encoding="UTF-8"?> |
|||
<dictionary type="separable"> |
|||
<alphabet/> |
|||
<sdefs> |
|||
<sdef n="pr"/> |
|||
<sdef n="det"/> |
|||
</sdefs> |
|||
<section id="main" type="standard"> |
|||
<e> |
|||
<i>de<s n="pr"/><f/>d</i> |
|||
<p><l>e</l><r></r></p> |
|||
<i><d space="no"/>el<s n="det"/><t/><f/>el</i> |
|||
</e> |
|||
</section> |
|||
</dictionary> |
|||
</syntaxhighlight> |
|||
The <code><f/></code> on lines 10 and 12 represents a reading separator (<code>/</code> in stream format). |
|||
This will turn <code>^de<pr>/de$ ^el<det><def><m><sg>/el$</code> into <code>^de<pr>/d$^el<det><def><m><sg>/el$</code>. |
|||
If this were a rule that applied for any word beginning with an e rather than just the definite article, we could skip the lemma and tags and just write |
|||
<syntaxhighlight lang="xml" line> |
|||
<e> |
|||
<i>de<s n="pr"/><f/>d</i> |
|||
<p><l>e</l><r></r></p> |
|||
<i><d space="no"/>el</i> |
|||
</e> |
|||
</syntaxhighlight> |
|||
Here there is a <code><f/></code>, representing a slash, in the first word, because we are listing both the analysis and the surface form, but there is no <code><f/></code>/slash in the second because we are only listing the surface form. |
|||
This is, admittedly, a bit more complicated to write than the <code>lt-proc</code> way of doing postgen, but it does allow the rules to apply to each other's output if <code>lsx-proc</code> is run with <code>-r</code>/<code>--repeat</code>, and it also allows case handling to be moved out of the postgenerator to the [[Capitalization restoration|dedicated module]], both of which may significantly reduce the number of rules. In addition, the morphological dictionaries do not need to create dedicated entries with the alarm character. |
|||
[[Category:Modules]] |
Latest revision as of 19:39, 1 March 2024
Sometimes you want to be able to merge two tokens in output, for example for contractions, e.g. de + el = del.
You can do this using the postgenerator.
First make sure you add the postgenerator wakeup symbol to your monolingual dictionary, e.g. apertium-aaa.aaa.dix
apertium-aaa.aaa.dix:
<pardef n="/de__pr"> <e r="LR"><p><l>de</l><r>de<s n="pr"/></r></p></e> <e r="RL"><p><l><a/>de</l><r>de<s n="pr"/></r></p></e> </pardef> ... <e lm="de"><i></i><par n="/de__pr"/></e> ...
You should get entries like:
de:>:de<pr> ~de:<:de<pr>
from lt-expand apertium-aaa.aaa.dix. apertium-aaa.post-aaa.dix:
<?xml version="1.0" encoding="UTF-8"?> <dictionary> <alphabet/> <sdefs> <sdef n="test"/> </sdefs> <section id="main" type="standard"> <e> <p><l><a/>de<b/>el</l><r>del</r></p></e> </section> </dictionary>
You can compile it like:
$ lt-comp lr apertium-aaa.post-aaa.dix aaa.autopgen.bin main@standard 7 6
And use it like:
$ echo "~de el" | lt-proc -p aaa.autopgen.bin del
In your modes file:
... <program name="lt-proc $1"> <file name="aaa-bbb.autogen.bin"/> </program> <program name="lt-proc -p"> <file name="aaa-bbb.autopgen.bin"/> </program> ...
Postgeneration Using apertium-separable[edit]
If you have at least version 0.7.0 of apertium-separable, you can accomplish the same as above using lsx-proc
.
This allows you to write postgeneration rules conditioned on lemmas and tags rather than needing multiple copies of each relevant dictionary entry.
For the de + el → del
rule above, we can write the following:
1 <?xml version="1.0" encoding="UTF-8"?>
2 <dictionary type="separable">
3 <alphabet/>
4 <sdefs>
5 <sdef n="pr"/>
6 <sdef n="det"/>
7 </sdefs>
8 <section id="main" type="standard">
9 <e>
10 <i>de<s n="pr"/><f/>d</i>
11 <p><l>e</l><r></r></p>
12 <i><d space="no"/>el<s n="det"/><t/><f/>el</i>
13 </e>
14 </section>
15 </dictionary>
The <f/>
on lines 10 and 12 represents a reading separator (/
in stream format).
This will turn ^de<pr>/de$ ^el<det><def><m><sg>/el$
into ^de<pr>/d$^el<det><def><m><sg>/el$
.
If this were a rule that applied for any word beginning with an e rather than just the definite article, we could skip the lemma and tags and just write
1 <e>
2 <i>de<s n="pr"/><f/>d</i>
3 <p><l>e</l><r></r></p>
4 <i><d space="no"/>el</i>
5 </e>
Here there is a <f/>
, representing a slash, in the first word, because we are listing both the analysis and the surface form, but there is no <f/>
/slash in the second because we are only listing the surface form.
This is, admittedly, a bit more complicated to write than the lt-proc
way of doing postgen, but it does allow the rules to apply to each other's output if lsx-proc
is run with -r
/--repeat
, and it also allows case handling to be moved out of the postgenerator to the dedicated module, both of which may significantly reduce the number of rules. In addition, the morphological dictionaries do not need to create dedicated entries with the alarm character.