Difference between revisions of "Online Apertium Workshop 2020/Ugly hacks"
TommiPirinen (talk | contribs) |
TommiPirinen (talk | contribs) |
||
(4 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
I just realised during this whole process that trimming is behind many quality issues in some apertium dix and it can infect other dix even if trimming. Here's a non-exhaustive list of typical things that I've got frustrated and piled ugly hacks on. |
I just realised during this whole process that trimming is behind many quality issues in some apertium dix and it can infect other dix even if trimming. Here's a non-exhaustive list of typical things that I've got frustrated and piled ugly hacks on. |
||
If you have your own you can please add them here now. This is just to list the type of things we think we are talking about today... |
|||
(as a little background, the dev script I use to extend dixes uses non-trim dix so I see these oftener than most... I don't mean to point fingers at apertium-eng devs but I had decent amount of memorable problems with it working on WMT shared task fin-eng) |
|||
⚫ | |||
⚫ | |||
If you are building your dix well, you can find your ugly hacks by fgrepping for c= |
If you are building your dix well, you can find your ugly hacks by fgrepping for c= |
||
Line 9: | Line 12: | ||
Nobody has time for that, fgrep for LR, RL, <j and so on... |
Nobody has time for that, fgrep for LR, RL, <j and so on... |
||
<h2>Multiwords</h2> |
|||
These multiwords I found from eng when debug-adding finnish dictionary: |
|||
<pre> |
|||
<e lm="a holiday"> <i>a<b/>holiday</i><par n="house__n"/></e> |
|||
<e lm="foreign language"><i>foreign<b/>language</i><par n="house__n"/></e> |
|||
<e lm="very strong wind"><i>very<b/>strong<b/>wind</i><par n="house__n"/></e> |
|||
<e lm="shortly after"> <i>shortly<b/>after</i><par n="at__pr"/></e> |
|||
<e r="RL" lm="shortly after"><i>shortly<b/>after</i><par n="after__cnjadv"/></e> |
|||
<e r="LR" lm="that's why"><p><l>that's<b/>why</l><r>that<b/>is<b/>why</r></p><par n="after__cnjadv"/></e> |
|||
</pre> |
|||
I can understand sometimes these match one word in one language pair but I would be tempted to call them hacks that kind of are not ideal in monodix that others use any more... |
|||
Various fin-eng hack solutions: |
|||
<pre> |
|||
<e r="RL"><p><l>vieras<b/>kieli<s n="n"/></l><r>foreign<b/>language<s n="n"/></r></p></e> |
|||
<e r="RL"><p><l>loma<s n="n"/></l><r>a<b/>holiday<s n="n"/></r></p></e |
|||
<e r="RL"><p><l>tänä<b/>syksynä<s n="adv"/></l><r>this<b/>autumn<s n="adv"/></r></p></e> |
|||
<e r="RL"><p><l>tässä<s n="adv"/><j/><b/>maa<s n="n"/><s n="sg"/><s n="ine"/></l><r>in<b/>this<b/>country<s n="adv"/></r></p></e> |
|||
</pre> |
|||
<h2>Poses</h2> |
|||
Non-trimming will reveal tons of rarer parses, odd zero-derivations and whatnot... |
|||
This should be easy: |
|||
<pre> |
|||
echo north | lt-proc eng.automorf.bin |
|||
^north/north<adv>/north<n><sg>/north<adj><sint>$ |
|||
</pre> |
|||
What's a canonical hack to catch all when developing for fin-eng? |
|||
<pre> |
|||
bidix?: |
|||
<e><p><l>pohjoinen<s n="n"/></l><r>north<s n="n"/></r></p></e> |
|||
<e r="RL"><p><l>pohjoinen<s n="n"/></l><r>north<s n="adj"/></r></p></e> |
|||
<e r="RL"><p><l>pohjoinen<s n="n"/><s n="sg"/><s n="ill"/></l><r>north<s n="adv"/></r></p></e> |
|||
bidix?: |
|||
<e><p><l>pohjoinen<s n="n"/></l><r>north<s n="n"/></r></p></e> |
|||
<e r="RL"><p><l>pohjoinen<s n="adj"/></l><r>north<s n="adj"/></r></p></e> |
|||
<e r="RL"><p><l>pohjoinen<s n="adv"/></l><r>north<s n="adv"/></r></p></e> |
|||
monodix? (Lexc): |
|||
pohjoinen+N # ; |
|||
pohjoisellinen+Adj # ; ! +Use/MT |
|||
pohjoisesti+Adv ! +Use/MT |
|||
something else: |
|||
</pre> |
|||
<h2>Guessers</h2> |
|||
When not trimming, guessery things and maybe other derivations also start to pop up: |
|||
<pre> |
|||
<e r="RL" c="deu guesser"><p><l>juhlia<s n="vblex"/></l><r c="maybe german guesser fail?">anniversarfeieren<s n="vblex"/></r></p></e> |
|||
<e r="RL" c="deu guesser"><p><l>kirjoittaa<s n="vblex"/><j/>mielipidekirjoitus<s n="n"/><s n="sg"/><s n="nom"/></l><r c="maybe german guesser fail?">grundsatzpapieren<s n="vblex"/></r></p></e> |
|||
</pre> |
|||
here deu guesser marks .*eier as vblex for .*eieren, shows up for otherwise oovs at least... |
|||
<h2>(de)Compounding</h2> |
|||
E.g. in fin-deu there are multiple levels of de-compounding fails |
|||
<pre> |
|||
<e r="RL" c="nonSI unit"><p><l>beli<s n="n"/></l><r c="usually compound fail">bel<s n="n"/><s n="nt"/></r></p></e> |
|||
<e r="RL"><p><l>lotto<s n="n"/></l><r c="usually compound fail, c.f. suffix -los">los<s n="n"/><s n="nt"/></r></p></e> |
|||
</pre> |
|||
# Poses |
Latest revision as of 13:37, 2 July 2020
I just realised during this whole process that trimming is behind many quality issues in some apertium dix and it can infect other dix even if trimming. Here's a non-exhaustive list of typical things that I've got frustrated and piled ugly hacks on. If you have your own you can please add them here now. This is just to list the type of things we think we are talking about today...
(as a little background, the dev script I use to extend dixes uses non-trim dix so I see these oftener than most... I don't mean to point fingers at apertium-eng devs but I had decent amount of memorable problems with it working on WMT shared task fin-eng)
Where to find them
If you are building your dix well, you can find your ugly hacks by fgrepping for c=
...
Nobody has time for that, fgrep for LR, RL, <j and so on...
Multiwords
These multiwords I found from eng when debug-adding finnish dictionary:
<e lm="a holiday"> <i>a<b/>holiday</i><par n="house__n"/></e> <e lm="foreign language"><i>foreign<b/>language</i><par n="house__n"/></e> <e lm="very strong wind"><i>very<b/>strong<b/>wind</i><par n="house__n"/></e> <e lm="shortly after"> <i>shortly<b/>after</i><par n="at__pr"/></e> <e r="RL" lm="shortly after"><i>shortly<b/>after</i><par n="after__cnjadv"/></e> <e r="LR" lm="that's why"><p><l>that's<b/>why</l><r>that<b/>is<b/>why</r></p><par n="after__cnjadv"/></e>
I can understand sometimes these match one word in one language pair but I would be tempted to call them hacks that kind of are not ideal in monodix that others use any more...
Various fin-eng hack solutions:
<e r="RL"><p><l>vieras<b/>kieli<s n="n"/></l><r>foreign<b/>language<s n="n"/></r></p></e> <e r="RL"><p><l>loma<s n="n"/></l><r>a<b/>holiday<s n="n"/></r></p></e <e r="RL"><p><l>tänä<b/>syksynä<s n="adv"/></l><r>this<b/>autumn<s n="adv"/></r></p></e> <e r="RL"><p><l>tässä<s n="adv"/><j/><b/>maa<s n="n"/><s n="sg"/><s n="ine"/></l><r>in<b/>this<b/>country<s n="adv"/></r></p></e>
Poses
Non-trimming will reveal tons of rarer parses, odd zero-derivations and whatnot...
This should be easy:
echo north | lt-proc eng.automorf.bin ^north/north<adv>/north<n><sg>/north<adj><sint>$
What's a canonical hack to catch all when developing for fin-eng?
bidix?: <e><p><l>pohjoinen<s n="n"/></l><r>north<s n="n"/></r></p></e> <e r="RL"><p><l>pohjoinen<s n="n"/></l><r>north<s n="adj"/></r></p></e> <e r="RL"><p><l>pohjoinen<s n="n"/><s n="sg"/><s n="ill"/></l><r>north<s n="adv"/></r></p></e> bidix?: <e><p><l>pohjoinen<s n="n"/></l><r>north<s n="n"/></r></p></e> <e r="RL"><p><l>pohjoinen<s n="adj"/></l><r>north<s n="adj"/></r></p></e> <e r="RL"><p><l>pohjoinen<s n="adv"/></l><r>north<s n="adv"/></r></p></e> monodix? (Lexc): pohjoinen+N # ; pohjoisellinen+Adj # ; ! +Use/MT pohjoisesti+Adv ! +Use/MT something else:
Guessers
When not trimming, guessery things and maybe other derivations also start to pop up:
<e r="RL" c="deu guesser"><p><l>juhlia<s n="vblex"/></l><r c="maybe german guesser fail?">anniversarfeieren<s n="vblex"/></r></p></e> <e r="RL" c="deu guesser"><p><l>kirjoittaa<s n="vblex"/><j/>mielipidekirjoitus<s n="n"/><s n="sg"/><s n="nom"/></l><r c="maybe german guesser fail?">grundsatzpapieren<s n="vblex"/></r></p></e>
here deu guesser marks .*eier as vblex for .*eieren, shows up for otherwise oovs at least...
(de)Compounding
E.g. in fin-deu there are multiple levels of de-compounding fails
<e r="RL" c="nonSI unit"><p><l>beli<s n="n"/></l><r c="usually compound fail">bel<s n="n"/><s n="nt"/></r></p></e> <e r="RL"><p><l>lotto<s n="n"/></l><r c="usually compound fail, c.f. suffix -los">los<s n="n"/><s n="nt"/></r></p></e>