Difference between revisions of "User:Firespeaker/HFST bug"
Jump to navigation
Jump to search
Firespeaker (talk | contribs) |
m (Freenode -> OFTC) |
||
(11 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
{{TOCD}} |
|||
In 2011, a bug in how HFST handles words containing spaces was [http://sourceforge.net/p/hfst/bugs/59/ documented and resolved] (apparently in [http://hfst.svn.sourceforge.net/viewvc/hfst?view=revision&revision=1518 r1518]?), but it introduced a new bug. This page documents the new behaviour. |
In 2011, a bug in how HFST handles words containing spaces was [http://sourceforge.net/p/hfst/bugs/59/ documented and resolved] (apparently in [http://hfst.svn.sourceforge.net/viewvc/hfst?view=revision&revision=1518 r1518]?), but it introduced a new bug. This page documents the new [incorrect!] behaviour. It appears to only affect transducers written in lexc. |
||
[https://sourceforge.net/p/hfst/bugs/153/ A bug report] was filed in January of 2013 along with a patch for a test case. In March of 2013, [[User:Francis Tyers|spectie]] posted a patch that fixed the bug but introduced an issue with newlines and full stops. As of today, the bug has still not been fixed. |
|||
== text.lexc == |
== text.lexc == |
||
Line 35: | Line 38: | ||
* <code>$ echo "erke medvedev" | hfst-proc test2.hfst.ol</code> |
* <code>$ echo "erke medvedev" | hfst-proc test2.hfst.ol</code> |
||
: <code>^erke/erke$ ^medvedev/medvedev$</code> |
: <code>^erke/erke$ ^medvedev/medvedev$</code> |
||
== Another test case == |
|||
This one is meant to be more familiar to English-speakers :) |
|||
<pre> |
|||
Multichar_Symbols |
|||
% |
|||
LEXICON Root |
|||
word:word #; |
|||
word% form:word% form #; |
|||
formation:formation #; |
|||
</pre> |
|||
* <code>$ echo "word formation" | hfst-proc test3.hfst.ol</code> |
|||
: <code>^word formation/<span style="color:red;">*</span>word formation$</code> |
|||
* <code>$ echo "formation word" | hfst-proc test3.hfst.ol</code> |
|||
: <code>^formation/formation$ ^word/word$</code> |
|||
== Notes == |
|||
<ul> |
|||
<li>This doesn't seem to affect transducers written in other formats. E.g., the transducer that results from <code>apertium-eng-kaz.eng.dix</code> outputs the following:</li> |
|||
<ul> |
|||
<li><code>$ echo "right there" | apertium -d . eng-kaz-morph</code></li> |
|||
: <code>^right there/right there<adv>$^./.<sent>$</code> |
|||
<li><code>$ echo "right the" | apertium -d . eng-kaz-morph</code></li> |
|||
: <code>^right/right<adj>/right<adv>/right<n><sg>$ ^the/the<det><def><sp>$^./.<sent>$</code> |
|||
<li><code>$ echo "right therein" | apertium -d . eng-kaz-morph</code></li> |
|||
: <code>^right/right<adj>/right<adv>/right<n><sg>$ ^therein/*therein$^./.<sent>$</code> |
|||
</ul> |
|||
</ul> |
|||
== Other materials == |
== Other materials == |
||
* [http://wiki.apertium.org/wiki/Talk:Ideas_for_Google_Summer_of_Code/Closer_integration_with_HFST spectie explains the bug to firespeaker] |
* [http://wiki.apertium.org/wiki/Talk:Ideas_for_Google_Summer_of_Code/Closer_integration_with_HFST spectie explains the bug to firespeaker] |
||
* irc.oftc.net#hfst |
Latest revision as of 06:33, 27 May 2021
In 2011, a bug in how HFST handles words containing spaces was documented and resolved (apparently in r1518?), but it introduced a new bug. This page documents the new [incorrect!] behaviour. It appears to only affect transducers written in lexc.
A bug report was filed in January of 2013 along with a patch for a test case. In March of 2013, spectie posted a patch that fixed the bug but introduced an issue with newlines and full stops. As of today, the bug has still not been fixed.
text.lexc[edit]
Make sure to include the space in '%
' under Multichar_Symbols
.
Multichar_Symbols % LEXICON Root erke:erke # ; erke% me:erke% me # ; medvedev:medvedev # ;
Compiling[edit]
$ hfst-lexc test.lexc -o test.hfst
$ hfst-invert test.hfst | hfst-fst2fst -w -o test.hfst.ol
Testing[edit]
Some correctly analysed forms[edit]
$ echo "erke" | hfst-proc test.hfst.ol
^erke/erke$
$ echo "erke me" | hfst-proc test.hfst.ol
^erke me/erke me$
$ echo "medvedev" | hfst-proc test.hfst.ol
^medvedev/medvedev$
The incorrectly analysed form[edit]
$ echo "erke medvedev" | hfst-proc test.hfst.ol
^erke medvedev/*erke medvedev$
Expected output[edit]
This form is analysed correctly by a transducer identical to the one above except with the "erke me" form removed:
$ echo "erke medvedev" | hfst-proc test2.hfst.ol
^erke/erke$ ^medvedev/medvedev$
Another test case[edit]
This one is meant to be more familiar to English-speakers :)
Multichar_Symbols % LEXICON Root word:word #; word% form:word% form #; formation:formation #;
$ echo "word formation" | hfst-proc test3.hfst.ol
^word formation/*word formation$
$ echo "formation word" | hfst-proc test3.hfst.ol
^formation/formation$ ^word/word$
Notes[edit]
- This doesn't seem to affect transducers written in other formats. E.g., the transducer that results from
apertium-eng-kaz.eng.dix
outputs the following: $ echo "right there" | apertium -d . eng-kaz-morph
^right there/right there<adv>$^./.<sent>$
$ echo "right the" | apertium -d . eng-kaz-morph
^right/right<adj>/right<adv>/right<n><sg>$ ^the/the<det><def><sp>$^./.<sent>$
$ echo "right therein" | apertium -d . eng-kaz-morph
^right/right<adj>/right<adv>/right<n><sg>$ ^therein/*therein$^./.<sent>$
Other materials[edit]
- spectie explains the bug to firespeaker
- irc.oftc.net#hfst