User:Firespeaker/HFST bug
< User:Firespeaker
Jump to navigation
Jump to search
Revision as of 15:56, 28 April 2013 by Firespeaker (talk | contribs)
In 2011, a bug in how HFST handles words containing spaces was documented and resolved (apparently in r1518?), but it introduced a new bug. This page documents the new [incorrect!] behaviour. It appears to only affect transducers written in lexc.
A bug report was filed in January of 2013 along with a patch for a test case. In March of 2013, spectie posted a patch that fixed the bug but introduced an issue with newlines and full stops. As of today, the bug has still not been fixed.
text.lexc
Make sure to include the space in '%
' under Multichar_Symbols
.
Multichar_Symbols % LEXICON Root erke:erke # ; erke% me:erke% me # ; medvedev:medvedev # ;
Compiling
$ hfst-lexc test.lexc -o test.hfst
$ hfst-invert test.hfst | hfst-fst2fst -w -o test.hfst.ol
Testing
Some correctly analysed forms
$ echo "erke" | hfst-proc test.hfst.ol
^erke/erke$
$ echo "erke me" | hfst-proc test.hfst.ol
^erke me/erke me$
$ echo "medvedev" | hfst-proc test.hfst.ol
^medvedev/medvedev$
The incorrectly analysed form
$ echo "erke medvedev" | hfst-proc test.hfst.ol
^erke medvedev/*erke medvedev$
Expected output
This form is analysed correctly by a transducer identical to the one above except with the "erke me" form removed:
$ echo "erke medvedev" | hfst-proc test2.hfst.ol
^erke/erke$ ^medvedev/medvedev$
Another test case
This one is meant to be more familiar to English-speakers :)
Multichar_Symbols % LEXICON Root word:word #; word% form:word% form #; formation:formation #;
$ echo "word formation" | hfst-proc test3.hfst.ol
^word formation/*word formation$
$ echo "formation word" | hfst-proc test3.hfst.ol
^formation/formation$ ^word/word$
Notes
- This doesn't seem to affect transducers written in other formats. E.g., the transducer that results from
apertium-eng-kaz.eng.dix
outputs the following: $ echo "right there" | apertium -d . eng-kaz-morph
^right there/right there<adv>$^./.<sent>$
$ echo "right the" | apertium -d . eng-kaz-morph
^right/right<adj>/right<adv>/right<n><sg>$ ^the/the<det><def><sp>$^./.<sent>$
$ echo "right therein" | apertium -d . eng-kaz-morph
^right/right<adj>/right<adv>/right<n><sg>$ ^therein/*therein$^./.<sent>$
Other materials
- spectie explains the bug to firespeaker
- irc.freenode.net#hfst