User:Firespeaker/HFST bug

From Apertium
Jump to navigation Jump to search

In 2011, a bug in how HFST handles words containing spaces was documented and resolved (apparently in r1518?), but it introduced a new bug. This page documents the new [incorrect!] behaviour. It appears to only affect transducers written in lexc.

A bug report was filed in January of 2013 along with a patch for a test case. In March of 2013, spectie posted a patch that fixed the bug but introduced an issue with newlines and full stops. As of today, the bug has still not been fixed.

text.lexc

Make sure to include the space in '% ' under Multichar_Symbols.

Multichar_Symbols

% 

LEXICON Root

erke:erke # ;
erke% me:erke% me # ;
medvedev:medvedev # ;

Compiling

  1. $ hfst-lexc test.lexc -o test.hfst
  2. $ hfst-invert test.hfst | hfst-fst2fst -w -o test.hfst.ol

Testing

Some correctly analysed forms

  • $ echo "erke" | hfst-proc test.hfst.ol
^erke/erke$
  • $ echo "erke me" | hfst-proc test.hfst.ol
^erke me/erke me$
  • $ echo "medvedev" | hfst-proc test.hfst.ol
^medvedev/medvedev$

The incorrectly analysed form

  • $ echo "erke medvedev" | hfst-proc test.hfst.ol
^erke medvedev/*erke medvedev$

Expected output

This form is analysed correctly by a transducer identical to the one above except with the "erke me" form removed:

  • $ echo "erke medvedev" | hfst-proc test2.hfst.ol
^erke/erke$ ^medvedev/medvedev$

Another test case

This one is meant to be more familiar to English-speakers :)

Multichar_Symbols

% 

LEXICON Root

word:word #;
word% form:word% form #;
formation:formation #;
  • $ echo "word formation" | hfst-proc test3.hfst.ol
^word formation/*word formation$
  • $ echo "formation word" | hfst-proc test3.hfst.ol
^formation/formation$ ^word/word$


Notes

  • This doesn't seem to affect transducers written in other formats. E.g., the transducer that results from apertium-eng-kaz.eng.dix outputs the following:
    • $ echo "right there" | apertium -d . eng-kaz-morph
    • ^right there/right there<adv>$^./.<sent>$
    • $ echo "right the" | apertium -d . eng-kaz-morph
    • ^right/right<adj>/right<adv>/right<n><sg>$ ^the/the<det><def><sp>$^./.<sent>$
    • $ echo "right therein" | apertium -d . eng-kaz-morph
    • ^right/right<adj>/right<adv>/right<n><sg>$ ^therein/*therein$^./.<sent>$

Other materials