User:Firespeaker/HFST bug

From Apertium
< User:Firespeaker
Revision as of 06:33, 27 May 2021 by Tino Didriksen (talk | contribs) (Freenode -> OFTC)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

In 2011, a bug in how HFST handles words containing spaces was documented and resolved (apparently in r1518?), but it introduced a new bug. This page documents the new [incorrect!] behaviour. It appears to only affect transducers written in lexc.

A bug report was filed in January of 2013 along with a patch for a test case. In March of 2013, spectie posted a patch that fixed the bug but introduced an issue with newlines and full stops. As of today, the bug has still not been fixed.

text.lexc[edit]

Make sure to include the space in '% ' under Multichar_Symbols.

Multichar_Symbols

% 

LEXICON Root

erke:erke # ;
erke% me:erke% me # ;
medvedev:medvedev # ;

Compiling[edit]

  1. $ hfst-lexc test.lexc -o test.hfst
  2. $ hfst-invert test.hfst | hfst-fst2fst -w -o test.hfst.ol

Testing[edit]

Some correctly analysed forms[edit]

  • $ echo "erke" | hfst-proc test.hfst.ol
^erke/erke$
  • $ echo "erke me" | hfst-proc test.hfst.ol
^erke me/erke me$
  • $ echo "medvedev" | hfst-proc test.hfst.ol
^medvedev/medvedev$

The incorrectly analysed form[edit]

  • $ echo "erke medvedev" | hfst-proc test.hfst.ol
^erke medvedev/*erke medvedev$

Expected output[edit]

This form is analysed correctly by a transducer identical to the one above except with the "erke me" form removed:

  • $ echo "erke medvedev" | hfst-proc test2.hfst.ol
^erke/erke$ ^medvedev/medvedev$

Another test case[edit]

This one is meant to be more familiar to English-speakers :)

Multichar_Symbols

% 

LEXICON Root

word:word #;
word% form:word% form #;
formation:formation #;
  • $ echo "word formation" | hfst-proc test3.hfst.ol
^word formation/*word formation$
  • $ echo "formation word" | hfst-proc test3.hfst.ol
^formation/formation$ ^word/word$


Notes[edit]

  • This doesn't seem to affect transducers written in other formats. E.g., the transducer that results from apertium-eng-kaz.eng.dix outputs the following:
    • $ echo "right there" | apertium -d . eng-kaz-morph
    • ^right there/right there<adv>$^./.<sent>$
    • $ echo "right the" | apertium -d . eng-kaz-morph
    • ^right/right<adj>/right<adv>/right<n><sg>$ ^the/the<det><def><sp>$^./.<sent>$
    • $ echo "right therein" | apertium -d . eng-kaz-morph
    • ^right/right<adj>/right<adv>/right<n><sg>$ ^therein/*therein$^./.<sent>$

Other materials[edit]