Talk:Ideas for Google Summer of Code/Closer integration with HFST

From Apertium
Jump to navigation Jump to search

HFST bug[edit]

<spectie> because of the tokenisation problem in hfst-proc
<firespeaker> At revision 39205.
<spectie> that i've told you about a million times :P
<firespeaker> okay, I know you've said there is a problem
<firespeaker> but I fail to understand what it is or how it is so bad
<spectie> http://sourceforge.net/tracker/?func=detail&aid=3383731&group_id=224521&atid=1061990
<firespeaker> fixed?
<firespeaker> really?
<firespeaker> also, I don't see how that applies here
<spectie> no
<spectie> read it !!
<firespeaker> Айгүл and мені are both recognised by the transducer individually
<spectie> "Fixed in SVN along the latter suggestion, better control over it can be
<spectie> implemented later.
<spectie> "
<firespeaker> Fixed in SVN along the latter suggestion, better control over it can be
<firespeaker> implemented later.
<spectie> the bug was "losing a word"
<spectie> now the word is not lost, it is just not analysed
* firespeaker is lost again
<firespeaker> what is the bug exactly?
<firespeaker> what triggers it?
<spectie> ok
<spectie> let me try and explain
<spectie> it happens when you have a multiword with a space, or a clitic with a space
<spectie> what happens is that the transducer reads past the space, but if the whole form is not found, it keeps reading until the next space and treats the whole section as an unknown word
<spectie> for example, if you have "foo% bar" in your lexicon
<spectie> and you have "foo" and "bar" in your lexicon
<spectie> if you try and analyse "foo bare" you will get ^*foo bare$"
<spectie> the bug is about a problem where we were getting "^*foo$" and the second part was lost
<spectie> what we should get is ^foo$ ^*bare$
<spectie> where it backtracks to the last space
<firespeaker> hrm
<spectie> is that clearer ?
<firespeaker> I think so, but I don't see how that applies here
<firespeaker> there is no Айгүл% мені in the lexicon
<spectie> Exactly!!
<spectie> but there is
<spectie> Айгүл% мен
<firespeaker> hrm
<spectie> no
<spectie> t$ echo "Айгүл ме" | hfst-proc kaz-tat.automorf.hfst 
<spectie> ^Айгүл ме/Айгүл<np><ant><f><nom>+ма<qst>$
<spectie>  
<spectie> here
<firespeaker> so should I update hfst?
<spectie> no
<spectie> because the behaviour is the same
<firespeaker> uhm
<spectie> only the original bug has been fixed
<spectie> the tokenisation is still the same tokenisation
<firespeaker> (04:44:24) spectie: $ echo "Айгүл сені іздеп жатыр" | apertium -d . kaz-tat-transfer
<firespeaker> (04:44:24) spectie: ^Гөлнара<np><ant><f><nom>$ ^син<prn><pers><p2><sg><acc>$ ^@ізде<v><tv><gna9>$ ^ят<vaux><pres><p3><sg>$^.<sent>$
<firespeaker> that's not what I'm getting.....
<spectie> i don't know why that is :/
<firespeaker> so this is two different bugs?
<spectie> you could try updating hfst
<spectie>  
<spectie> <firespeaker> $ echo "Айгүл мені іздеп жатыр" | apertium -d . kaz-tat-tagger
<spectie> <firespeaker> ^*Айгүл мені$ ^ізде<v><tv><gna9>$ ^жат<vaux><pres><p3><sg>$^.<sent>$
<spectie>  
<spectie> this is the bug i have been talking about 
<firespeaker> what's the one you linked to?
<spectie> that one 
<firespeaker> (04:51:27) spectie: only the original bug has been fixed
<firespeaker> (04:51:32) spectie: the tokenisation is still the same tokenisation
<spectie> :(
<firespeaker> I'm confused again
<spectie> i don't know how to explain it better
<spectie> :(
<spectie> i have 7 minutes
<firespeaker> so the bug has been fixed or not?
<spectie> one bug
<spectie> it was fixed, but the new behaviour was broken
<firespeaker> what's the second bug?
<spectie> the new behaviour has not been fixed
<firespeaker> oh
<firespeaker> I see
<spectie> it's all in the original report
<firespeaker> yes
<firespeaker> I was misunderstanding
<spectie> ah
<firespeaker> so the new behaviour needs to be fixed
<spectie> yes
<firespeaker> seems like a pretty critical bug
<firespeaker> I keep tripping over it accidentally
<firespeaker> bugs should be too small to trip over
<spectie> well i agree
<spectie> but there isn't really anything we can do about it 
<spectie> although if i counted up all the time that i spent on explaining the bug to people
<spectie> i could probably have solved it