Ideas for Google Summer of Code/Morphology with HFST

From Apertium
Jump to navigation Jump to search

This page will try to collect useful information to adapting the HFST lookup tools to use lttoolbox style tokenise-as-you-analyse[1].


HFST should also handle Superblanks and the Apertium stream format correctly.


lttoolbox/fst_processor.cc has the functions analysis and readAnalysis, these go character by character through a stream.

hfst-tools/src/hfst-lookup.cc however, uses line_to_keyvector (calling hfst_getline in hfst-commandline.cc), going line by line with getline. It seems like a good idea to change hfst into working on a char by char basis at least? —Preceding unsigned comment added by Unhammer (talkcontribs)

Yes, and there should be two modes, one for analysis, and one for generation. We don't really need tokenise-as-you-analyse for generation, but it should deal nicely with Apertium input streams and formatting, superblanks etc. —Preceding unsigned comment added by Francis Tyers (talkcontribs)
I suppose that transforming lookup from tool that assumes external tokenization into one that has possibility of pluggable input formatters would be a nice approach. Ideally it could be possible to write an interface that is easily usable in different programs without much porting. The current lookup originated much around the idea of generic tool that does not assume anything about the format of input or intent of transducer so extending it to support formats has perhaps unnecessarily complicated the code, and especially I could guess that implementing tokenize-as-you-analyze would require enough changes to program logic to warrant rewrite instead of trying to mend current version. —Preceding unsigned comment added by TommiPirinen (talkcontribs)
I think the idea we had was to implement an "hfst-proc" which would basically be a rewrite of lookup to just work with Apertium stream format. E.g. handling analysis and generation in a similar manner to "lt-proc", superblanks, tokenise-as-you-analyse etc. I think lookup is good for what it is good for, being a generic (as you mentioned) interface to HFST transducers. But it would be good to have something nicely integrated for Apertiumland -- much like we did for vislcg3 - Francis Tyers 16:39, 13 March 2010 (UTC)
Yes, that seems like a nice idea, certainly in line with how most of other command line tools are designed and born in HFST tool set TommiPirinen 18:48, 13 March 2010 (UTC)

Some interesting places in the HFST code

$ https://hfst.svn.sourceforge.net/svnroot/hfst


hfst/trunk/hfst/hfst-tools/src

Files:

hfst-lookup.cc 

hfst-optimized-lookup.cc

Lines:

1184                KeyVector* kv = HFST::line_to_keyvector(&line, key_table,
                                                         &markup, &unknown);

1191                 kvs  = HFST::lookup_unique(kv, cascade[0],
                                                    key_table, &infinite);

707     lookups = lookup_all(t, kv, &flag_diacritic_set);

591 KeyVector* 
592 line_to_keyvector(char** s, KeyTable* kt, char** markup, bool* outside_sigma)

692 KeyVectorSet*
693 lookup_unique(KeyVector* kv, TransducerHandle t,
694               KeyTable* kt, bool* infinity)

hfst2/string/string.cc

Lines
54      stringUtf8ToKeyVector(const string& s, KeyTable* kt, bool addUnknown)


hfst2/src

hfst2/sfst/hsfst.C:

247 KeyVectorVector * lookup_all(TransducerHandle t,
248                              KeyVector * input_string,
249                              KeySet * skip_symbols) 

258        return find_all_output_strings( pT,input_string, &ks);

288  KeyVectorVector * find_all_output_strings( Transducer * t,
                                             KeyVector * input,
                                             KeySet * skip_symbols) 


295      find_all_continuations(start, input_position,
                             last_input_position,
                             skip_symbols);

149   find_all_continuations(Node * n,
                         KeyVector::iterator input_position,
                         KeyVector::iterator input_end_position,
                         KeySet * skip_symbols,
                         bool preserve_epsilons=false) 




References