Ideas for Google Summer of Code/Morphology with HFST

From Apertium
< Ideas for Google Summer of Code
Revision as of 16:17, 13 March 2010 by TommiPirinen (talk | contribs) (backgrounds of hfst-lookup)
Jump to navigation Jump to search

This page will try to collect useful information to adapting the HFST lookup tools to use lttoolbox style tokenise-as-you-analyse.


lttoolbox/fst_processor.cc has the functions analysis and readAnalysis, these go character by character through a stream.

hfst-tools/src/hfst-lookup.cc however, uses line_to_keyvector (calling hfst_getline in hfst-commandline.cc), going line by line with getline. It seems like a good idea to change hfst into working on a char by char basis at least?

Yes, and there should be two modes, one for analysis, and one for generation. We don't really need tokenise-as-you-analyse for generation, but it should deal nicely with Apertium input streams and formatting, superblanks etc.
I suppose that transforming lookup from tool that assumes external tokenization into one that has possibility of pluggable input formatters would be a nice approach. Ideally it could be possible to write an interface that is easily usable in different programs without much porting. The current lookup originated much around the idea of generic tool that does not assume anything about the format of input or intent of transducer so extending it to support formats has perhaps unnecessarily complicated the code, and especially I could guess that implementing tokenize-as-you-analyze would require enough changes to program logic to warrant rewrite instead of trying to mend current version.
$ https://hfst.svn.sourceforge.net/svnroot/hfst


hfst/trunk/hfst/hfst-tools/src

Files:

hfst-lookup.cc 

hfst-optimized-lookup.cc

Lines:

1184                KeyVector* kv = HFST::line_to_keyvector(&line, key_table,
                                                         &markup, &unknown);

1191                 kvs  = HFST::lookup_unique(kv, cascade[0],
                                                    key_table, &infinite);

707     lookups = lookup_all(t, kv, &flag_diacritic_set);

591 KeyVector* 
592 line_to_keyvector(char** s, KeyTable* kt, char** markup, bool* outside_sigma)

692 KeyVectorSet*
693 lookup_unique(KeyVector* kv, TransducerHandle t,
694               KeyTable* kt, bool* infinity)

hfst2/string/string.cc

Lines
54      stringUtf8ToKeyVector(const string& s, KeyTable* kt, bool addUnknown)


hfst2/src

hfst2/sfst/hsfst.C:

247 KeyVectorVector * lookup_all(TransducerHandle t,
248                              KeyVector * input_string,
249                              KeySet * skip_symbols) 

258        return find_all_output_strings( pT,input_string, &ks);

288  KeyVectorVector * find_all_output_strings( Transducer * t,
                                             KeyVector * input,
                                             KeySet * skip_symbols) 


295      find_all_continuations(start, input_position,
                             last_input_position,
                             skip_symbols);

149   find_all_continuations(Node * n,
                         KeyVector::iterator input_position,
                         KeyVector::iterator input_end_position,
                         KeySet * skip_symbols,
                         bool preserve_epsilons=false)