Ideas for Google Summer of Code/Morphology with HFST

From Apertium
Jump to navigation Jump to search

This page will try to collect useful information to adapting the HFST lookup tools to use lttoolbox style tokenise-as-you-analyse.


lttoolbox/fst_processor.cc has the functions analysis and readAnalysis, these go character by character through a stream.

hfst-tools/src/hfst-lookup.cc however, uses line_to_keyvector (calling hfst_getline in hfst-commandline.cc), going line by line with getline. It seems like a good idea to change hfst into working on a char by char basis at least?

Yes, and there should be two modes, one for analysis, and one for generation. We don't really need tokenise-as-you-analyse for generation, but it should deal nicely with Apertium input streams and formatting, superblanks etc.
$ https://hfst.svn.sourceforge.net/svnroot/hfst


hfst/trunk/hfst/hfst-tools/src

Files:

hfst-lookup.cc 

hfst-optimized-lookup.cc

Lines:

1184                KeyVector* kv = HFST::line_to_keyvector(&line, key_table,
                                                         &markup, &unknown);

1191                 kvs  = HFST::lookup_unique(kv, cascade[0],
                                                    key_table, &infinite);

707     lookups = lookup_all(t, kv, &flag_diacritic_set);

591 KeyVector* 
592 line_to_keyvector(char** s, KeyTable* kt, char** markup, bool* outside_sigma)

692 KeyVectorSet*
693 lookup_unique(KeyVector* kv, TransducerHandle t,
694               KeyTable* kt, bool* infinity)

hfst2/src

hfst2/sfst/hsfst.C:

247 KeyVectorVector * lookup_all(TransducerHandle t,
248                              KeyVector * input_string,
249                              KeySet * skip_symbols) 

258        return find_all_output_strings( pT,input_string, &ks);

288  KeyVectorVector * find_all_output_strings( Transducer * t,
                                             KeyVector * input,
                                             KeySet * skip_symbols) 


295      find_all_continuations(start, input_position,
                             last_input_position,
                             skip_symbols);

149   find_all_continuations(Node * n,
                         KeyVector::iterator input_position,
                         KeyVector::iterator input_end_position,
                         KeySet * skip_symbols,
                         bool preserve_epsilons=false)