Ideas for Google Summer of Code/Morphology with HFST
Jump to navigation
Jump to search
This page will try to collect useful information to adapting the HFST lookup tools to use lttoolbox style tokenise-as-you-analyse.
lttoolbox/fst_processor.cc has the functions analysis and readAnalysis, these go character by character through a stream.
hfst-tools/src/hfst-lookup.cc however, uses line_to_keyvector (calling hfst_getline in hfst-commandline.cc), going line by line with getline. It seems like a good idea to change hfst into working on a char by char basis at least?
- Yes, and there should be two modes, one for analysis, and one for generation. We don't really need tokenise-as-you-analyse for generation, but it should deal nicely with Apertium input streams and formatting, superblanks etc.
$ https://hfst.svn.sourceforge.net/svnroot/hfst hfst/trunk/hfst/hfst-tools/src Files: hfst-lookup.cc hfst-optimized-lookup.cc Lines: 1184 KeyVector* kv = HFST::line_to_keyvector(&line, key_table, &markup, &unknown); 1191 kvs = HFST::lookup_unique(kv, cascade[0], key_table, &infinite); 707 lookups = lookup_all(t, kv, &flag_diacritic_set); 591 KeyVector* 592 line_to_keyvector(char** s, KeyTable* kt, char** markup, bool* outside_sigma) 692 KeyVectorSet* 693 lookup_unique(KeyVector* kv, TransducerHandle t, 694 KeyTable* kt, bool* infinity) hfst2/string/string.cc Lines 54 stringUtf8ToKeyVector(const string& s, KeyTable* kt, bool addUnknown) hfst2/src hfst2/sfst/hsfst.C: 247 KeyVectorVector * lookup_all(TransducerHandle t, 248 KeyVector * input_string, 249 KeySet * skip_symbols) 258 return find_all_output_strings( pT,input_string, &ks); 288 KeyVectorVector * find_all_output_strings( Transducer * t, KeyVector * input, KeySet * skip_symbols) 295 find_all_continuations(start, input_position, last_input_position, skip_symbols); 149 find_all_continuations(Node * n, KeyVector::iterator input_position, KeyVector::iterator input_end_position, KeySet * skip_symbols, bool preserve_epsilons=false)