Ideas for Google Summer of Code/Morphology with HFST
		
		
		
		
		
		
		Jump to navigation
		Jump to search
		
		
		
		
		
		
		
	
This page will try to collect useful information to adapting the HFST lookup tools to use lttoolbox style tokenise-as-you-analyse.
lttoolbox/fst_processor.cc has the functions analysis and readAnalysis, these go character by character through a stream.
hfst-tools/src/hfst-lookup.cc however, uses line_to_keyvector (calling hfst_getline in hfst-commandline.cc), going line by line with getline. It seems like a good idea to change hfst into working on a char by char basis at least?
- Yes, and there should be two modes, one for analysis, and one for generation. We don't really need tokenise-as-you-analyse for generation, but it should deal nicely with Apertium input streams and formatting, superblanks etc.
$ https://hfst.svn.sourceforge.net/svnroot/hfst
hfst/trunk/hfst/hfst-tools/src
Files:
hfst-lookup.cc 
hfst-optimized-lookup.cc
Lines:
1184                KeyVector* kv = HFST::line_to_keyvector(&line, key_table,
                                                         &markup, &unknown);
1191                 kvs  = HFST::lookup_unique(kv, cascade[0],
                                                    key_table, &infinite);
707     lookups = lookup_all(t, kv, &flag_diacritic_set);
591 KeyVector* 
592 line_to_keyvector(char** s, KeyTable* kt, char** markup, bool* outside_sigma)
692 KeyVectorSet*
693 lookup_unique(KeyVector* kv, TransducerHandle t,
694               KeyTable* kt, bool* infinity)
hfst2/string/string.cc
Lines
54      stringUtf8ToKeyVector(const string& s, KeyTable* kt, bool addUnknown)
hfst2/src
hfst2/sfst/hsfst.C:
247 KeyVectorVector * lookup_all(TransducerHandle t,
248                              KeyVector * input_string,
249                              KeySet * skip_symbols) 
258        return find_all_output_strings( pT,input_string, &ks);
288  KeyVectorVector * find_all_output_strings( Transducer * t,
                                             KeyVector * input,
                                             KeySet * skip_symbols) 
295      find_all_continuations(start, input_position,
                             last_input_position,
                             skip_symbols);
149   find_all_continuations(Node * n,
                         KeyVector::iterator input_position,
                         KeyVector::iterator input_end_position,
                         KeySet * skip_symbols,
                         bool preserve_epsilons=false) 

