Ideas for Google Summer of Code/Morphology with HFST

From Apertium
Jump to navigation Jump to search

This page will try to collect useful information to adapting the HFST lookup tools to use lttoolbox style tokenise-as-you-analyse[1].

Our HFST lookup tool should also handle Superblanks and the Apertium stream format correctly.

The difference between Standard vs Inconditional sections can be handled by simply accepting two fsts (command-line arguments), the latter one typically containing all the punctuation.

If you are interested in this for a GsoC project, look first at how lt-proc works (eg. by looking at the first part of the Apertium New Language Pair HOWTO). Ideally we would have an hfst-proc program that is similar in use to lt-proc. There is already a program hfst-lookup in HFST, but this requires each input word to be on a separate line, whereas lt-proc finds word boundaries based on the <alphabet> section of the dictionary (non-alphabet characters always separate words).

Some notes/discussion[edit]

lttoolbox/ has the functions analysis and readAnalysis, these go character by character through a stream.

hfst-tools/src/ however, uses line_to_keyvector (calling hfst_getline in, going line by line with getline. It seems like a good idea to change hfst into working on a char by char basis at least? --Unhammer

Yes, and there should be two modes, one for analysis, and one for generation. We don't really need tokenise-as-you-analyse for generation, but it should deal nicely with Apertium input streams and formatting, superblanks etc. —Preceding unsigned comment added by Francis Tyers (talkcontribs)
I suppose that transforming lookup from tool that assumes external tokenization into one that has possibility of pluggable input formatters would be a nice approach. Ideally it could be possible to write an interface that is easily usable in different programs without much porting. The current lookup originated much around the idea of generic tool that does not assume anything about the format of input or intent of transducer so extending it to support formats has perhaps unnecessarily complicated the code, and especially I could guess that implementing tokenize-as-you-analyze would require enough changes to program logic to warrant rewrite instead of trying to mend current version. —Preceding unsigned comment added by TommiPirinen (talkcontribs)
I think the idea we had was to implement an "hfst-proc" which would basically be a rewrite of lookup to just work with Apertium stream format. E.g. handling analysis and generation in a similar manner to "lt-proc", superblanks, tokenise-as-you-analyse etc. I think lookup is good for what it is good for, being a generic (as you mentioned) interface to HFST transducers. But it would be good to have something nicely integrated for Apertiumland -- much like we did for vislcg3 - Francis Tyers 16:39, 13 March 2010 (UTC)
Yes, that seems like a nice idea, certainly in line with how most of other command line tools are designed and born in HFST tool set TommiPirinen 18:48, 13 March 2010 (UTC)

Some interesting places in the HFST code[edit]





1184                KeyVector* kv = HFST::line_to_keyvector(&line, key_table,
                                                         &markup, &unknown);

1191                 kvs  = HFST::lookup_unique(kv, cascade[0],
                                                    key_table, &infinite);

707     lookups = lookup_all(t, kv, &flag_diacritic_set);

591 KeyVector* 
592 line_to_keyvector(char** s, KeyTable* kt, char** markup, bool* outside_sigma)

692 KeyVectorSet*
693 lookup_unique(KeyVector* kv, TransducerHandle t,
694               KeyTable* kt, bool* infinity)


54      stringUtf8ToKeyVector(const string& s, KeyTable* kt, bool addUnknown)



247 KeyVectorVector * lookup_all(TransducerHandle t,
248                              KeyVector * input_string,
249                              KeySet * skip_symbols) 

258        return find_all_output_strings( pT,input_string, &ks);

288  KeyVectorVector * find_all_output_strings( Transducer * t,
                                             KeyVector * input,
                                             KeySet * skip_symbols) 

295      find_all_continuations(start, input_position,

149   find_all_continuations(Node * n,
                         KeyVector::iterator input_position,
                         KeyVector::iterator input_end_position,
                         KeySet * skip_symbols,
                         bool preserve_epsilons=false)