Difference between revisions of "Ideas for Google Summer of Code/Unsupervised weighting of automata"
Jump to navigation
Jump to search
TommiPirinen (talk | contribs) |
|||
Line 12: | Line 12: | ||
** Rank the analyses produced using the training data |
** Rank the analyses produced using the training data |
||
** Compare this ranking to the default order from the transducer, and to a "random" ranking using your metric |
** Compare this ranking to the default order from the transducer, and to a "random" ranking using your metric |
||
+ | ** Weight the analyser automaton based on the tagged corpus |
||
+ | ** Try surface weights, analysis weights, simple combination of the two |
||
+ | == Hinweis == |
||
+ | To calculate weights and stuff, just count tokens’ frequencies and corpus size, to prepare that for HFST automata you need to -log(freq) and get the stuff into hfst-strings2fst format (see man pages etc. for details) |
||
+ | |||
+ | To add weight to ane xisting analyser automaton, simple FSA operations with hfst should be enough: hfst-strings2fst, hfst-compose, hfst-union... you may need hfst-regexp2fst, hfst-minus or so to implement something for OOV items. |
||
+ | |||
+ | You can study an example here: https://gtsvn.uit.no/langtech/trunk/giella-templates/langs-templates/und/tools/spellcheckers/fstbased/desktop/weighting/ https://gtsvn.uit.no/langtech/trunk/giella-templates/langs-templates/und/am-shared/tools-spellcheckers-fstbased-desktop_weights-dir-include.am |
||
+ | |||
+ | or http://opengrm.org/ |
||
Revision as of 17:15, 29 March 2017
Coding challenge
- Install HFST
- Install lttoolbox
- Define an evaluation metric --- talk to your mentor
- Perform a baseline experiment using a tagged corpus:
- Select a language
- Split the corpus into 90% training, 10% testing (or use existing test/train split)
- Use the Apertium morphological analyser to analyse the test data
- Rank the analyses produced using the training data
- Compare this ranking to the default order from the transducer, and to a "random" ranking using your metric
- Weight the analyser automaton based on the tagged corpus
- Try surface weights, analysis weights, simple combination of the two
Hinweis
To calculate weights and stuff, just count tokens’ frequencies and corpus size, to prepare that for HFST automata you need to -log(freq) and get the stuff into hfst-strings2fst format (see man pages etc. for details)
To add weight to ane xisting analyser automaton, simple FSA operations with hfst should be enough: hfst-strings2fst, hfst-compose, hfst-union... you may need hfst-regexp2fst, hfst-minus or so to implement something for OOV items.
You can study an example here: https://gtsvn.uit.no/langtech/trunk/giella-templates/langs-templates/und/tools/spellcheckers/fstbased/desktop/weighting/ https://gtsvn.uit.no/langtech/trunk/giella-templates/langs-templates/und/am-shared/tools-spellcheckers-fstbased-desktop_weights-dir-include.am