Difference between revisions of "Ideas for Google Summer of Code/Unsupervised weighting of automata"
Jump to navigation
Jump to search
TommiPirinen (talk | contribs) (Created page with "foo") |
m |
||
(4 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
+ | |||
− | foo |
||
+ | |||
+ | ==Coding challenge== |
||
+ | |||
+ | * Install [[HFST]] |
||
+ | * Install [[lttoolbox]] |
||
+ | * Define an evaluation metric --- talk to your mentor |
||
+ | * Perform a baseline experiment using a tagged corpus: |
||
+ | ** Select a language |
||
+ | ** Split the corpus into 90% training, 10% testing (or use existing test/train split) |
||
+ | ** Use the Apertium morphological analyser to analyse the test data |
||
+ | ** Rank the analyses produced using the training data |
||
+ | ** Compare this ranking to the default order from the transducer, and to a "random" ranking using your metric |
||
+ | ** Weight the analyser automaton based on the tagged corpus |
||
+ | ** Try surface weights, analysis weights, simple combination of the two |
||
+ | |||
+ | == Hinweis == |
||
+ | |||
+ | To calculate weights and stuff, just count tokens’ frequencies and corpus size, to prepare that for HFST automata you need to -log(freq) and get the stuff into hfst-strings2fst format (see man pages etc. for details) |
||
+ | |||
+ | To add weight to an existing analyser automaton, simple FSA operations with hfst should be enough: hfst-strings2fst, hfst-compose, hfst-union... you may need hfst-regexp2fst, hfst-minus or so to implement something for OOV items. |
||
+ | |||
+ | You can study an example here: https://gtsvn.uit.no/langtech/trunk/giella-templates/langs-templates/und/tools/spellcheckers/fstbased/desktop/weighting/ https://gtsvn.uit.no/langtech/trunk/giella-templates/langs-templates/und/am-shared/tools-spellcheckers-fstbased-desktop_weights-dir-include.am |
||
+ | |||
+ | or http://opengrm.org/ |
||
+ | |||
+ | |||
+ | [[Category:Ideas for Google Summer of Code|Unsupervised weighting of automata]] |
Latest revision as of 21:13, 18 March 2019
Coding challenge[edit]
- Install HFST
- Install lttoolbox
- Define an evaluation metric --- talk to your mentor
- Perform a baseline experiment using a tagged corpus:
- Select a language
- Split the corpus into 90% training, 10% testing (or use existing test/train split)
- Use the Apertium morphological analyser to analyse the test data
- Rank the analyses produced using the training data
- Compare this ranking to the default order from the transducer, and to a "random" ranking using your metric
- Weight the analyser automaton based on the tagged corpus
- Try surface weights, analysis weights, simple combination of the two
Hinweis[edit]
To calculate weights and stuff, just count tokens’ frequencies and corpus size, to prepare that for HFST automata you need to -log(freq) and get the stuff into hfst-strings2fst format (see man pages etc. for details)
To add weight to an existing analyser automaton, simple FSA operations with hfst should be enough: hfst-strings2fst, hfst-compose, hfst-union... you may need hfst-regexp2fst, hfst-minus or so to implement something for OOV items.
You can study an example here: https://gtsvn.uit.no/langtech/trunk/giella-templates/langs-templates/und/tools/spellcheckers/fstbased/desktop/weighting/ https://gtsvn.uit.no/langtech/trunk/giella-templates/langs-templates/und/am-shared/tools-spellcheckers-fstbased-desktop_weights-dir-include.am