Ideas for Google Summer of Code/Unsupervised weighting of automata

Coding challenge[edit]

Install HFST
Install lttoolbox
Define an evaluation metric --- talk to your mentor
Perform a baseline experiment using a tagged corpus:
- Select a language
- Split the corpus into 90% training, 10% testing (or use existing test/train split)
- Use the Apertium morphological analyser to analyse the test data
- Rank the analyses produced using the training data
- Compare this ranking to the default order from the transducer, and to a "random" ranking using your metric
- Weight the analyser automaton based on the tagged corpus
- Try surface weights, analysis weights, simple combination of the two

Hinweis[edit]

To calculate weights and stuff, just count tokens’ frequencies and corpus size, to prepare that for HFST automata you need to -log(freq) and get the stuff into hfst-strings2fst format (see man pages etc. for details)

To add weight to an existing analyser automaton, simple FSA operations with hfst should be enough: hfst-strings2fst, hfst-compose, hfst-union... you may need hfst-regexp2fst, hfst-minus or so to implement something for OOV items.

You can study an example here: https://gtsvn.uit.no/langtech/trunk/giella-templates/langs-templates/und/tools/spellcheckers/fstbased/desktop/weighting/ https://gtsvn.uit.no/langtech/trunk/giella-templates/langs-templates/und/am-shared/tools-spellcheckers-fstbased-desktop_weights-dir-include.am

or http://opengrm.org/

Ideas for Google Summer of Code/Unsupervised weighting of automata

Coding challenge[edit]

Hinweis[edit]

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools