Ideas for Google Summer of Code/Unsupervised weighting of automata

Coding challenge

Install HFST
Install lttoolbox
Define an evaluation metric --- talk to your mentor
Perform a baseline experiment using a tagged corpus:
- Select a language
- Split the corpus into 90% training, 10% testing (or use existing test/train split)
- Use the Apertium morphological analyser to analyse the test data
- Rank the analyses produced using the training data
- Compare this ranking to the default order from the transducer, and to a "random" ranking using your metric
- Weight the analyser automaton based on the tagged corpus
- Try surface weights, analysis weights, simple combination of the two

Hinweis

To calculate weights and stuff, just count tokens’ frequencies and corpus size, to prepare that for HFST automata you need to -log(freq) and get the stuff into hfst-strings2fst format (see man pages etc. for details)

To add weight to an existing analyser automaton, simple FSA operations with hfst should be enough: hfst-strings2fst, hfst-compose, hfst-union... you may need hfst-regexp2fst, hfst-minus or so to implement something for OOV items.

You can study an example here: https://gtsvn.uit.no/langtech/trunk/giella-templates/langs-templates/und/tools/spellcheckers/fstbased/desktop/weighting/ https://gtsvn.uit.no/langtech/trunk/giella-templates/langs-templates/und/am-shared/tools-spellcheckers-fstbased-desktop_weights-dir-include.am

or http://opengrm.org/

Ideas for Google Summer of Code/Unsupervised weighting of automata

Coding challenge

Hinweis

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools