Difference between revisions of "Ideas for Google Summer of Code/Unsupervised weighting of automata"

Latest revision as of 21:13, 18 March 2019

Coding challenge[edit]

Install HFST
Install lttoolbox
Define an evaluation metric --- talk to your mentor
Perform a baseline experiment using a tagged corpus:
- Select a language
- Split the corpus into 90% training, 10% testing (or use existing test/train split)
- Use the Apertium morphological analyser to analyse the test data
- Rank the analyses produced using the training data
- Compare this ranking to the default order from the transducer, and to a "random" ranking using your metric
- Weight the analyser automaton based on the tagged corpus
- Try surface weights, analysis weights, simple combination of the two

Hinweis[edit]

To calculate weights and stuff, just count tokens’ frequencies and corpus size, to prepare that for HFST automata you need to -log(freq) and get the stuff into hfst-strings2fst format (see man pages etc. for details)

To add weight to an existing analyser automaton, simple FSA operations with hfst should be enough: hfst-strings2fst, hfst-compose, hfst-union... you may need hfst-regexp2fst, hfst-minus or so to implement something for OOV items.

You can study an example here: https://gtsvn.uit.no/langtech/trunk/giella-templates/langs-templates/und/tools/spellcheckers/fstbased/desktop/weighting/ https://gtsvn.uit.no/langtech/trunk/giella-templates/langs-templates/und/am-shared/tools-spellcheckers-fstbased-desktop_weights-dir-include.am

or http://opengrm.org/

@@ Line 1: / Line 1: @@
-foo
+==Coding challenge==
+* Install [[HFST]]
+* Install [[lttoolbox]]
+* Define an evaluation metric --- talk to your mentor
+* Perform a baseline experiment using a tagged corpus:
+** Select a language
+** Split the corpus into 90% training, 10% testing (or use existing test/train split)
+** Use the Apertium morphological analyser to analyse the test data
+** Rank the analyses produced using the training data
+** Compare this ranking to the default order from the transducer, and to a "random" ranking using your metric
+** Weight the analyser automaton based on the tagged corpus
+** Try surface weights, analysis weights, simple combination of the two
+== Hinweis ==
+To calculate weights and stuff, just count tokens’ frequencies and corpus size, to prepare that for HFST automata you need to -log(freq) and get the stuff into hfst-strings2fst format (see man pages etc. for details)
+To add weight to an existing analyser automaton, simple FSA operations with hfst should be enough: hfst-strings2fst, hfst-compose, hfst-union... you may need hfst-regexp2fst, hfst-minus or so to implement something for OOV items.
+You can study an example here: https://gtsvn.uit.no/langtech/trunk/giella-templates/langs-templates/und/tools/spellcheckers/fstbased/desktop/weighting/ https://gtsvn.uit.no/langtech/trunk/giella-templates/langs-templates/und/am-shared/tools-spellcheckers-fstbased-desktop_weights-dir-include.am
+or http://opengrm.org/
+[[Category:Ideas for Google Summer of Code|Unsupervised weighting of automata]]

Difference between revisions of "Ideas for Google Summer of Code/Unsupervised weighting of automata"

Latest revision as of 21:13, 18 March 2019

Coding challenge[edit]

Hinweis[edit]

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools