Ideas for Google Summer of Code/Unsupervised weighting of automata

From Apertium
Jump to navigation Jump to search


Coding challenge[edit]

  • Install HFST
  • Install lttoolbox
  • Define an evaluation metric --- talk to your mentor
  • Perform a baseline experiment using a tagged corpus:
    • Select a language
    • Split the corpus into 90% training, 10% testing (or use existing test/train split)
    • Use the Apertium morphological analyser to analyse the test data
    • Rank the analyses produced using the training data
    • Compare this ranking to the default order from the transducer, and to a "random" ranking using your metric
    • Weight the analyser automaton based on the tagged corpus
    • Try surface weights, analysis weights, simple combination of the two

Hinweis[edit]

To calculate weights and stuff, just count tokens’ frequencies and corpus size, to prepare that for HFST automata you need to -log(freq) and get the stuff into hfst-strings2fst format (see man pages etc. for details)

To add weight to an existing analyser automaton, simple FSA operations with hfst should be enough: hfst-strings2fst, hfst-compose, hfst-union... you may need hfst-regexp2fst, hfst-minus or so to implement something for OOV items.

You can study an example here: https://gtsvn.uit.no/langtech/trunk/giella-templates/langs-templates/und/tools/spellcheckers/fstbased/desktop/weighting/ https://gtsvn.uit.no/langtech/trunk/giella-templates/langs-templates/und/am-shared/tools-spellcheckers-fstbased-desktop_weights-dir-include.am

or http://opengrm.org/