Ideas for Google Summer of Code/Unsupervised weighting of automata

From Apertium
Jump to navigation Jump to search


Coding challenge

  • Install HFST
  • Install lttoolbox
  • Define an evaluation metric --- talk to your mentor
  • Perform a baseline experiment using a tagged corpus:
    • Select a language
    • Split the corpus into 90% training, 10% testing (or use existing test/train split)
    • Use the Apertium morphological analyser to analyse the test data
    • Rank the analyses produced using the training data
    • Compare this ranking to the default order from the transducer, and to a "random" ranking using your metric
    • Weight the analyser automaton based on the tagged corpus
    • Try surface weights, analysis weights, simple combination of the two

Hinweis

To calculate weights and stuff, just count tokens’ frequencies and corpus size, to prepare that for HFST automata you need to -log(freq) and get the stuff into hfst-strings2fst format (see man pages etc. for details)

To add weight to an existing analyser automaton, simple FSA operations with hfst should be enough: hfst-strings2fst, hfst-compose, hfst-union... you may need hfst-regexp2fst, hfst-minus or so to implement something for OOV items.

You can study an example here: https://gtsvn.uit.no/langtech/trunk/giella-templates/langs-templates/und/tools/spellcheckers/fstbased/desktop/weighting/ https://gtsvn.uit.no/langtech/trunk/giella-templates/langs-templates/und/am-shared/tools-spellcheckers-fstbased-desktop_weights-dir-include.am

or http://opengrm.org/