Difference between revisions of "Ideas for Google Summer of Code/Unsupervised weighting of automata"

From Apertium
Jump to navigation Jump to search
Line 12: Line 12:
 
** Rank the analyses produced using the training data
 
** Rank the analyses produced using the training data
 
** Compare this ranking to the default order from the transducer, and to a "random" ranking using your metric
 
** Compare this ranking to the default order from the transducer, and to a "random" ranking using your metric
  +
** Weight the analyser automaton based on the tagged corpus
  +
** Try surface weights, analysis weights, simple combination of the two
   
  +
== Hinweis ==
   
  +
To calculate weights and stuff, just count tokens’ frequencies and corpus size, to prepare that for HFST automata you need to -log(freq) and get the stuff into hfst-strings2fst format (see man pages etc. for details)
  +
  +
To add weight to ane xisting analyser automaton, simple FSA operations with hfst should be enough: hfst-strings2fst, hfst-compose, hfst-union... you may need hfst-regexp2fst, hfst-minus or so to implement something for OOV items.
  +
  +
You can study an example here: https://gtsvn.uit.no/langtech/trunk/giella-templates/langs-templates/und/tools/spellcheckers/fstbased/desktop/weighting/ https://gtsvn.uit.no/langtech/trunk/giella-templates/langs-templates/und/am-shared/tools-spellcheckers-fstbased-desktop_weights-dir-include.am
  +
  +
or http://opengrm.org/
   
   

Revision as of 17:15, 29 March 2017


Coding challenge

  • Install HFST
  • Install lttoolbox
  • Define an evaluation metric --- talk to your mentor
  • Perform a baseline experiment using a tagged corpus:
    • Select a language
    • Split the corpus into 90% training, 10% testing (or use existing test/train split)
    • Use the Apertium morphological analyser to analyse the test data
    • Rank the analyses produced using the training data
    • Compare this ranking to the default order from the transducer, and to a "random" ranking using your metric
    • Weight the analyser automaton based on the tagged corpus
    • Try surface weights, analysis weights, simple combination of the two

Hinweis

To calculate weights and stuff, just count tokens’ frequencies and corpus size, to prepare that for HFST automata you need to -log(freq) and get the stuff into hfst-strings2fst format (see man pages etc. for details)

To add weight to ane xisting analyser automaton, simple FSA operations with hfst should be enough: hfst-strings2fst, hfst-compose, hfst-union... you may need hfst-regexp2fst, hfst-minus or so to implement something for OOV items.

You can study an example here: https://gtsvn.uit.no/langtech/trunk/giella-templates/langs-templates/und/tools/spellcheckers/fstbased/desktop/weighting/ https://gtsvn.uit.no/langtech/trunk/giella-templates/langs-templates/und/am-shared/tools-spellcheckers-fstbased-desktop_weights-dir-include.am

or http://opengrm.org/