Difference between revisions of "Ideas for Google Summer of Code/Add weights to lttoolbox"

From Apertium
Jump to navigation Jump to search
Line 45: Line 45:
 
* ftp://ftp.cs.indiana.edu/pub/gasser/eacl09_final.pdf
 
* ftp://ftp.cs.indiana.edu/pub/gasser/eacl09_final.pdf
 
* http://link.springer.com/article/10.1007%2Fs10590-004-2476-5
 
* http://link.springer.com/article/10.1007%2Fs10590-004-2476-5
  +
* http://www.dwds.de/static/website/publications/text/Geyken_Hanneforth_fsmnlp.pdf
 
   
 
[[Category:Ideas for Google Summer of Code|Add weights to lttoolbox]]
 
[[Category:Ideas for Google Summer of Code|Add weights to lttoolbox]]

Revision as of 18:40, 12 February 2015

lttoolbox is a set of tools for building finite-state transducers. It would be good to add possibility for weighting lexemes and analyses.

Weights in this page are intuitively defined as larger is worse and two weights can be added using + to make bigger weight reasonably. Theoretically they can be logprobs or tropical semiring stuffs or whatever.

  <section id="main" type="standard">
    <e lm="foo" w="0.1"><i>foo</i><par n="foo__n"/></e>
    <e lm="foo" w="0.2"><i>foo</i><par n="bar__adj"/></e>
    <e lm="foo" w="0.7"><i>foo</i><par n="baz__adv"/></e>
  </section>

Output ordered list:

^foo/foo<n>/foo<adj>/foo<adv>$

There're couple of ways to easily weight:

  • gold-standard tagged corpora and
  • untagged corpora

can be sort | uniq -c'd and composed to analysis and surface sides of the monodix automaton respectively. And:

  • rules can tell weights to stuffs: <sg><ins> += 123, <sg><gen> += 0.1, etc.

Notes:

  • HFST supports weighted automata out of the box already.
  • Weights should ideally be supported throughout the pipeline, analysis, disambiguation, lexical transfer/selection, translation, chunking can all use it reasonably

Tasks

  • ..

Coding challenge

  • ..

Further reading