Ideas for Google Summer of Code/Add weights to lttoolbox
< Ideas for Google Summer of Code
Jump to navigation
Jump to search
Revision as of 13:14, 12 February 2015 by Francis Tyers (talk | contribs)
lttoolbox is a set of tools for building finite-state transducers. It would be good to add possibility for weighting lexemes and analyses.
Weights in this page are intuitively defined as larger is worse and two weights can be added using + to make bigger weight reasonably. Theoretically they can be logprobs or tropical semiring stuffs or whatever.
<section id="main" type="standard"> <e lm="foo" w="0.1"><i>foo</i><par n="foo__n"/></e> <e lm="foo" w="0.2"><i>foo</i><par n="bar__adj"/></e> <e lm="foo" w="0.7"><i>foo</i><par n="baz__adv"/></e> </section>
Output ordered list:
^foo/foo<n>/foo<adj>/foo<adv>$
There're couple of ways to easily weight:
- gold-standard tagged corpora and
- untagged corpora
can be sort | uniq -c'd and composed to analysis and surface sides of the monodix automaton respectively. And:
- rules can tell weights to stuffs:
<sg><ins>
+= 123,<sg><gen>
+= 0.1, etc.
Notes:
- HFST supports weighted automata out of the box already.
- Weights should ideally be supported throughout the pipeline, analysis, disambiguation, lexical transfer/selection, translation, chunking can all use it reasonably