Difference between revisions of "Ideas for Google Summer of Code/Add weights to lttoolbox"

From Apertium
Jump to navigation Jump to search
Line 5: Line 5:
 
<pre>
 
<pre>
 
<section id="main" type="standard">
 
<section id="main" type="standard">
<e lm="foo" w="0.7"><i>foo</i><par n="foo__n"/></e>
+
<e lm="foo" w="0.1"><i>foo</i><par n="foo__n"/></e>
 
<e lm="foo" w="0.2"><i>foo</i><par n="bar__adj"/></e>
 
<e lm="foo" w="0.2"><i>foo</i><par n="bar__adj"/></e>
<e lm="foo" w="0.1"><i>foo</i><par n="baz__adv"/></e>
+
<e lm="foo" w="0.7"><i>foo</i><par n="baz__adv"/></e>
 
</section>
 
</section>
 
</pre>
 
</pre>

Revision as of 13:14, 12 February 2015

lttoolbox is a set of tools for building finite-state transducers. It would be good to add possibility for weighting lexemes and analyses.

Weights in this page are intuitively defined as larger is worse and two weights can be added using + to make bigger weight reasonably. Theoretically they can be logprobs or tropical semiring stuffs or whatever.

  <section id="main" type="standard">
    <e lm="foo" w="0.1"><i>foo</i><par n="foo__n"/></e>
    <e lm="foo" w="0.2"><i>foo</i><par n="bar__adj"/></e>
    <e lm="foo" w="0.7"><i>foo</i><par n="baz__adv"/></e>
  </section>

Output ordered list:

^foo/foo<n>/foo<adj>/foo<adv>$

There're couple of ways to easily weight:

  • gold-standard tagged corpora and
  • untagged corpora

can be sort | uniq -c'd and composed to analysis and surface sides of the monodix automaton respectively. And:

  • rules can tell weights to stuffs: <sg><ins> += 123, <sg><gen> += 0.1, etc.

Notes:

  • HFST supports weighted automata out of the box already.
  • Weights should ideally be supported throughout the pipeline, analysis, disambiguation, lexical transfer/selection, translation, chunking can all use it reasonably


Further reading