Difference between revisions of "Ideas for Google Summer of Code/Add weights to lttoolbox"

Revision as of 13:14, 12 February 2015

lttoolbox is a set of tools for building finite-state transducers. It would be good to add possibility for weighting lexemes and analyses.

Weights in this page are intuitively defined as larger is worse and two weights can be added using + to make bigger weight reasonably. Theoretically they can be logprobs or tropical semiring stuffs or whatever.

  <section id="main" type="standard">
    <e lm="foo" w="0.1"><i>foo</i><par n="foo__n"/></e>
    <e lm="foo" w="0.2"><i>foo</i><par n="bar__adj"/></e>
    <e lm="foo" w="0.7"><i>foo</i><par n="baz__adv"/></e>
  </section>

Output ordered list:

^foo/foo<n>/foo<adj>/foo<adv>$

There're couple of ways to easily weight:

gold-standard tagged corpora and
untagged corpora

can be sort | uniq -c'd and composed to analysis and surface sides of the monodix automaton respectively. And:

rules can tell weights to stuffs: <sg><ins> += 123, <sg><gen> += 0.1, etc.

Notes:

HFST supports weighted automata out of the box already.

Weights should ideally be supported throughout the pipeline, analysis, disambiguation, lexical transfer/selection, translation, chunking can all use it reasonably

Difference between revisions of "Ideas for Google Summer of Code/Add weights to lttoolbox"

Revision as of 13:14, 12 February 2015

Further reading

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools