Difference between revisions of "Ideas for Google Summer of Code/Add weights to lttoolbox"

From Apertium
Jump to navigation Jump to search
(more blah)
Line 2: Line 2:
   
 
Weights in this page are intuitively defined as larger is worse and two weights can be added using + to make bigger weight reasonably. Theoretically they can be logprobs or tropical semiring stuffs or whatever.
 
Weights in this page are intuitively defined as larger is worse and two weights can be added using + to make bigger weight reasonably. Theoretically they can be logprobs or tropical semiring stuffs or whatever.
  +
  +
== Apertium pipeline vs. weights ==
  +
  +
The apertium pipeline is like so:
  +
  +
# morphological analysis
  +
## POS tagging
  +
## disambiguation
  +
# lexical transfer
  +
## lexical selection
  +
# structural transfer
  +
# morphological generation
  +
  +
For most of the tasks the weights are well understood and clear cut:
  +
  +
# WFST morphological analysers (surface unigrams, compound n-grams, morph n-grams, lemma weights, pos tag sequence weights)
  +
## N-gram FSTs (OpenGRM etc.)
  +
## CG with weights!
  +
# IBM model 1, apertium's current lexical selection module, etc.
  +
# Something from SMT, simple weights for stuffs?
  +
# WFST morph. analyser is generator inversed
  +
  +
Weights can / should be scaled and combined. The formulas for combining them can be learnt from gold corpora unsupervisedly.
  +
  +
== Syntaxes for weights in dixes and t?xes and all ==
  +
  +
Task should include writing some formats for weights in dictionaries...
  +
  +
Writing weights for lemmas:
   
 
<pre>
 
<pre>
 
<section id="main" type="standard">
 
<section id="main" type="standard">
<e lm="foo" w="0.1"><i>foo</i><par n="foo__n"/></e>
+
<e lm="foo" w="1"><i>foo</i><par n="foo__n"/></e>
<e lm="foo" w="0.2"><i>foo</i><par n="bar__adj"/></e>
+
<e lm="foo" w="2"><i>foo</i><par n="bar__adj"/></e>
<e lm="foo" w="0.7"><i>foo</i><par n="baz__adv"/></e>
+
<e lm="foo" w="7"><i>foo</i><par n="baz__adv"/></e>
 
</section>
 
</section>
 
</pre>
 
</pre>
   
  +
Same stuff for other parts...?
Output ordered list:
 
  +
  +
<pre>
  +
<pardef n="foo__n">
  +
<e w="1"><p><l/><r><s n="sg"/></e>
  +
<e w="4"><p><l>s</l><r><s n="pl"/></e>
  +
</pardef>
  +
</pre>
  +
  +
== Pipeline outputs with weights ==
  +
  +
Morph. Analysis with weights should output ordered list (with weights):
  +
 
<pre>
 
<pre>
 
^foo/foo<n>/foo<adj>/foo<adv>$
 
^foo/foo<n>/foo<adj>/foo<adv>$
 
</pre>
 
</pre>
  +
  +
== Acquiring weights ==
   
 
There're couple of ways to easily weight:
 
There're couple of ways to easily weight:
Line 25: Line 68:
 
* rules can tell weights to stuffs: {{tag|sg><ins}} += 123, {{tag|sg><gen}} += 0.1, etc.
 
* rules can tell weights to stuffs: {{tag|sg><ins}} += 123, {{tag|sg><gen}} += 0.1, etc.
   
Notes:
+
== Notes with weights ==
   
* HFST supports weighted automata out of the box already.
+
* HFST supports weighted automata out of the box already (and OpenFst and pyfst and libhfst-python are all good candidates for prototyping)
   
 
* Weights should ideally be supported throughout the pipeline, analysis, disambiguation, lexical transfer/selection, translation, chunking can all use it reasonably
 
* Weights should ideally be supported throughout the pipeline, analysis, disambiguation, lexical transfer/selection, translation, chunking can all use it reasonably
   
 
* For the lexical transfer the weights could basically be model 1 onto lttoolbox.
 
* For the lexical transfer the weights could basically be model 1 onto lttoolbox.
  +
  +
* compare how SMT's do probabilities
   
 
==Tasks==
 
==Tasks==
   
  +
* syntaxes for weights
* ..
 
  +
  +
* specs for weight annotations in pipeline
  +
  +
* tools for acquiring / converting weights
  +
  +
* tuning weight combinations
   
 
==Coding challenge==
 
==Coding challenge==
   
  +
<!--
* ..
 
  +
  +
* Install and do MosesBaseline
  +
  +
* find lexical translation weights and convert to weighted full-form list
  +
  +
* format in dix-like format
  +
  +
-->
   
 
==Further reading==
 
==Further reading==

Revision as of 05:29, 17 February 2015

lttoolbox is a set of tools for building finite-state transducers. It would be good to add possibility for weighting lexemes and analyses.

Weights in this page are intuitively defined as larger is worse and two weights can be added using + to make bigger weight reasonably. Theoretically they can be logprobs or tropical semiring stuffs or whatever.

Apertium pipeline vs. weights

The apertium pipeline is like so:

  1. morphological analysis
    1. POS tagging
    2. disambiguation
  2. lexical transfer
    1. lexical selection
  3. structural transfer
  4. morphological generation

For most of the tasks the weights are well understood and clear cut:

  1. WFST morphological analysers (surface unigrams, compound n-grams, morph n-grams, lemma weights, pos tag sequence weights)
    1. N-gram FSTs (OpenGRM etc.)
    2. CG with weights!
  2. IBM model 1, apertium's current lexical selection module, etc.
  3. Something from SMT, simple weights for stuffs?
  4. WFST morph. analyser is generator inversed

Weights can / should be scaled and combined. The formulas for combining them can be learnt from gold corpora unsupervisedly.

Syntaxes for weights in dixes and t?xes and all

Task should include writing some formats for weights in dictionaries...

Writing weights for lemmas:

  <section id="main" type="standard">
    <e lm="foo" w="1"><i>foo</i><par n="foo__n"/></e>
    <e lm="foo" w="2"><i>foo</i><par n="bar__adj"/></e>
    <e lm="foo" w="7"><i>foo</i><par n="baz__adv"/></e>
  </section>

Same stuff for other parts...?

  <pardef n="foo__n">
    <e w="1"><p><l/><r><s n="sg"/></e>
    <e w="4"><p><l>s</l><r><s n="pl"/></e>
  </pardef>

Pipeline outputs with weights

Morph. Analysis with weights should output ordered list (with weights):

^foo/foo<n>/foo<adj>/foo<adv>$

Acquiring weights

There're couple of ways to easily weight:

  • gold-standard tagged corpora and
  • untagged corpora

can be sort | uniq -c'd and composed to analysis and surface sides of the monodix automaton respectively. And:

  • rules can tell weights to stuffs: <sg><ins> += 123, <sg><gen> += 0.1, etc.

Notes with weights

  • HFST supports weighted automata out of the box already (and OpenFst and pyfst and libhfst-python are all good candidates for prototyping)
  • Weights should ideally be supported throughout the pipeline, analysis, disambiguation, lexical transfer/selection, translation, chunking can all use it reasonably
  • For the lexical transfer the weights could basically be model 1 onto lttoolbox.
  • compare how SMT's do probabilities

Tasks

  • syntaxes for weights
  • specs for weight annotations in pipeline
  • tools for acquiring / converting weights
  • tuning weight combinations

Coding challenge

Further reading