Ideas for Google Summer of Code/Add weights to lttoolbox

Apertium pipeline vs. weights[edit]

The apertium pipeline is like so:

morphological analysis
1. POS tagging
2. disambiguation
lexical transfer
1. lexical selection
structural transfer
morphological generation

For most of the tasks the weights are well understood and clear cut:

WFST morphological analysers (surface unigrams, compound n-grams, morph n-grams, lemma weights, pos tag sequence weights)
1. N-gram FSTs (OpenGRM etc.)
2. CG with weights!
IBM model 1, apertium's current lexical selection module, etc.
Something from SMT, simple weights for stuffs?
1. there's whole separate gsoc project for this
WFST morph. analyser is generator inversed

Weights can / should be scaled and combined. The formulas for combining them can be learnt from gold corpora unsupervisedly.

Syntaxes for weights in dixes and t?xes and all[edit]

Task should include writing some formats for weights in dictionaries...

Writing weights for lemmas:

  <section id="main" type="standard">
    <e lm="foo" w="1"><i>foo</i><par n="foo__n"/></e>
    <e lm="foo" w="2"><i>foo</i><par n="bar__adj"/></e>
    <e lm="foo" w="7"><i>foo</i><par n="baz__adv"/></e>
  </section>

Same stuff for other parts...?

  <pardef n="foo__n">
    <e w="1"><p><l/><r><s n="sg"/></e>
    <e w="4"><p><l>s</l><r><s n="pl"/></e>
  </pardef>

Pipeline outputs with weights[edit]

Morph. Analysis with weights should output ordered list (with weights):

^foo/foo<n>/foo<adj>/foo<adv>$

Acquiring weights[edit]

There're couple of ways to easily weight:

gold-standard tagged corpora and
untagged corpora

can be sort | uniq -c'd and composed to analysis and surface sides of the monodix automaton respectively. And:

rules can tell weights to stuffs: <sg><ins> += 123, <sg><gen> += 0.1, etc.

Notes with weights[edit]

HFST supports weighted automata out of the box already (and OpenFst and pyfst and libhfst-python are all good candidates for prototyping)

Weights should ideally be supported throughout the pipeline, analysis, disambiguation, lexical transfer/selection, translation, chunking can all use it reasonably

For the lexical transfer the weights could basically be model 1 onto lttoolbox.

compare how SMT's do probabilities

Related features[edit]

Section weights[edit]

A .dix in lttoolbox can have several "sections", each of these is a separate (non-merged) fst in the compiled binary.

We would also like to have weights on whole sections – this is easier to implement than arc-weights. If you have three sections, where a and b have weight 1 and c has weight 3, you would first try to analyse the word with a ∪ b and if it's known, you output that analysis (ignoring c), but if it's still unknown, you analyse with c. This way, we can have e.g. heuristics in section c, that don't pollute the output in the "good" known analyses.

Recursive paradigms[edit]

The lttoolbox FST's do allow cycles, but the .dix format only allows specifying cycles using the <re> syntax, not with paradigms.

We could do this:

on seeing a 'par n="foo"' before seeing 'pardef n="foo"', we create a "stub" node "foo" and store it in a 'map<string,node> undefinedPar'.
on then seeing 'pardef n="foo"', we create that fst and let 'undefinedPar["foo"]' epsilon-arc into it
after reading the whole file, we go through undefinedPar and exit with error if any of the nodes there don't have an outgoing arc at all.
(minimisation strips the epsilon-arcs)

Tasks[edit]

syntaxes for weights

specs for weight annotations in pipeline

tools for acquiring / converting weights

tuning weight combinations

Implement weighted arcs in lttoolbox (C++) and integrate into Apertium.

Implement weighted sections in lttoolbox

Implement recursive paradigms in lttoolbox

Coding challenge[edit]

Add section weights to lttoolbox as described above

Ideas for Google Summer of Code/Add weights to lttoolbox

Contents

Apertium pipeline vs. weights[edit]

Syntaxes for weights in dixes and t?xes and all[edit]

Pipeline outputs with weights[edit]

Acquiring weights[edit]

Notes with weights[edit]

Related features[edit]

Section weights[edit]

Recursive paradigms[edit]

Tasks[edit]

Coding challenge[edit]

Further reading[edit]

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools