Ideas for Google Summer of Code/Extend lttoolbox to have the power of HFST

From Apertium
Jump to navigation Jump to search

Some language pairs in Apertium use HFST where most language pairs use Apertium's own lttoolbox. This is due to the fact that writing morphologies for languages that have features such as the vowel harmony found in Turkic languages is very hard with the current format supported by lttoolbox. The mixture of HFST and lttoolbox makes it harder for people to develop some language pairs. Another problem with HFST use is with non-command-line tools, like Mitzuli, Apy and OmegaT, which fail to support HFST-based language pairs.


Missing features

Following features in HFST are absent from apertium's lt-toolbox or unsatisfactorily implemented:

  • morphographemics, i.e. rules governing character changes when affixes are attached, e.g. n -> m before p, a -> ä after ä etc.
  • probabilities and weights, e.g. dogs is likelier noun than verb, kuin (fin) is likelier conjunction than 'than' instructive plural 'with moons as tools'
  • separated morphotactics, e.g. prefix depending on suffix,
  • compounding, e.g. with deverbal nouns or such non-trivial things
  • derivation, e.g. with loops (monodix for example cannot have a future reference to pardef making a lot of valid constructions illegal or super cumbersome)
  • trying to compile any complex morphology with lttoolbox will segfault, e.g. https://github.com/flammie/apertium-fin/blob/master/apertium-fin.fin.dix

Following features from lttoolbox are hacky in HFST versions of tools:

Morphographemics

Read twolc documentation?

Weights (statistics etc.)

Weights in this page are intuitively defined as larger is worse and two weights can be added using + to make bigger weight reasonably. Theoretically they can be logprobs or tropical semiring stuffs or whatever.

The apertium pipeline is like so:

  1. morphological analysis
    1. POS tagging
    2. disambiguation
  2. lexical transfer
    1. lexical selection
  3. structural transfer
  4. morphological generation

For most of the tasks the weights are well understood and clear cut:

  1. WFST morphological analysers (surface unigrams, compound n-grams, morph n-grams, lemma weights, pos tag sequence weights)
    1. N-gram FSTs (OpenGRM etc.)
    2. CG with weights!
  2. IBM model 1, apertium's current lexical selection module, etc.
  3. Something from SMT, simple weights for stuffs?
    1. there's whole separate gsoc project for this
  4. WFST morph. analyser is generator inversed

Weights can / should be scaled and combined. The formulas for combining them can be learnt from gold corpora unsupervisedly.

Task should include writing some formats for weights in dictionaries...

Writing weights for lemmas:

  <section id="main" type="standard">
    <e lm="foo" w="1"><i>foo</i><par n="foo__n"/></e>
    <e lm="foo" w="2"><i>foo</i><par n="bar__adj"/></e>
    <e lm="foo" w="7"><i>foo</i><par n="baz__adv"/></e>
  </section>

Same stuff for other parts...?

  <pardef n="foo__n">
    <e w="1"><p><l/><r><s n="sg"/></e>
    <e w="4"><p><l>s</l><r><s n="pl"/></e>
  </pardef>

Morph. Analysis with weights should output ordered list (with weights):

^foo/foo<n>~1/foo<adj>~2/foo<adv>~3$

There're couple of ways to easily weight:

  • gold-standard tagged corpora and
  • untagged corpora

can be sort | uniq -c'd and composed to analysis and surface sides of the monodix automaton respectively. And:

  • rules can tell weights to stuffs: <sg><ins> += 123, <sg><gen> += 0.1, etc.

Note:

  • HFST supports weighted automata out of the box already (and OpenFst and pyfst and libhfst-python are all good candidates for prototyping)
  • Weights should ideally be supported throughout the pipeline, analysis, disambiguation, lexical transfer/selection, translation, chunking can all use it reasonably
  • For the lexical transfer the weights could basically be model 1 onto lttoolbox.
  • compare how SMT's do probabilities


Tasks

  • Extend lttoolbox (perhaps writing a preprocessor for it) so that it can be used to do the morphological transformations currently done with HFST.
  • And yes, of course, writing something that translates the current HFST format to the new lttoolbox format.
  • Proof of concept:
    • Come up with a new format that can express all of the features found in the Kazakh transducer;
    • Implement this format in Apertium;
    • Implement the Kazakh transducer in this format and integrate it in the English--Kazakh pair.

Coding challenge

  • Install Apertium (see Installation)
  • Write a HFST lexc/twolc to (non-existing) monodix++ converter. or, write monodix to lexc+twolc conversion. or fix hfst's apertium2fst and related tools.

Frequently asked questions

  • none yet, ask us something! :)

See also

Weights: