Ideas for Google Summer of Code/Extend lttoolbox to have the power of HFST

From Apertium
Jump to navigation Jump to search

Some language pairs in Apertium use HFST where most language pairs use Apertium's own lttoolbox. This is due to the fact that writing morphologies for languages that have features such as the vowel harmony found in Turkic languages is very hard with the current format supported by lttoolbox. The mixture of HFST and lttoolbox makes it harder for people to develop some language pairs. Another problem with HFST use is with non-command-line tools, like Mitzuli, Apy and OmegaT, which fail to support HFST-based language pairs.


Missing features

Following features in HFST are absent from apertium's lt-toolbox or unsatisfactorily implemented:

  • morphographemics, i.e. rules governing character changes when affixes are attached, e.g. n -> m before p, a -> ä after ä etc.
  • probabilities and weights, e.g. dogs is likelier noun than verb, kuin (fin) is likelier conjunction than 'than' instructive plural 'with moons as tools'
  • separated morphotactics, e.g. prefix depending on suffix,
  • compounding, e.g. with deverbal nouns or such non-trivial things
  • derivation, e.g. with loops (monodix for example cannot have a future reference to pardef making a lot of valid constructions illegal or super cumbersome)
  • trying to compile any complex morphology with lttoolbox will segfault, e.g. https://github.com/flammie/apertium-fin/blob/master/apertium-fin.fin.dix

Following features from lttoolbox are hacky in HFST versions of tools:

Tasks

  • Extend lttoolbox (perhaps writing a preprocessor for it) so that it can be used to do the morphological transformations currently done with HFST.
  • And yes, of course, writing something that translates the current HFST format to the new lttolbox format.
  • Proof of concept:
    • Come up with a new format that can express all of the features found in the Kazakh transducer;
    • implement this format in Apertium;
    • Implement the Kazakh transducer in this format and integrate it in the English--Kazakh pair.

Coding challenge

  • Install Apertium (see Installation)
  • Write a HFST lexc/twolc to dix++ converter?

Frequently asked questions

  • none yet, ask us something! :)

See also