Ideas for Google Summer of Code/Extend lttoolbox to have the power of HFST
Some language pairs in Apertium use HFST where most language pairs use Apertium's own lttoolbox. This is due to the fact that writing morphologies for languages that have features such as the vowel harmony found in Turkic languages is very hard with the current format supported by lttoolbox. The mixture of HFST and lttoolbox makes it harder for people to develop some language pairs. Another problem with HFST use is with non-command-line tools, like Mitzuli, Apy and OmegaT, which fail to support HFST-based language pairs.
Following features in HFST are absent from apertium's lt-toolbox or unsatisfactorily implemented:
- morphographemics, i.e. rules governing character changes when affixes are attached, e.g. n -> m before p, a -> ä after ä etc.
- ~~probabilities and weights, e.g. dogs is likelier noun than verb, kuin (fin) is likelier conjunction than 'than' instructive plural 'with moons as tools'~~ (done in GSOC 2018, needs more integrations)
- separated morphotactics, e.g. prefix depending on suffix,
- compounding, e.g. with deverbal nouns or such non-trivial things
- derivation, e.g. with loops (monodix for example cannot have a future reference to pardef making a lot of valid constructions illegal or super cumbersome)
- trying to compile any complex morphology with lttoolbox will segfault, e.g. https://github.com/flammie/apertium-fin/blob/master/apertium-fin.fin.dix
Following features from lttoolbox are hacky in HFST versions of tools:
- tokenise-as-you-analyse: c.f. https://github.com/hfst/hfst/issues?q=is%3Aissue+is%3Aopen+label%3Aproc
- compound-only-l tags
Refer to twolc documentation
Weights (statistics etc.)
Weights are supported in current apertium
E.g. in German the past participle is formed by adding _ge_ in front of the main verb and something like _t_ or _en_ in the end, in Hungarian superlative is formed by adding _leg_ in front of adjective and _...bb_ in the end, etc. in both cases adding the prefix but not the suffix creates a word-form that is not acceptable. In HFST this is solved with _flag diacritics_, this is a horrible hacky system, but it needs to be supported for legacy reasons.
Compoundings and derivations
E.g. in many languages you can make new words by chaining one form of a word with any form of another word, these have restrictions like Noun+Genitive-Noun or Noun-s-Noun, but then Noun can also be a Verb+Participle. Apertium's limited compounding support does not cover many of these cases.
- Extend lttoolbox (perhaps writing a preprocessor for it) so that it can be used to do the morphological transformations currently done with HFST.
- And yes, of course, writing something that translates the current HFST format to the new lttoolbox format.
- Proof of concept:
- Come up with a new format that can express all of the features found in the Kazakh transducer;
- Implement this format in Apertium;
- Implement the Kazakh transducer in this format and integrate it in the English--Kazakh pair.
- Install Apertium (see Installation)
- Write a HFST lexc/twolc to (non-existing) monodix++ converter. or, write monodix to lexc+twolc conversion. or fix hfst's apertium2fst and related tools.
Frequently asked questions
- none yet, ask us something! :)