Difference between revisions of "Ideas for Google Summer of Code/Extend lttoolbox to have the power of HFST"

From Apertium
Jump to navigation Jump to search
m (→‎Tasks: Fix typo)
 
Line 10: Line 10:
   
 
* morphographemics, i.e. rules governing character changes when affixes are attached, e.g. n -> m before p, a -> ä after ä etc.
 
* morphographemics, i.e. rules governing character changes when affixes are attached, e.g. n -> m before p, a -> ä after ä etc.
* probabilities and weights, e.g. dogs is likelier noun than verb, kuin (fin) is likelier conjunction than 'than' instructive plural 'with moons as tools'
+
* ~~probabilities and weights, e.g. dogs is likelier noun than verb, kuin (fin) is likelier conjunction than 'than' instructive plural 'with moons as tools'~~ (done in GSOC 2018, needs more integrations)
 
* separated morphotactics, e.g. prefix depending on suffix,
 
* separated morphotactics, e.g. prefix depending on suffix,
 
* compounding, e.g. with deverbal nouns or such non-trivial things
 
* compounding, e.g. with deverbal nouns or such non-trivial things
Line 23: Line 23:
 
=== Morphographemics ===
 
=== Morphographemics ===
   
Read twolc documentation?
+
Refer to twolc documentation
   
 
=== Weights (statistics etc.) ===
 
=== Weights (statistics etc.) ===
   
  +
Weights are supported in current apertium
Weights in this page are intuitively defined as larger is worse and two weights can be added using + to make bigger weight reasonably. Theoretically they can be logprobs or tropical semiring stuffs or whatever.
 
   
  +
== Separated morphotactics ==
The apertium pipeline is like so:
 
   
  +
E.g. in German the past participle is formed by adding _ge_ in front of the main verb and something like _t_ or _en_ in the end, in Hungarian superlative is formed by adding _leg_ in front of adjective and _...bb_ in the end, etc. in both cases adding the prefix but not the suffix creates a word-form that is not acceptable. In HFST this is solved with _flag diacritics_, this is a horrible hacky system, but it needs to be supported for legacy reasons.
# morphological analysis
 
## POS tagging
 
## disambiguation
 
# lexical transfer
 
## lexical selection
 
# structural transfer
 
# morphological generation
 
   
  +
== Compoundings and derivations ==
For most of the tasks the weights are well understood and clear cut:
 
 
# WFST morphological analysers (surface unigrams, compound n-grams, morph n-grams, lemma weights, pos tag sequence weights)
 
## N-gram FSTs (OpenGRM etc.)
 
## CG with weights!
 
# IBM model 1, apertium's current lexical selection module, etc.
 
# Something from SMT, simple weights for stuffs?
 
## there's whole separate gsoc project for this
 
# WFST morph. analyser is generator inversed
 
 
Weights can / should be scaled and combined. The formulas for combining them can be learnt from gold corpora unsupervisedly.
 
 
Task should include writing some formats for weights in dictionaries...
 
 
Writing weights for lemmas:
 
 
<pre>
 
<section id="main" type="standard">
 
<e lm="foo" w="1"><i>foo</i><par n="foo__n"/></e>
 
<e lm="foo" w="2"><i>foo</i><par n="bar__adj"/></e>
 
<e lm="foo" w="7"><i>foo</i><par n="baz__adv"/></e>
 
</section>
 
</pre>
 
 
Same stuff for other parts...?
 
 
<pre>
 
<pardef n="foo__n">
 
<e w="1"><p><l/><r><s n="sg"/></e>
 
<e w="4"><p><l>s</l><r><s n="pl"/></e>
 
</pardef>
 
</pre>
 
 
Morph. Analysis with weights should output ordered list (with weights):
 
 
<pre>
 
^foo/foo<n>~1/foo<adj>~2/foo<adv>~3$
 
</pre>
 
 
There're couple of ways to easily weight:
 
 
* gold-standard tagged corpora and
 
* untagged corpora
 
 
can be sort | uniq -c'd and composed to analysis and surface sides of the monodix automaton respectively. And:
 
 
* rules can tell weights to stuffs: {{tag|sg><ins}} += 123, {{tag|sg><gen}} += 0.1, etc.
 
 
Note:
 
 
* HFST supports weighted automata out of the box already (and OpenFst and pyfst and libhfst-python are all good candidates for prototyping)
 
 
* Weights should ideally be supported throughout the pipeline, analysis, disambiguation, lexical transfer/selection, translation, chunking can all use it reasonably
 
 
* For the lexical transfer the weights could basically be model 1 onto lttoolbox.
 
 
* compare how SMT's do probabilities
 
   
  +
E.g. in many languages you can make new words by chaining one form of a word with any form of another word, these have restrictions like Noun+Genitive-Noun or Noun-s-Noun, but then Noun can also be a Verb+Participle. Apertium's limited compounding support does not cover many of these cases.
   
 
==Tasks==
 
==Tasks==

Latest revision as of 11:45, 27 February 2019

Some language pairs in Apertium use HFST where most language pairs use Apertium's own lttoolbox. This is due to the fact that writing morphologies for languages that have features such as the vowel harmony found in Turkic languages is very hard with the current format supported by lttoolbox. The mixture of HFST and lttoolbox makes it harder for people to develop some language pairs. Another problem with HFST use is with non-command-line tools, like Mitzuli, Apy and OmegaT, which fail to support HFST-based language pairs.


Missing features[edit]

Following features in HFST are absent from apertium's lt-toolbox or unsatisfactorily implemented:

  • morphographemics, i.e. rules governing character changes when affixes are attached, e.g. n -> m before p, a -> ä after ä etc.
  • ~~probabilities and weights, e.g. dogs is likelier noun than verb, kuin (fin) is likelier conjunction than 'than' instructive plural 'with moons as tools'~~ (done in GSOC 2018, needs more integrations)
  • separated morphotactics, e.g. prefix depending on suffix,
  • compounding, e.g. with deverbal nouns or such non-trivial things
  • derivation, e.g. with loops (monodix for example cannot have a future reference to pardef making a lot of valid constructions illegal or super cumbersome)
  • trying to compile any complex morphology with lttoolbox will segfault, e.g. https://github.com/flammie/apertium-fin/blob/master/apertium-fin.fin.dix

Following features from lttoolbox are hacky in HFST versions of tools:

Morphographemics[edit]

Refer to twolc documentation

Weights (statistics etc.)[edit]

Weights are supported in current apertium

Separated morphotactics[edit]

E.g. in German the past participle is formed by adding _ge_ in front of the main verb and something like _t_ or _en_ in the end, in Hungarian superlative is formed by adding _leg_ in front of adjective and _...bb_ in the end, etc. in both cases adding the prefix but not the suffix creates a word-form that is not acceptable. In HFST this is solved with _flag diacritics_, this is a horrible hacky system, but it needs to be supported for legacy reasons.

Compoundings and derivations[edit]

E.g. in many languages you can make new words by chaining one form of a word with any form of another word, these have restrictions like Noun+Genitive-Noun or Noun-s-Noun, but then Noun can also be a Verb+Participle. Apertium's limited compounding support does not cover many of these cases.

Tasks[edit]

  • Extend lttoolbox (perhaps writing a preprocessor for it) so that it can be used to do the morphological transformations currently done with HFST.
  • And yes, of course, writing something that translates the current HFST format to the new lttoolbox format.
  • Proof of concept:
    • Come up with a new format that can express all of the features found in the Kazakh transducer;
    • Implement this format in Apertium;
    • Implement the Kazakh transducer in this format and integrate it in the English--Kazakh pair.

Coding challenge[edit]

  • Install Apertium (see Installation)
  • Write a HFST lexc/twolc to (non-existing) monodix++ converter. or, write monodix to lexc+twolc conversion. or fix hfst's apertium2fst and related tools.

Frequently asked questions[edit]

  • none yet, ask us something! :)

See also[edit]

Weights: