User:Firespeaker/GSoC2014

From Apertium
< User:Firespeaker
Revision as of 07:18, 14 December 2014 by StemCounterBot (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Turkic pairs from nursery to release quality

Current status[edit]

Bidixes[edit]

Monodixes[edit]

  • apertium-kaz - (~94.5% coverage, 36,595 stems) - production (original: 90.8%, 11,402)
  • apertium-kir - (~90.4% coverage, 14,424 stems) - working (original: 86.7%, 13,705)
  • apertium-tur - (~87.3% coverage, 17,221 stems) - working (original: 86.6%, 11,172)
  • apertium-uzb - (~82.9% coverage, 34,470 stems) - development (original: 82.9%, 3,957)

CG, lrx[edit]

  • We should start keeping track of number of lrx rules
    • better: keep track of per-token ambiguity: tokens( analyser | CG | biltrans | lrx ) / tokens( analyser | CG | biltrans )
  • We could quantify CG progress with per-token ambiguity measures across coprora?
    • tokens( analyser | CG ) / tokens( analyser )

To-do list[edit]

morphological transducer work[edit]

vanilla transducers:

  • Increase apertium-uzb coverage to >90%
    • expand morphology
    • expand lexicon
  • Clean up apertium-tur, bring coverage to >90%
    • fix some phonology
    • clean up some morphotactics
    • bring in line with apertium-kaz/etc.
  • Clean up apertium-kir, bring coverage to >90%
    • improve morphotactics
    • bring in line with apertium-kaz/etc.

hard forms:

  • Keep lists of difficult-to-classify forms and take a shot at them periodically with concordancer

trimmed transducers:

  • bring trimmed coverage to approaching 90% for each transducer

CG and lrx work[edit]

especially in need of attention:

  • Apertium-uzb
  • Apertium-kir

Grammar stuff[edit]

  • model basic transfer4 grammar for each language (with remapping rules to the other languages)
    • Get Turkish relative "ki" to Kyrgyz relative clauses working
    • Get transfer working for both directions

Testvoc[edit]

  • Make an effort into getting clean testvoc for kaz-kir (both directions) and tur-kir (mainly tur→kir)
  • See if it'll be possible to get a clean testvoc for tur-uzb (take a stab at it a few times)

Plan of attack[edit]

  • Get better corpus for Uzbek
  • Run transducers against corpora and add most frequently missing stems and any morphology
  • Keep regression test corpus
  • Run frequent WER tests and tweak grammars/dixes so that the texts consistently have <10% WER
  • Try as much as possible to work on everything in parallel, but have goals defined in series
  • Document tur-uzb better on the wiki
  • testvoc various categories for various translation directions regularly