User:Firespeaker/GSoC2014

From Apertium

< User:Firespeaker

Revision as of 07:51, 11 June 2014 by Firespeaker (talk | contribs) (→‎Monodixes)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Jump to navigation Jump to search

Contents

1 Current status
2 To-do list
3 Plan of attack

Turkic pairs from nursery to release quality

Current status

Bidixes

kaz-kir ( stems)
tur-kir ( stems)
tur-uzb ( stems)

Monodixes

apertium-kaz - (~94.5% coverage, 36,595 stems) - production (original: 90.8%, 11,402)
apertium-kir - (~90.4% coverage, 14,424 stems) - working (original: 86.7%, 13,705)
apertium-tur - (~87.3% coverage, 17,221 stems) - working (original: 86.6%, 11,172)
apertium-uzb - (~82.9% coverage, 34,470 stems) - development (original: 82.9%, 3,957)

CG, lrx

We should start keeping track of number of lrx rules
- better: keep track of per-token ambiguity: tokens( analyser | CG | biltrans | lrx ) / tokens( analyser | CG | biltrans )
We could quantify CG progress with per-token ambiguity measures across coprora?
- tokens( analyser | CG ) / tokens( analyser )

To-do list

morphological transducer work

vanilla transducers:

Increase apertium-uzb coverage to >90%
- expand morphology
- expand lexicon
Clean up apertium-tur, bring coverage to >90%
- fix some phonology
- clean up some morphotactics
- bring in line with apertium-kaz/etc.
Clean up apertium-kir, bring coverage to >90%
- improve morphotactics
- bring in line with apertium-kaz/etc.

hard forms:

Keep lists of difficult-to-classify forms and take a shot at them periodically with concordancer

trimmed transducers:

bring trimmed coverage to approaching 90% for each transducer

CG and lrx work

especially in need of attention:

Apertium-uzb
Apertium-kir

Grammar stuff

model basic transfer4 grammar for each language (with remapping rules to the other languages)
- Get Turkish relative "ki" to Kyrgyz relative clauses working
- Get transfer working for both directions

Testvoc

Make an effort into getting clean testvoc for kaz-kir (both directions) and tur-kir (mainly tur→kir)
See if it'll be possible to get a clean testvoc for tur-uzb (take a stab at it a few times)

Plan of attack

Get better corpus for Uzbek
Run transducers against corpora and add most frequently missing stems and any morphology
Keep regression test corpus
Run frequent WER tests and tweak grammars/dixes so that the texts consistently have <10% WER
Try as much as possible to work on everything in parallel, but have goals defined in series
Document tur-uzb better on the wiki
testvoc various categories for various translation directions regularly

Retrieved from "https://wiki.apertium.org/w/index.php?title=User:Firespeaker/GSoC2014&oldid=48879"