Difference between revisions of "User:Firespeaker/GSoC2014"
Jump to navigation
Jump to search
Firespeaker (talk | contribs) (Created page with "== Current status == === Pairs === * kaz-kir ({{#lst:Apertium-kaz-kir/stats|kaz-kir-stems}} stems) * tur-kir ({{#lst:Apertium-tur-kir/stats|tur-kir-stems}} stems) * [[...") |
Firespeaker (talk | contribs) |
||
Line 12: | Line 12: | ||
=== CG, lrx === |
=== CG, lrx === |
||
* We should start keeping track of number of lrx rules. |
|||
* We could quantify CG progress with per-token ambiguity measures across coprora? |
|||
== To-do list == |
== To-do list == |
Revision as of 19:20, 23 January 2014
Contents
Current status
Pairs
Transducers
- apertium-kaz - (~94.5% coverage, 36,595 stems) - production
- apertium-kir - (~90.4% coverage, 14,424 stems) - working
- apertium-tur - (~87.3% coverage, 17,221 stems) - working
- apertium-uzb - (~82.9% coverage, 34,470 stems) - development
CG, lrx
- We should start keeping track of number of lrx rules.
- We could quantify CG progress with per-token ambiguity measures across coprora?
To-do list
morphological transducer work
vanilla transducers:
- Increase apertium-uzb coverage to >90%
- expand morphology
- expand lexicon
- Clean up apertium-tur, bring coverage to >90%
- fix some phonology
- clean up some morphotactics
- bring in line with apertium-kaz/etc.
- Clean up apertium-kir, bring coverage to >90%
- improve morphotactics
- bring in line with apertium-kaz/etc.
hard forms:
- Keep lists of difficult-to-classify forms and take a shot at them periodically with concordancer
trimmed transducers:
- bring trimmed coverage to approaching 90% for each transducer
CG and lrx work
especially in need of attention:
- Apertium-uzb
- Apertium-kir
Grammar stuff
- model basic transfer4 grammar for each language (with remapping rules to the other languages)
- Get Turkish relative "ki" to Kyrgyz relative clauses working
- Get transfer working for both directions
Plan of attack
- Get better corpus for Uzbek
- Run transducers against corpora and add most frequently missing stems and any morphology
- Keep regression test corpus
- Run frequent WER tests and tweak grammars/dixes so that the texts consistently have <10% WER
- Try as much as possible to work on everything in parallel, but have goals defined in series