Difference between revisions of "User:Firespeaker/GSoC2014"

From Apertium
Jump to navigation Jump to search
m
Line 1: Line 1:
{{TOCD}}
== Current status ==
== Current status ==
=== Pairs ===
=== Bidixes ===
* [[kaz-kir]] ({{#lst:Apertium-kaz-kir/stats|kaz-kir-stems}} stems)
* [[kaz-kir]] ({{#lst:Apertium-kaz-kir/stats|kaz-kir-stems}} stems)
* [[tur-kir]] ({{#lst:Apertium-tur-kir/stats|tur-kir-stems}} stems)
* [[tur-kir]] ({{#lst:Apertium-tur-kir/stats|tur-kir-stems}} stems)
* [[tur-uzb]] ({{#lst:Apertium-tur-uzb/stats|tur-uzb-stems}} stems)
* [[tur-uzb]] ({{#lst:Apertium-tur-uzb/stats|tur-uzb-stems}} stems)


=== Transducers ===
=== Monodixes ===
* [[apertium-kaz]] - (~{{#lst:Apertium-kaz/stats/average}}% coverage, {{#lst:Apertium-kaz/stats|stems}} stems) - production
* [[apertium-kaz]] - (~{{#lst:Apertium-kaz/stats/average}}% coverage, {{#lst:Apertium-kaz/stats|stems}} stems) - production
* [[apertium-kir]] - (~{{#lst:Apertium-kir/stats/average}}% coverage, {{#lst:Apertium-kir/stats|stems}} stems) - working
* [[apertium-kir]] - (~{{#lst:Apertium-kir/stats/average}}% coverage, {{#lst:Apertium-kir/stats|stems}} stems) - working
Line 45: Line 46:
** Get transfer working for both directions
** Get transfer working for both directions


=== Testvoc ===
* Make an effort into getting clean testvoc for kaz-kir (both directions) and tur-kir (mainly tur→kir)
* See if it'll be possible to get a clean testvoc for tur-uzb (take a stab at it a few times)


== Plan of attack ==
== Plan of attack ==
Line 53: Line 57:
* Try as much as possible to work on everything in parallel, but have goals defined in series
* Try as much as possible to work on everything in parallel, but have goals defined in series
* Document tur-uzb better on the wiki
* Document tur-uzb better on the wiki
* testvoc various categories for various translation directions regularly

Revision as of 19:26, 23 January 2014

Current status

Bidixes

Monodixes

CG, lrx

  • We should start keeping track of number of lrx rules.
  • We could quantify CG progress with per-token ambiguity measures across coprora?

To-do list

morphological transducer work

vanilla transducers:

  • Increase apertium-uzb coverage to >90%
    • expand morphology
    • expand lexicon
  • Clean up apertium-tur, bring coverage to >90%
    • fix some phonology
    • clean up some morphotactics
    • bring in line with apertium-kaz/etc.
  • Clean up apertium-kir, bring coverage to >90%
    • improve morphotactics
    • bring in line with apertium-kaz/etc.

hard forms:

  • Keep lists of difficult-to-classify forms and take a shot at them periodically with concordancer

trimmed transducers:

  • bring trimmed coverage to approaching 90% for each transducer

CG and lrx work

especially in need of attention:

  • Apertium-uzb
  • Apertium-kir

Grammar stuff

  • model basic transfer4 grammar for each language (with remapping rules to the other languages)
    • Get Turkish relative "ki" to Kyrgyz relative clauses working
    • Get transfer working for both directions

Testvoc

  • Make an effort into getting clean testvoc for kaz-kir (both directions) and tur-kir (mainly tur→kir)
  • See if it'll be possible to get a clean testvoc for tur-uzb (take a stab at it a few times)

Plan of attack

  • Get better corpus for Uzbek
  • Run transducers against corpora and add most frequently missing stems and any morphology
  • Keep regression test corpus
  • Run frequent WER tests and tweak grammars/dixes so that the texts consistently have <10% WER
  • Try as much as possible to work on everything in parallel, but have goals defined in series
  • Document tur-uzb better on the wiki
  • testvoc various categories for various translation directions regularly