Difference between revisions of "User:Firespeaker/GSoC2014"

Latest revision as of 07:18, 14 December 2014

Current status[edit]

Bidixes[edit]

kaz-kir (? stems)
tur-kir (7,123 stems)
tur-uzb (3,519 stems)

Monodixes[edit]

apertium-kaz - (~94.5% coverage, 36,595 stems) - production (original: 90.8%, 11,402)
apertium-kir - (~90.4% coverage, 14,424 stems) - working (original: 86.7%, 13,705)
apertium-tur - (~87.3% coverage, 17,221 stems) - working (original: 86.6%, 11,172)
apertium-uzb - (~82.9% coverage, 34,470 stems) - development (original: 82.9%, 3,957)

CG, lrx[edit]

We should start keeping track of number of lrx rules
- better: keep track of per-token ambiguity: tokens( analyser | CG | biltrans | lrx ) / tokens( analyser | CG | biltrans )
We could quantify CG progress with per-token ambiguity measures across coprora?
- tokens( analyser | CG ) / tokens( analyser )

To-do list[edit]

morphological transducer work[edit]

vanilla transducers:

Increase apertium-uzb coverage to >90%
- expand morphology
- expand lexicon
Clean up apertium-tur, bring coverage to >90%
- fix some phonology
- clean up some morphotactics
- bring in line with apertium-kaz/etc.
Clean up apertium-kir, bring coverage to >90%
- improve morphotactics
- bring in line with apertium-kaz/etc.

hard forms:

Keep lists of difficult-to-classify forms and take a shot at them periodically with concordancer

trimmed transducers:

bring trimmed coverage to approaching 90% for each transducer

CG and lrx work[edit]

especially in need of attention:

Apertium-uzb
Apertium-kir

Grammar stuff[edit]

model basic transfer4 grammar for each language (with remapping rules to the other languages)
- Get Turkish relative "ki" to Kyrgyz relative clauses working
- Get transfer working for both directions

Testvoc[edit]

Make an effort into getting clean testvoc for kaz-kir (both directions) and tur-kir (mainly tur→kir)
See if it'll be possible to get a clean testvoc for tur-uzb (take a stab at it a few times)

Plan of attack[edit]

Get better corpus for Uzbek
Run transducers against corpora and add most frequently missing stems and any morphology
Keep regression test corpus
Run frequent WER tests and tweak grammars/dixes so that the texts consistently have <10% WER
Try as much as possible to work on everything in parallel, but have goals defined in series
Document tur-uzb better on the wiki
testvoc various categories for various translation directions regularly

@@ Line 1: / Line 1: @@
 {{TOCD}}
+'''Turkic pairs from nursery to release quality'''
+* [[User:Firespeaker/GSoC2014/Application draft|Application draft]]
+* [[User:Firespeaker/GSoC2014/Workplan|Workplan]]
+* [[User:Firespeaker/GSoC2014/Progress|Progress]]
+* [[User:Firespeaker/GSoC2014/TODO|To-do list]]
 == Current status ==
 === Bidixes ===
-* [[kaz-kir]] ({{#lst:Apertium-kaz-kir/stats|kaz-kir-stems}} stems)
+* [[kaz-kir]] ({{#lst:Apertium-kaz-kir/stats|kaz-kir_stems}} stems)
-* [[tur-kir]] ({{#lst:Apertium-tur-kir/stats|tur-kir-stems}} stems)
+* [[tur-kir]] ({{#lst:Apertium-tur-kir/stats|tur-kir_stems}} stems)
-* [[tur-uzb]] ({{#lst:Apertium-tur-uzb/stats|tur-uzb-stems}} stems)
+* [[tur-uzb]] ({{#lst:Apertium-tur-uzb/stats|tur-uzb_stems}} stems)
 === Monodixes ===
-* [[apertium-kaz]] - (~{{#lst:Apertium-kaz/stats/average}}% coverage, {{#lst:Apertium-kaz/stats|stems}} stems) - production
+* [[apertium-kaz]] - (~{{#lst:Apertium-kaz/stats/average}}% coverage, {{#lst:Apertium-kaz/stats|stems}} stems) - production (original: 90.8%, 11,402)
-* [[apertium-kir]] - (~{{#lst:Apertium-kir/stats/average}}% coverage, {{#lst:Apertium-kir/stats|stems}} stems) - working
+* [[apertium-kir]] - (~{{#lst:Apertium-kir/stats/average}}% coverage, {{#lst:Apertium-kir/stats|stems}} stems) - working (original: 86.7%, 13,705)
-* [[apertium-tur]] - (~{{#lst:Apertium-tur/stats/average}}% coverage, {{#lst:Apertium-tur/stats|stems}} stems) - working
+* [[apertium-tur]] - (~{{#lst:Apertium-tur/stats/average}}% coverage, {{#lst:Apertium-tur/stats|stems}} stems) - working (original: 86.6%, 11,172)
-* [[apertium-uzb]] - (~{{#lst:Apertium-uzb/stats/average}}% coverage, {{#lst:Apertium-uzb/stats|stems}} stems) - development
+* [[apertium-uzb]] - (~{{#lst:Apertium-uzb/stats/average}}% coverage, {{#lst:Apertium-uzb/stats|stems}} stems) - development (original: 82.9%, 3,957)
 === CG, lrx ===
-* We should start keeping track of number of lrx rules.
+* We should start keeping track of number of lrx rules
+** better: keep track of per-token ambiguity: <code>tokens( analyser | CG | biltrans | lrx ) / tokens( analyser | CG | biltrans )</code>
 * We could quantify CG progress with per-token ambiguity measures across coprora?
+** <code>tokens( analyser | CG ) / tokens( analyser )</code>
 == To-do list ==

Difference between revisions of "User:Firespeaker/GSoC2014"

Latest revision as of 07:18, 14 December 2014

Contents

Current status[edit]

Bidixes[edit]

Monodixes[edit]

CG, lrx[edit]

To-do list[edit]

morphological transducer work[edit]

CG and lrx work[edit]

Grammar stuff[edit]

Testvoc[edit]

Plan of attack[edit]

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools