Difference between revisions of "User:Firespeaker/GSoC2014"
Jump to navigation
Jump to search
Firespeaker (talk | contribs) m |
|||
(7 intermediate revisions by one other user not shown) | |||
Line 1: | Line 1: | ||
{{TOCD}} |
|||
'''Turkic pairs from nursery to release quality''' |
|||
* [[User:Firespeaker/GSoC2014/Application draft|Application draft]] |
|||
* [[User:Firespeaker/GSoC2014/Workplan|Workplan]] |
|||
* [[User:Firespeaker/GSoC2014/Progress|Progress]] |
|||
* [[User:Firespeaker/GSoC2014/TODO|To-do list]] |
|||
== Current status == |
== Current status == |
||
=== |
=== Bidixes === |
||
* [[kaz-kir]] ({{#lst:Apertium-kaz-kir/stats|kaz- |
* [[kaz-kir]] ({{#lst:Apertium-kaz-kir/stats|kaz-kir_stems}} stems) |
||
* [[tur-kir]] ({{#lst:Apertium-tur-kir/stats|tur- |
* [[tur-kir]] ({{#lst:Apertium-tur-kir/stats|tur-kir_stems}} stems) |
||
* [[tur-uzb]] ({{#lst:Apertium-tur-uzb/stats|tur- |
* [[tur-uzb]] ({{#lst:Apertium-tur-uzb/stats|tur-uzb_stems}} stems) |
||
=== |
=== Monodixes === |
||
* [[apertium-kaz]] - (~{{#lst:Apertium-kaz/stats/average}}% coverage, {{#lst:Apertium-kaz/stats|stems}} stems) - production |
* [[apertium-kaz]] - (~{{#lst:Apertium-kaz/stats/average}}% coverage, {{#lst:Apertium-kaz/stats|stems}} stems) - production (original: 90.8%, 11,402) |
||
* [[apertium-kir]] - (~{{#lst:Apertium-kir/stats/average}}% coverage, {{#lst:Apertium-kir/stats|stems}} stems) - working |
* [[apertium-kir]] - (~{{#lst:Apertium-kir/stats/average}}% coverage, {{#lst:Apertium-kir/stats|stems}} stems) - working (original: 86.7%, 13,705) |
||
* [[apertium-tur]] - (~{{#lst:Apertium-tur/stats/average}}% coverage, {{#lst:Apertium-tur/stats|stems}} stems) - working |
* [[apertium-tur]] - (~{{#lst:Apertium-tur/stats/average}}% coverage, {{#lst:Apertium-tur/stats|stems}} stems) - working (original: 86.6%, 11,172) |
||
* [[apertium-uzb]] - (~{{#lst:Apertium-uzb/stats/average}}% coverage, {{#lst:Apertium-uzb/stats|stems}} stems) - development |
* [[apertium-uzb]] - (~{{#lst:Apertium-uzb/stats/average}}% coverage, {{#lst:Apertium-uzb/stats|stems}} stems) - development (original: 82.9%, 3,957) |
||
=== CG, lrx === |
=== CG, lrx === |
||
* We should start keeping track of number of lrx rules |
* We should start keeping track of number of lrx rules |
||
** better: keep track of per-token ambiguity: <code>tokens( analyser | CG | biltrans | lrx ) / tokens( analyser | CG | biltrans )</code> |
|||
* We could quantify CG progress with per-token ambiguity measures across coprora? |
* We could quantify CG progress with per-token ambiguity measures across coprora? |
||
** <code>tokens( analyser | CG ) / tokens( analyser )</code> |
|||
== To-do list == |
== To-do list == |
||
Line 45: | Line 53: | ||
** Get transfer working for both directions |
** Get transfer working for both directions |
||
=== Testvoc === |
|||
* Make an effort into getting clean testvoc for kaz-kir (both directions) and tur-kir (mainly tur→kir) |
|||
* See if it'll be possible to get a clean testvoc for tur-uzb (take a stab at it a few times) |
|||
== Plan of attack == |
== Plan of attack == |
||
Line 53: | Line 64: | ||
* Try as much as possible to work on everything in parallel, but have goals defined in series |
* Try as much as possible to work on everything in parallel, but have goals defined in series |
||
* Document tur-uzb better on the wiki |
* Document tur-uzb better on the wiki |
||
* testvoc various categories for various translation directions regularly |
Latest revision as of 07:18, 14 December 2014
Turkic pairs from nursery to release quality
Current status[edit]
Bidixes[edit]
Monodixes[edit]
- apertium-kaz - (~94.5% coverage, 36,595 stems) - production (original: 90.8%, 11,402)
- apertium-kir - (~90.4% coverage, 14,424 stems) - working (original: 86.7%, 13,705)
- apertium-tur - (~87.3% coverage, 17,221 stems) - working (original: 86.6%, 11,172)
- apertium-uzb - (~82.9% coverage, 34,470 stems) - development (original: 82.9%, 3,957)
CG, lrx[edit]
- We should start keeping track of number of lrx rules
- better: keep track of per-token ambiguity:
tokens( analyser | CG | biltrans | lrx ) / tokens( analyser | CG | biltrans )
- better: keep track of per-token ambiguity:
- We could quantify CG progress with per-token ambiguity measures across coprora?
tokens( analyser | CG ) / tokens( analyser )
To-do list[edit]
morphological transducer work[edit]
vanilla transducers:
- Increase apertium-uzb coverage to >90%
- expand morphology
- expand lexicon
- Clean up apertium-tur, bring coverage to >90%
- fix some phonology
- clean up some morphotactics
- bring in line with apertium-kaz/etc.
- Clean up apertium-kir, bring coverage to >90%
- improve morphotactics
- bring in line with apertium-kaz/etc.
hard forms:
- Keep lists of difficult-to-classify forms and take a shot at them periodically with concordancer
trimmed transducers:
- bring trimmed coverage to approaching 90% for each transducer
CG and lrx work[edit]
especially in need of attention:
- Apertium-uzb
- Apertium-kir
Grammar stuff[edit]
- model basic transfer4 grammar for each language (with remapping rules to the other languages)
- Get Turkish relative "ki" to Kyrgyz relative clauses working
- Get transfer working for both directions
Testvoc[edit]
- Make an effort into getting clean testvoc for kaz-kir (both directions) and tur-kir (mainly tur→kir)
- See if it'll be possible to get a clean testvoc for tur-uzb (take a stab at it a few times)
Plan of attack[edit]
- Get better corpus for Uzbek
- Run transducers against corpora and add most frequently missing stems and any morphology
- Keep regression test corpus
- Run frequent WER tests and tweak grammars/dixes so that the texts consistently have <10% WER
- Try as much as possible to work on everything in parallel, but have goals defined in series
- Document tur-uzb better on the wiki
- testvoc various categories for various translation directions regularly