Difference between revisions of "User:Firespeaker/GSoC2014/Workplan"
< User:Firespeaker | GSoC2014
Jump to navigation
Jump to search
Firespeaker (talk | contribs) |
Firespeaker (talk | contribs) |
||
(22 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
− | == |
+ | == Primary goals == |
+ | * A '''production-ready''' release of '''kaz-kir''' |
||
+ | ** Translates kaz→kir and kir→kaz with consistently <10% WER |
||
+ | ** Trimmed coverage for kaz and kir ≥90% |
||
+ | * A '''production-ready''' release of '''tur-kir''' |
||
+ | ** Translates tur→kir and kir→tur with consistently <20% WER |
||
+ | ** Trimmed coverage for tur and kir ≥85% |
||
+ | * A '''stable release''' of '''uzb-tur''' |
||
+ | ** Translates tur→uzb and uzb→tur with consistently <25% WER |
||
+ | ** Trimmed coverage for tur and uzb ≥80% |
||
+ | * While '''bidix size''' is not built into the goals, the trimmed coverage numbers can be seen as a more relevant proxy for the same basic idea. |
||
− | == |
+ | == Plan == |
=== Schedule === |
=== Schedule === |
||
+ | See [http://www.google-melange.com/gsoc/events/google/gsoc2014 GSoC 2014 Timeline] for complete timeline. |
||
− | Dates need to be verified. |
||
{|class="wikitable" |
{|class="wikitable" |
||
Line 15: | Line 25: | ||
!colspan="2" style="text-align: right"|post-application period<br />22 March - 20 April |
!colspan="2" style="text-align: right"|post-application period<br />22 March - 20 April |
||
| |
| |
||
− | * [[apertium-kir]] to 90% coverage |
+ | * [[apertium-kir]] to 90% coverage (with kaz-like transducer) |
− | * [[apertium-tur]] to 90% coverage |
+ | * [[apertium-tur]] to 90% coverage (with kaz-like transducer) |
− | * [[apertium-uzb]] to 90% coverage |
+ | * [[apertium-uzb]] to 90% coverage (with kaz-like transducer) |
* build arsenal of texts with post-edited translations: |
* build arsenal of texts with post-edited translations: |
||
** four 200-word texts in each kaz, kir, tur, uzb |
** four 200-word texts in each kaz, kir, tur, uzb |
||
** four 500-word texts in each kaz, kir, tur, uzb |
** four 500-word texts in each kaz, kir, tur, uzb |
||
+ | | {{Workeval5|3}} |
||
+ | | |
||
+ | * Reworked apertium-tur verb morphology on paper |
||
+ | * Much better disam in apertium-tur |
||
+ | * Have a bunch of texts, not many post-edited |
||
+ | | |
||
+ | * Need to rework transfer rules for apertium-tur |
||
|- |
|- |
||
!colspan="2" style="text-align: right"|community bonding period<br />21 April - 19 May |
!colspan="2" style="text-align: right"|community bonding period<br />21 April - 19 May |
||
| |
| |
||
* tur-uzb bidix to 7000 stems |
* tur-uzb bidix to 7000 stems |
||
− | * make some |
+ | * make some baseline CG for kir, uzb |
* one 200-word kaz-kir text to <10% WER |
* one 200-word kaz-kir text to <10% WER |
||
* one 200-word kir-kaz text to <10% WER |
* one 200-word kir-kaz text to <10% WER |
||
Line 32: | Line 49: | ||
* one 200-word tur-uzb text to <10% WER |
* one 200-word tur-uzb text to <10% WER |
||
* one 200-word uzb-tur text to <10% WER |
* one 200-word uzb-tur text to <10% WER |
||
+ | |{{Workeval5|2}} |
||
+ | | |
||
+ | * Implemented reworking of apertium-tur verb morphology |
||
+ | | |
||
+ | * Around, but lost time due to end of semester and conferences |
||
|- |
|- |
||
! 1 !! 19 - 24 May |
! 1 !! 19 - 24 May |
||
Line 38: | Line 60: | ||
* one 200-word kir-kaz text to <10% WER |
* one 200-word kir-kaz text to <10% WER |
||
* work on kir CG and lrx |
* work on kir CG and lrx |
||
+ | |rowspan="3"| {{Workeval5|3}} |
||
+ | | |
||
+ | * Fixed some apertium-tur phonology to go with morphology rework |
||
+ | | |
||
+ | * Mostly worked on LREC poster |
||
|- |
|- |
||
! 2 !! 25 - 31 May |
! 2 !! 25 - 31 May |
||
Line 43: | Line 70: | ||
* one 200-word kaz-kir text to <10% WER |
* one 200-word kaz-kir text to <10% WER |
||
* one 200-word tur-kir text to <10% WER |
* one 200-word tur-kir text to <10% WER |
||
+ | * work on kaz CG and lrx |
||
* work on tur CG and lrx |
* work on tur CG and lrx |
||
+ | | |
||
+ | | |
||
+ | * At LREC, worked a lot on apertium-uig and apertium-kaz-uig |
||
|- |
|- |
||
! 3 !! 1 - 7 June |
! 3 !! 1 - 7 June |
||
Line 50: | Line 81: | ||
* one 200-word uzb-tur text to <10% WER |
* one 200-word uzb-tur text to <10% WER |
||
* work on uzb CG and lrx |
* work on uzb CG and lrx |
||
+ | * work on tur CG and lrx |
||
+ | | |
||
+ | * Fixed a small handful of transfer issues in apertium-tur-kir |
||
+ | | |
||
+ | * Got apertium-uig in a state for others to work with |
||
+ | * Completed eval for last few weeks |
||
|- |
|- |
||
! 4 !! 8 - 14 June |
! 4 !! 8 - 14 June |
||
Line 57: | Line 94: | ||
* work on kir CG and lrx |
* work on kir CG and lrx |
||
* start testvoc nouns for all pairs |
* start testvoc nouns for all pairs |
||
+ | |{{Workeval5|4}} |
||
+ | | |
||
+ | * Brought coverage of apertium-tur-kir on SETimes corpus up by 2% |
||
+ | * Brought testvoc of apertium-tur-kir on SETimes corpus down from 10.39% to 0.22% |
||
+ | | |
||
|- |
|- |
||
! 5 !! 15 - 21 June |
! 5 !! 15 - 21 June |
||
Line 62: | Line 104: | ||
* one 500-word kaz-kir text to <10% WER |
* one 500-word kaz-kir text to <10% WER |
||
* one 500-word tur-kir text to <10% WER |
* one 500-word tur-kir text to <10% WER |
||
+ | * work on kaz CG and lrx |
||
* work on tur CG and lrx |
* work on tur CG and lrx |
||
* continue testvoc nouns for all pairs |
* continue testvoc nouns for all pairs |
||
+ | |{{Workeval5|3}} |
||
+ | | |
||
+ | * Brought testvoc of apertium-tur-kir on SETimes corpus down from 0.22% to 0.04% |
||
+ | | |
||
+ | * Made and presented poster for Morphology Fest |
||
|- |
|- |
||
! 6 !! 22 - 28 June |
! 6 !! 22 - 28 June |
||
Line 69: | Line 117: | ||
* one 500-word tur-uzb text to <10% WER |
* one 500-word tur-uzb text to <10% WER |
||
* one 500-word uzb-tur text to <10% WER |
* one 500-word uzb-tur text to <10% WER |
||
+ | * work on tur CG and lrx |
||
* work on uzb CG and lrx |
* work on uzb CG and lrx |
||
* continue testvoc nouns for all pairs |
* continue testvoc nouns for all pairs |
||
+ | |{{Workeval5|0}} |
||
+ | | |
||
+ | | |
||
+ | * Moving and getting situated week |
||
+ | * (break for personal reasons) |
||
|- |
|- |
||
+ | !colspan="2" style="text-align: right"|midterm eval<br />29 June |
||
− | ! 7 !! 29 June - 5 July |
||
+ | | |
||
+ | * kaz(-kir) trimmed coverage ≥90% |
||
+ | * kir(-kaz) trimmed coverage ≥90% |
||
+ | * tur(-kir) trimmed coverage ≥90% |
||
+ | * kir(-tur) trimmed coverage ≥90% |
||
+ | * tur(-uzb) trimmed coverage ≥80% |
||
+ | * uzb(-tur) trimmed coverage ≥80% |
||
+ | |{{Workeval5|3}} |
||
+ | | |
||
| |
| |
||
− | * finish testvoc nouns for all pairs |
||
|- |
|- |
||
+ | ! 7 !! 29 June - 5 July |
||
− | !colspan="2" style="text-align: right"|midterm eval<br />July 6 |
||
+ | | |
||
+ | * get texts for kaz-kir translating |
||
|- |
|- |
||
! 8 !! 6 - 12 July |
! 8 !! 6 - 12 July |
||
| |
| |
||
+ | * clean up kir.lexc |
||
− | * testvoc adjs for all pairs |
||
|- |
|- |
||
! 9 !! 13 - 19 July |
! 9 !! 13 - 19 July |
||
| |
| |
||
− | * |
+ | * corpus textvoc for kaz-kir |
|- |
|- |
||
! 10 !! 20 - 26 July |
! 10 !! 20 - 26 July |
||
| |
| |
||
+ | * |
||
− | * testvoc v.iv for all pairs |
||
|- |
|- |
||
! 11 !! 27 July - 2 August |
! 11 !! 27 July - 2 August |
||
| |
| |
||
+ | * |
||
− | * testvoc v.tv categories for all pairs |
||
|- |
|- |
||
− | ! 12 !! 3 - |
+ | ! 12 !! 3 - 10 August |
| |
| |
||
+ | * |
||
− | * testvoc adverbs for all pairs |
||
|- |
|- |
||
+ | !colspan="2" style="text-align: right"|pencils-down week<br />final evaluation<br />11 August - 18 August |
||
− | ! 13 !! 10 - 18 August |
||
| |
| |
||
+ | * move pairs to trunk |
||
− | * testvoc misc categories for all pairs |
||
+ | * document stuff better on the wiki |
||
− | |- |
||
+ | * make the pairs live at [http://turkic.apertium.org turkic.apertium.org] |
||
− | !colspan="2" style="text-align: right"|pencils-down week<br />final evaluation<br />18 August - 24 August |
||
|} |
|} |
||
− | === |
+ | === Getting started === |
+ | * make scripts for: |
||
− | See [http://www.google-melange.com/gsoc/events/google/gsoc2014 GSoC 2014 Timeline] for complete timeline. Important coding dates follow: |
||
+ | ** getting raw numbers for [[User:Firespeaker/GSoC2014/Workplan|Progress]] |
||
− | * March 10th - March 21st: application |
||
+ | ** doing regression tests (/learn how to use existing scripts) |
||
− | * April 21st - May 19th: community bonding |
||
+ | ** [[User:Firespeaker/Cleaning up a tail|guesser]] |
||
− | * May 19th: coding begins |
||
+ | * get updated corpora for: |
||
− | * ??: midterm evaluations |
||
+ | ** Uzbek |
||
− | * August 18th?: pencils down |
||
+ | ** Turkish |
||
− | * ??: final evaluation |
||
− | |||
− | === Goals by time === |
||
− | * Community bonding (4+ weeks): |
||
− | ** apertium-kir, apertium-tur, apertium-uzb coverages to 90% |
||
− | ** one 200-word text for each direction to <10% WER |
||
− | ** make some real CG for kir, uzb |
||
− | ** build arsenal of 4 200-word texts and 4 500-word texts translated to all languages |
||
− | ** tur-uzb bidix to 7000 stems |
||
+ | === Recurring === |
||
− | * Coding period (13 weeks) |
||
− | + | * The end of every week: |
|
+ | ** Update [[User:Firespeaker/GSoC2014/Progress|Progress]] |
||
− | *** work on WER (one text per week) |
||
+ | * Constantly: |
||
− | *** beef up CG for each language |
||
+ | ** Add good sentences to regression tests |
||
− | *** lrx, transfer as needed |
||
− | ** |
+ | ** Clean up lexc files |
− | *** |
+ | *** remove duplicate entries |
+ | *** alphabetise sections? |
||
+ | *** add glosses, etc. |
Latest revision as of 18:50, 2 July 2014
Primary goals[edit]
- A production-ready release of kaz-kir
- Translates kaz→kir and kir→kaz with consistently <10% WER
- Trimmed coverage for kaz and kir ≥90%
- A production-ready release of tur-kir
- Translates tur→kir and kir→tur with consistently <20% WER
- Trimmed coverage for tur and kir ≥85%
- A stable release of uzb-tur
- Translates tur→uzb and uzb→tur with consistently <25% WER
- Trimmed coverage for tur and uzb ≥80%
- While bidix size is not built into the goals, the trimmed coverage numbers can be seen as a more relevant proxy for the same basic idea.
Plan[edit]
Schedule[edit]
See GSoC 2014 Timeline for complete timeline.
week | dates | goals | eval | accomplishments | notes |
---|---|---|---|---|---|
post-application period 22 March - 20 April |
|
|
| ||
community bonding period 21 April - 19 May |
|
|
| ||
1 | 19 - 24 May |
|
|
| |
2 | 25 - 31 May |
|
| ||
3 | 1 - 7 June |
|
|
| |
4 | 8 - 14 June |
|
|
||
5 | 15 - 21 June |
|
|
| |
6 | 22 - 28 June |
|
| ||
midterm eval 29 June |
|
||||
7 | 29 June - 5 July |
| |||
8 | 6 - 12 July |
| |||
9 | 13 - 19 July |
| |||
10 | 20 - 26 July |
| |||
11 | 27 July - 2 August |
| |||
12 | 3 - 10 August |
| |||
pencils-down week final evaluation 11 August - 18 August |
|
Getting started[edit]
- make scripts for:
- get updated corpora for:
- Uzbek
- Turkish
Recurring[edit]
- The end of every week:
- Update Progress
- Constantly:
- Add good sentences to regression tests
- Clean up lexc files
- remove duplicate entries
- alphabetise sections?
- add glosses, etc.