Difference between revisions of "User:Firespeaker/GSoC2014/Workplan"

From Apertium
Jump to navigation Jump to search
 
(21 intermediate revisions by the same user not shown)
Line 1: Line 1:
== Major goals ==
== Primary goals ==
* A '''production-ready''' release of '''kaz-kir'''
** Translates kaz→kir and kir→kaz with consistently <10% WER
** Trimmed coverage for kaz and kir ≥90%
* A '''production-ready''' release of '''tur-kir'''
** Translates tur→kir and kir→tur with consistently <20% WER
** Trimmed coverage for tur and kir ≥85%
* A '''stable release''' of '''uzb-tur'''
** Translates tur→uzb and uzb→tur with consistently <25% WER
** Trimmed coverage for tur and uzb ≥80%
* While '''bidix size''' is not built into the goals, the trimmed coverage numbers can be seen as a more relevant proxy for the same basic idea.


== Schedule ==
== Plan ==
=== Schedule ===
=== Schedule ===
See [http://www.google-melange.com/gsoc/events/google/gsoc2014 GSoC 2014 Timeline] for complete timeline.
Dates need to be verified.


{|class="wikitable"
{|class="wikitable"
Line 15: Line 25:
!colspan="2" style="text-align: right"|post-application period<br />22 March - 20 April
!colspan="2" style="text-align: right"|post-application period<br />22 March - 20 April
|
|
* [[apertium-kir]] to 90% coverage
* [[apertium-kir]] to 90% coverage (with kaz-like transducer)
* [[apertium-tur]] to 90% coverage
* [[apertium-tur]] to 90% coverage (with kaz-like transducer)
* [[apertium-uzb]] to 90% coverage
* [[apertium-uzb]] to 90% coverage (with kaz-like transducer)
* build arsenal of texts with post-edited translations:
* build arsenal of texts with post-edited translations:
** four 200-word texts in each kaz, kir, tur, uzb
** four 200-word texts in each kaz, kir, tur, uzb
** four 500-word texts in each kaz, kir, tur, uzb
** four 500-word texts in each kaz, kir, tur, uzb
| {{Workeval5|3}}
|
* Reworked apertium-tur verb morphology on paper
* Much better disam in apertium-tur
* Have a bunch of texts, not many post-edited
|
* Need to rework transfer rules for apertium-tur
|-
|-
!colspan="2" style="text-align: right"|community bonding period<br />21 April - 19 May
!colspan="2" style="text-align: right"|community bonding period<br />21 April - 19 May
|
|
* tur-uzb bidix to 7000 stems
* tur-uzb bidix to 7000 stems
* make some real CG for kir, uzb
* make some baseline CG for kir, uzb
* one 200-word kaz-kir text to <10% WER
* one 200-word kaz-kir text to <10% WER
* one 200-word kir-kaz text to <10% WER
* one 200-word kir-kaz text to <10% WER
Line 32: Line 49:
* one 200-word tur-uzb text to <10% WER
* one 200-word tur-uzb text to <10% WER
* one 200-word uzb-tur text to <10% WER
* one 200-word uzb-tur text to <10% WER
|{{Workeval5|2}}
|
* Implemented reworking of apertium-tur verb morphology
|
* Around, but lost time due to end of semester and conferences
|-
|-
! 1 !! 19 - 24 May
! 1 !! 19 - 24 May
Line 38: Line 60:
* one 200-word kir-kaz text to <10% WER
* one 200-word kir-kaz text to <10% WER
* work on kir CG and lrx
* work on kir CG and lrx
|rowspan="3"| {{Workeval5|3}}
|
* Fixed some apertium-tur phonology to go with morphology rework
|
* Mostly worked on LREC poster
|-
|-
! 2 !! 25 - 31 May
! 2 !! 25 - 31 May
Line 43: Line 70:
* one 200-word kaz-kir text to <10% WER
* one 200-word kaz-kir text to <10% WER
* one 200-word tur-kir text to <10% WER
* one 200-word tur-kir text to <10% WER
* work on kaz CG and lrx
* work on tur CG and lrx
* work on tur CG and lrx
|
|
* At LREC, worked a lot on apertium-uig and apertium-kaz-uig
|-
|-
! 3 !! 1 - 7 June
! 3 !! 1 - 7 June
Line 50: Line 81:
* one 200-word uzb-tur text to <10% WER
* one 200-word uzb-tur text to <10% WER
* work on uzb CG and lrx
* work on uzb CG and lrx
* work on tur CG and lrx
|
* Fixed a small handful of transfer issues in apertium-tur-kir
|
* Got apertium-uig in a state for others to work with
* Completed eval for last few weeks
|-
|-
! 4 !! 8 - 14 June
! 4 !! 8 - 14 June
Line 57: Line 94:
* work on kir CG and lrx
* work on kir CG and lrx
* start testvoc nouns for all pairs
* start testvoc nouns for all pairs
|{{Workeval5|4}}
|
* Brought coverage of apertium-tur-kir on SETimes corpus up by 2%
* Brought testvoc of apertium-tur-kir on SETimes corpus down from 10.39% to 0.22%
|
|-
|-
! 5 !! 15 - 21 June
! 5 !! 15 - 21 June
Line 62: Line 104:
* one 500-word kaz-kir text to <10% WER
* one 500-word kaz-kir text to <10% WER
* one 500-word tur-kir text to <10% WER
* one 500-word tur-kir text to <10% WER
* work on kaz CG and lrx
* work on tur CG and lrx
* work on tur CG and lrx
* continue testvoc nouns for all pairs
* continue testvoc nouns for all pairs
|{{Workeval5|3}}
|
* Brought testvoc of apertium-tur-kir on SETimes corpus down from 0.22% to 0.04%
|
* Made and presented poster for Morphology Fest
|-
|-
! 6 !! 22 - 28 June
! 6 !! 22 - 28 June
Line 69: Line 117:
* one 500-word tur-uzb text to <10% WER
* one 500-word tur-uzb text to <10% WER
* one 500-word uzb-tur text to <10% WER
* one 500-word uzb-tur text to <10% WER
* work on tur CG and lrx
* work on uzb CG and lrx
* work on uzb CG and lrx
* continue testvoc nouns for all pairs
* continue testvoc nouns for all pairs
|{{Workeval5|0}}
|
|
* Moving and getting situated week
* (break for personal reasons)
|-
|-
!colspan="2" style="text-align: right"|midterm eval<br />29 June
! 7 !! 29 June - 5 July
|
* kaz(-kir) trimmed coverage ≥90%
* kir(-kaz) trimmed coverage ≥90%
* tur(-kir) trimmed coverage ≥90%
* kir(-tur) trimmed coverage ≥90%
* tur(-uzb) trimmed coverage ≥80%
* uzb(-tur) trimmed coverage ≥80%
|{{Workeval5|3}}
|
|
|
* finish testvoc nouns for all pairs
|-
|-
! 7 !! 29 June - 5 July
!colspan="2" style="text-align: right"|midterm eval<br />July 6
|
* get texts for kaz-kir translating
|-
|-
! 8 !! 6 - 12 July
! 8 !! 6 - 12 July
|
|
* clean up kir.lexc
* testvoc adjs for all pairs
|-
|-
! 9 !! 13 - 19 July
! 9 !! 13 - 19 July
|
|
* testvoc numerals for all pairs
* corpus textvoc for kaz-kir
|-
|-
! 10 !! 20 - 26 July
! 10 !! 20 - 26 July
|
|
*
* testvoc v.iv for all pairs
|-
|-
! 11 !! 27 July - 2 August
! 11 !! 27 July - 2 August
|
|
*
* testvoc v.tv categories for all pairs
|-
|-
! 12 !! 3 - 9 August
! 12 !! 3 - 10 August
|
|
*
* testvoc adverbs for all pairs
|-
|-
!colspan="2" style="text-align: right"|pencils-down week<br />final evaluation<br />11 August - 18 August
! 13 !! 10 - 18 August
|
* testvoc misc categories for all pairs
|-
!colspan="2" style="text-align: right"|pencils-down week<br />final evaluation<br />18 August - 24 August
|
|
* move pairs to trunk
* move pairs to trunk
* document stuff better on the wiki
* document stuff better on the wiki
* make live at [http://turkic.apertium.org turkic.apertium.org]
* make the pairs live at [http://turkic.apertium.org turkic.apertium.org]
|}
|}


=== GSoC Timeline ===
=== Getting started ===
* make scripts for:
See [http://www.google-melange.com/gsoc/events/google/gsoc2014 GSoC 2014 Timeline] for complete timeline. Important coding dates follow:
** getting raw numbers for [[User:Firespeaker/GSoC2014/Workplan|Progress]]
* March 10th - March 21st: application
** doing regression tests (/learn how to use existing scripts)
* April 21st - May 19th: community bonding
** [[User:Firespeaker/Cleaning up a tail|guesser]]
* May 19th: coding begins
* get updated corpora for:
* ??: midterm evaluations
** Uzbek
* August 18th?: pencils down
** Turkish
* ??: final evaluation

=== Goals by time ===
* Community bonding (4+ weeks):
** apertium-kir, apertium-tur, apertium-uzb coverages to 90%
** one 200-word text for each direction to &lt;10% WER
** make some real CG for kir, uzb
** build arsenal of 4 200-word texts and 4 500-word texts translated to all languages
** tur-uzb bidix to 7000 stems


=== Recurring ===
* Coding period (13 weeks)
** First half (7 weeks):
* The end of every week:
** Update [[User:Firespeaker/GSoC2014/Progress|Progress]]
*** work on WER (one text per week)
* Constantly:
*** beef up CG for each language
** Add good sentences to regression tests
*** lrx, transfer as needed
** Second half (6 weeks):
** Clean up lexc files
*** work on testvoc
*** remove duplicate entries
*** alphabetise sections?
*** add glosses, etc.

Latest revision as of 18:50, 2 July 2014

Primary goals[edit]

  • A production-ready release of kaz-kir
    • Translates kaz→kir and kir→kaz with consistently <10% WER
    • Trimmed coverage for kaz and kir ≥90%
  • A production-ready release of tur-kir
    • Translates tur→kir and kir→tur with consistently <20% WER
    • Trimmed coverage for tur and kir ≥85%
  • A stable release of uzb-tur
    • Translates tur→uzb and uzb→tur with consistently <25% WER
    • Trimmed coverage for tur and uzb ≥80%
  • While bidix size is not built into the goals, the trimmed coverage numbers can be seen as a more relevant proxy for the same basic idea.

Plan[edit]

Schedule[edit]

See GSoC 2014 Timeline for complete timeline.

week dates goals eval accomplishments notes
post-application period
22 March - 20 April
  • apertium-kir to 90% coverage (with kaz-like transducer)
  • apertium-tur to 90% coverage (with kaz-like transducer)
  • apertium-uzb to 90% coverage (with kaz-like transducer)
  • build arsenal of texts with post-edited translations:
    • four 200-word texts in each kaz, kir, tur, uzb
    • four 500-word texts in each kaz, kir, tur, uzb
  • Reworked apertium-tur verb morphology on paper
  • Much better disam in apertium-tur
  • Have a bunch of texts, not many post-edited
  • Need to rework transfer rules for apertium-tur
community bonding period
21 April - 19 May
  • tur-uzb bidix to 7000 stems
  • make some baseline CG for kir, uzb
  • one 200-word kaz-kir text to <10% WER
  • one 200-word kir-kaz text to <10% WER
  • one 200-word tur-kir text to <10% WER
  • one 200-word kir-tur text to <10% WER
  • one 200-word tur-uzb text to <10% WER
  • one 200-word uzb-tur text to <10% WER
  • Implemented reworking of apertium-tur verb morphology
  • Around, but lost time due to end of semester and conferences
1 19 - 24 May
  • one 200-word kir-tur text to <10% WER
  • one 200-word kir-kaz text to <10% WER
  • work on kir CG and lrx
  • Fixed some apertium-tur phonology to go with morphology rework
  • Mostly worked on LREC poster
2 25 - 31 May
  • one 200-word kaz-kir text to <10% WER
  • one 200-word tur-kir text to <10% WER
  • work on kaz CG and lrx
  • work on tur CG and lrx
  • At LREC, worked a lot on apertium-uig and apertium-kaz-uig
3 1 - 7 June
  • one 200-word tur-uzb text to <10% WER
  • one 200-word uzb-tur text to <10% WER
  • work on uzb CG and lrx
  • work on tur CG and lrx
  • Fixed a small handful of transfer issues in apertium-tur-kir
  • Got apertium-uig in a state for others to work with
  • Completed eval for last few weeks
4 8 - 14 June
  • one 500-word kir-tur text to <10% WER
  • one 500-word kir-kaz text to <10% WER
  • work on kir CG and lrx
  • start testvoc nouns for all pairs
  • Brought coverage of apertium-tur-kir on SETimes corpus up by 2%
  • Brought testvoc of apertium-tur-kir on SETimes corpus down from 10.39% to 0.22%
5 15 - 21 June
  • one 500-word kaz-kir text to <10% WER
  • one 500-word tur-kir text to <10% WER
  • work on kaz CG and lrx
  • work on tur CG and lrx
  • continue testvoc nouns for all pairs
  • Brought testvoc of apertium-tur-kir on SETimes corpus down from 0.22% to 0.04%
  • Made and presented poster for Morphology Fest
6 22 - 28 June
  • one 500-word tur-uzb text to <10% WER
  • one 500-word uzb-tur text to <10% WER
  • work on tur CG and lrx
  • work on uzb CG and lrx
  • continue testvoc nouns for all pairs
  • Moving and getting situated week
  • (break for personal reasons)
midterm eval
29 June
  • kaz(-kir) trimmed coverage ≥90%
  • kir(-kaz) trimmed coverage ≥90%
  • tur(-kir) trimmed coverage ≥90%
  • kir(-tur) trimmed coverage ≥90%
  • tur(-uzb) trimmed coverage ≥80%
  • uzb(-tur) trimmed coverage ≥80%
7 29 June - 5 July
  • get texts for kaz-kir translating
8 6 - 12 July
  • clean up kir.lexc
9 13 - 19 July
  • corpus textvoc for kaz-kir
10 20 - 26 July
11 27 July - 2 August
12 3 - 10 August
pencils-down week
final evaluation
11 August - 18 August
  • move pairs to trunk
  • document stuff better on the wiki
  • make the pairs live at turkic.apertium.org

Getting started[edit]

  • make scripts for:
    • getting raw numbers for Progress
    • doing regression tests (/learn how to use existing scripts)
    • guesser
  • get updated corpora for:
    • Uzbek
    • Turkish

Recurring[edit]

  • The end of every week:
  • Constantly:
    • Add good sentences to regression tests
    • Clean up lexc files
      • remove duplicate entries
      • alphabetise sections?
      • add glosses, etc.