Apertium-kaz-kir/Workplan

From Apertium
Jump to navigation Jump to search

Major goals

  • Good WER
  • Clean testvoc
  • 12'000 stems in bidix (~1000 stems per week, or ~200 per day)
  • Sort Adjective and Noun stems in kir.lexc into appropriate categories
  • Trimmed coverage approaching 90%

Schedule

Timeline

See GSoC 2013 Timeline for complete timeline. Important coding dates follow:

  • June 17th: coding begins
  • July 29th - August 2nd: midterm evaluations
  • September 16th - September 23rd: pencils down
  • September 27th: final evaluation

Workplan

week dates goals eval accomplishments notes
post-application period
3 - 24 May
  1. finish coding challenge with WER ~10%
  2. trimmed coverage 45%
  3. total 250 stems in dix
  1. coding challenge: WER ~9%
  2. trimmed coverage: 52%,48%
  3. stems in dix: 380
  • Demonstrated ability to add stems to dix and lexc.
  • A couple easy lexical selection rules are still not written.
  • Needs to learn more about other aspects of apertium and evaluation.
Firespeaker 06:45, 20 May 2013 (UTC)
community bonding period
27 May - 16 June
  1. run first testvoc
  2. run coverage scripts
  3. get first frequency lists
  4. write ≥4 lexical selection rules
  5. write ≥2 transfer rules
  6. write ≥3 disambig rules

note: should be in IRC every day

  1. ran trimmed coverage script on a corpus
  2. took a look at frequency lists
  3. wrote 4 pairs of lexical selection rules
  4. wrote 4 variants of 1 transfer rule
  • demonstrated ability to work with lexical selection rules
  • demonstrated ability to work with transfer rules
  • got only some experience with coverage scripts
  • did not get experience with testvoc
  • did not get experience with disambig rules
  • was not around IRC frequently
  • worked in bursts, did not spend a single long period of time

Firespeaker 02:28, 2 July 2013 (UTC)

1 17 - 22 June
  1. total 1500 stems in dix
  2. clean testvoc for <postadv> <ij>
  3. 500-word evaluation, WER ~10%
  4. trimmed coverage 51%
  • did not show up —Firespeaker 02:28, 2 July 2013 (UTC)
2 23 - 29 June
  1. total 2400 stems in dix
  2. clean testvoc for <num> <post>
  3. trimmed coverage 53%
  1. stems in dix: 408
  • did not show up —Firespeaker 02:28, 2 July 2013 (UTC)
3 30 - 6 July
  1. total 3200 stems in dix
  2. clean testvoc for <cnjcoo> <cnjadv> <cnjsub>
  3. trimmed coverage 55%
  1. stems in dix: 508
  2. trimmed coverage: 59.5%,51.5%
  • trimmed coverage good
  • too narrow a focus on a single corpus
  • number of stems too low
  • no new WER text
  • no testvoc

Firespeaker 20:43, 8 July 2013 (UTC)

4 7 - 13 July
  1. total 4000 stems in dix
  2. clean testvoc for <adv>
  3. trimmed coverage 59%
  1. stems in dix: 2574
  2. trimmed coverage: 69.3%,63.8%
  3. azattyq_24455849 WER: 14.78%
  4. completed most of TODO-list
  • good progress on adding stems
  • fixed little things as directed
  • good progress on post-editing process
  • didn't make good progress on reducing WER
  • still no testvoc
  • committed once every 3 or 4 days; should be committing every day
  • poor communication with mentors; needs to be around more often

Firespeaker 22:16, 22 July 2013 (UTC)

5 14 - 20 July
  1. total 4800 stems in dix
  2. clean testvoc for <prn> <det>
  3. trimmed coverage 63%
6 21 - 27 July
  1. total 5600 stems in dix
  2. clean testvoc for <adj> <adj><advl>
  3. trimmed coverage 68%
  1. stems in dix: 5552
  2. trimmed coverage:
  3. azattyq_24455849 WER:
7 28 - 3 August
  1. total 6400 stems in dix
  2. trimmed coverage 70%
midterm eval
2 August
  1. total 6500 stems in dix
  2. 500-word evaluation, WER ~10%
  3. trimmed coverage 72%
8 4 - 10 August
  1. total 7200 stems in dix
  2. clean testvoc for <n> <num><subst> <np> <adj><subst>
  3. trimmed coverage 75%
9 11 - 17 August
  1. total 8000 stems in dix
  2. trimmed coverage 78%
10 18 - 24 August
  1. total 8800 stems in dix
  2. trimmed coverage 81%
11 25 - 31 August
  1. total 9600 stems in dix
  2. clean testvoc for <v>
  3. trimmed coverage 83%
12 1 - 7 September
  1. total 10400 stems in dix
  2. trimmed coverage 85%
13 8 - 15 September
  1. total 11200 stems in dix
  2. trimmed coverage 87%
pencils-down week
final evaluation
16 - 23 September
  1. total 12000 stems in dix
  2. 500-word evaluation, WER ~10%
  3. clean testvoc for all categories
  4. trimmed coverage 88%
  5. release 0.1.0 and move to trunk

Tips and Tricks

Adding stems quickly

  • Add top stems from frequency lists of unknown forms
  • Use spectie's dix-entries-to-be-checked script