Difference between revisions of "Apertium-kaz-kir/Workplan"

From Apertium
Jump to navigation Jump to search
 
(54 intermediate revisions by the same user not shown)
Line 4: Line 4:
* 12'000 stems in bidix (~1000 stems per week, or ~200 per day)
* 12'000 stems in bidix (~1000 stems per week, or ~200 per day)
* Sort Adjective and Noun stems in kir.lexc into appropriate categories
* Sort Adjective and Noun stems in kir.lexc into appropriate categories
* Trimmed coverage approaching 90%
* [[Apertium-kaz-kir/stats#Over-all_stats|Trimmed coverage]] approaching 90%


== Schedule ==
== Schedule ==
Line 16: Line 16:
=== Workplan ===
=== Workplan ===
{|class="wikitable"
{|class="wikitable"
! week
! week !! dates !! goals !! eval !! notes
! dates
!style="width: 25%"| goals
! eval
!style="width: 25%"| accomplishments
!style="width: 35%"| notes
|-
|-
|colspan="2" align="right"|post-application period<br />3 - 24 May
!colspan="2" style="text-align: right"|post-application period<br />3 - 24 May
|
|
# finish coding challenge with WER ~10%
# finish coding challenge with WER ~10%
# trimmed coverage 45%
# trimmed coverage 45%
# total 250 stems in dix
# total 250 stems in dix
| {{Workeval5|4}}
| 4/5 '''pass'''
|
|
# coding challenge: WER ~9%
# coding challenge: WER ~9%
# trimmed coverage: 52%,48%
# trimmed coverage: 52%,48%
# stems in dix: 380
# stems in dix: 380
|
* Demonstrated ability to add stems to dix and lexc.
* A couple easy lexical selection rules are still not written.
* Needs to learn more about [[User:Firespeaker/Steps_for_writing_a_language_pair#Solve_more_complicated_translation_problems|other aspects of apertium]] and [[User:Firespeaker/Steps_for_writing_a_language_pair#Evaluate_the_pair|evaluation]].
: —[[User:Firespeaker|Firespeaker]] 06:45, 20 May 2013 (UTC)
|-
|-
|colspan="2" align="right"|community bonding period<br />27 May - 16 June
!colspan="2" style="text-align: right"|community bonding period<br />27 May - 16 June
|
|
# run first testvoc
# run first testvoc
Line 37: Line 47:
# write ≥2 transfer rules
# write ≥2 transfer rules
# write ≥3 disambig rules
# write ≥3 disambig rules
note: should be in IRC every day
| {{Workeval5|3}}
|
|
# —
|
# ran trimmed coverage script on a corpus
# took a look at frequency lists
# wrote 4 pairs of lexical selection rules
# wrote 4 variants of 1 transfer rule
# —
|
* demonstrated ability to work with lexical selection rules
* demonstrated ability to work with transfer rules
* got only some experience with coverage scripts
* did not get experience with testvoc
* did not get experience with disambig rules
* '''was not around IRC frequently'''
* worked in bursts, did not spend a single long period of time
—[[User:Firespeaker|Firespeaker]] 02:28, 2 July 2013 (UTC)
|-
|-
! 1
| 1 ||align="right"| 17 - 22 June
!style="text-align: right"| 17 - 22 June
|
|
# total 1500 stems in dix
# total 1500 stems in dix
Line 46: Line 73:
# 500-word evaluation, WER ~10%
# 500-word evaluation, WER ~10%
# trimmed coverage 51%
# trimmed coverage 51%
| {{Workeval5|0}}
|
|
|
|
* did not show up —[[User:Firespeaker|Firespeaker]] 02:28, 2 July 2013 (UTC)
|-
|-
! 2
| 2 ||align="right"| 23 - 29 June
!style="text-align: right"| 23 - 29 June
|
|
# total 2400 stems in dix
# total 2400 stems in dix
# clean testvoc for {{tag|num}} {{tag|post}}
# clean testvoc for {{tag|num}} {{tag|post}}
# trimmed coverage 53%
# trimmed coverage 53%
| {{Workeval5|0}}
|
|
# stems in dix: 408
|
|
* did not show up —[[User:Firespeaker|Firespeaker]] 02:28, 2 July 2013 (UTC)
|-
|-
! 3
| 3 ||align="right"| 30 - 6 July
!style="text-align: right"| 30 - 6 July
|
|
# total 3200 stems in dix
# total 3200 stems in dix
# clean testvoc for {{tag|cnjcoo}} {{tag|cnjadv}} {{tag|cnjsub}}
# clean testvoc for {{tag|cnjcoo}} {{tag|cnjadv}} {{tag|cnjsub}}
# trimmed coverage 55%
# trimmed coverage 55%
| {{Workeval5|2}}
|
# stems in dix: 508
# trimmed coverage: 59.5%,51.5%
|
|
* trimmed coverage good
|
* too narrow a focus on a single corpus
* number of stems too low
* no new WER text
* no testvoc
—[[User:Firespeaker|Firespeaker]] 20:43, 8 July 2013 (UTC)
|-
|-
! 4
| 4 ||align="right"| 7 - 13 July
!style="text-align: right"| 7 - 13 July
|
|
# total 4000 stems in dix
# total 4000 stems in dix
# clean testvoc for {{tag|adv}}
# clean testvoc for {{tag|adv}}
# trimmed coverage 59%
# trimmed coverage 59%
|{{Workeval5|2}}
|
|rowspan="2"|
|
# stems in dix: 2574
# trimmed coverage: 69.3%,63.8%
# azattyq_24455849 WER: 14.78%
# completed most of TODO-list
|rowspan="2"|
* good progress on adding stems
* fixed little things as directed
* good progress on post-editing process
* didn't make good progress on reducing WER
* still no testvoc
* committed once every 3 or 4 days; '''should be committing every day'''
* poor communication with mentors; needs to be around more often
—[[User:Firespeaker|Firespeaker]] 22:16, 22 July 2013 (UTC)
|-
|-
! 5
| 5 ||align="right"| 14 -20 July
!style="text-align: right"| 14 - 20 July
|
|
# total 4800 stems in dix
# total 4800 stems in dix
# clean testvoc for {{tag|prn}} {{tag|det}}
# clean testvoc for {{tag|prn}} {{tag|det}}
# trimmed coverage 63%
# trimmed coverage 63%
|{{Workeval5|3}}
|
|
|-
|-
! 6
| 6 ||align="right"| 21 - 27 July
!style="text-align: right"| 21 - 27 July
|
|
# total 5600 stems in dix
# total 5600 stems in dix
# clean testvoc for {{tag|adj}} {{tag|adj}}{{tag|advl}}
# clean testvoc for {{tag|adj}} {{tag|adj}}{{tag|advl}}
# trimmed coverage 68%
# trimmed coverage 68%
|{{Workeval5|3}}
|
|rowspan="3"|
|
# stems in dix: 5552
# trimmed coverage: 72%,67%
# azattyq_24455849 WER: 18.01%
|rowspan="2"|
* good improvement in dix
** should be checking for errors (e.g., extra spaces)
* not much progress with WER text
** simple lrx and t1x should be enough here
* still no indication of progress with testvoc
* better communication and commit frequency, but could still improve
—[[User:Firespeaker|Firespeaker]] 18:21, 1 August 2013 (UTC)
|-
|-
! 7
| 7 ||align="right"| 28 - 3 August
!style="text-align: right"| 28 - 3 August
|
|
# total 6400 stems in dix
# total 6400 stems in dix
# trimmed coverage 70%
# trimmed coverage 70%
|{{Workeval5|2}}
|
|
|-
|-
|colspan="2" align="right"| '''midterm eval<br />2 August'''
!colspan="2" style="text-align: right"| [[Apertium-kaz-kir/TODO#By_midterm|midterm eval]]<br />2 August
|
|
# total 6500 stems in dix
# total 6500 stems in dix
# 500-word evaluation, WER ~10%
# 500-word evaluation, WER ~10%
# trimmed coverage 72%
# trimmed coverage 72%
|{{Workeval5|2}}
|
|
* midterm TODO list goals only partially attained
|
* overall progress has been mediocre
* among the lowest-performing students
* noticeable improvement in the last few weeks
* needs to improve more to pass the final
—[[User:Firespeaker|Firespeaker]] 18:26, 1 August 2013 (UTC)
|-
|-
! 8
| 8 ||align="right"| 4 - 10 August
!style="text-align: right"| 4 - 10 August
|
|
# total 7200 stems in dix
# total 7200 stems in dix
# clean testvoc for {{tag|n}} {{tag|num}}{{tag|subst}} {{tag|np}} {{tag|adj}}{{tag|subst}}
# clean testvoc for {{tag|n}} {{tag|num}}{{tag|subst}} {{tag|np}} {{tag|adj}}{{tag|subst}}
# trimmed coverage 75%
# trimmed coverage 75%
|{{Workeval5|2}}
|
|rowspan="3"|
# stems in dix: 6493
# trimmed coverage: 79.6%,74.1%
|
|
|-
|-
! 9
| 9 ||align="right"| 11 - 17 August
!style="text-align: right"| 11 - 17 August
|
|
# total 8000 stems in dix
# total 8000 stems in dix
# trimmed coverage 78%
# trimmed coverage 78%
|{{Workeval5|2}}
|
|
|
|-
|-
! 10
| 10 ||align="right"| 18 - 24 August
!style="text-align: right"| 18 - 24 August
|
|
# total 8800 stems in dix
# total 8800 stems in dix
# trimmed coverage 81%
# trimmed coverage 81%
|{{Workeval5|3}}
|
|
|
|-
|-
! 11
| 11 ||align="right"| 25 - 31 August
!style="text-align: right"| 25 - 31 August
|
|
# total 9600 stems in dix
# total 9600 stems in dix
# clean testvoc for {{tag|v}}
# clean testvoc for {{tag|v}}
# trimmed coverage 83%
# trimmed coverage 83%
|{{Workeval5|3}}
|
|
# stems in dix: 6730
# trimmed coverage: 82.5%,78.4%
# azattyq_24455849 WER: 6.62%
|
|
|-
|-
! 12
| 12 ||align="right"| 1 - 7 September
!style="text-align: right"| 1 - 7 September
|
|
# total 10400 stems in dix
# total 10400 stems in dix
# trimmed coverage 85%
# trimmed coverage 85%
|{{Workeval5|3}}
|
|
# stems in dix: 7007
# trimmed coverage: 84.2%,79.8%
|
|
* Good [[Turkic_lexicon#Kyrgyz|adjective typology]]
* Decent progress on coverage
* Not around much later in the week
* Still no testvoc...
—[[User:Firespeaker|Firespeaker]] 07:29, 10 September 2013 (UTC)
|-
|-
! 13
| 13 ||align="right"| 8 - 15 September
!style="text-align: right"| 8 - 15 September
|
|
# total 11200 stems in dix
# total 11200 stems in dix
# trimmed coverage 87%
# trimmed coverage 87%
|{{Workeval5|1}}
|
|
# stems in dix: 7454
# trimmed coverage: 85.2%,80.4%
|
|
* Decent increase in coverage
* Still no testvoc
* Still ~600 unsorted ADJ
* Not around much
—[[User:Firespeaker|Firespeaker]] 20:06, 22 September 2013 (UTC)
|-
|-
|colspan="2" align="right"| '''pencils-down week<br />final evaluation<br />16 - 23 September'''
!colspan="2" style="text-align: right"| pencils-down week<br />final evaluation<br />16 - 23 September
|
|
# total 12000 stems in dix
# total 12000 stems in dix
Line 154: Line 258:
# clean testvoc for all categories
# clean testvoc for all categories
# trimmed coverage 88%
# trimmed coverage 88%
# release 0.1.0 and move to trunk
|
|
|
|
# stems in dix: 7546
# trimmed coverage: 85.8%,81.6%
|
* Good coverage
* "Good" WER results
** But lots of # and * errors :(
* No work on testvoc
* Some ADJ sorted; still >500 unsorted
* only 2 sets of LRX rules since early in GSoC
* only 1 transfer rule since early in GSoC
|-
!colspan="2" style="text-align: right"| Final evaluation
|
|
|
|
* Has improved coverage a certain amount
* Has not done anything else
* Mentors have had to nag to get him to work
* Has not been around enough
* Among the lowest-performing students
* Has not improved since midterm
* Last-ditch efforts not at all impressive
|}
|}

== Tips and Tricks ==
=== Adding stems quickly ===
* Add top stems from frequency lists of unknown forms
* Use spectie's dix-entries-to-be-checked script

Latest revision as of 06:42, 23 September 2013

Major goals[edit]

  • Good WER
  • Clean testvoc
  • 12'000 stems in bidix (~1000 stems per week, or ~200 per day)
  • Sort Adjective and Noun stems in kir.lexc into appropriate categories
  • Trimmed coverage approaching 90%

Schedule[edit]

Timeline[edit]

See GSoC 2013 Timeline for complete timeline. Important coding dates follow:

  • June 17th: coding begins
  • July 29th - August 2nd: midterm evaluations
  • September 16th - September 23rd: pencils down
  • September 27th: final evaluation

Workplan[edit]

week dates goals eval accomplishments notes
post-application period
3 - 24 May
  1. finish coding challenge with WER ~10%
  2. trimmed coverage 45%
  3. total 250 stems in dix
  1. coding challenge: WER ~9%
  2. trimmed coverage: 52%,48%
  3. stems in dix: 380
  • Demonstrated ability to add stems to dix and lexc.
  • A couple easy lexical selection rules are still not written.
  • Needs to learn more about other aspects of apertium and evaluation.
Firespeaker 06:45, 20 May 2013 (UTC)
community bonding period
27 May - 16 June
  1. run first testvoc
  2. run coverage scripts
  3. get first frequency lists
  4. write ≥4 lexical selection rules
  5. write ≥2 transfer rules
  6. write ≥3 disambig rules

note: should be in IRC every day

  1. ran trimmed coverage script on a corpus
  2. took a look at frequency lists
  3. wrote 4 pairs of lexical selection rules
  4. wrote 4 variants of 1 transfer rule
  • demonstrated ability to work with lexical selection rules
  • demonstrated ability to work with transfer rules
  • got only some experience with coverage scripts
  • did not get experience with testvoc
  • did not get experience with disambig rules
  • was not around IRC frequently
  • worked in bursts, did not spend a single long period of time

Firespeaker 02:28, 2 July 2013 (UTC)

1 17 - 22 June
  1. total 1500 stems in dix
  2. clean testvoc for <postadv> <ij>
  3. 500-word evaluation, WER ~10%
  4. trimmed coverage 51%
  • did not show up —Firespeaker 02:28, 2 July 2013 (UTC)
2 23 - 29 June
  1. total 2400 stems in dix
  2. clean testvoc for <num> <post>
  3. trimmed coverage 53%
  1. stems in dix: 408
  • did not show up —Firespeaker 02:28, 2 July 2013 (UTC)
3 30 - 6 July
  1. total 3200 stems in dix
  2. clean testvoc for <cnjcoo> <cnjadv> <cnjsub>
  3. trimmed coverage 55%
  1. stems in dix: 508
  2. trimmed coverage: 59.5%,51.5%
  • trimmed coverage good
  • too narrow a focus on a single corpus
  • number of stems too low
  • no new WER text
  • no testvoc

Firespeaker 20:43, 8 July 2013 (UTC)

4 7 - 13 July
  1. total 4000 stems in dix
  2. clean testvoc for <adv>
  3. trimmed coverage 59%
  1. stems in dix: 2574
  2. trimmed coverage: 69.3%,63.8%
  3. azattyq_24455849 WER: 14.78%
  4. completed most of TODO-list
  • good progress on adding stems
  • fixed little things as directed
  • good progress on post-editing process
  • didn't make good progress on reducing WER
  • still no testvoc
  • committed once every 3 or 4 days; should be committing every day
  • poor communication with mentors; needs to be around more often

Firespeaker 22:16, 22 July 2013 (UTC)

5 14 - 20 July
  1. total 4800 stems in dix
  2. clean testvoc for <prn> <det>
  3. trimmed coverage 63%
6 21 - 27 July
  1. total 5600 stems in dix
  2. clean testvoc for <adj> <adj><advl>
  3. trimmed coverage 68%
  1. stems in dix: 5552
  2. trimmed coverage: 72%,67%
  3. azattyq_24455849 WER: 18.01%
  • good improvement in dix
    • should be checking for errors (e.g., extra spaces)
  • not much progress with WER text
    • simple lrx and t1x should be enough here
  • still no indication of progress with testvoc
  • better communication and commit frequency, but could still improve

Firespeaker 18:21, 1 August 2013 (UTC)

7 28 - 3 August
  1. total 6400 stems in dix
  2. trimmed coverage 70%
midterm eval
2 August
  1. total 6500 stems in dix
  2. 500-word evaluation, WER ~10%
  3. trimmed coverage 72%
  • midterm TODO list goals only partially attained
  • overall progress has been mediocre
  • among the lowest-performing students
  • noticeable improvement in the last few weeks
  • needs to improve more to pass the final

Firespeaker 18:26, 1 August 2013 (UTC)

8 4 - 10 August
  1. total 7200 stems in dix
  2. clean testvoc for <n> <num><subst> <np> <adj><subst>
  3. trimmed coverage 75%
  1. stems in dix: 6493
  2. trimmed coverage: 79.6%,74.1%
9 11 - 17 August
  1. total 8000 stems in dix
  2. trimmed coverage 78%
10 18 - 24 August
  1. total 8800 stems in dix
  2. trimmed coverage 81%
11 25 - 31 August
  1. total 9600 stems in dix
  2. clean testvoc for <v>
  3. trimmed coverage 83%
  1. stems in dix: 6730
  2. trimmed coverage: 82.5%,78.4%
  3. azattyq_24455849 WER: 6.62%
12 1 - 7 September
  1. total 10400 stems in dix
  2. trimmed coverage 85%
  1. stems in dix: 7007
  2. trimmed coverage: 84.2%,79.8%
  • Good adjective typology
  • Decent progress on coverage
  • Not around much later in the week
  • Still no testvoc...

Firespeaker 07:29, 10 September 2013 (UTC)

13 8 - 15 September
  1. total 11200 stems in dix
  2. trimmed coverage 87%
  1. stems in dix: 7454
  2. trimmed coverage: 85.2%,80.4%
  • Decent increase in coverage
  • Still no testvoc
  • Still ~600 unsorted ADJ
  • Not around much

Firespeaker 20:06, 22 September 2013 (UTC)

pencils-down week
final evaluation
16 - 23 September
  1. total 12000 stems in dix
  2. 500-word evaluation, WER ~10%
  3. clean testvoc for all categories
  4. trimmed coverage 88%
  5. release 0.1.0 and move to trunk
  1. stems in dix: 7546
  2. trimmed coverage: 85.8%,81.6%
  • Good coverage
  • "Good" WER results
    • But lots of # and * errors :(
  • No work on testvoc
  • Some ADJ sorted; still >500 unsorted
  • only 2 sets of LRX rules since early in GSoC
  • only 1 transfer rule since early in GSoC
Final evaluation
  • Has improved coverage a certain amount
  • Has not done anything else
  • Mentors have had to nag to get him to work
  • Has not been around enough
  • Among the lowest-performing students
  • Has not improved since midterm
  • Last-ditch efforts not at all impressive

Tips and Tricks[edit]

Adding stems quickly[edit]

  • Add top stems from frequency lists of unknown forms
  • Use spectie's dix-entries-to-be-checked script