Difference between revisions of "User:Eden/GSoC progress"

From Apertium
Jump to navigation Jump to search
 
(30 intermediate revisions by 2 users not shown)
Line 1: Line 1:
  +
== Community Bonding Period ==
  +
* Find Swahili-Lingala resources
  +
* Update Lingala lexc transducer to lexd
  +
* New lexd transducer for Swahili
  +
* Keep track of coverage for Lin and Swa transducers
  +
* Get familiar with apertium-recursive
  +
* Set up <code>swa-lin</code> pair using apertium-recursive
  +
* Update GSOC progress page
  +
* James and Marry story + Wikipedia article in Swahili and Lingala.
  +
  +
== Goals ==
  +
* By first evaluation: have story about kids or similar text to WER/PER of around 20% (work with all stages of translation, focus on "lowest-hanging fruit" relevant to the text)
  +
* By second evaluation: increase [trimmed] coverage to around 90% (work focused on lexicons, adding from frequency lists)
  +
* By final evaluation: work to get clean testvoc (work focused on transfer, making sure everything is dealt with one way or other)
  +
 
== Status table ==
 
== Status table ==
   
Line 4: Line 19:
 
|-
 
|-
 
!colspan="2"|Week
 
!colspan="2"|Week
!colspan="2"|Stems
+
!colspan="3"|Stems
!colspan="2"|naïve coverage
+
!colspan="3"|naïve coverage
 
!colspan="2"|WER,PER
 
!colspan="2"|WER,PER
 
!colspan="2"|Progress
 
!colspan="2"|Progress
Line 11: Line 26:
 
! №
 
! №
 
! dates
 
! dates
  +
! swa
 
! lin
 
! lin
! lin-eng
+
! swa-lin
  +
! swa
 
! lin
 
! lin
! lin-eng
+
! swa-lin
  +
! swa→lin
! lin→eng
 
  +
! lin→swa
! eng→lin
 
 
!Evaluation
 
!Evaluation
 
!Notes
 
!Notes
 
|-
 
|-
  +
| 0 (community bonding)
| 0
 
| May 20 - May 26
+
| May 4 - May 31
| 727
+
| 86
| 139
+
| 1,444
| 61.95%
+
| 26
  +
|
| 40.86%
 
  +
|
| 86.79%,80.87%
 
  +
|
| 75.27%,63.98%
 
|
+
|
  +
|
  +
|
 
|
 
|
 
|-
 
|-
 
| 1
 
| 1
| May 27 - June 02
+
| June 1 - June 7
| 904
+
| 86
| 139
+
| 1,444
| 62.57%
+
| 26
  +
|
| 40.86%
 
  +
|
| 86.79%,80.87%
 
  +
|
| 75.27%,63.98%
 
  +
|
  +
|
 
|
 
|
 
|
 
|
 
|-
 
|-
 
| 2
 
| 2
| May 03 - June 09
+
| May 8 - June 14
| 1,154
+
| 170
| 1,416
+
| 1,444
| 63.17%
+
| 26
| 53.03%
 
| 87.02%,79.95%
 
| 74.46%,60.22%
 
 
|
 
|
  +
|
  +
|
  +
|
  +
|
 
|
  +
| Number of stems in lin transducer comes from prev. estimates. Manually counted stems in swa transducer
 
|-
 
|-
 
| 3
 
| 3
| June 10 - June 16
+
| June 15 - June 21
| 1,172
+
| 303
| 1,501
+
| 1,444
|
+
| 26
  +
|
| 61.60%
 
  +
|
| 91.57%,79.04%
 
  +
|
| 75.85%,62.90%
 
  +
|
  +
|
 
|
 
|
  +
| work was mainly collecting and finding stems.
| WER for 'lin-eng' went up because of an incomplete rule for verbs that creates unnecessary pronouns. Main work next week will be on rules to dramatically improve WER and PER.
 
 
|-
 
|-
 
| 4
 
| 4
| June 17 - June 23
+
| June 22 - June 28
| 1,178
+
| 6,667
| 1,510
+
| 1,716
| 61.71%
+
| 1,436
| 61.52%
 
| 85.65%,71.98%
 
| 81.72%,68.82%
 
 
|
 
|
 
| 76.5%
 
|
 
|
 
| 94.40%
 
| 107.95%
  +
|
  +
| several duplicates in the swa transducer.
  +
|-
  +
| 5
  +
| June 29 - July 5
  +
|-
  +
| 6
  +
| July 6 - July 12
  +
|-
  +
| 7
  +
| July 13 - July 19
  +
|-
  +
| 8
  +
| July 20 - July 26
  +
|-
  +
| 9
  +
| July 27 - Aug 2
  +
|-
  +
| 10
  +
| July 3 - Aug 9
  +
|-
  +
| 11
  +
| Aug 10 - Aug 16
  +
|-
  +
| 12
  +
| Aug 17 - Aug 23
 
|-
 
|-
 
|}
 
|}
  +
  +
== Work ==
  +
* June 8 - June 14
  +
- verb, noun, adjective morphotatics in swa transducer
  +
* June 15 - June 21
  +
- add missing verb TAM(continuative, reciprocal,causative)(<br/>
  +
- more subsections in 'Verb Morphotatics'<br/>
  +
- add stems in swa transducer <br/>
  +
- start writing transfer rules <br/>
   
 
== Notes ==
 
== Notes ==

Latest revision as of 15:06, 27 June 2020

Community Bonding Period[edit]

  • Find Swahili-Lingala resources
  • Update Lingala lexc transducer to lexd
  • New lexd transducer for Swahili
  • Keep track of coverage for Lin and Swa transducers
  • Get familiar with apertium-recursive
  • Set up swa-lin pair using apertium-recursive
  • Update GSOC progress page
  • James and Marry story + Wikipedia article in Swahili and Lingala.

Goals[edit]

  • By first evaluation: have story about kids or similar text to WER/PER of around 20% (work with all stages of translation, focus on "lowest-hanging fruit" relevant to the text)
  • By second evaluation: increase [trimmed] coverage to around 90% (work focused on lexicons, adding from frequency lists)
  • By final evaluation: work to get clean testvoc (work focused on transfer, making sure everything is dealt with one way or other)

Status table[edit]

Week Stems naïve coverage WER,PER Progress
dates swa lin swa-lin swa lin swa-lin swa→lin lin→swa Evaluation Notes
0 (community bonding) May 4 - May 31 86 1,444 26
1 June 1 - June 7 86 1,444 26
2 May 8 - June 14 170 1,444 26 Number of stems in lin transducer comes from prev. estimates. Manually counted stems in swa transducer
3 June 15 - June 21 303 1,444 26 work was mainly collecting and finding stems.
4 June 22 - June 28 6,667 1,716 1,436 76.5% 94.40% 107.95% several duplicates in the swa transducer.
5 June 29 - July 5
6 July 6 - July 12
7 July 13 - July 19
8 July 20 - July 26
9 July 27 - Aug 2
10 July 3 - Aug 9
11 Aug 10 - Aug 16
12 Aug 17 - Aug 23

Work[edit]

  • June 8 - June 14

- verb, noun, adjective morphotatics in swa transducer

  • June 15 - June 21

- add missing verb TAM(continuative, reciprocal,causative)(
- more subsections in 'Verb Morphotatics'
- add stems in swa transducer
- start writing transfer rules

Notes[edit]

  • To count stems in lexc, try:
 grep -E ":\w+.*;" apertium-lin.lin.lexc | grep -v "[<>]" | wc -l
  • To count stems in the bidix, try this:
 grep "<p" apertium-eng-lin.eng-lin.dix  | wc -l
  • To get WER and PER use apertium-eval-translator-line