Difference between revisions of "User:Eden/GSoC progress"
Jump to navigation
Jump to search
m (→Status table) |
|||
(18 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
+ | == Community Bonding Period == |
||
+ | * Find Swahili-Lingala resources |
||
+ | * Update Lingala lexc transducer to lexd |
||
+ | * New lexd transducer for Swahili |
||
+ | * Keep track of coverage for Lin and Swa transducers |
||
+ | * Get familiar with apertium-recursive |
||
+ | * Set up <code>swa-lin</code> pair using apertium-recursive |
||
+ | * Update GSOC progress page |
||
+ | * James and Marry story + Wikipedia article in Swahili and Lingala. |
||
+ | |||
+ | == Goals == |
||
+ | * By first evaluation: have story about kids or similar text to WER/PER of around 20% (work with all stages of translation, focus on "lowest-hanging fruit" relevant to the text) |
||
+ | * By second evaluation: increase [trimmed] coverage to around 90% (work focused on lexicons, adding from frequency lists) |
||
+ | * By final evaluation: work to get clean testvoc (work focused on transfer, making sure everything is dealt with one way or other) |
||
+ | |||
== Status table == |
== Status table == |
||
Line 4: | Line 19: | ||
|- |
|- |
||
!colspan="2"|Week |
!colspan="2"|Week |
||
− | !colspan=" |
+ | !colspan="3"|Stems |
− | !colspan=" |
+ | !colspan="3"|naïve coverage |
!colspan="2"|WER,PER |
!colspan="2"|WER,PER |
||
!colspan="2"|Progress |
!colspan="2"|Progress |
||
Line 11: | Line 26: | ||
! № |
! № |
||
! dates |
! dates |
||
+ | ! swa |
||
! lin |
! lin |
||
− | ! |
+ | ! swa-lin |
+ | ! swa |
||
! lin |
! lin |
||
− | ! |
+ | ! swa-lin |
+ | ! swa→lin |
||
− | ! lin→eng |
||
+ | ! lin→swa |
||
− | ! eng→lin |
||
!Evaluation |
!Evaluation |
||
!Notes |
!Notes |
||
|- |
|- |
||
+ | | 0 (community bonding) |
||
− | | 0 |
||
− | | May |
+ | | May 4 - May 31 |
− | | |
+ | | 86 |
− | | |
+ | | 1,444 |
− | | |
+ | | 26 |
+ | | |
||
− | | 40.86% |
||
+ | | |
||
− | | 86.79%,80.87% |
||
+ | | |
||
− | | 75.27%,63.98% |
||
− | | |
+ | | |
+ | | |
||
+ | | |
||
| |
| |
||
|- |
|- |
||
| 1 |
| 1 |
||
− | | |
+ | | June 1 - June 7 |
− | | |
+ | | 86 |
− | | |
+ | | 1,444 |
− | | |
+ | | 26 |
+ | | |
||
− | | 40.86% |
||
+ | | |
||
− | | 86.79%,80.87% |
||
+ | | |
||
− | | 75.27%,63.98% |
||
+ | | |
||
+ | | |
||
| |
| |
||
| |
| |
||
|- |
|- |
||
| 2 |
| 2 |
||
− | | May |
+ | | May 8 - June 14 |
− | | |
+ | | 170 |
− | | 1, |
+ | | 1,444 |
− | | |
+ | | 26 |
+ | | |
||
− | | 53.03% |
||
− | | 87.02%,79.95% |
||
− | | 74.46%,60.22% |
||
| |
| |
||
− | |- |
||
− | | 3 |
||
− | | June 10 - June 16 |
||
− | | 1,172 |
||
− | | 1,501 |
||
− | | |
||
− | | 61.60% |
||
− | | 91.57%,79.04% |
||
− | | 75.85%,62.90% |
||
| |
| |
||
− | | WER for 'lin-eng' went up because of an incomplete rule for verbs that creates unnecessary pronouns. Main work next week will be on rules to dramatically improve WER and PER. |
||
− | |- |
||
− | | 4 |
||
− | | June 17 - June 23 |
||
− | | 1,200 |
||
− | | 1,540 |
||
− | | 69.70% |
||
− | | 62.70% |
||
− | | 79.27%,64.24% |
||
− | | 84.41%,72.58% |
||
| |
| |
||
− | | |
||
− | |- |
||
− | | 5 |
||
− | | June 24 - June 30 |
||
− | | 1,200 |
||
− | | 1,556 |
||
− | | 70.21% |
||
− | | 61.90% |
||
− | | 77.68%,67.88% |
||
− | | 85.48%,73.92% |
||
| |
| |
||
| |
| |
||
+ | | Number of stems in lin transducer comes from prev. estimates. Manually counted stems in swa transducer |
||
|- |
|- |
||
− | | |
+ | | 3 |
− | | |
+ | | June 15 - June 21 |
+ | | 303 |
||
+ | | 1,444 |
||
+ | | 26 |
||
| |
| |
||
| |
| |
||
Line 93: | Line 87: | ||
| |
| |
||
| |
| |
||
+ | | work was mainly collecting and finding stems. |
||
+ | |- |
||
+ | | 4 |
||
+ | | June 22 - June 28 |
||
+ | | 6,667 |
||
+ | | 1,716 |
||
+ | | 1,436 |
||
| |
| |
||
+ | | 76.5% |
||
+ | | |
||
+ | | 94.40% |
||
+ | | 107.95% |
||
| |
| |
||
+ | | several duplicates in the swa transducer. |
||
+ | |- |
||
+ | | 5 |
||
+ | | June 29 - July 5 |
||
+ | |- |
||
+ | | 6 |
||
+ | | July 6 - July 12 |
||
|- |
|- |
||
| 7 |
| 7 |
||
− | | July |
+ | | July 13 - July 19 |
− | |1,236 |
||
− | |1,577 |
||
− | |69.35% |
||
− | |60.47% |
||
− | |60.59%,46.47% |
||
− | |72.61%,58.68% |
||
− | | |
||
− | |Work was done on lexical selection and rules about determinants. Current lexical selection works well with the text currently in use, which is a more rigid and literary Lingala. Further tests will be run on texts from the Wikipedia corpus to generalize lexical rules. |
||
|- |
|- |
||
| 8 |
| 8 |
||
− | | July |
+ | | July 20 - July 26 |
− | | |
+ | |- |
− | | |
+ | | 9 |
+ | | July 27 - Aug 2 |
||
− | | |
||
− | | |
+ | |- |
+ | | 10 |
||
− | |59.68%,47.84% |
||
+ | | July 3 - Aug 9 |
||
− | |57.98%,45.21% |
||
− | | |
+ | |- |
+ | | 11 |
||
− | | WER went down in both directions by approximately 2% after I added accents, and missing ɔ́ ɔ ɛ́ ɛ. Next focus will be on negation and trying to find a bigger corpus(>1000 words). |
||
+ | | Aug 10 - Aug 16 |
||
− | |||
+ | |- |
||
+ | | 12 |
||
+ | | Aug 17 - Aug 23 |
||
+ | |- |
||
|} |
|} |
||
+ | |||
+ | == Work == |
||
+ | * June 8 - June 14 |
||
+ | - verb, noun, adjective morphotatics in swa transducer |
||
+ | * June 15 - June 21 |
||
+ | - add missing verb TAM(continuative, reciprocal,causative)(<br/> |
||
+ | - more subsections in 'Verb Morphotatics'<br/> |
||
+ | - add stems in swa transducer <br/> |
||
+ | - start writing transfer rules <br/> |
||
== Notes == |
== Notes == |
Latest revision as of 15:06, 27 June 2020
Community Bonding Period[edit]
- Find Swahili-Lingala resources
- Update Lingala lexc transducer to lexd
- New lexd transducer for Swahili
- Keep track of coverage for Lin and Swa transducers
- Get familiar with apertium-recursive
- Set up
swa-lin
pair using apertium-recursive - Update GSOC progress page
- James and Marry story + Wikipedia article in Swahili and Lingala.
Goals[edit]
- By first evaluation: have story about kids or similar text to WER/PER of around 20% (work with all stages of translation, focus on "lowest-hanging fruit" relevant to the text)
- By second evaluation: increase [trimmed] coverage to around 90% (work focused on lexicons, adding from frequency lists)
- By final evaluation: work to get clean testvoc (work focused on transfer, making sure everything is dealt with one way or other)
Status table[edit]
Week | Stems | naïve coverage | WER,PER | Progress | |||||||
---|---|---|---|---|---|---|---|---|---|---|---|
№ | dates | swa | lin | swa-lin | swa | lin | swa-lin | swa→lin | lin→swa | Evaluation | Notes |
0 (community bonding) | May 4 - May 31 | 86 | 1,444 | 26 | |||||||
1 | June 1 - June 7 | 86 | 1,444 | 26 | |||||||
2 | May 8 - June 14 | 170 | 1,444 | 26 | Number of stems in lin transducer comes from prev. estimates. Manually counted stems in swa transducer | ||||||
3 | June 15 - June 21 | 303 | 1,444 | 26 | work was mainly collecting and finding stems. | ||||||
4 | June 22 - June 28 | 6,667 | 1,716 | 1,436 | 76.5% | 94.40% | 107.95% | several duplicates in the swa transducer. | |||
5 | June 29 - July 5 | ||||||||||
6 | July 6 - July 12 | ||||||||||
7 | July 13 - July 19 | ||||||||||
8 | July 20 - July 26 | ||||||||||
9 | July 27 - Aug 2 | ||||||||||
10 | July 3 - Aug 9 | ||||||||||
11 | Aug 10 - Aug 16 | ||||||||||
12 | Aug 17 - Aug 23 |
Work[edit]
- June 8 - June 14
- verb, noun, adjective morphotatics in swa transducer
- June 15 - June 21
- add missing verb TAM(continuative, reciprocal,causative)(
- more subsections in 'Verb Morphotatics'
- add stems in swa transducer
- start writing transfer rules
Notes[edit]
- To count stems in
lexc
, try:
grep -E ":\w+.*;" apertium-lin.lin.lexc | grep -v "[<>]" | wc -l
- To count stems in the bidix, try this:
grep "<p" apertium-eng-lin.eng-lin.dix | wc -l
- To get WER and PER use
apertium-eval-translator-line
- Coverage above is on 2019-05-20 Wikipedia dump.