User:Eden/GSoC progress
Jump to navigation
Jump to search
Status table
Week | Stems | naïve coverage | WER,PER | Progress | |||||
---|---|---|---|---|---|---|---|---|---|
№ | dates | lin | lin-eng | lin | lin-eng | lin→eng | eng→lin | Evaluation | Notes |
0 | May 20 - May 26 | 727 | 139 | 61.95% | 40.86% | 86.79%,80.87% | 75.27%,63.98% | ||
1 | May 27 - June 02 | 904 | 139 | 62.57% | 40.86% | 86.79%,80.87% | 75.27%,63.98% | ||
2 | May 03 - June 09 | 1,154 | 1,416 | 63.17% | 53.03% | 87.02%,79.95% | 74.46%,60.22% | ||
3 | June 10 - June 16 | 1,172 | 1,501 | 61.60% | 91.57%,79.04% | 75.85%,62.90% | WER for 'lin-eng' went up because of an incomplete rule for verbs that creates unnecessary pronouns. Main work next week will be on rules to dramatically improve WER and PER. | ||
4 | June 17 - June 23 | 1,200 | 1,540 | 69.70% | 62.70% | 79.27%,64.24% | 84.41%,72.58% | ||
5 | June 24 - June 30 | 1,200 | 1,556 | 70.21% | 61.90% | 77.68%,67.88% | 85.48%,73.92% | ||
6 | July 1 - July 7 | ||||||||
7 | July 8 - July 14 | 1,236 | 1,577 | 69.35% | 60.47% | 60.59%,46.47% | 72.61%,58.68% | Work was done on lexical selection and rules about determinants. Current lexical selection works well with the text currently in use, which is a more rigid and literary Lingala. Further tests will be run on texts from the Wikipedia corpus to generalize lexical rules. | |
8 | July 15 - July 21 | 1,280 | 1,580 | 72.81% | 68.62% | 52.62%,42.82% | 59.04%,46.28% | WER went down in both directions by approximately 2% after I added accents, and missing ɔ́ ɔ ɛ́ ɛ. Next focus will be on negation and trying to find a bigger corpus(>1000 words). | |
9 | July 22 - July 28 | 1,320 | 1,600 | 73.24% | 68.92% | 50.02%,41.55% | 52.81%,40.09% | ||
10 | July 29 - Aug 04 | Work was mainly on lexical selection rules. First half of Bible translation(~1,100 words) is understandable. | |||||||
11 | Aug 5 - Aug 11 | 1,341 | 1,661 | 75.35% | 69.33% | 48.97%,39.18% | 53.99%,41.49% | Lexical selection rules for 'na' and 'ya'. WER in eng-lin went up because I commented out some words in the bidix. | |
12 | Aug 12 - Aug 18 | 76.21% | Added missing morphology for determinants and adjectives. |
Notes
- To count stems in
lexc
, try:
grep -E ":\w+.*;" apertium-lin.lin.lexc | grep -v "[<>]" | wc -l
- To count stems in the bidix, try this:
grep "<p" apertium-eng-lin.eng-lin.dix | wc -l
- To get WER and PER use
apertium-eval-translator-line
- Coverage above is on 2019-05-20 Wikipedia dump.