Difference between revisions of "User:Eden/GSoC progress"

From Apertium
Jump to navigation Jump to search
(22 intermediate revisions by 2 users not shown)
Line 26: Line 26:
 
| 61.95%
 
| 61.95%
 
| 40.86%
 
| 40.86%
  +
| 86.79%,80.87%
|
 
  +
| 75.27%,63.98%
|
 
 
|
 
|
 
|
 
|
Line 33: Line 33:
 
| 1
 
| 1
 
| May 27 - June 02
 
| May 27 - June 02
  +
| 904
  +
| 139
  +
| 62.57%
  +
| 40.86%
  +
| 86.79%,80.87%
  +
| 75.27%,63.98%
 
|
 
|
 
|
  +
|-
  +
| 2
  +
| May 03 - June 09
  +
| 1,154
  +
| 1,416
  +
| 63.17%
  +
| 53.03%
  +
| 87.02%,79.95%
  +
| 74.46%,60.22%
 
|
  +
|-
  +
| 3
  +
| June 10 - June 16
  +
| 1,172
  +
| 1,501
  +
|
  +
| 61.60%
  +
| 91.57%,79.04%
  +
| 75.85%,62.90%
  +
|
  +
| WER for 'lin-eng' went up because of an incomplete rule for verbs that creates unnecessary pronouns. Main work next week will be on rules to dramatically improve WER and PER.
  +
|-
  +
| 4
  +
| June 17 - June 23
  +
| 1,200
  +
| 1,540
  +
| 69.70%
  +
| 62.70%
  +
| 79.27%,64.24%
  +
| 84.41%,72.58%
  +
|
  +
|
  +
|-
  +
| 5
  +
| June 24 - June 30
  +
| 1,200
  +
| 1,556
  +
| 70.21%
  +
| 61.90%
  +
| 77.68%,67.88%
  +
| 85.48%,73.92%
  +
|
  +
|
  +
|-
  +
| 6
  +
| July 1 - July 7
  +
|
  +
|
  +
|
  +
|
  +
|
  +
|
  +
|
  +
|
  +
|-
  +
| 7
  +
| July 8 - July 14
  +
|1,236
  +
|1,577
  +
|69.35%
  +
|60.47%
  +
|60.59%,46.47%
  +
|72.61%,58.68%
  +
|
  +
|Work was done on lexical selection and rules about determinants. Current lexical selection works well with the text currently in use, which is a more rigid and literary Lingala. Further tests will be run on texts from the Wikipedia corpus to generalize lexical rules.
  +
|-
  +
| 8
  +
| July 15 - July 21
  +
|
  +
|
  +
|
  +
|
  +
|59.68%,47.84%
  +
|57.98%,45.21%
  +
|
  +
| WER went down in both directions by approximately 2% after I added accents, and missing ɔ́ ɔ ɛ́ ɛ. Next focus will be on negation and trying to find a bigger corpus(>1000 words).
  +
 
|}
 
|}
   
Line 39: Line 123:
 
* To count stems in <code>lexc</code>, try:
 
* To count stems in <code>lexc</code>, try:
 
grep -E ":\w+.*;" apertium-lin.lin.lexc | grep -v "[<>]" | wc -l
 
grep -E ":\w+.*;" apertium-lin.lin.lexc | grep -v "[<>]" | wc -l
  +
  +
* To count stems in the bidix, try this:
  +
grep "<p" apertium-eng-lin.eng-lin.dix | wc -l
  +
  +
* To get WER and PER use <code>apertium-eval-translator-line</code>
  +
  +
* Coverage above is on [https://dumps.wikimedia.org/lnwiki/20190520/ 2019-05-20 Wikipedia dump].

Revision as of 05:49, 18 July 2019

Status table

Week Stems naïve coverage WER,PER Progress
dates lin lin-eng lin lin-eng lin→eng eng→lin Evaluation Notes
0 May 20 - May 26 727 139 61.95% 40.86% 86.79%,80.87% 75.27%,63.98%
1 May 27 - June 02 904 139 62.57% 40.86% 86.79%,80.87% 75.27%,63.98%
2 May 03 - June 09 1,154 1,416 63.17% 53.03% 87.02%,79.95% 74.46%,60.22%
3 June 10 - June 16 1,172 1,501 61.60% 91.57%,79.04% 75.85%,62.90% WER for 'lin-eng' went up because of an incomplete rule for verbs that creates unnecessary pronouns. Main work next week will be on rules to dramatically improve WER and PER.
4 June 17 - June 23 1,200 1,540 69.70% 62.70% 79.27%,64.24% 84.41%,72.58%
5 June 24 - June 30 1,200 1,556 70.21% 61.90% 77.68%,67.88% 85.48%,73.92%
6 July 1 - July 7
7 July 8 - July 14 1,236 1,577 69.35% 60.47% 60.59%,46.47% 72.61%,58.68% Work was done on lexical selection and rules about determinants. Current lexical selection works well with the text currently in use, which is a more rigid and literary Lingala. Further tests will be run on texts from the Wikipedia corpus to generalize lexical rules.
8 July 15 - July 21 59.68%,47.84% 57.98%,45.21% WER went down in both directions by approximately 2% after I added accents, and missing ɔ́ ɔ ɛ́ ɛ. Next focus will be on negation and trying to find a bigger corpus(>1000 words).

Notes

  • To count stems in lexc, try:
 grep -E ":\w+.*;" apertium-lin.lin.lexc | grep -v "[<>]" | wc -l
  • To count stems in the bidix, try this:
 grep "<p" apertium-eng-lin.eng-lin.dix  | wc -l
  • To get WER and PER use apertium-eval-translator-line