Difference between revisions of "User:Eden/GSoC progress"
		
		
		
		
		
		
		Jump to navigation
		Jump to search
		
				
		
		
		
		
		
		
		
	
m (→Status table)  | 
				|||
| Line 1: | Line 1: | ||
== Community Bonding Period ==  | 
|||
* Find Swahili-Lingala resources   | 
|||
* Update Lingala lexc transducer to lexd  | 
|||
* New lexd transducer for Swahili   | 
|||
* Keep track of coverage for Lin and Swa transducers   | 
|||
* Get familiar with apertium-recursive   | 
|||
* Set up <code>swa-lin</code> pair using apertium-recursive  | 
|||
* Update GSOC progress page  | 
|||
== Status table ==  | 
  == Status table ==  | 
||
| Line 19: | Line 28: | ||
!Evaluation  | 
  !Evaluation  | 
||
!Notes  | 
  !Notes  | 
||
|-  | 
  |||
| 0  | 
  |||
| May 20 - May 26  | 
  |||
| 727  | 
  |||
| 139  | 
  |||
| 61.95%  | 
  |||
| 40.86%  | 
  |||
| 86.79%,80.87%  | 
  |||
| 75.27%,63.98%  | 
  |||
|   | 
  |||
|  | 
  |||
|-  | 
  |-  | 
||
| 1  | 
  | 1  | 
||
|   | 
  | June 1 - June 7  | 
||
| 904  | 
  |||
| 139  | 
  |||
| 62.57%  | 
  |||
| 40.86%  | 
  |||
| 86.79%,80.87%  | 
  |||
| 75.27%,63.98%  | 
  |||
|  | 
  |||
|  | 
  |||
|-  | 
  |-  | 
||
| 2  | 
  | 2  | 
||
| May   | 
  | May 8 - June 14  | 
||
| 1,154  | 
  |||
| 1,416  | 
  |||
| 63.17%  | 
  |||
| 53.03%  | 
  |||
| 87.02%,79.95%  | 
  |||
| 74.46%,60.22%  | 
  |||
|  | 
  |||
|-  | 
  |-  | 
||
| 3  | 
  | 3  | 
||
| June   | 
  | June 15 - June 21  | 
||
| 1,172  | 
  |||
| 1,501  | 
  |||
|   | 
  |||
| 61.60%  | 
  |||
| 91.57%,79.04%  | 
  |||
| 75.85%,62.90%  | 
  |||
|  | 
  |||
| WER for 'lin-eng' went up because of an incomplete rule for verbs that creates unnecessary pronouns. Main work next week will be on rules to dramatically improve WER and PER.  | 
  |||
|-  | 
  |-  | 
||
| 4  | 
  | 4  | 
||
| June   | 
  | June 22 - June 28  | 
||
| 1,200  | 
  |||
| 1,540  | 
  |||
| 69.70%  | 
  |||
| 62.70%  | 
  |||
| 79.27%,64.24%  | 
  |||
| 84.41%,72.58%  | 
  |||
|  | 
  |||
|   | 
  |||
|-  | 
  |-  | 
||
| 5  | 
  | 5  | 
||
| June   | 
  | June 29 - July 5  | 
||
| 1,200  | 
  |||
| 1,556  | 
  |||
| 70.21%  | 
  |||
| 61.90%  | 
  |||
| 77.68%,67.88%  | 
  |||
| 85.48%,73.92%  | 
  |||
|  | 
  |||
|   | 
  |||
|-  | 
  |-  | 
||
| 6  | 
  | 6  | 
||
| July   | 
  | July 6 - July 12  | 
||
|  | 
  |||
|  | 
  |||
|  | 
  |||
|  | 
  |||
|  | 
  |||
|  | 
  |||
|  | 
  |||
|  | 
  |||
|-  | 
  |-  | 
||
| 7  | 
  | 7  | 
||
| July   | 
  | July 13 - July 19  | 
||
|1,236  | 
  |||
|1,577  | 
  |||
|69.35%  | 
  |||
|60.47%  | 
  |||
|60.59%,46.47%  | 
  |||
|72.61%,58.68%  | 
  |||
|  | 
  |||
|Work was done on lexical selection and rules about determinants. Current lexical selection works well with the text currently in use, which is a more rigid and literary Lingala. Further tests will be run on texts from the Wikipedia corpus to generalize lexical rules.  | 
  |||
|-  | 
  |-  | 
||
| 8  | 
  | 8  | 
||
| July   | 
  | July 20 - July 26  | 
||
|1,280   | 
  |||
|1,580  | 
  |||
|72.81%  | 
  |||
|68.62%  | 
  |||
|52.62%,42.82%  | 
  |||
|59.04%,46.28%  | 
  |||
|  | 
  |||
| WER went down in both directions by approximately 2% after I added accents, and missing ɔ́ ɔ ɛ́ ɛ. Next focus will be on negation and trying to find a bigger corpus(>1000 words).  | 
  |||
|-  | 
  |-  | 
||
| 9  | 
  | 9  | 
||
| July   | 
  | July 27 - Aug 2  | 
||
| 1,320  | 
  |||
| 1,600  | 
  |||
| 73.24%  | 
  |||
| 68.92%  | 
  |||
| 50.02%,41.55%  | 
  |||
| 52.81%,40.09%  | 
  |||
|  | 
  |||
|-  | 
  |-  | 
||
| 10  | 
  | 10  | 
||
| July   | 
  | July 3 - Aug 9  | 
||
|   | 
  |||
|  | 
  |||
|  | 
  |||
|  | 
  |||
|  | 
  |||
|  | 
  |||
|  | 
  |||
| Work was mainly on lexical selection rules. First half of Bible translation(~1,100 words) is understandable.  | 
  |||
|-  | 
  |-  | 
||
| 11  | 
  | 11  | 
||
| Aug   | 
  | Aug 10 - Aug 16  | 
||
| 1,341  | 
  |||
| 1,661  | 
  |||
| 75.35%  | 
  |||
| 69.33%  | 
  |||
| 48.97%,39.18%  | 
  |||
| 53.99%,41.49%  | 
  |||
|  | 
  |||
| Lexical selection rules for 'na' and 'ya'. WER in eng-lin went up because I commented out some words in the bidix.   | 
  |||
|-  | 
  |-  | 
||
| 12  | 
  | 12  | 
||
| Aug   | 
  | Aug 17 - Aug 23  | 
||
| 1,444  | 
  |||
| 1,700  | 
  |||
| 76.5%  | 
  |||
| 71.10%  | 
  |||
| 48.52%,37.81%  | 
  |||
| 50.13%,38.13%  | 
  |||
|  | 
  |||
| Added missing morphology for determinants and adjectives.   | 
  |||
|-  | 
  |-  | 
||
|}  | 
  |}  | 
||
Revision as of 04:45, 19 May 2020
Community Bonding Period
- Find Swahili-Lingala resources
 - Update Lingala lexc transducer to lexd
 - New lexd transducer for Swahili
 - Keep track of coverage for Lin and Swa transducers
 - Get familiar with apertium-recursive
 - Set up 
swa-linpair using apertium-recursive - Update GSOC progress page
 
Status table
| Week | Stems | naïve coverage | WER,PER | Progress | |||||
|---|---|---|---|---|---|---|---|---|---|
| № | dates | lin | lin-eng | lin | lin-eng | lin→eng | eng→lin | Evaluation | Notes | 
| 1 | June 1 - June 7 | ||||||||
| 2 | May 8 - June 14 | ||||||||
| 3 | June 15 - June 21 | ||||||||
| 4 | June 22 - June 28 | ||||||||
| 5 | June 29 - July 5 | ||||||||
| 6 | July 6 - July 12 | ||||||||
| 7 | July 13 - July 19 | ||||||||
| 8 | July 20 - July 26 | ||||||||
| 9 | July 27 - Aug 2 | ||||||||
| 10 | July 3 - Aug 9 | ||||||||
| 11 | Aug 10 - Aug 16 | ||||||||
| 12 | Aug 17 - Aug 23 | ||||||||
Notes
- To count stems in 
lexc, try: 
grep -E ":\w+.*;" apertium-lin.lin.lexc | grep -v "[<>]" | wc -l
- To count stems in the bidix, try this:
 
grep "<p" apertium-eng-lin.eng-lin.dix | wc -l
- To get WER and PER use 
apertium-eval-translator-line 
- Coverage above is on 2019-05-20 Wikipedia dump.