User:Aidana/Proposal/Working plan
		
		
		
		
		
		
		Jump to navigation
		Jump to search
		
		
		
		
		
		
		
	
Contents
Corpora
Downloads
- Bitextor: https://www.dropbox.com/s/ajy53y55toh0n40/corpus%20Lab%20IIS%20%285925%29.kz?dl=0
- Akorda corpus: https://www.dropbox.com/s/o2d2fdtktdlsvt7/akorda-kaz-8651.txt?dl=0
- GCI kaz corpus:https://www.dropbox.com/s/dqz01r0wzw2u2op/1452262000_kaz.darkgaia.txt?dl=0
- 12500 words from wikipedia: https://www.dropbox.com/s/lrj7i639d3g7dhn/12500words.txt?dl=0
Expanding vocabulary
Coverage targets
| Date | Target | Achieved | Target achieved | Stems | Notes | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 5925 corpus | GCI corpus | Wiki 12500 words | Akorda | 5925 corpus | GCI corpus | Wiki 12500 words | Akorda | |||||
| 23-04-2016 | 85.70% | 93.32% | 85.74% | 83.21% | 86.85% | 93.85% | 85.74% | 83.21% | yes | 21613 | Initial value | |
| 30-04-2016 | 86.00% | 93.80% | 86.80% | 83.50% | 88.22% | 94.56% | 88.14% | 86.48% | yes | 21923 | ||
| 07-05-2016 | 86.50% | 94.30% | 87.50% | 84.00% | 89.36% | 94.86% | 89.30% | 88.12% | yes | 22242 | ||
| 14-05-2016 | 87.00% | 95.00% | 87.70% | 84.50% | 89.84% | 95.08% | 90.71% | 90.54% | yes | 22708 | ||
| 21-05-2016 | 87.50% | 96.00% | 88.00% | 85.00% | 89.87% | 96.65% | 90.79% | 90.82% | yes | 23232 | Official GSOC start date | |
| 23-05-2016 | 87.70% | 96.50% | 88.00% | 85.00% | 89.872% | 96.65% | 90.794% | 90.836% | yes | 23238 | ||
| 1-06-2016 | 88.00% | 97.00% | 88.30% | 85.50% | 90.11% | 97.24% | 90.83% | 90.85% | yes | 23308 | ||
| 10-06-2016 | 88.30% | 97.50% | 88.50% | 85.70% | 90.26% | 98.95% | 91.00% | 92.77% | yes | 23497 | ||
| 16-06-2016 | 88.50% | 98.00% | 88.70% | 86.00% | 90.37% | 98.95% | 91.12% | 92.94% | yes | 23593 | ||
| 27-06-2016 | 89.00% | 98.50% | 89.00% | 86.50% | Midterm evaluation | |||||||
| 02-07-2016 | 89.30% | 99.00% | 89.30% | 86.80% | 90.46% | 99.03% | 91.12% | 92.98% | yes | 23633 | ||
| 09-07-2016 | 89.70% | 99.40% | 89.70% | 87.00% | 90.48% | 99.40% | 91.33% | 92.99% | yes | 24139 | ||
| 16-07-2016 | 90.00% | 99.40% | 90.00% | 87.30% | 90.74% | 99.62% | 91.55% | 93.20% | yes | 24376 | ||
| 23-07-2016 | 90.50% | 99.50% | 90.30% | 87.70% | ||||||||
| 30-07-2016 | 90.70% | 99.60% | 90.70% | 88.00% | 90.90% | 99.62% | 92.40% | 93.55% | yes | 26766 | ||
| 06-08-2016 | 91.00% | 99.70% | 91.00% | 88.50% | 91.15% | 99.70% | 93.13% | 93.85% | yes | 28257 | ||
| 13-08-2016 | 91.50% | 99.80% | 91.50% | 89.00% | ||||||||
Midterm evaluation
WER% Before:
Statistics about input files
Number of words in reference: 800 Number of words in test: 572 Number of unknown words (marked with a star) in test: 46 Percentage of unknown words: 8.04 %
Results when removing unknown-word marks (stars)
Edit distance: 751 Word error rate (WER): 93.88 % Number of position-independent correct words: 110 Position-independent word error rate (PER): 86.25 %
Report of GSoC project
List of all commits: https://apertium.projectjj.com/gsoc2016/aidana1.html
- Bilingual dictionary was expanded from 21613 to 28757
- Transfer rules:
- Chunker rules: 176
- Interchunk rules: 55
 
- Table of сoverage of 4 corpuses could be seen above: http://wiki.apertium.org/w/index.php?title=User:Aidana/Proposal/Working_plan#Expanding_vocabulary
- Testvoc:
| POS | Total | Clean | With @ | With # | Clean % | 
|---|---|---|---|---|---|
| v | 49395 | 49395 | 0 | 0 | 100 | 
| cop | 48464 | 48464 | 0 | 0 | 100 | 
| adj | 20197 | 20197 | 0 | 0 | 100 | 
| n | 12512 | 12512 | 0 | 0 | 100 | 
| prn | 10873 | 10873 | 0 | 0 | 100 | 
| det | 2248 | 2248 | 0 | 0 | 100 | 
| cnjcoo | 1389 | 1389 | 0 | 0 | 100 | 
| vaux | 808 | 808 | 0 | 0 | 100 | 
| post | 464 | 464 | 0 | 0 | 100 | 
| np | 155 | 155 | 0 | 0 | 100 | 
| adv | 45 | 45 | 0 | 0 | 100 | 
| num | 38 | 38 | 0 | 0 | 100 | 
| guio | 1 | 1 | 0 | 0 | 100 | 
| cm | 1 | 1 | 0 | 0 | 100 | 
| ij | 0 | 0 | 0 | 0 | 100 | 
| cnjsub | 0 | 0 | 0 | 0 | 100 | 

