Difference between revisions of "User:Aidana/Proposal/Working plan"
Jump to navigation
Jump to search
(Created page with "==Corpora== ===Downloads=== * Bitextor: https://www.dropbox.com/s/ajy53y55toh0n40/corpus%20Lab%20IIS%20%285925%29.kz?dl=0 * GCI kaz corpus: https://www.dropbox.com/s/8krkzjs...") |
|||
(38 intermediate revisions by 3 users not shown) | |||
Line 4: | Line 4: | ||
* Bitextor: https://www.dropbox.com/s/ajy53y55toh0n40/corpus%20Lab%20IIS%20%285925%29.kz?dl=0 |
* Bitextor: https://www.dropbox.com/s/ajy53y55toh0n40/corpus%20Lab%20IIS%20%285925%29.kz?dl=0 |
||
− | * |
+ | * Akorda corpus: https://www.dropbox.com/s/o2d2fdtktdlsvt7/akorda-kaz-8651.txt?dl=0 |
+ | * GCI kaz corpus:https://www.dropbox.com/s/dqz01r0wzw2u2op/1452262000_kaz.darkgaia.txt?dl=0 |
||
+ | * 12500 words from wikipedia: https://www.dropbox.com/s/lrj7i639d3g7dhn/12500words.txt?dl=0 |
||
+ | |||
+ | ===Expanding vocabulary=== |
||
+ | |||
+ | |||
+ | |||
+ | ==Coverage targets== |
||
+ | |||
+ | {|class=wikitable |
||
+ | !rowspan=2| Date ||colspan=4| Target || ||colspan=4| Achieved ||rowspan=2| Target<br/>achieved||rowspan=2|Stems||rowspan=2| Notes |
||
+ | |- |
||
+ | ! 5925 corpus!! GCI corpus!!Wiki 12500 words !! Akorda!! !! 5925 corpus!! GCI corpus!!Wiki 12500 words !! Akorda |
||
+ | |- |
||
+ | | 23-04-2016 || 85.70% || 93.32% ||85.74% ||83.21% || || 86.85% || 93.85% || 85.74% ||83.21% ||align=center| yes || 21613 || Initial value |
||
+ | |- |
||
+ | | 30-04-2016 || 86.00% || 93.80% ||86.80% || 83.50% || ||88.22% || 94.56% || 88.14% ||86.48% ||align=center| yes ||21923 || |
||
+ | |- |
||
+ | | 07-05-2016 || 86.50% || 94.30% || 87.50% || 84.00% || || 89.36% || 94.86% || 89.30% || 88.12% ||align=center| yes || 22242 || |
||
+ | |- |
||
+ | | 14-05-2016 ||87.00% || 95.00% || 87.70% || 84.50% || || 89.84% || 95.08% || 90.71% || 90.54% ||align=center| yes ||22708 || |
||
+ | |- |
||
+ | | 21-05-2016 || 87.50% || 96.00% || 88.00% ||85.00% || || 89.87% || 96.65% || 90.79% || 90.82% ||align=center| yes || 23232 ||Official GSOC start date |
||
+ | |- |
||
+ | | 23-05-2016 || 87.70% || 96.50% || 88.00% ||85.00% || || 89.872% || 96.65% || 90.794% ||90.836% ||align=center| yes || 23238 || |
||
+ | |- |
||
+ | | 1-06-2016 || 88.00% || 97.00% || 88.30% ||85.50% || || 90.11% || 97.24% || 90.83% || 90.85% ||align=center| yes || 23308 || |
||
+ | |- |
||
+ | | 10-06-2016 || 88.30% || 97.50% || 88.50% ||85.70% || || 90.26% || 98.95% || 91.00% ||92.77% ||align=center | yes ||23497|| |
||
+ | |- |
||
+ | | 16-06-2016 || 88.50% || 98.00% || 88.70% || 86.00%|| || 90.37% ||98.95% || 91.12% || 92.94% ||align=center | yes || 23593 || |
||
+ | |- |
||
+ | | 27-06-2016 || 89.00% || 98.50% || 89.00% || 86.50% || || || || || || || || Midterm evaluation |
||
+ | |- |
||
+ | | 02-07-2016 || 89.30% || 99.00% || 89.30% || 86.80% || || 90.46% || 99.03% || 91.12% ||92.98% || align=center|yes ||23633 || |
||
+ | |- |
||
+ | | 09-07-2016 || 89.70% || 99.40% || 89.70% ||87.00% || || 90.48% || 99.40% || 91.33% || 92.99% || align=center|yes|| 24139 || |
||
+ | |- |
||
+ | | 16-07-2016 || 90.00% || 99.40% || 90.00% || 87.30% || || 90.74% ||99.62% || 91.55% || 93.20% || align=center|yes || 24376 || |
||
+ | |- |
||
+ | | 23-07-2016 || 90.50% || 99.50% || 90.30% || 87.70% || || || || || || || || |
||
+ | |- |
||
+ | | 30-07-2016 || 90.70% || 99.60% || 90.70% || 88.00% || || 90.90% || 99.62% || 92.40% || 93.55% || align=center|yes || 26766 || |
||
+ | |- |
||
+ | | 06-08-2016 || 91.00% || 99.70% || 91.00% || 88.50% || || 91.15% || 99.70% || 93.13% ||93.85% || align=center|yes || 28257 || |
||
+ | |- |
||
+ | | 13-08-2016 || 91.50% || 99.80% || 91.50% || 89.00% || || 91.55% || 99.80% || 93.21% || 93.87% || align=center|yes ||29086 || |
||
+ | |- |
||
+ | |} |
||
+ | |||
+ | |||
+ | ===Midterm evaluation=== |
||
+ | |||
+ | <pre> |
||
+ | WER% Before: |
||
+ | |||
+ | Statistics about input files |
||
+ | ------------------------------------------------------- |
||
+ | Number of words in reference: 800 |
||
+ | Number of words in test: 572 |
||
+ | Number of unknown words (marked with a star) in test: 46 |
||
+ | Percentage of unknown words: 8.04 % |
||
+ | |||
+ | Results when removing unknown-word marks (stars) |
||
+ | ------------------------------------------------------- |
||
+ | Edit distance: 751 |
||
+ | Word error rate (WER): 93.88 % |
||
+ | Number of position-independent correct words: 110 |
||
+ | Position-independent word error rate (PER): 86.25 % |
||
+ | </pre> |
||
+ | |||
+ | ===Report of GSoC project=== |
||
+ | |||
+ | List of all commits: https://apertium.projectjj.com/gsoc2016/aidana1.html |
||
+ | |||
+ | * Bilingual dictionary was expanded from 21613 to 29086 |
||
+ | * Transfer rules: |
||
+ | ** Chunker rules: 176 |
||
+ | ** Interchunk rules: 55 |
||
+ | |||
+ | *Table of сoverage of 4 corpuses could be seen above: http://wiki.apertium.org/w/index.php?title=User:Aidana/Proposal/Working_plan#Expanding_vocabulary |
||
+ | *Testvoc: |
||
+ | |||
+ | {|class=wikitable |
||
+ | !rowspan=1| POS || Total || Clean ||rowspan=1| With<br/>@||rowspan=1|With<br/>#||rowspan=1| Clean<br/>% |
||
+ | |- |
||
+ | |v||49395||49395||0||0||100 |
||
+ | |- |
||
+ | |cop||48464||48464||0||0||100 |
||
+ | |- |
||
+ | |adj||20197||20197||0||0||100 |
||
+ | |- |
||
+ | |n||12512||12512||0||0||100 |
||
+ | |- |
||
+ | |prn||10873||10873||0||0||100 |
||
+ | |- |
||
+ | |det||2248||2248||0||0||100 |
||
+ | |- |
||
+ | |cnjcoo||1389||1389||0||0||100 |
||
+ | |- |
||
+ | |vaux||808||808||0||0||100 |
||
+ | |- |
||
+ | |post||464||464||0||0||100 |
||
+ | |- |
||
+ | |np||155||155||0||0||100 |
||
+ | |- |
||
+ | |adv||45||45||0||0||100 |
||
+ | |- |
||
+ | |num||38||38||0||0||100 |
||
+ | |- |
||
+ | |guio||1||1||0||0||100 |
||
+ | |- |
||
+ | |cm||1||1||0||0||100 |
||
+ | |- |
||
+ | |ij||0||0||0||0||100 |
||
+ | |- |
||
+ | |cnjsub||0||0||0||0||100 |
||
+ | |} |
Latest revision as of 18:28, 16 August 2016
Contents
Corpora[edit]
Downloads[edit]
- Bitextor: https://www.dropbox.com/s/ajy53y55toh0n40/corpus%20Lab%20IIS%20%285925%29.kz?dl=0
- Akorda corpus: https://www.dropbox.com/s/o2d2fdtktdlsvt7/akorda-kaz-8651.txt?dl=0
- GCI kaz corpus:https://www.dropbox.com/s/dqz01r0wzw2u2op/1452262000_kaz.darkgaia.txt?dl=0
- 12500 words from wikipedia: https://www.dropbox.com/s/lrj7i639d3g7dhn/12500words.txt?dl=0
Expanding vocabulary[edit]
Coverage targets[edit]
Date | Target | Achieved | Target achieved |
Stems | Notes | |||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
5925 corpus | GCI corpus | Wiki 12500 words | Akorda | 5925 corpus | GCI corpus | Wiki 12500 words | Akorda | |||||
23-04-2016 | 85.70% | 93.32% | 85.74% | 83.21% | 86.85% | 93.85% | 85.74% | 83.21% | yes | 21613 | Initial value | |
30-04-2016 | 86.00% | 93.80% | 86.80% | 83.50% | 88.22% | 94.56% | 88.14% | 86.48% | yes | 21923 | ||
07-05-2016 | 86.50% | 94.30% | 87.50% | 84.00% | 89.36% | 94.86% | 89.30% | 88.12% | yes | 22242 | ||
14-05-2016 | 87.00% | 95.00% | 87.70% | 84.50% | 89.84% | 95.08% | 90.71% | 90.54% | yes | 22708 | ||
21-05-2016 | 87.50% | 96.00% | 88.00% | 85.00% | 89.87% | 96.65% | 90.79% | 90.82% | yes | 23232 | Official GSOC start date | |
23-05-2016 | 87.70% | 96.50% | 88.00% | 85.00% | 89.872% | 96.65% | 90.794% | 90.836% | yes | 23238 | ||
1-06-2016 | 88.00% | 97.00% | 88.30% | 85.50% | 90.11% | 97.24% | 90.83% | 90.85% | yes | 23308 | ||
10-06-2016 | 88.30% | 97.50% | 88.50% | 85.70% | 90.26% | 98.95% | 91.00% | 92.77% | yes | 23497 | ||
16-06-2016 | 88.50% | 98.00% | 88.70% | 86.00% | 90.37% | 98.95% | 91.12% | 92.94% | yes | 23593 | ||
27-06-2016 | 89.00% | 98.50% | 89.00% | 86.50% | Midterm evaluation | |||||||
02-07-2016 | 89.30% | 99.00% | 89.30% | 86.80% | 90.46% | 99.03% | 91.12% | 92.98% | yes | 23633 | ||
09-07-2016 | 89.70% | 99.40% | 89.70% | 87.00% | 90.48% | 99.40% | 91.33% | 92.99% | yes | 24139 | ||
16-07-2016 | 90.00% | 99.40% | 90.00% | 87.30% | 90.74% | 99.62% | 91.55% | 93.20% | yes | 24376 | ||
23-07-2016 | 90.50% | 99.50% | 90.30% | 87.70% | ||||||||
30-07-2016 | 90.70% | 99.60% | 90.70% | 88.00% | 90.90% | 99.62% | 92.40% | 93.55% | yes | 26766 | ||
06-08-2016 | 91.00% | 99.70% | 91.00% | 88.50% | 91.15% | 99.70% | 93.13% | 93.85% | yes | 28257 | ||
13-08-2016 | 91.50% | 99.80% | 91.50% | 89.00% | 91.55% | 99.80% | 93.21% | 93.87% | yes | 29086 |
Midterm evaluation[edit]
WER% Before: Statistics about input files ------------------------------------------------------- Number of words in reference: 800 Number of words in test: 572 Number of unknown words (marked with a star) in test: 46 Percentage of unknown words: 8.04 % Results when removing unknown-word marks (stars) ------------------------------------------------------- Edit distance: 751 Word error rate (WER): 93.88 % Number of position-independent correct words: 110 Position-independent word error rate (PER): 86.25 %
Report of GSoC project[edit]
List of all commits: https://apertium.projectjj.com/gsoc2016/aidana1.html
- Bilingual dictionary was expanded from 21613 to 29086
- Transfer rules:
- Chunker rules: 176
- Interchunk rules: 55
- Table of сoverage of 4 corpuses could be seen above: http://wiki.apertium.org/w/index.php?title=User:Aidana/Proposal/Working_plan#Expanding_vocabulary
- Testvoc:
POS | Total | Clean | With @ |
With # |
Clean % |
---|---|---|---|---|---|
v | 49395 | 49395 | 0 | 0 | 100 |
cop | 48464 | 48464 | 0 | 0 | 100 |
adj | 20197 | 20197 | 0 | 0 | 100 |
n | 12512 | 12512 | 0 | 0 | 100 |
prn | 10873 | 10873 | 0 | 0 | 100 |
det | 2248 | 2248 | 0 | 0 | 100 |
cnjcoo | 1389 | 1389 | 0 | 0 | 100 |
vaux | 808 | 808 | 0 | 0 | 100 |
post | 464 | 464 | 0 | 0 | 100 |
np | 155 | 155 | 0 | 0 | 100 |
adv | 45 | 45 | 0 | 0 | 100 |
num | 38 | 38 | 0 | 0 | 100 |
guio | 1 | 1 | 0 | 0 | 100 |
cm | 1 | 1 | 0 | 0 | 100 |
ij | 0 | 0 | 0 | 0 | 100 |
cnjsub | 0 | 0 | 0 | 0 | 100 |