Difference between revisions of "User:Aidana/Proposal/Working plan"

From Apertium
Jump to navigation Jump to search
 
(36 intermediate revisions by 3 users not shown)
Line 4: Line 4:


* Bitextor: https://www.dropbox.com/s/ajy53y55toh0n40/corpus%20Lab%20IIS%20%285925%29.kz?dl=0
* Bitextor: https://www.dropbox.com/s/ajy53y55toh0n40/corpus%20Lab%20IIS%20%285925%29.kz?dl=0
* Akorda corpus: https://www.dropbox.com/home/apertium/lexical%20selection/Akorda?preview=akorda-kaz-8651.txt
* Akorda corpus: https://www.dropbox.com/s/o2d2fdtktdlsvt7/akorda-kaz-8651.txt?dl=0
* GCI kaz corpus: https://www.dropbox.com/s/8krkzjs1ykxwdhk/1452262000_kaz.darkgaia.txt?dl=0
* GCI kaz corpus:https://www.dropbox.com/s/dqz01r0wzw2u2op/1452262000_kaz.darkgaia.txt?dl=0
* 12500 words from wikipedia: https://www.dropbox.com/s/lrj7i639d3g7dhn/12500words.txt?dl=0
* 12500 words from wikipedia: https://www.dropbox.com/s/lrj7i639d3g7dhn/12500words.txt?dl=0


===Expanding vocabulary===
===Expanding vocabulary===




==Coverage targets==
==Coverage targets==
Line 15: Line 17:
!rowspan=2| Date ||colspan=4| Target || ||colspan=4| Achieved ||rowspan=2| Target<br/>achieved||rowspan=2|Stems||rowspan=2| Notes
!rowspan=2| Date ||colspan=4| Target || ||colspan=4| Achieved ||rowspan=2| Target<br/>achieved||rowspan=2|Stems||rowspan=2| Notes
|-
|-
! 5925 corpus!! Akorda!! GCI corpus!! Wiki 12500 words!! !! 5925 corpus!! Akorda!! GCI corpus!! Wiki 12500 words
! 5925 corpus!! GCI corpus!!Wiki 12500 words !! Akorda!! !! 5925 corpus!! GCI corpus!!Wiki 12500 words !! Akorda
|-
|-
| 23-04-2014 || 85.70% || 97.32% ||86.83% ||83.21% || || ? || || || ||align=center| || || Initial value
| 23-04-2016 || 85.70% || 93.32% ||85.74% ||83.21% || || 86.85% || 93.85% || 85.74% ||83.21% ||align=center| yes || 21613 || Initial value
|-
|-
| 30-04-2014 || 86.00% || 97.50% ||87.00% || 83.50% || || || || || ||align=center| || ||
| 30-04-2016 || 86.00% || 93.80% ||86.80% || 83.50% || ||88.22% || 94.56% || 88.14% ||86.48% ||align=center| yes ||21923 ||
|-
|-
| 07-05-2014 || 86.50% || 97.70% || 87.50% || 84.00% || || || || || ||align=center| || ||
| 07-05-2016 || 86.50% || 94.30% || 87.50% || 84.00% || || 89.36% || 94.86% || 89.30% || 88.12% ||align=center| yes || 22242 ||
|-
| 14-05-2016 ||87.00% || 95.00% || 87.70% || 84.50% || || 89.84% || 95.08% || 90.71% || 90.54% ||align=center| yes ||22708 ||
|-
| 21-05-2016 || 87.50% || 96.00% || 88.00% ||85.00% || || 89.87% || 96.65% || 90.79% || 90.82% ||align=center| yes || 23232 ||Official GSOC start date
|-
| 23-05-2016 || 87.70% || 96.50% || 88.00% ||85.00% || || 89.872% || 96.65% || 90.794% ||90.836% ||align=center| yes || 23238 ||
|-
| 1-06-2016 || 88.00% || 97.00% || 88.30% ||85.50% || || 90.11% || 97.24% || 90.83% || 90.85% ||align=center| yes || 23308 ||
|-
| 10-06-2016 || 88.30% || 97.50% || 88.50% ||85.70% || || 90.26% || 98.95% || 91.00% ||92.77% ||align=center | yes ||23497||
|-
| 16-06-2016 || 88.50% || 98.00% || 88.70% || 86.00%|| || 90.37% ||98.95% || 91.12% || 92.94% ||align=center | yes || 23593 ||
|-
| 27-06-2016 || 89.00% || 98.50% || 89.00% || 86.50% || || || || || || || || Midterm evaluation
|-
| 02-07-2016 || 89.30% || 99.00% || 89.30% || 86.80% || || 90.46% || 99.03% || 91.12% ||92.98% || align=center|yes ||23633 ||
|-
| 09-07-2016 || 89.70% || 99.40% || 89.70% ||87.00% || || 90.48% || 99.40% || 91.33% || 92.99% || align=center|yes|| 24139 ||
|-
| 16-07-2016 || 90.00% || 99.40% || 90.00% || 87.30% || || 90.74% ||99.62% || 91.55% || 93.20% || align=center|yes || 24376 ||
|-
| 23-07-2016 || 90.50% || 99.50% || 90.30% || 87.70% || || || || || || || ||
|-
| 30-07-2016 || 90.70% || 99.60% || 90.70% || 88.00% || || 90.90% || 99.62% || 92.40% || 93.55% || align=center|yes || 26766 ||
|-
| 06-08-2016 || 91.00% || 99.70% || 91.00% || 88.50% || || 91.15% || 99.70% || 93.13% ||93.85% || align=center|yes || 28257 ||
|-
| 13-08-2016 || 91.50% || 99.80% || 91.50% || 89.00% || || 91.55% || 99.80% || 93.21% || 93.87% || align=center|yes ||29086 ||
|-
|}


===Midterm evaluation===

<pre>
WER% Before:

Statistics about input files
-------------------------------------------------------
Number of words in reference: 800
Number of words in test: 572
Number of unknown words (marked with a star) in test: 46
Percentage of unknown words: 8.04 %

Results when removing unknown-word marks (stars)
-------------------------------------------------------
Edit distance: 751
Word error rate (WER): 93.88 %
Number of position-independent correct words: 110
Position-independent word error rate (PER): 86.25 %
</pre>

===Report of GSoC project===

List of all commits: https://apertium.projectjj.com/gsoc2016/aidana1.html

* Bilingual dictionary was expanded from 21613 to 29086
* Transfer rules:
** Chunker rules: 176
** Interchunk rules: 55

*Table of сoverage of 4 corpuses could be seen above: http://wiki.apertium.org/w/index.php?title=User:Aidana/Proposal/Working_plan#Expanding_vocabulary
*Testvoc:

{|class=wikitable
!rowspan=1| POS || Total || Clean ||rowspan=1| With<br/>@||rowspan=1|With<br/>#||rowspan=1| Clean<br/>%
|-
|-
|v||49395||49395||0||0||100
| 14-05-2014 ||87.00% || 98.00% || 87.70% || 84.50% || || || || || ||align=center| || ||
|-
|-
|cop||48464||48464||0||0||100
| 21-05-2014 || 87.50% || 98.20% || 88.00% ||85.00% || || || || || ||align=center| || ||Official GSOC start date
|-
|-
|adj||20197||20197||0||0||100
| 23-05-2014 || 87.70% || 98.20% || 88.00% ||85.00% || || || || || ||align=center| || ||
|-
|-
|n||12512||12512||0||0||100
| 1-06-2014 || 88.00% || 98.50% || 88.30% ||85.50% || || || || || ||align=center| || ||
|-
|-
|prn||10873||10873||0||0||100
| 8-06-2014 || 88.30% || 98.70% || 88.50% ||85.70% || || || || || || || ||
|-
|-
|det||2248||2248||0||0||100
| 16-06-2014 || 88.50% || 99.00% || 88.70% || 86.00%|| || || || || || || ||
|-
|-
|cnjcoo||1389||1389||0||0||100
| 27-06-2014 || 89.00% || 99.20% || 89.00% || 86.50% || || || || || || || || Midterm evaluation
|-
|-
|vaux||808||808||0||0||100
| 02-07-2014 || 89.30% || 99.30% || 89.30% || 86.80% || || || || || || || ||
|-
|-
|post||464||464||0||0||100
| 09-07-2014 || 89.70% || 99.40% || 89.70% ||87.00% || || || || || || || ||
|-
|-
|np||155||155||0||0||100
| 16-07-2014 || 90.00% || 99.40% || 90.00% || 87.30% || || || || || || || ||
|-
|-
|adv||45||45||0||0||100
| 23-07-2014 || 90.50% || 99.50% || 90.30% || 87.70% || || || || || || || ||
|-
|-
|num||38||38||0||0||100
| 30-07-2014 || 90.70% || 99.60% || 90.70% || 88.00% || || || || || || || ||
|-
|-
|guio||1||1||0||0||100
| 06-08-2014 || 91.00% || 99.70% || 91.00% || 88.50% || || || || || || || ||
|-
|-
|cm||1||1||0||0||100
| 13-08-2014 || 91.50% || 99.80% || 91.50% || 89.00% || || || || || || || ||
|-
|-
|ij||0||0||0||0||100
| 23-08-2014 || 92.00% ||99.90% || 92.00% ||90.00% || || || || || || || || Final target
|-
|-
|cnjsub||0||0||0||0||100
|}
|}

Latest revision as of 18:28, 16 August 2016

Corpora[edit]

Downloads[edit]

Expanding vocabulary[edit]

Coverage targets[edit]

Date Target Achieved Target
achieved
Stems Notes
5925 corpus GCI corpus Wiki 12500 words Akorda 5925 corpus GCI corpus Wiki 12500 words Akorda
23-04-2016 85.70% 93.32% 85.74% 83.21% 86.85% 93.85% 85.74% 83.21% yes 21613 Initial value
30-04-2016 86.00% 93.80% 86.80% 83.50% 88.22% 94.56% 88.14% 86.48% yes 21923
07-05-2016 86.50% 94.30% 87.50% 84.00% 89.36% 94.86% 89.30% 88.12% yes 22242
14-05-2016 87.00% 95.00% 87.70% 84.50% 89.84% 95.08% 90.71% 90.54% yes 22708
21-05-2016 87.50% 96.00% 88.00% 85.00% 89.87% 96.65% 90.79% 90.82% yes 23232 Official GSOC start date
23-05-2016 87.70% 96.50% 88.00% 85.00% 89.872% 96.65% 90.794% 90.836% yes 23238
1-06-2016 88.00% 97.00% 88.30% 85.50% 90.11% 97.24% 90.83% 90.85% yes 23308
10-06-2016 88.30% 97.50% 88.50% 85.70% 90.26% 98.95% 91.00% 92.77% yes 23497
16-06-2016 88.50% 98.00% 88.70% 86.00% 90.37% 98.95% 91.12% 92.94% yes 23593
27-06-2016 89.00% 98.50% 89.00% 86.50% Midterm evaluation
02-07-2016 89.30% 99.00% 89.30% 86.80% 90.46% 99.03% 91.12% 92.98% yes 23633
09-07-2016 89.70% 99.40% 89.70% 87.00% 90.48% 99.40% 91.33% 92.99% yes 24139
16-07-2016 90.00% 99.40% 90.00% 87.30% 90.74% 99.62% 91.55% 93.20% yes 24376
23-07-2016 90.50% 99.50% 90.30% 87.70%
30-07-2016 90.70% 99.60% 90.70% 88.00% 90.90% 99.62% 92.40% 93.55% yes 26766
06-08-2016 91.00% 99.70% 91.00% 88.50% 91.15% 99.70% 93.13% 93.85% yes 28257
13-08-2016 91.50% 99.80% 91.50% 89.00% 91.55% 99.80% 93.21% 93.87% yes 29086


Midterm evaluation[edit]

WER% Before:

Statistics about input files
-------------------------------------------------------
Number of words in reference: 800
Number of words in test: 572
Number of unknown words (marked with a star) in test: 46
Percentage of unknown words: 8.04 %

Results when removing unknown-word marks (stars)
-------------------------------------------------------
Edit distance: 751
Word error rate (WER): 93.88 %
Number of position-independent correct words: 110
Position-independent word error rate (PER): 86.25 %

Report of GSoC project[edit]

List of all commits: https://apertium.projectjj.com/gsoc2016/aidana1.html

  • Bilingual dictionary was expanded from 21613 to 29086
  • Transfer rules:
    • Chunker rules: 176
    • Interchunk rules: 55
POS Total Clean With
@
With
#
Clean
%
v 49395 49395 0 0 100
cop 48464 48464 0 0 100
adj 20197 20197 0 0 100
n 12512 12512 0 0 100
prn 10873 10873 0 0 100
det 2248 2248 0 0 100
cnjcoo 1389 1389 0 0 100
vaux 808 808 0 0 100
post 464 464 0 0 100
np 155 155 0 0 100
adv 45 45 0 0 100
num 38 38 0 0 100
guio 1 1 0 0 100
cm 1 1 0 0 100
ij 0 0 0 0 100
cnjsub 0 0 0 0 100