User:Aidana/Proposal/Working plan

From Apertium
Jump to navigation Jump to search

Corpora[edit]

Downloads[edit]

Expanding vocabulary[edit]

Coverage targets[edit]

Date Target Achieved Target
achieved
Stems Notes
5925 corpus GCI corpus Wiki 12500 words Akorda 5925 corpus GCI corpus Wiki 12500 words Akorda
23-04-2016 85.70% 93.32% 85.74% 83.21% 86.85% 93.85% 85.74% 83.21% yes 21613 Initial value
30-04-2016 86.00% 93.80% 86.80% 83.50% 88.22% 94.56% 88.14% 86.48% yes 21923
07-05-2016 86.50% 94.30% 87.50% 84.00% 89.36% 94.86% 89.30% 88.12% yes 22242
14-05-2016 87.00% 95.00% 87.70% 84.50% 89.84% 95.08% 90.71% 90.54% yes 22708
21-05-2016 87.50% 96.00% 88.00% 85.00% 89.87% 96.65% 90.79% 90.82% yes 23232 Official GSOC start date
23-05-2016 87.70% 96.50% 88.00% 85.00% 89.872% 96.65% 90.794% 90.836% yes 23238
1-06-2016 88.00% 97.00% 88.30% 85.50% 90.11% 97.24% 90.83% 90.85% yes 23308
10-06-2016 88.30% 97.50% 88.50% 85.70% 90.26% 98.95% 91.00% 92.77% yes 23497
16-06-2016 88.50% 98.00% 88.70% 86.00% 90.37% 98.95% 91.12% 92.94% yes 23593
27-06-2016 89.00% 98.50% 89.00% 86.50% Midterm evaluation
02-07-2016 89.30% 99.00% 89.30% 86.80% 90.46% 99.03% 91.12% 92.98% yes 23633
09-07-2016 89.70% 99.40% 89.70% 87.00% 90.48% 99.40% 91.33% 92.99% yes 24139
16-07-2016 90.00% 99.40% 90.00% 87.30% 90.74% 99.62% 91.55% 93.20% yes 24376
23-07-2016 90.50% 99.50% 90.30% 87.70%
30-07-2016 90.70% 99.60% 90.70% 88.00% 90.90% 99.62% 92.40% 93.55% yes 26766
06-08-2016 91.00% 99.70% 91.00% 88.50% 91.15% 99.70% 93.13% 93.85% yes 28257
13-08-2016 91.50% 99.80% 91.50% 89.00% 91.55% 99.80% 93.21% 93.87% yes 29086


Midterm evaluation[edit]

WER% Before:

Statistics about input files
-------------------------------------------------------
Number of words in reference: 800
Number of words in test: 572
Number of unknown words (marked with a star) in test: 46
Percentage of unknown words: 8.04 %

Results when removing unknown-word marks (stars)
-------------------------------------------------------
Edit distance: 751
Word error rate (WER): 93.88 %
Number of position-independent correct words: 110
Position-independent word error rate (PER): 86.25 %

Report of GSoC project[edit]

List of all commits: https://apertium.projectjj.com/gsoc2016/aidana1.html

  • Bilingual dictionary was expanded from 21613 to 29086
  • Transfer rules:
    • Chunker rules: 176
    • Interchunk rules: 55
POS Total Clean With
@
With
#
Clean
%
v 49395 49395 0 0 100
cop 48464 48464 0 0 100
adj 20197 20197 0 0 100
n 12512 12512 0 0 100
prn 10873 10873 0 0 100
det 2248 2248 0 0 100
cnjcoo 1389 1389 0 0 100
vaux 808 808 0 0 100
post 464 464 0 0 100
np 155 155 0 0 100
adv 45 45 0 0 100
num 38 38 0 0 100
guio 1 1 0 0 100
cm 1 1 0 0 100
ij 0 0 0 0 100
cnjsub 0 0 0 0 100