User:Aidana/Proposal/Working plan

From Apertium
Jump to navigation Jump to search

Corpora

Downloads

Expanding vocabulary

Coverage targets

Date Target Achieved Target
achieved
Stems Notes
5925 corpus GCI corpus Wiki 12500 words Akorda 5925 corpus GCI corpus Wiki 12500 words Akorda
23-04-2016 85.70% 93.32% 85.74% 83.21% 86.85% 93.85% 85.74% 83.21% yes 21613 Initial value
30-04-2016 86.00% 93.80% 86.80% 83.50% 88.22% 94.56% 88.14% 86.48% yes 21923
07-05-2016 86.50% 94.30% 87.50% 84.00% 89.36% 94.86% 89.30% 88.12% yes 22242
14-05-2016 87.00% 95.00% 87.70% 84.50% 89.84% 95.08% 90.71% 90.54% yes 22708
21-05-2016 87.50% 96.00% 88.00% 85.00% 89.87% 96.65% 90.79% 90.82% yes 23232 Official GSOC start date
23-05-2016 87.70% 96.50% 88.00% 85.00% 89.872% 96.65% 90.794% 90.836% yes 23238
1-06-2016 88.00% 97.00% 88.30% 85.50% 90.11% 97.24% 90.83% 90.85% yes 23308
10-06-2016 88.30% 97.50% 88.50% 85.70% 90.26% 98.95% 91.00% 92.77% yes 23497
16-06-2016 88.50% 98.00% 88.70% 86.00% 90.37% 98.95% 91.12% 92.94% yes 23593
27-06-2016 89.00% 98.50% 89.00% 86.50% Midterm evaluation
02-07-2016 89.30% 99.00% 89.30% 86.80% 90.46% 99.03% 91.12% 92.98% yes 23633
09-07-2016 89.70% 99.40% 89.70% 87.00% 90.48% 99.40% 91.33% 92.99% yes 24139
16-07-2016 90.00% 99.40% 90.00% 87.30% 90.74% 99.62% 91.55% 93.20% yes 24376
23-07-2016 90.50% 99.50% 90.30% 87.70%
30-07-2016 90.70% 99.60% 90.70% 88.00% 90.90% 99.62% 92.40% 93.55% yes 26766
06-08-2016 91.00% 99.70% 91.00% 88.50% 91.15% 99.70% 93.13% 93.85% yes 28257
13-08-2016 91.50% 99.80% 91.50% 89.00%


Midterm evaluation

WER% Before:

Statistics about input files


Number of words in reference: 800 Number of words in test: 572 Number of unknown words (marked with a star) in test: 46 Percentage of unknown words: 8.04 %

Results when removing unknown-word marks (stars)


Edit distance: 751 Word error rate (WER): 93.88 % Number of position-independent correct words: 110 Position-independent word error rate (PER): 86.25 %


Report of GSoC project

List of all commits: https://apertium.projectjj.com/gsoc2016/aidana1.html

  • Bilingual dictionary was expanded from 21613 to 28757
  • Transfer rules:
    • Chunker rules: 176
    • Interchunk rules: 55
POS Total Clean With
@
With
#
Clean
%
v 49395 49395 0 0 100
cop 48464 48464 0 0 100
adj 20197 20197 0 0 100
n 12512 12512 0 0 100
prn 10873 10873 0 0 100
det 2248 2248 0 0 100
cnjcoo 1389 1389 0 0 100
vaux 808 808 0 0 100
post 464 464 0 0 100
np 155 155 0 0 100
adv 45 45 0 0 100
num 38 38 0 0 100
guio 1 1 0 0 100
cm 1 1 0 0 100
ij 0 0 0 0 100
cnjsub 0 0 0 0 100