Difference between revisions of "User:Aidana/Proposal/Working plan"

From Apertium
Jump to navigation Jump to search
 
(11 intermediate revisions by 3 users not shown)
Line 49: Line 49:
 
| 30-07-2016 || 90.70% || 99.60% || 90.70% || 88.00% || || 90.90% || 99.62% || 92.40% || 93.55% || align=center|yes || 26766 ||
 
| 30-07-2016 || 90.70% || 99.60% || 90.70% || 88.00% || || 90.90% || 99.62% || 92.40% || 93.55% || align=center|yes || 26766 ||
 
|-
 
|-
| 06-08-2016 || 91.00% || 99.70% || 91.00% || 88.50% || || || || || || || ||
+
| 06-08-2016 || 91.00% || 99.70% || 91.00% || 88.50% || || 91.15% || 99.70% || 93.13% ||93.85% || align=center|yes || 28257 ||
 
|-
 
|-
| 13-08-2016 || 91.50% || 99.80% || 91.50% || 89.00% || || || || || || || ||
+
| 13-08-2016 || 91.50% || 99.80% || 91.50% || 89.00% || || 91.55% || 99.80% || 93.21% || 93.87% || align=center|yes ||29086 ||
|-
 
| 23-08-2016 || 92.00% ||99.90% || 92.00% ||90.00% || || || || || || || || Final target
 
 
|-
 
|-
 
|}
 
|}
Line 60: Line 58:
 
===Midterm evaluation===
 
===Midterm evaluation===
   
  +
<pre>
 
WER% Before:
 
WER% Before:
   
Line 75: Line 74:
 
Number of position-independent correct words: 110
 
Number of position-independent correct words: 110
 
Position-independent word error rate (PER): 86.25 %
 
Position-independent word error rate (PER): 86.25 %
  +
</pre>
  +
  +
===Report of GSoC project===
  +
  +
List of all commits: https://apertium.projectjj.com/gsoc2016/aidana1.html
  +
  +
* Bilingual dictionary was expanded from 21613 to 29086
  +
* Transfer rules:
  +
** Chunker rules: 176
  +
** Interchunk rules: 55
  +
  +
*Table of сoverage of 4 corpuses could be seen above: http://wiki.apertium.org/w/index.php?title=User:Aidana/Proposal/Working_plan#Expanding_vocabulary
  +
*Testvoc:
  +
  +
{|class=wikitable
  +
!rowspan=1| POS || Total || Clean ||rowspan=1| With<br/>@||rowspan=1|With<br/>#||rowspan=1| Clean<br/>%
 
|-
  +
|v||49395||49395||0||0||100
  +
|-
  +
|cop||48464||48464||0||0||100
  +
|-
  +
|adj||20197||20197||0||0||100
  +
|-
  +
|n||12512||12512||0||0||100
  +
|-
  +
|prn||10873||10873||0||0||100
  +
|-
  +
|det||2248||2248||0||0||100
  +
|-
  +
|cnjcoo||1389||1389||0||0||100
  +
|-
  +
|vaux||808||808||0||0||100
  +
|-
  +
|post||464||464||0||0||100
  +
|-
  +
|np||155||155||0||0||100
  +
|-
  +
|adv||45||45||0||0||100
  +
|-
  +
|num||38||38||0||0||100
  +
|-
  +
|guio||1||1||0||0||100
  +
|-
  +
|cm||1||1||0||0||100
  +
|-
  +
|ij||0||0||0||0||100
  +
|-
  +
|cnjsub||0||0||0||0||100
  +
|}

Latest revision as of 18:28, 16 August 2016

Corpora[edit]

Downloads[edit]

Expanding vocabulary[edit]

Coverage targets[edit]

Date Target Achieved Target
achieved
Stems Notes
5925 corpus GCI corpus Wiki 12500 words Akorda 5925 corpus GCI corpus Wiki 12500 words Akorda
23-04-2016 85.70% 93.32% 85.74% 83.21% 86.85% 93.85% 85.74% 83.21% yes 21613 Initial value
30-04-2016 86.00% 93.80% 86.80% 83.50% 88.22% 94.56% 88.14% 86.48% yes 21923
07-05-2016 86.50% 94.30% 87.50% 84.00% 89.36% 94.86% 89.30% 88.12% yes 22242
14-05-2016 87.00% 95.00% 87.70% 84.50% 89.84% 95.08% 90.71% 90.54% yes 22708
21-05-2016 87.50% 96.00% 88.00% 85.00% 89.87% 96.65% 90.79% 90.82% yes 23232 Official GSOC start date
23-05-2016 87.70% 96.50% 88.00% 85.00% 89.872% 96.65% 90.794% 90.836% yes 23238
1-06-2016 88.00% 97.00% 88.30% 85.50% 90.11% 97.24% 90.83% 90.85% yes 23308
10-06-2016 88.30% 97.50% 88.50% 85.70% 90.26% 98.95% 91.00% 92.77% yes 23497
16-06-2016 88.50% 98.00% 88.70% 86.00% 90.37% 98.95% 91.12% 92.94% yes 23593
27-06-2016 89.00% 98.50% 89.00% 86.50% Midterm evaluation
02-07-2016 89.30% 99.00% 89.30% 86.80% 90.46% 99.03% 91.12% 92.98% yes 23633
09-07-2016 89.70% 99.40% 89.70% 87.00% 90.48% 99.40% 91.33% 92.99% yes 24139
16-07-2016 90.00% 99.40% 90.00% 87.30% 90.74% 99.62% 91.55% 93.20% yes 24376
23-07-2016 90.50% 99.50% 90.30% 87.70%
30-07-2016 90.70% 99.60% 90.70% 88.00% 90.90% 99.62% 92.40% 93.55% yes 26766
06-08-2016 91.00% 99.70% 91.00% 88.50% 91.15% 99.70% 93.13% 93.85% yes 28257
13-08-2016 91.50% 99.80% 91.50% 89.00% 91.55% 99.80% 93.21% 93.87% yes 29086


Midterm evaluation[edit]

WER% Before:

Statistics about input files
-------------------------------------------------------
Number of words in reference: 800
Number of words in test: 572
Number of unknown words (marked with a star) in test: 46
Percentage of unknown words: 8.04 %

Results when removing unknown-word marks (stars)
-------------------------------------------------------
Edit distance: 751
Word error rate (WER): 93.88 %
Number of position-independent correct words: 110
Position-independent word error rate (PER): 86.25 %

Report of GSoC project[edit]

List of all commits: https://apertium.projectjj.com/gsoc2016/aidana1.html

  • Bilingual dictionary was expanded from 21613 to 29086
  • Transfer rules:
    • Chunker rules: 176
    • Interchunk rules: 55
POS Total Clean With
@
With
#
Clean
%
v 49395 49395 0 0 100
cop 48464 48464 0 0 100
adj 20197 20197 0 0 100
n 12512 12512 0 0 100
prn 10873 10873 0 0 100
det 2248 2248 0 0 100
cnjcoo 1389 1389 0 0 100
vaux 808 808 0 0 100
post 464 464 0 0 100
np 155 155 0 0 100
adv 45 45 0 0 100
num 38 38 0 0 100
guio 1 1 0 0 100
cm 1 1 0 0 100
ij 0 0 0 0 100
cnjsub 0 0 0 0 100