Hectoralos/GSOC 2020 work plan control
Week | Dates | Goals | Fulfilled | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Bidix (excluding proper names) |
Coverage | WER | Testvoc --- Manual disamb. of frp texts |
Frp monodix (excl. proper names) |
Bidix (excl. proper names) |
Non-WP coverage (%)[1] |
WP coverage (%)[2] |
WER (%) |
Testvoc (clean %) --- Manual disamb. (words) | ||
1 | 1 June - 7 June |
7,006 | 1,213 | fra > frp 64.5 frp > fra* 74.7 |
fra > frp 61.3 frp > fra** 47.9 |
||||||
2 | 8 June - 14 June |
~1,500 | 13,699 | 6,739 | fra > frp 77.6 frp > fra 82.5 |
fra > frp 74.1 frp > fra 57.9 |
|||||
3 | 15 June - 21 June |
~3,000 | 14,863 | 8,922 | fra > frp - frp > fra 86.3 |
fra > frp 77.7 frp > fra 61.0 |
|||||
4 | 22 June - 28 June |
~4,500 | 15,805 | 10,747 | fra > frp 85.2 frp > fra 90.7 |
fra > frp 81.8 frp > fra 72.0 |
|||||
5 | 29 June - 5 July |
~6,000 | fra > frp >80% frp > fra >85% |
16,746 | 13,637 | fra > frp 87.6 frp > fra 92.6 |
fra > frp 84.3 frp > fra 75.3 |
||||
6 | 6 July - 12 July |
~7,500 | 17,392[3] | 18,419 fra > frp 14,639 frp > fra 16,743 |
fra > frp 92.1 frp > fra 95.3 |
fra > frp 89.2 frp > fra 78.0 |
|||||
7 | 7 July - 19 July |
~8,500 | 18,160 | 19,401 fra > frp 15,552 frp > fra 17,498 |
fra > frp 93.2 frp > fra 95.4 |
fra > frp 90.2 frp > fra 78.5 |
|||||
8 | 20 July - 26 July |
~9,500 | Disamb. of frp texts | 19,411 | 20,844 fra > frp 16,915 frp > fra 18,744 |
fra > frp 94.5 frp > fra 95.6 |
fra > frp 91.6 frp > fra 80.1 |
0 | |||
9 | 27 July - 3 August |
~10,500 | fra-frp ~89% frp > fra ~92% |
fra-frp <25% | Disamb. of frp texts | fra > frp 6.6[4] frp > fra |
|||||
10 | 4 August - 10 August |
~11,500 | |||||||||
11 | 11 August - 17 August |
~12,500 | Testvoc: closed categories, vblex | ||||||||
12 | 18 August - 23 August |
~12,750 | Testvoc: adj, adv | ||||||||
13 | 24 August - 30 August |
~13,000 | fra > frp ~90.0% frp > fra ~93.0% |
fra > frp <20% frp > fra <25% |
Testvoc: n |
See also
Work plan in the original proposal
Notes
- ↑ The Arpitan non-Wikipedia corpus contains a few texts in ORB. At the beginning of the project it had 81,670 words. 80% were written by Dominique Stich, and the rest were several sociopolitical texts.
- ↑ The Arpitan Wikipedia corpus contains all the articles written in ORB, except the 366 for every day of year (2375 articles, 347,101 words). My impression is that 75+% of the content is made by bots, so it is very repetitive and little representative.
- ↑ + massive inclusion of np.ant and np.cog (thus the jump in the coverage)
- ↑
Test file: '200727_fra-frp.ini.txt'
Reference file '200727_fra-frp.fin.txt'
Statistics about input files
Number of words in reference: 1180
Number of words in test: 1180
Number of unknown words (marked with a star) in test: 78
Percentage of unknown words: 6.61 %
Results when removing unknown-word marks (stars)
Edit distance: 66
Word error rate (WER): 5.59 %
Number of position-independent correct words: 1121
Position-independent word error rate (PER): 5.00 %
Results when unknown-word marks (stars) are not removed
Edit distance: 125
Word Error Rate (WER): 10.59 %
Number of position-independent correct words: 1062
Position-independent word error rate (PER): 10.00 %
Statistics about the translation of unknown words
Number of unknown words which were free rides: 59
Percentage of unknown words that were free rides: 75.64 % - ↑ No he trobat (encara) la manera de fer una prova realment aleatòria de textos. No conec cap revista electrònica en arpità a la xarxa i la Viquipèdia arpitana està escrita amb diferents normes i, a més, acostuma a tenir articles que són simples plantilles. He agafat tres articles (France, Rimbaud i Toquio, ~ 1000 paraules), sobre els quals no havia treballat abans, i he obtingut uns resultats massa bons. És inimaginable que això passi en textos reals, especialment quan s'escriu amb una acceptació molt àmplia de les normes per a adaptar-les a la parla local (el que en diuen "grafia estreta").
Test file: '200727_frp-fra.ini.txt'
Reference file '200727_frp-fra.fin.txt'
Statistics about input files
Number of words in reference: 932
Number of words in test: 939
Number of unknown words (marked with a star) in test: 57
Percentage of unknown words: 6.07 %
Results when removing unknown-word marks (stars)
Edit distance: 100
Word error rate (WER): 10.73 %
Number of position-independent correct words: 848
Position-independent word error rate (PER): 9.76 %
Results when unknown-word marks (stars) are not removed
Edit distance: 115
Word Error Rate (WER): 12.34 %
Number of position-independent correct words: 833
Position-independent word error rate (PER): 11.37 %
Statistics about the translation of unknown words
Number of unknown words which were free rides: 15
Percentage of unknown words that were free rides: 26.32 %