Hectoralos/GSOC 2020 work plan control
| Week | Dates | Goals | Fulfilled | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Bidix (excluding proper names) | Coverage | WER | Testvoc --- Manual disamb. of frp texts | Frp monodix (excl. proper names) | Bidix (excl. proper names) | Non-WP coverage (%)[1] | WP coverage (%)[2] | WER (%) | Testvoc (clean %) --- Manual disamb. (words) | ||
| 1 | 1 June - 7 June | 7,006 | 1,213 | fra > frp 64.5 frp > fra* 74.7 | fra > frp 61.3 frp > fra** 47.9 | ||||||
| 2 | 8 June - 14 June | ~1,500 | 13,699 | 6,739 | fra > frp 77.6 frp > fra 82.5 | fra > frp 74.1 frp > fra 57.9 | |||||
| 3 | 15 June - 21 June | ~3,000 | 14,863 | 8,922 | fra > frp - frp > fra 86.3 | fra > frp 77.7 frp > fra 61.0 | |||||
| 4 | 22 June - 28 June | ~4,500 | 15,805 | 10,747 | fra > frp 85.2 frp > fra 90.7 | fra > frp 81.8 frp > fra 72.0 | |||||
| 5 | 29 June - 5 July | ~6,000 | fra > frp >80% frp > fra >85% | 16,746 | 13,637 | fra > frp 87.6 frp > fra 92.6 | fra > frp 84.3 frp > fra 75.3 | ||||
| 6 | 6 July - 12 July | ~7,500 | 17,392[3] | 18,419 fra > frp 14,639 frp > fra 16,743 | fra > frp 92.1 frp > fra 95.3 | fra > frp 89.2 frp > fra 78.0 | |||||
| 7 | 7 July - 19 July | ~8,500 | 18,160 | 19,401 fra > frp 15,552 frp > fra 17,498 | fra > frp 93.2 frp > fra 95.4 | fra > frp 90.2 frp > fra 78.5 | |||||
| 8 | 20 July - 26 July | ~9,500 | Disamb. of frp texts | 19,411 | 20,844 fra > frp 16,915 frp > fra 18,744 | fra > frp 94.5 frp > fra 95.6 | fra > frp 91.6 frp > fra 80.1 | 0 | |||
| 9 | 27 July - 3 August | ~10,500 | fra-frp ~89% frp > fra ~92% | fra-frp <25% | Disamb. of frp texts | fra > frp frp > fra 10.7[4] | |||||
| 10 | 4 August - 10 August | ~11,500 | |||||||||
| 11 | 11 August - 17 August | ~12,500 | Testvoc: closed categories, vblex | ||||||||
| 12 | 18 August - 23 August | ~12,750 | Testvoc: adj, adv | ||||||||
| 13 | 24 August  - 30 August | ~13,000 | fra > frp ~90.0% frp > fra ~93.0% | fra > frp <20% frp > fra <25% | Testvoc: n | ||||||
See also
Work plan in the original proposal
Notes
- ↑ The Arpitan non-Wikipedia corpus contains a few texts in ORB. At the beginning of the project it had 81,670 words. 80% were written by Dominique Stich, and the rest were several sociopolitical texts.
- ↑ The Arpitan Wikipedia corpus contains all the articles written in ORB, except the 366 for every day of year (2375 articles, 347,101 words). My impression is that 75+% of the content is made by bots, so it is very repetitive and little representative.
- ↑ + massive inclusion of np.ant and np.cog (thus the jump in the coverage)
- ↑ No he trobat (encara) la manera de fer una prova realment aleatòria de textos. No conec cap revista electrònica en arpità a la xarxa i la Viquipèdia arpitana està escrita amb diferents normes i, a més, acostuma a tenir articles que són simples plantilles. He agafat tres articles (France, Rimbaud i Toquio, ~ 1000 paraules), sobre els quals no havia treballat abans, i he obtingut uns resultats massa bons. És inimaginable que això passi en textos reals, especialment quan s'escriu amb una acceptació molt àmplia de les normes per a adaptar-les a la parla local (el que en diuen "grafia estreta").
 Test file: '200727_frp-fra.ini.txt'
 Reference file '200727_frp-fra.fin.txt'
 
 Statistics about input files
 
 Number of words in reference: 932 
 Number of words in test: 939
 Number of unknown words (marked with a star) in test: 57
 Percentage of unknown words: 6.07 %
 
 Results when removing unknown-word marks (stars)
 
 Edit distance: 100 
 Word error rate (WER): 10.73 %
 Number of position-independent correct words: 848
 Position-independent word error rate (PER): 9.76 %
 
 Results when unknown-word marks (stars) are not removed
 
 Edit distance: 115 
 Word Error Rate (WER): 12.34 %
 Number of position-independent correct words: 833
 Position-independent word error rate (PER): 11.37 %
 
 Statistics about the translation of unknown words
 
 Number of unknown words which were free rides: 15 
 Percentage of unknown words that were free rides: 26.32 %

