Difference between revisions of "Hectoralos/GSOC 2020 work plan control"
| Hectoralos (talk | contribs) | Hectoralos (talk | contribs)  | ||
| Line 130: | Line 130: | ||
| | style="text-align:center" |  | | style="text-align:center" |  | ||
| | style="text-align:center" |  | | style="text-align:center" |  | ||
| | style="text-align:center" |  | | style="text-align:center" | fra > frp<br><br>frp > fra<br>10.7<ref><br> | ||
| Test file: '200727_frp-fra.ini.txt'<br> | |||
| Reference file '200727_frp-fra.fin.txt'<br> | |||
| <br> | |||
| Statistics about input files<br> | |||
| -------------------------------------------------------<br> | |||
| Number of words in reference: 932<br> | |||
| Number of words in test: 939<br> | |||
| Number of unknown words (marked with a star) in test: 57<br> | |||
| Percentage of unknown words: 6.07 %<br> | |||
| <br> | |||
| Results when removing unknown-word marks (stars)<br> | |||
| -------------------------------------------------------<br> | |||
| Edit distance: 100<br> | |||
| Word error rate (WER): 10.73 %<br> | |||
| Number of position-independent correct words: 848<br> | |||
| Position-independent word error rate (PER): 9.76 %<br> | |||
| <br> | |||
| Results when unknown-word marks (stars) are not removed<br> | |||
| -------------------------------------------------------<br> | |||
| Edit distance: 115<br> | |||
| Word Error Rate (WER): 12.34 %<br> | |||
| Number of position-independent correct words: 833<br> | |||
| Position-independent word error rate (PER): 11.37 %<br> | |||
| <br> | |||
| Statistics about the translation of unknown words<br> | |||
| -------------------------------------------------------<br> | |||
| Number of unknown words which were free rides: 15<br> | |||
| Percentage of unknown words that were free rides: 26.32 % | |||
| </ref> | |||
| | style="text-align:center" | | | style="text-align:center" | | ||
| |- | |- | ||
Revision as of 14:15, 28 July 2020
| Week | Dates | Goals | Fulfilled | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Bidix (excluding proper names) | Coverage | WER | Testvoc --- Manual disamb. of frp texts | Frp monodix (excl. proper names) | Bidix (excl. proper names) | Non-WP coverage (%)[1] | WP coverage (%)[2] | WER (%) | Testvoc (clean %) --- Manual disamb. (words) | ||
| 1 | 1 June - 7 June | 7,006 | 1,213 | fra > frp 64.5 frp > fra* 74.7 | fra > frp 61.3 frp > fra** 47.9 | ||||||
| 2 | 8 June - 14 June | ~1,500 | 13,699 | 6,739 | fra > frp 77.6 frp > fra 82.5 | fra > frp 74.1 frp > fra 57.9 | |||||
| 3 | 15 June - 21 June | ~3,000 | 14,863 | 8,922 | fra > frp - frp > fra 86.3 | fra > frp 77.7 frp > fra 61.0 | |||||
| 4 | 22 June - 28 June | ~4,500 | 15,805 | 10,747 | fra > frp 85.2 frp > fra 90.7 | fra > frp 81.8 frp > fra 72.0 | |||||
| 5 | 29 June - 5 July | ~6,000 | fra > frp >80% frp > fra >85% | 16,746 | 13,637 | fra > frp 87.6 frp > fra 92.6 | fra > frp 84.3 frp > fra 75.3 | ||||
| 6 | 6 July - 12 July | ~7,500 | 17,392[3] | 18,419 fra > frp 14,639 frp > fra 16,743 | fra > frp 92.1 frp > fra 95.3 | fra > frp 89.2 frp > fra 78.0 | |||||
| 7 | 7 July - 19 July | ~8,500 | 18,160 | 19,401 fra > frp 15,552 frp > fra 17,498 | fra > frp 93.2 frp > fra 95.4 | fra > frp 90.2 frp > fra 78.5 | |||||
| 8 | 20 July - 26 July | ~9,500 | Disamb. of frp texts | 19,411 | 20,844 fra > frp 16,915 frp > fra 18,744 | fra > frp 94.5 frp > fra 95.6 | fra > frp 91.6 frp > fra 80.1 | 0 | |||
| 9 | 27 July - 3 August | ~10,500 | fra-frp ~89% frp > fra ~92% | fra-frp <25% | Disamb. of frp texts | fra > frp frp > fra 10.7[4] | |||||
| 10 | 4 August - 10 August | ~11,500 | |||||||||
| 11 | 11 August - 17 August | ~12,500 | Testvoc: closed categories, vblex | ||||||||
| 12 | 18 August - 23 August | ~12,750 | Testvoc: adj, adv | ||||||||
| 13 | 24 August  - 30 August | ~13,000 | fra > frp ~90.0% frp > fra ~93.0% | fra > frp <20% frp > fra <25% | Testvoc: n | ||||||
See also
Work plan in the original proposal
Notes
- ↑ The Arpitan non-Wikipedia corpus contains a few texts in ORB. At the beginning of the project it had 81,670 words. 80% were written by Dominique Stich, and the rest were several sociopolitical texts.
- ↑ The Arpitan Wikipedia corpus contains all the articles written in ORB, except the 366 for every day of year (2375 articles, 347,101 words). My impression is that 75+% of the content is made by bots, so it is very repetitive and little representative.
- ↑ + massive inclusion of np.ant and np.cog (thus the jump in the coverage)
- ↑ 
 Test file: '200727_frp-fra.ini.txt'
 Reference file '200727_frp-fra.fin.txt'
 
 Statistics about input files
 
 Number of words in reference: 932 
 Number of words in test: 939
 Number of unknown words (marked with a star) in test: 57
 Percentage of unknown words: 6.07 %
 
 Results when removing unknown-word marks (stars)
 
 Edit distance: 100 
 Word error rate (WER): 10.73 %
 Number of position-independent correct words: 848
 Position-independent word error rate (PER): 9.76 %
 
 Results when unknown-word marks (stars) are not removed
 
 Edit distance: 115 
 Word Error Rate (WER): 12.34 %
 Number of position-independent correct words: 833
 Position-independent word error rate (PER): 11.37 %
 
 Statistics about the translation of unknown words
 
 Number of unknown words which were free rides: 15 
 Percentage of unknown words that were free rides: 26.32 %

