Difference between revisions of "Hectoralos/GSOC 2020 work plan control"
Hectoralos (talk | contribs) |
Hectoralos (talk | contribs) |
||
Line 130: | Line 130: | ||
| style="text-align:center" | |
| style="text-align:center" | |
||
| style="text-align:center" | |
| style="text-align:center" | |
||
| style="text-align:center" | fra > frp<br><br>frp > fra<br>10.7<ref>No he trobat (encara) la manera de fer una prova realment aleatòria de textos. No conec cap revista electrònica en arpità a la xarxa i la Viquipèdia arpitana està escrita amb diferents normes i, a més, acostuma a tenir articles que són simples plantilles. He agafat tres articles ([https://raw.githubusercontent.com/apertium/apertium-fra-frp/master/tests/200727_WP_France.frp.txt France], [https://raw.githubusercontent.com/apertium/apertium-fra-frp/master/tests/200727_WP_Rimbaud.frp.txt Rimbaud] i [https://raw.githubusercontent.com/apertium/apertium-fra-frp/master/tests/200727_WP_Toquio.frp.txt Toquio], ~ 1000 paraules), sobre els quals no havia treballat abans, i he obtingut uns resultats massa bons. És inimaginable que això passi en textos reals, especialment quan s'escriu amb una acceptació molt àmplia de les normes per a adaptar-les a la parla local (el que en diuen "grafia estreta").<br> |
|||
| style="text-align:center" | fra > frp<br><br>frp > fra<br>10.7<ref><br> |
|||
Test file: '200727_frp-fra.ini.txt'<br> |
Test file: [https://raw.githubusercontent.com/apertium/apertium-fra-frp/master/tests/200727_frp-fra.ini.txt '200727_frp-fra.ini.txt']<br> |
||
Reference file '200727_frp-fra.fin.txt'<br> |
Reference file [https://raw.githubusercontent.com/apertium/apertium-fra-frp/master/tests/200727_frp-fra.fin.txt '200727_frp-fra.fin.txt']<br> |
||
<br> |
<br> |
||
Statistics about input files<br> |
Statistics about input files<br> |
||
------------------------------------------------------- |
------------------------------------------------------- |
||
Number of words in reference: 932<br> |
Number of words in reference: 932<br> |
||
Number of words in test: 939<br> |
Number of words in test: 939<br> |
||
Line 142: | Line 142: | ||
<br> |
<br> |
||
Results when removing unknown-word marks (stars)<br> |
Results when removing unknown-word marks (stars)<br> |
||
------------------------------------------------------- |
------------------------------------------------------- |
||
Edit distance: 100<br> |
Edit distance: 100<br> |
||
Word error rate (WER): 10.73 %<br> |
Word error rate (WER): 10.73 %<br> |
||
Line 149: | Line 149: | ||
<br> |
<br> |
||
Results when unknown-word marks (stars) are not removed<br> |
Results when unknown-word marks (stars) are not removed<br> |
||
------------------------------------------------------- |
------------------------------------------------------- |
||
Edit distance: 115<br> |
Edit distance: 115<br> |
||
Word Error Rate (WER): 12.34 %<br> |
Word Error Rate (WER): 12.34 %<br> |
||
Line 156: | Line 156: | ||
<br> |
<br> |
||
Statistics about the translation of unknown words<br> |
Statistics about the translation of unknown words<br> |
||
------------------------------------------------------- |
------------------------------------------------------- |
||
Number of unknown words which were free rides: 15<br> |
Number of unknown words which were free rides: 15<br> |
||
Percentage of unknown words that were free rides: 26.32 % |
Percentage of unknown words that were free rides: 26.32 % |
Revision as of 14:48, 28 July 2020
Week | Dates | Goals | Fulfilled | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Bidix (excluding proper names) |
Coverage | WER | Testvoc --- Manual disamb. of frp texts |
Frp monodix (excl. proper names) |
Bidix (excl. proper names) |
Non-WP coverage (%)[1] |
WP coverage (%)[2] |
WER (%) |
Testvoc (clean %) --- Manual disamb. (words) | ||
1 | 1 June - 7 June |
7,006 | 1,213 | fra > frp 64.5 frp > fra* 74.7 |
fra > frp 61.3 frp > fra** 47.9 |
||||||
2 | 8 June - 14 June |
~1,500 | 13,699 | 6,739 | fra > frp 77.6 frp > fra 82.5 |
fra > frp 74.1 frp > fra 57.9 |
|||||
3 | 15 June - 21 June |
~3,000 | 14,863 | 8,922 | fra > frp - frp > fra 86.3 |
fra > frp 77.7 frp > fra 61.0 |
|||||
4 | 22 June - 28 June |
~4,500 | 15,805 | 10,747 | fra > frp 85.2 frp > fra 90.7 |
fra > frp 81.8 frp > fra 72.0 |
|||||
5 | 29 June - 5 July |
~6,000 | fra > frp >80% frp > fra >85% |
16,746 | 13,637 | fra > frp 87.6 frp > fra 92.6 |
fra > frp 84.3 frp > fra 75.3 |
||||
6 | 6 July - 12 July |
~7,500 | 17,392[3] | 18,419 fra > frp 14,639 frp > fra 16,743 |
fra > frp 92.1 frp > fra 95.3 |
fra > frp 89.2 frp > fra 78.0 |
|||||
7 | 7 July - 19 July |
~8,500 | 18,160 | 19,401 fra > frp 15,552 frp > fra 17,498 |
fra > frp 93.2 frp > fra 95.4 |
fra > frp 90.2 frp > fra 78.5 |
|||||
8 | 20 July - 26 July |
~9,500 | Disamb. of frp texts | 19,411 | 20,844 fra > frp 16,915 frp > fra 18,744 |
fra > frp 94.5 frp > fra 95.6 |
fra > frp 91.6 frp > fra 80.1 |
0 | |||
9 | 27 July - 3 August |
~10,500 | fra-frp ~89% frp > fra ~92% |
fra-frp <25% | Disamb. of frp texts | fra > frp frp > fra 10.7[4] |
|||||
10 | 4 August - 10 August |
~11,500 | |||||||||
11 | 11 August - 17 August |
~12,500 | Testvoc: closed categories, vblex | ||||||||
12 | 18 August - 23 August |
~12,750 | Testvoc: adj, adv | ||||||||
13 | 24 August - 30 August |
~13,000 | fra > frp ~90.0% frp > fra ~93.0% |
fra > frp <20% frp > fra <25% |
Testvoc: n |
See also
Work plan in the original proposal
Notes
- ↑ The Arpitan non-Wikipedia corpus contains a few texts in ORB. At the beginning of the project it had 81,670 words. 80% were written by Dominique Stich, and the rest were several sociopolitical texts.
- ↑ The Arpitan Wikipedia corpus contains all the articles written in ORB, except the 366 for every day of year (2375 articles, 347,101 words). My impression is that 75+% of the content is made by bots, so it is very repetitive and little representative.
- ↑ + massive inclusion of np.ant and np.cog (thus the jump in the coverage)
- ↑ No he trobat (encara) la manera de fer una prova realment aleatòria de textos. No conec cap revista electrònica en arpità a la xarxa i la Viquipèdia arpitana està escrita amb diferents normes i, a més, acostuma a tenir articles que són simples plantilles. He agafat tres articles (France, Rimbaud i Toquio, ~ 1000 paraules), sobre els quals no havia treballat abans, i he obtingut uns resultats massa bons. És inimaginable que això passi en textos reals, especialment quan s'escriu amb una acceptació molt àmplia de les normes per a adaptar-les a la parla local (el que en diuen "grafia estreta").
Test file: '200727_frp-fra.ini.txt'
Reference file '200727_frp-fra.fin.txt'
Statistics about input files
Number of words in reference: 932
Number of words in test: 939
Number of unknown words (marked with a star) in test: 57
Percentage of unknown words: 6.07 %
Results when removing unknown-word marks (stars)
Edit distance: 100
Word error rate (WER): 10.73 %
Number of position-independent correct words: 848
Position-independent word error rate (PER): 9.76 %
Results when unknown-word marks (stars) are not removed
Edit distance: 115
Word Error Rate (WER): 12.34 %
Number of position-independent correct words: 833
Position-independent word error rate (PER): 11.37 %
Statistics about the translation of unknown words
Number of unknown words which were free rides: 15
Percentage of unknown words that were free rides: 26.32 %