Difference between revisions of "Hectoralos/GSOC 2020 work plan control"
Hectoralos (talk | contribs) |
Hectoralos (talk | contribs) |
||
Line 130: | Line 130: | ||
| style="text-align:center" | |
| style="text-align:center" | |
||
| style="text-align:center" | |
| style="text-align:center" | |
||
| style="text-align:center" | fra > frp<br> |
| style="text-align:center" | fra > frp<br>5.5<ref> |
||
The test was conducted using three randomly selected texts (total: ~1000 words): two "good" Wikipedia articles ([https://raw.githubusercontent.com/apertium/apertium-fra-frp/master/tests/200727_WP_MyLife.fra.txt In My Life] and [https://raw.githubusercontent.com/apertium/apertium-fra-frp/master/tests/200727_WP_Niger.fra.txt Niger (cheval)]) and the most outstanding article on lemonde.fr at the time of the test ([https://raw.githubusercontent.com/apertium/apertium-fra-frp/master/tests/200727_LM_Pompili.fra.txt Barbara Pompili : « C’est le bon moment pour gagner la bataille écologique »]). The results are surprisingly good, even though the translator still needs work. The reasons for this may be several: the articles may have been easy; the revision (by a native speaker) may have been too acquiescent; the close proximity of the languages and the great variety of Arpitan may make it more difficult to consider the translator's choice wrong than in more distant languages with well-established standards; and, of course, it is a huge advantage to have been able to use an extensive electronic dictionary and to have the help of two great language experts who are very committed to the project.<br> |
The test was conducted using three randomly selected texts (total: ~1000 words): two "good" Wikipedia articles ([https://raw.githubusercontent.com/apertium/apertium-fra-frp/master/tests/200727_WP_MyLife.fra.txt In My Life] and [https://raw.githubusercontent.com/apertium/apertium-fra-frp/master/tests/200727_WP_Niger.fra.txt Niger (cheval)]) and the most outstanding article on lemonde.fr at the time of the test ([https://raw.githubusercontent.com/apertium/apertium-fra-frp/master/tests/200727_LM_Pompili.fra.txt Barbara Pompili : « C’est le bon moment pour gagner la bataille écologique »]). The results are surprisingly good, even though the translator still needs work. The reasons for this may be several: the articles may have been easy; the revision (by a native speaker) may have been too acquiescent; the close proximity of the languages and the great variety of Arpitan may make it more difficult to consider the translator's choice wrong than in more distant languages with well-established standards; and, of course, it is a huge advantage to have been able to use an extensive electronic dictionary and to have the help of two great language experts who are very committed to the project.<br> |
||
Test file: [https://raw.githubusercontent.com/apertium/apertium-fra-frp/master/tests/200727_fra-frp.ini.txt '200727_fra-frp.ini.txt']<br> |
Test file: [https://raw.githubusercontent.com/apertium/apertium-fra-frp/master/tests/200727_fra-frp.ini.txt '200727_fra-frp.ini.txt']<br> |
||
Line 144: | Line 144: | ||
Results when removing unknown-word marks (stars)<br> |
Results when removing unknown-word marks (stars)<br> |
||
------------------------------------------------------- |
------------------------------------------------------- |
||
Edit distance: |
Edit distance: 65<br> |
||
Word error rate (WER): 5. |
Word error rate (WER): 5.51 %<br> |
||
Number of position-independent correct words: |
Number of position-independent correct words: 1122<br> |
||
Position-independent word error rate (PER): |
Position-independent word error rate (PER): 4.92 %<br> |
||
<br> |
<br> |
||
Results when unknown-word marks (stars) are not removed<br> |
Results when unknown-word marks (stars) are not removed<br> |
||
Line 158: | Line 158: | ||
Statistics about the translation of unknown words<br> |
Statistics about the translation of unknown words<br> |
||
------------------------------------------------------- |
------------------------------------------------------- |
||
Number of unknown words which were free rides: |
Number of unknown words which were free rides: 60<br> |
||
Percentage of unknown words that were free rides: |
Percentage of unknown words that were free rides: 76.92 % |
||
</ref><br> |
|||
frp > fra<br>10.7<ref>No he trobat (encara) la manera de fer una prova realment aleatòria de textos. No conec cap revista electrònica en arpità a la xarxa i la Viquipèdia arpitana està escrita amb diferents normes i, a més, acostuma a tenir articles que són simples plantilles. He agafat tres articles ([https://raw.githubusercontent.com/apertium/apertium-fra-frp/master/tests/200727_WP_France.frp.txt France], [https://raw.githubusercontent.com/apertium/apertium-fra-frp/master/tests/200727_WP_Rimbaud.frp.txt Rimbaud] i [https://raw.githubusercontent.com/apertium/apertium-fra-frp/master/tests/200727_WP_Toquio.frp.txt Toquio], ~ 1000 paraules), sobre els quals no havia treballat abans, i he obtingut uns resultats massa bons. És inimaginable que això passi en textos reals, especialment quan s'escriu amb una acceptació molt àmplia de les normes per a adaptar-les a la parla local (el que en diuen "grafia estreta").<br> |
frp > fra<br>10.7<ref>No he trobat (encara) la manera de fer una prova realment aleatòria de textos. No conec cap revista electrònica en arpità a la xarxa i la Viquipèdia arpitana està escrita amb diferents normes i, a més, acostuma a tenir articles que són simples plantilles. He agafat tres articles ([https://raw.githubusercontent.com/apertium/apertium-fra-frp/master/tests/200727_WP_France.frp.txt France], [https://raw.githubusercontent.com/apertium/apertium-fra-frp/master/tests/200727_WP_Rimbaud.frp.txt Rimbaud] i [https://raw.githubusercontent.com/apertium/apertium-fra-frp/master/tests/200727_WP_Toquio.frp.txt Toquio], ~ 1000 paraules), sobre els quals no havia treballat abans, i he obtingut uns resultats massa bons. És inimaginable que això passi en textos reals, especialment quan s'escriu amb una acceptació molt àmplia de les normes per a adaptar-les a la parla local (el que en diuen "grafia estreta").<br> |
||
Test file: [https://raw.githubusercontent.com/apertium/apertium-fra-frp/master/tests/200727_frp-fra.ini.txt '200727_frp-fra.ini.txt']<br> |
Test file: [https://raw.githubusercontent.com/apertium/apertium-fra-frp/master/tests/200727_frp-fra.ini.txt '200727_frp-fra.ini.txt']<br> |
Revision as of 08:36, 30 July 2020
Week | Dates | Goals | Fulfilled | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Bidix (excluding proper names) |
Coverage | WER | Testvoc --- Manual disamb. of frp texts |
Frp monodix (excl. proper names) |
Bidix (excl. proper names) |
Non-WP coverage (%)[1] |
WP coverage (%)[2] |
WER (%) |
Testvoc (clean %) --- Manual disamb. (words) | ||
1 | 1 June - 7 June |
7,006 | 1,213 | fra > frp 64.5 frp > fra 74.7 |
fra > frp 61.3 frp > fra 47.9 |
||||||
2 | 8 June - 14 June |
~1,500 | 13,699 | 6,739 | fra > frp 77.6 frp > fra 82.5 |
fra > frp 74.1 frp > fra 57.9 |
|||||
3 | 15 June - 21 June |
~3,000 | 14,863 | 8,922 | fra > frp - frp > fra 86.3 |
fra > frp 77.7 frp > fra 61.0 |
|||||
4 | 22 June - 28 June |
~4,500 | 15,805 | 10,747 | fra > frp 85.2 frp > fra 90.7 |
fra > frp 81.8 frp > fra 72.0 |
|||||
5 | 29 June - 5 July |
~6,000 | fra > frp >80% frp > fra >85% |
16,746 | 13,637 | fra > frp 87.6 frp > fra 92.6 |
fra > frp 84.3 frp > fra 75.3 |
||||
6 | 6 July - 12 July |
~7,500 | 17,392[3] | 18,419 fra > frp 14,639 frp > fra 16,743 |
fra > frp 92.1 frp > fra 95.3 |
fra > frp 89.2 frp > fra 78.0 |
|||||
7 | 7 July - 19 July |
~8,500 | 18,160 | 19,401 fra > frp 15,552 frp > fra 17,498 |
fra > frp 93.2 frp > fra 95.4 |
fra > frp 90.2 frp > fra 78.5 |
|||||
8 | 20 July - 26 July |
~9,500 | Disamb. of frp texts | 19,411 | 20,844 fra > frp 16,915 frp > fra 18,744 |
fra > frp 94.5 frp > fra 95.6 |
fra > frp 91.6 frp > fra 80.1 |
0 | |||
9 | 27 July - 3 August |
~10,500 | fra-frp ~89% frp > fra ~92% |
fra-frp <25% | Disamb. of frp texts | fra > frp 5.5[4] frp > fra |
|||||
10 | 4 August - 10 August |
~11,500 | |||||||||
11 | 11 August - 17 August |
~12,500 | Testvoc: closed categories, vblex | ||||||||
12 | 18 August - 23 August |
~12,750 | Testvoc: adj, adv | ||||||||
13 | 24 August - 30 August |
~13,000 | fra > frp ~90.0% frp > fra ~93.0% |
fra > frp <20% frp > fra <25% |
Testvoc: n |
See also
Work plan in the original proposal
Notes
- ↑ The Arpitan non-Wikipedia corpus contains a few texts in ORB. At the beginning of the project it had 81,670 words. 80% were written by Dominique Stich, and the rest were several sociopolitical texts.
- ↑ The Arpitan Wikipedia corpus contains all the articles written in ORB, except the 366 for every day of year (2375 articles, 347,101 words). My impression is that 75+% of the content is made by bots, so it is very repetitive and little representative.
- ↑ + massive inclusion of np.ant and np.cog (thus the jump in the coverage)
- ↑
The test was conducted using three randomly selected texts (total: ~1000 words): two "good" Wikipedia articles (In My Life and Niger (cheval)) and the most outstanding article on lemonde.fr at the time of the test (Barbara Pompili : « C’est le bon moment pour gagner la bataille écologique »). The results are surprisingly good, even though the translator still needs work. The reasons for this may be several: the articles may have been easy; the revision (by a native speaker) may have been too acquiescent; the close proximity of the languages and the great variety of Arpitan may make it more difficult to consider the translator's choice wrong than in more distant languages with well-established standards; and, of course, it is a huge advantage to have been able to use an extensive electronic dictionary and to have the help of two great language experts who are very committed to the project.
Test file: '200727_fra-frp.ini.txt'
Reference file '200727_fra-frp.fin.txt'
Statistics about input files
Number of words in reference: 1180
Number of words in test: 1180
Number of unknown words (marked with a star) in test: 78
Percentage of unknown words: 6.61 %
Results when removing unknown-word marks (stars)
Edit distance: 65
Word error rate (WER): 5.51 %
Number of position-independent correct words: 1122
Position-independent word error rate (PER): 4.92 %
Results when unknown-word marks (stars) are not removed
Edit distance: 125
Word Error Rate (WER): 10.59 %
Number of position-independent correct words: 1062
Position-independent word error rate (PER): 10.00 %
Statistics about the translation of unknown words
Number of unknown words which were free rides: 60
Percentage of unknown words that were free rides: 76.92 % - ↑ No he trobat (encara) la manera de fer una prova realment aleatòria de textos. No conec cap revista electrònica en arpità a la xarxa i la Viquipèdia arpitana està escrita amb diferents normes i, a més, acostuma a tenir articles que són simples plantilles. He agafat tres articles (France, Rimbaud i Toquio, ~ 1000 paraules), sobre els quals no havia treballat abans, i he obtingut uns resultats massa bons. És inimaginable que això passi en textos reals, especialment quan s'escriu amb una acceptació molt àmplia de les normes per a adaptar-les a la parla local (el que en diuen "grafia estreta").
Test file: '200727_frp-fra.ini.txt'
Reference file '200727_frp-fra.fin.txt'
Statistics about input files
Number of words in reference: 932
Number of words in test: 939
Number of unknown words (marked with a star) in test: 57
Percentage of unknown words: 6.07 %
Results when removing unknown-word marks (stars)
Edit distance: 100
Word error rate (WER): 10.73 %
Number of position-independent correct words: 848
Position-independent word error rate (PER): 9.76 %
Results when unknown-word marks (stars) are not removed
Edit distance: 115
Word Error Rate (WER): 12.34 %
Number of position-independent correct words: 833
Position-independent word error rate (PER): 11.37 %
Statistics about the translation of unknown words
Number of unknown words which were free rides: 15
Percentage of unknown words that were free rides: 26.32 %