Difference between revisions of "Hectoralos/GSOC 2020 work plan control"

Latest revision as of 06:14, 27 August 2020

Week	Dates	Goals				Fulfilled
Week	Dates	Bidix (excluding proper names)	Coverage	WER	Testvoc --- Manual disamb. of frp texts	Frp monodix (excl. proper names)	Bidix (excl. proper names)	Non-WP coverage (%)^[1]	WP coverage (%)^[2]	WER (%)	Testvoc (clean %) --- Manual disamb. (words)
1	1 June - 7 June					7,006	1,213	fra > frp 64.5 frp > fra 74.7	fra > frp 61.3 frp > fra 47.9
2	8 June - 14 June	~1,500				13,699	6,739	fra > frp 77.6 frp > fra 82.5	fra > frp 74.1 frp > fra 57.9
3	15 June - 21 June	~3,000				14,863	8,922	fra > frp - frp > fra 86.3	fra > frp 77.7 frp > fra 61.0
4	22 June - 28 June	~4,500				15,805	10,747	fra > frp 85.2 frp > fra 90.7	fra > frp 81.8 frp > fra 72.0
5	29 June - 5 July	~6,000	fra > frp >80% frp > fra >85%			16,746	13,637	fra > frp 87.6 frp > fra 92.6	fra > frp 84.3 frp > fra 75.3
6	6 July - 12 July	~7,500				17,392^[3]	18,419 fra > frp 14,639 frp > fra 16,743	fra > frp 92.1 frp > fra 95.3	fra > frp 89.2 frp > fra 78.0
7	7 July - 19 July	~8,500				18,160	19,401 fra > frp 15,552 frp > fra 17,498	fra > frp 93.2 frp > fra 95.4	fra > frp 90.2 frp > fra 78.5
8	20 July - 26 July	~9,500			Disamb. of frp texts	19,411	20,844 fra > frp 16,915 frp > fra 18,744	fra > frp 94.5 frp > fra 95.6	fra > frp 91.6 frp > fra 80.1		0
9	27 July - 2 August	~10,500	fra-frp ~89% frp > fra ~92%	fra-frp <25%	Disamb. of frp texts	19,813	21,465 fra > frp 17,425 frp > fra 19,216	fra > frp 94.8 frp > fra 95.8	fra > frp 91.9 frp > fra 80.8	fra > frp 5.5^[4] frp > fra 10.7^[5]	0^[6]
10	3 August - 9 August	~11,500				21,193	23,086 fra > frp 18,902 frp > fra 20,502	fra > frp 95.3 frp > fra 95.9	fra > frp 92.6 frp > fra 81.2
11	10 August - 16 August	~12,500			Testvoc: closed categories, vblex	21,775	23,718 fra > frp 19,472 frp > fra 21,061	fra > frp 95.6 frp > fra 95.9	fra > frp 92.6 frp > fra 81.3		non-verbs: fra > frp 100% frp > fra 100%
12	17 August - 23 August	~12,750			Testvoc: adj, adv	22,407	24,459 fra > frp 20,142 frp > fra 21,694	fra > frp 95.7 frp > fra 96.0	fra > frp 92.7 frp > fra 81.4		verbs: fra > frp 99,99% frp > fra 100%
13	24 August - 30 August	~13,000	fra > frp ~90.0% frp > fra ~93.0%	fra > frp <20% frp > fra <25%	Testvoc: n	22,667	24,775 fra > frp 20,423 frp > fra 21,964	fra > frp 95.8 frp > fra 96.0	fra > frp 92.8 frp > fra 81.5	fra > frp 5.7^[7] frp > fra 15.5^[8]	fra > frp 100% frp > fra 100%

Notes[edit]

↑ The Arpitan non-Wikipedia corpus contains a few texts in ORB. At the beginning of the project it had 81,670 words. 80% were written by Dominique Stich, and the rest were several sociopolitical texts.
↑ The Arpitan Wikipedia corpus contains all the articles written in ORB, except the 366 for every day of year (2375 articles, 347,101 words). My impression is that 75+% of the content is made by bots, so it is very repetitive and little representative.
↑ + massive inclusion of np.ant and np.cog (thus the jump in the coverage)
↑ The test was conducted using three randomly selected texts (total: ~1000 words): two "good" Wikipedia articles (In My Life and Niger (cheval)) and the most outstanding article on lemonde.fr at the time of the test (Barbara Pompili : « C’est le bon moment pour gagner la bataille écologique »). The results are surprisingly good, even though the translator still needs work. The reasons for this may be several: the articles may have been easy; the revision (by a native speaker) may have been too acquiescent; the close proximity of the languages and the great variety of Arpitan may make it more difficult to consider the translator's choice wrong than in more distant languages with well-established standards; and, of course, it is a huge advantage to have been able to use an extensive electronic dictionary and to have the help of two great language experts who are very committed to the project.

Test file: '200727_fra-frp.ini.txt'
Reference file '200727_fra-frp.fin.txt'

Statistics about input files

Number of words in reference: 1180
Number of words in test: 1180
Number of unknown words (marked with a star) in test: 78
Percentage of unknown words: 6.61 %

Results when removing unknown-word marks (stars)

Edit distance: 65
Word error rate (WER): 5.51 %
Number of position-independent correct words: 1122
Position-independent word error rate (PER): 4.92 %

Results when unknown-word marks (stars) are not removed

Edit distance: 125
Word Error Rate (WER): 10.59 %
Number of position-independent correct words: 1062
Position-independent word error rate (PER): 10.00 %

Statistics about the translation of unknown words

Number of unknown words which were free rides: 60
Percentage of unknown words that were free rides: 76.92 %
↑ No he trobat (encara) la manera de fer una prova realment aleatòria de textos. No conec cap revista electrònica en arpità a la xarxa i la Viquipèdia arpitana està escrita amb diferents normes i, a més, acostuma a tenir articles que són simples plantilles. He agafat tres articles (France, Rimbaud i Toquio, ~ 1000 paraules), sobre els quals no havia treballat abans, i he obtingut uns resultats massa bons. És inimaginable que això passi en textos reals, especialment quan s'escriu amb una acceptació molt àmplia de les normes per a adaptar-les a la parla local (el que en diuen "grafia estreta").

Test file: '200727_frp-fra.ini.txt'
Reference file '200727_frp-fra.fin.txt'

Statistics about input files

Number of words in reference: 932
Number of words in test: 939
Number of unknown words (marked with a star) in test: 57
Percentage of unknown words: 6.07 %

Results when removing unknown-word marks (stars)

Edit distance: 100
Word error rate (WER): 10.73 %
Number of position-independent correct words: 848
Position-independent word error rate (PER): 9.76 %

Results when unknown-word marks (stars) are not removed

Edit distance: 115
Word Error Rate (WER): 12.34 %
Number of position-independent correct words: 833
Position-independent word error rate (PER): 11.37 %

Statistics about the translation of unknown words

Number of unknown words which were free rides: 15
Percentage of unknown words that were free rides: 26.32 %
↑ Disambiguation is working pretty well, using the French prob-file and ad-hoc CG rules , so I used the time for doing other things
↑ The test was conducted using four randomly selected texts (total: ~1600 words): three "good" Wikipedia articles (Cheval au Togo, Élisabeth de Bavière and Nyctalope) and the most outstanding article on lemonde.fr at the time of the test (Pour Emmanuel Macron, un été de crises diplomatiques).

Test file: '200820_fra-frp.ini.txt'
Reference file '200820_fra-frp.fin.txt'

Statistics about input files

Number of words in reference: 1635
Number of words in test: 1629
Number of unknown words (marked with a star) in test: 61
Percentage of unknown words: 3.74 %

Results when removing unknown-word marks (stars)

Edit distance: 93
Word error rate (WER): 5.69 %
Number of position-independent correct words: 1548
Position-independent word error rate (PER): 5.32 %

Results when unknown-word marks (stars) are not removed

Edit distance: 113
Word Error Rate (WER): 6.91 %
Number of position-independent correct words: 1528
Position-independent word error rate (PER): 6.54 %

Statistics about the translation of unknown words

Number of unknown words which were free rides: 20
Percentage of unknown words that were free rides: 32.79 %
↑ The test was conducted using randomly selected texts (total: ~1700 words): 30 (!) randomly selected Wikipedia articles (200826_WP.fra.txt -- there are no labelled "good articles" in the Arpitan Wikipedia and most of them have 1-3 sentences), adding up 1063 words, and the first three pages of the "manual" of the site arpitan.eu (200826_Manuel.frp.txt), 636 words.

Test file: '200826_WP_frp-fra.ini.txt'
Reference file '200826_WP_frp-fra.fin.txt'

Statistics about input files

Number of words in reference: 1699
Number of words in test: 1703
Number of unknown words (marked with a star) in test: 269
Percentage of unknown words: 15.80 %

Results when removing unknown-word marks (stars)

Edit distance: 264
Word error rate (WER): 15.54 %
Number of position-independent correct words: 1480
Position-independent word error rate (PER): 13.13 %

Results when unknown-word marks (stars) are not removed

Edit distance: 372
Word Error Rate (WER): 21.90 %
Number of position-independent correct words: 1373
Position-independent word error rate (PER): 19.42 %

Statistics about the translation of unknown words

Number of unknown words which were free rides: 108
Percentage of unknown words that were free rides: 40.15 %

There is a noticeable difference in the results in the Wikipedia texts (WER: 13.4%) and the manual (WER: 19.0%). It seems significant that the author of the manual is from the Aosta Valley and uses constructions that are less similar to the French ones (which does not mean at all that his Arpitan is worse than any other variety, maybe the opposite could be sustained).

[1] The Arpitan non-Wikipedia corpus contains a few texts in ORB. At the beginning of the project it had 81,670 words. 80% were written by Dominique Stich, and the rest were several sociopolitical texts.

[2] The Arpitan Wikipedia corpus contains all the articles written in ORB, except the 366 for every day of year (2375 articles, 347,101 words). My impression is that 75+% of the content is made by bots, so it is very repetitive and little representative.

[3] + massive inclusion of np.ant and np.cog (thus the jump in the coverage)

[4] The test was conducted using three randomly selected texts (total: ~1000 words): two "good" Wikipedia articles (In My Life and Niger (cheval)) and the most outstanding article on lemonde.fr at the time of the test (Barbara Pompili : « C’est le bon moment pour gagner la bataille écologique »). The results are surprisingly good, even though the translator still needs work. The reasons for this may be several: the articles may have been easy; the revision (by a native speaker) may have been too acquiescent; the close proximity of the languages and the great variety of Arpitan may make it more difficult to consider the translator's choice wrong than in more distant languages with well-established standards; and, of course, it is a huge advantage to have been able to use an extensive electronic dictionary and to have the help of two great language experts who are very committed to the project.

Test file: '200727_fra-frp.ini.txt'
Reference file '200727_fra-frp.fin.txt'

Statistics about input files

Number of words in reference: 1180
Number of words in test: 1180
Number of unknown words (marked with a star) in test: 78
Percentage of unknown words: 6.61 %

Results when removing unknown-word marks (stars)

Edit distance: 65
Word error rate (WER): 5.51 %
Number of position-independent correct words: 1122
Position-independent word error rate (PER): 4.92 %

Results when unknown-word marks (stars) are not removed

Edit distance: 125
Word Error Rate (WER): 10.59 %
Number of position-independent correct words: 1062
Position-independent word error rate (PER): 10.00 %

Statistics about the translation of unknown words

Number of unknown words which were free rides: 60
Percentage of unknown words that were free rides: 76.92 %

[5] No he trobat (encara) la manera de fer una prova realment aleatòria de textos. No conec cap revista electrònica en arpità a la xarxa i la Viquipèdia arpitana està escrita amb diferents normes i, a més, acostuma a tenir articles que són simples plantilles. He agafat tres articles (France, Rimbaud i Toquio, ~ 1000 paraules), sobre els quals no havia treballat abans, i he obtingut uns resultats massa bons. És inimaginable que això passi en textos reals, especialment quan s'escriu amb una acceptació molt àmplia de les normes per a adaptar-les a la parla local (el que en diuen "grafia estreta").

Test file: '200727_frp-fra.ini.txt'
Reference file '200727_frp-fra.fin.txt'

Statistics about input files

Number of words in reference: 932
Number of words in test: 939
Number of unknown words (marked with a star) in test: 57
Percentage of unknown words: 6.07 %

Results when removing unknown-word marks (stars)

Edit distance: 100
Word error rate (WER): 10.73 %
Number of position-independent correct words: 848
Position-independent word error rate (PER): 9.76 %

Results when unknown-word marks (stars) are not removed

Edit distance: 115
Word Error Rate (WER): 12.34 %
Number of position-independent correct words: 833
Position-independent word error rate (PER): 11.37 %

Statistics about the translation of unknown words

Number of unknown words which were free rides: 15
Percentage of unknown words that were free rides: 26.32 %

[6] Disambiguation is working pretty well, using the French prob-file and ad-hoc CG rules , so I used the time for doing other things

[7] The test was conducted using four randomly selected texts (total: ~1600 words): three "good" Wikipedia articles (Cheval au Togo, Élisabeth de Bavière and Nyctalope) and the most outstanding article on lemonde.fr at the time of the test (Pour Emmanuel Macron, un été de crises diplomatiques).

Test file: '200820_fra-frp.ini.txt'
Reference file '200820_fra-frp.fin.txt'

Statistics about input files

Number of words in reference: 1635
Number of words in test: 1629
Number of unknown words (marked with a star) in test: 61
Percentage of unknown words: 3.74 %

Results when removing unknown-word marks (stars)

Edit distance: 93
Word error rate (WER): 5.69 %
Number of position-independent correct words: 1548
Position-independent word error rate (PER): 5.32 %

Results when unknown-word marks (stars) are not removed

Edit distance: 113
Word Error Rate (WER): 6.91 %
Number of position-independent correct words: 1528
Position-independent word error rate (PER): 6.54 %

Statistics about the translation of unknown words

Number of unknown words which were free rides: 20
Percentage of unknown words that were free rides: 32.79 %

[8] The test was conducted using randomly selected texts (total: ~1700 words): 30 (!) randomly selected Wikipedia articles (200826_WP.fra.txt -- there are no labelled "good articles" in the Arpitan Wikipedia and most of them have 1-3 sentences), adding up 1063 words, and the first three pages of the "manual" of the site arpitan.eu (200826_Manuel.frp.txt), 636 words.

Test file: '200826_WP_frp-fra.ini.txt'
Reference file '200826_WP_frp-fra.fin.txt'

Statistics about input files

Number of words in reference: 1699
Number of words in test: 1703
Number of unknown words (marked with a star) in test: 269
Percentage of unknown words: 15.80 %

Results when removing unknown-word marks (stars)

Edit distance: 264
Word error rate (WER): 15.54 %
Number of position-independent correct words: 1480
Position-independent word error rate (PER): 13.13 %

Results when unknown-word marks (stars) are not removed

Edit distance: 372
Word Error Rate (WER): 21.90 %
Number of position-independent correct words: 1373
Position-independent word error rate (PER): 19.42 %

Statistics about the translation of unknown words

Number of unknown words which were free rides: 108
Percentage of unknown words that were free rides: 40.15 %

There is a noticeable difference in the results in the Wikipedia texts (WER: 13.4%) and the manual (WER: 19.0%). It seems significant that the author of the manual is from the Aosta Valley and uses constructions that are less similar to the French ones (which does not mean at all that his Arpitan is worse than any other variety, maybe the opposite could be sustained).

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

Difference between revisions of "Hectoralos/GSOC 2020 work plan control"

Latest revision as of 06:14, 27 August 2020

See also[edit]

Notes[edit]

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools

@@ Line 11: / Line 11: @@
 ! style="width: 6%" | Frp monodix<br>(excl.<br>proper names)
 ! style="width: 6%" | Bidix<br>(excl.<br>proper names)
+! style="width: 4%" | Non-WP<br>coverage<br>(%)<ref>The Arpitan non-Wikipedia corpus contains a few texts in ORB. At the beginning of the project it had 81,670 words. 80% were written by Dominique Stich, and the rest were several sociopolitical texts.</ref>
-! style="width: 4%" | Non-WP<br>coverage<br>(%)
+! style="width: 4%" | WP<br>coverage<br>(%)<ref>The Arpitan Wikipedia corpus contains all the articles written in ORB, except the 366 for every day of year (2375 articles, 347,101 words). My impression is that 75+% of the content is made by bots, so it is very repetitive and little representative.</ref>
-! style="width: 4%" | WP<br>coverage<br>(%)
 ! style="width: 8%" | WER<br>(%)
 ! style="width: 4%" | Testvoc<br>(clean %)<br>---<br>Manual disamb.<br>(words)
@@ Line 24: / Line 24: @@
 | style="text-align:center" | 7,006
 | style="text-align:center" | 1,213
-| style="text-align:center" | fra > frp<br>64.5<br>frp > fra*<br>74.7
+| style="text-align:center" | fra > frp<br>64.5<br>frp > fra<br>74.7
-| style="text-align:center" | fra > frp<br>61.3<br>frp > fra**<br>47.9
+| style="text-align:center" | fra > frp<br>61.3<br>frp > fra<br>47.9
 | style="text-align:center" |
 | style="text-align:center" |
@@ Line 74: / Line 74: @@
 | style="text-align:center" |
 | style="text-align:center" |
-| style="text-align:center" |
+| style="text-align:center" | 16,746
-| style="text-align:center" |
+| style="text-align:center" | 13,637
-| style="text-align:center" |
+| style="text-align:center" | fra > frp<br>87.6<br>frp > fra<br>92.6
-| style="text-align:center" |
+| style="text-align:center" | fra > frp<br>84.3<br>frp > fra<br>75.3
 | style="text-align:center" |
 | style="text-align:center" |
@@ Line 87: / Line 87: @@
 | style="text-align:center" |
 | style="text-align:center" |
-| style="text-align:center" |
+| style="text-align:center" | 17,392<ref>+ massive inclusion of np.ant and np.cog (thus the jump in the coverage)</ref>
-| style="text-align:center" |
+| style="text-align:center" | 18,419<br>fra > frp<br>14,639<br>frp > fra<br>16,743
-| style="text-align:center" |
+| style="text-align:center" | fra > frp<br>92.1<br>frp > fra<br>95.3
-| style="text-align:center" |
+| style="text-align:center" | fra > frp<br>89.2<br>frp > fra<br>78.0
 | style="text-align:center" |
 | style="text-align:center" |
@@ Line 100: / Line 100: @@
 | style="text-align:center" |
 | style="text-align:center" |
-| style="text-align:center" |
+| style="text-align:center" | 18,160
-| style="text-align:center" |
+| style="text-align:center" | 19,401<br>fra > frp<br>15,552<br>frp > fra<br>17,498
-| style="text-align:center" |
+| style="text-align:center" | fra > frp<br>93.2<br>frp > fra<br>95.4
-| style="text-align:center" |
+| style="text-align:center" | fra > frp<br>90.2<br>frp > fra<br>78.5
 | style="text-align:center" |
 | style="text-align:center" |
@@ Line 113: / Line 113: @@
 | style="text-align:center" |
 | style="text-align:center" | Disamb. of frp texts
+| style="text-align:center" | 19,411
+| style="text-align:center" | 20,844<br>fra > frp<br>16,915<br>frp > fra<br>18,744
+| style="text-align:center" | fra > frp<br>94.5<br>frp > fra<br>95.6
+| style="text-align:center" | fra > frp<br>91.6<br>frp > fra<br>80.1
 | style="text-align:center" |
-| style="text-align:center" |
+| style="text-align:center" | 0
-| style="text-align:center" |
-| style="text-align:center" |
-| style="text-align:center" |
 |-
 ! 9
-| style="text-align:left"   | '''27 July -<br>3 August'''
+| style="text-align:left"   | '''27 July -<br>2 August'''
 | style="text-align:center" | '''~10,500'''
 | style="text-align:center" | fra-frp '''~89%'''<br>frp > fra '''~92%'''
 | style="text-align:center" | fra-frp '''<25%'''
 | style="text-align:center" | Disamb. of frp texts
-| style="text-align:center" |
+| style="text-align:center" | 19,813
-| style="text-align:center" |
+| style="text-align:center" | 21,465<br>fra > frp<br>17,425<br>frp > fra<br>19,216
-| style="text-align:center" |
+| style="text-align:center" | fra > frp<br>94.8<br>frp > fra<br>95.8
-| style="text-align:center" |
+| style="text-align:center" | fra > frp<br>91.9<br>frp > fra<br>80.8
-| style="text-align:center" |
+| style="text-align:center" | fra > frp<br>5.5<ref>
+The test was conducted using three randomly selected texts (total: ~1000 words): two "good" Wikipedia articles ([https://raw.githubusercontent.com/apertium/apertium-fra-frp/master/tests/200727_WP_MyLife.fra.txt In My Life] and [https://raw.githubusercontent.com/apertium/apertium-fra-frp/master/tests/200727_WP_Niger.fra.txt Niger (cheval)]) and the most outstanding article on lemonde.fr at the time of the test ([https://raw.githubusercontent.com/apertium/apertium-fra-frp/master/tests/200727_LM_Pompili.fra.txt Barbara Pompili : « C’est le bon moment pour gagner la bataille écologique »]). The results are surprisingly good, even though the translator still needs work. The reasons for this may be several: the articles may have been easy; the revision (by a native speaker) may have been too acquiescent; the close proximity of the languages and the great variety of Arpitan may make it more difficult to consider the translator's choice wrong than in more distant languages with well-established standards; and, of course, it is a huge advantage to have been able to use an extensive electronic dictionary and to have the help of two great language experts who are very committed to the project.<br>
-| style="text-align:center" |
+<br>
+Test file: [https://raw.githubusercontent.com/apertium/apertium-fra-frp/master/tests/200727_fra-frp.ini.txt '200727_fra-frp.ini.txt']<br>
+Reference file [https://raw.githubusercontent.com/apertium/apertium-fra-frp/master/tests/200727_fra-frp.fin.txt '200727_fra-frp.fin.txt']<br>
+<br>
+Statistics about input files<br>
+-------------------------------------------------------
+Number of words in reference: 1180<br>
+Number of words in test: 1180<br>
+Number of unknown words (marked with a star) in test: 78<br>
+Percentage of unknown words: 6.61 %<br>
+<br>
+Results when removing unknown-word marks (stars)<br>
+-------------------------------------------------------
+Edit distance: 65<br>
+Word error rate (WER): 5.51 %<br>
+Number of position-independent correct words: 1122<br>
+Position-independent word error rate (PER): 4.92 %<br>
+<br>
+Results when unknown-word marks (stars) are not removed<br>
+-------------------------------------------------------
+Edit distance: 125<br>
+Word Error Rate (WER): 10.59 %<br>
+Number of position-independent correct words: 1062<br>
+Position-independent word error rate (PER): 10.00 %<br>
+<br>
+Statistics about the translation of unknown words<br>
+-------------------------------------------------------
+Number of unknown words which were free rides: 60<br>
+Percentage of unknown words that were free rides: 76.92 %
+</ref><br>
+frp > fra<br>10.7<ref>No he trobat (encara) la manera de fer una prova realment aleatòria de textos. No conec cap revista electrònica en arpità a la xarxa i la Viquipèdia arpitana està escrita amb diferents normes i, a més, acostuma a tenir articles que són simples plantilles. He agafat tres articles ([https://raw.githubusercontent.com/apertium/apertium-fra-frp/master/tests/200727_WP_France.frp.txt France], [https://raw.githubusercontent.com/apertium/apertium-fra-frp/master/tests/200727_WP_Rimbaud.frp.txt Rimbaud] i [https://raw.githubusercontent.com/apertium/apertium-fra-frp/master/tests/200727_WP_Toquio.frp.txt Toquio], ~ 1000 paraules), sobre els quals no havia treballat abans, i he obtingut uns resultats massa bons. És inimaginable que això passi en textos reals, especialment quan s'escriu amb una acceptació molt àmplia de les normes per a adaptar-les a la parla local (el que en diuen "grafia estreta").<br>
+<br>
+Test file: [https://raw.githubusercontent.com/apertium/apertium-fra-frp/master/tests/200727_frp-fra.ini.txt '200727_frp-fra.ini.txt']<br>
+Reference file [https://raw.githubusercontent.com/apertium/apertium-fra-frp/master/tests/200727_frp-fra.fin.txt '200727_frp-fra.fin.txt']<br>
+<br>
+Statistics about input files<br>
+-------------------------------------------------------
+Number of words in reference: 932<br>
+Number of words in test: 939<br>
+Number of unknown words (marked with a star) in test: 57<br>
+Percentage of unknown words: 6.07 %<br>
+<br>
+Results when removing unknown-word marks (stars)<br>
+-------------------------------------------------------
+Edit distance: 100<br>
+Word error rate (WER): 10.73 %<br>
+Number of position-independent correct words: 848<br>
+Position-independent word error rate (PER): 9.76 %<br>
+<br>
+Results when unknown-word marks (stars) are not removed<br>
+-------------------------------------------------------
+Edit distance: 115<br>
+Word Error Rate (WER): 12.34 %<br>
+Number of position-independent correct words: 833<br>
+Position-independent word error rate (PER): 11.37 %<br>
+<br>
+Statistics about the translation of unknown words<br>
+-------------------------------------------------------
+Number of unknown words which were free rides: 15<br>
+Percentage of unknown words that were free rides: 26.32 %
+</ref>
+| style="text-align:center" | 0<ref>Disambiguation is working pretty well, using the French prob-file and ad-hoc CG rules , so I used the time for doing other things</ref>
 |-
 ! 10
-| style="text-align:left"   | 4 August -<br>10 August
+| style="text-align:left"   | 3 August -<br>9 August
 | style="text-align:center" | ~11,500
 | style="text-align:center" |
 | style="text-align:center" |
 | style="text-align:center" |
+| style="text-align:center" | 21,193
+| style="text-align:center" | 23,086<br>fra > frp<br>18,902<br>frp > fra<br>20,502
+| style="text-align:center" | fra > frp<br>95.3<br>frp > fra<br>95.9
+| style="text-align:center" | fra > frp<br>92.6<br>frp > fra<br>81.2
+| style="text-align:center" |
 | style="text-align:center" |
-| style="text-align:center" |
-| style="text-align:center" |
-| style="text-align:center" |
-| style="text-align:center" |
-| style="text-align:center" |
 |-
 ! 11
-| style="text-align:left"   | 11 August -<br>17 August
+| style="text-align:left"   | 10 August -<br>16 August
 | style="text-align:center" | ~12,500
 | style="text-align:center" |
 | style="text-align:center" |
 | style="text-align:center" | Testvoc: closed categories, vblex
+| style="text-align:center" | 21,775
+| style="text-align:center" | 23,718<br>fra > frp<br>19,472<br>frp > fra<br>21,061
+| style="text-align:center" | fra > frp<br>95.6<br>frp > fra<br>95.9
+| style="text-align:center" | fra > frp<br>92.6<br>frp > fra<br>81.3
 | style="text-align:center" |
-| style="text-align:center" |
+| style="text-align:center" | non-verbs:<br>fra > frp<br>100%<br>frp > fra<br>100%
-| style="text-align:center" |
-| style="text-align:center" |
-| style="text-align:center" |
-| style="text-align:center" |
 |-
 ! 12
-| style="text-align:left"   | 18 August -<br>23 August
+| style="text-align:left"   | 17 August -<br>23 August
 | style="text-align:center" | ~12,750
 | style="text-align:center" |
 | style="text-align:center" |
 | style="text-align:center" | Testvoc: adj, adv
+| style="text-align:center" | 22,407
+| style="text-align:center" | 24,459<br>fra > frp<br>20,142<br>frp > fra<br>21,694
+| style="text-align:center" | fra > frp<br>95.7<br>frp > fra<br>96.0
+| style="text-align:center" | fra > frp<br>92.7<br>frp > fra<br>81.4
 | style="text-align:center" |
-| style="text-align:center" |
+| style="text-align:center" | verbs:<br>fra > frp<br>99,99%<br>frp > fra<br>100%
-| style="text-align:center" |
-| style="text-align:center" |
-| style="text-align:center" |
-| style="text-align:center" |
 |-
 ! 13
@@ Line 177: / Line 240: @@
 | style="text-align:center" | fra > frp '''<20%'''<br>frp > fra '''<25%'''
 | style="text-align:center" | Testvoc: n
-| style="text-align:center" |
+| style="text-align:center" | 22,667
-| style="text-align:center" |
+| style="text-align:center" | 24,775<br>fra > frp<br>20,423<br>frp > fra<br>21,964
-| style="text-align:center" |
+| style="text-align:center" | fra > frp<br>95.8<br>frp > fra<br>96.0
-| style="text-align:center" |
+| style="text-align:center" | fra > frp<br>92.8<br>frp > fra<br>81.5
-| style="text-align:center" |
+| style="text-align:center" | fra > frp<br>5.7<ref>
+The test was conducted using four randomly selected texts (total: ~1600 words): three "good" Wikipedia articles ([https://raw.githubusercontent.com/apertium/apertium-fra-frp/master/tests/200820_WP_ChevalAuTogo.fra.txt Cheval au Togo], [https://raw.githubusercontent.com/apertium/apertium-fra-frp/master/tests/200820_WP_ElisabethDeBaviere.fra.txt Élisabeth de Bavière] and [https://raw.githubusercontent.com/apertium/apertium-fra-frp/master/tests/200820_WP_Nyctalope.fra.txt Nyctalope]) and the most outstanding article on lemonde.fr at the time of the test ([https://raw.githubusercontent.com/apertium/apertium-fra-frp/master/tests/200820_LM_CrisesDiplomatiques.fra.txt Pour Emmanuel Macron, un été de crises diplomatiques]).<br>
-| style="text-align:center" |
+<br>
+Test file: [https://raw.githubusercontent.com/apertium/apertium-fra-frp/master/tests/200820_fra-frp.ini.txt '200820_fra-frp.ini.txt']<br>
+Reference file [https://raw.githubusercontent.com/apertium/apertium-fra-frp/master/tests/200820_fra-frp.fin.txt '200820_fra-frp.fin.txt']<br>
+<br>
+Statistics about input files<br>
+-------------------------------------------------------
+Number of words in reference: 1635<br>
+Number of words in test: 1629<br>
+Number of unknown words (marked with a star) in test: 61<br>
+Percentage of unknown words: 3.74 %<br>
+<br>
+Results when removing unknown-word marks (stars)<br>
+-------------------------------------------------------
+Edit distance: 93<br>
+Word error rate (WER): 5.69 %<br>
+Number of position-independent correct words: 1548<br>
+Position-independent word error rate (PER): 5.32 %<br>
+<br>
+Results when unknown-word marks (stars) are not removed<br>
+-------------------------------------------------------
+Edit distance: 113<br>
+Word Error Rate (WER): 6.91 %<br>
+Number of position-independent correct words: 1528<br>
+Position-independent word error rate (PER): 6.54 %<br>
+<br>
+Statistics about the translation of unknown words<br>
+-------------------------------------------------------
+Number of unknown words which were free rides: 20<br>
+Percentage of unknown words that were free rides: 32.79 %
+</ref><br>frp > fra<br>15.5<ref>
+The test was conducted using randomly selected texts (total: ~1700 words): 30 (!) randomly selected Wikipedia articles ([https://raw.githubusercontent.com/apertium/apertium-fra-frp/master/tests/200826_WP.fra.txt 200826_WP.fra.txt] -- there are no labelled "good articles" in the Arpitan Wikipedia and most of them have 1-3 sentences), adding up 1063 words, and the first three pages of the "manual" of the site arpitan.eu ([https://raw.githubusercontent.com/apertium/apertium-fra-frp/master/tests/200826_Manuel.frp.txt 200826_Manuel.frp.txt]), 636 words.<br>
+<br>
+Test file: [https://raw.githubusercontent.com/apertium/apertium-fra-frp/master/tests/200826_WP_frp-fra.ini.txt '200826_WP_frp-fra.ini.txt']<br>
+Reference file [https://raw.githubusercontent.com/apertium/apertium-fra-frp/master/tests/200826_WP_frp-fra.fin.txt '200826_WP_frp-fra.fin.txt']<br>
+<br>
+Statistics about input files<br>
+-------------------------------------------------------
+Number of words in reference: 1699<br>
+Number of words in test: 1703<br>
+Number of unknown words (marked with a star) in test: 269<br>
+Percentage of unknown words: 15.80 %<br>
+<br>
+Results when removing unknown-word marks (stars)<br>
+-------------------------------------------------------
+Edit distance: 264<br>
+Word error rate (WER): 15.54 %<br>
+Number of position-independent correct words: 1480<br>
+Position-independent word error rate (PER): 13.13 %<br>
+<br>
+Results when unknown-word marks (stars) are not removed<br>
+-------------------------------------------------------
+Edit distance: 372<br>
+Word Error Rate (WER): 21.90 %<br>
+Number of position-independent correct words: 1373<br>
+Position-independent word error rate (PER): 19.42 %<br>
+<br>
+Statistics about the translation of unknown words<br>
+-------------------------------------------------------
+Number of unknown words which were free rides: 108<br>
+Percentage of unknown words that were free rides: 40.15 %<br><br>
+There is a noticeable difference in the results in the Wikipedia texts (WER: 13.4%) and the manual (WER: 19.0%). It seems significant that the author of the manual is from the Aosta Valley and uses constructions that are less similar to the French ones (which does not mean at all that his Arpitan is worse than any other variety, maybe the opposite could be sustained).
+</ref>
+| style="text-align:center" | fra > frp<br>100%<br>frp > fra<br>100%
 |}
-* The Arpitan non-Wikipedia corpus contains a few texts in ORB. At the beginning of the project it had 81,670 words. 80% were written by Dominique Stich, and the rest were several sociopolitical texts.
-** The Arpitan Wikipedia corpus contains all the articles written in ORB, except the 366 for every day of year (2375 articles, 347,101 words). My impression is that 75+% of the content is made by bots, so it is very repetitive and little representative.
 === See also ===
 [[Hectoralos/GSOC_2020_proposal:_French-Arpitan#Workplan | Work plan in the original proposal]]
+=== Notes ===
+[[Category:French and Arpitan]]