Difference between revisions of "Hectoralos/GSOC 2020 work plan control"

From Apertium
Jump to navigation Jump to search
 
(8 intermediate revisions by the same user not shown)
Line 11: Line 11:
 
! style="width: 6%" | Frp monodix<br>(excl.<br>proper names)
 
! style="width: 6%" | Frp monodix<br>(excl.<br>proper names)
 
! style="width: 6%" | Bidix<br>(excl.<br>proper names)
 
! style="width: 6%" | Bidix<br>(excl.<br>proper names)
 
! style="width: 4%" | Non-WP<br>coverage<br>(%)<ref>The Arpitan non-Wikipedia corpus contains a few texts in ORB. At the beginning of the project it had 81,670 words. 80% were written by Dominique Stich, and the rest were several sociopolitical texts.</ref>
! style="width: 4%" | Non-WP<br>coverage<br>(%)
 
 
! style="width: 4%" | WP<br>coverage<br>(%)<ref>The Arpitan Wikipedia corpus contains all the articles written in ORB, except the 366 for every day of year (2375 articles, 347,101 words). My impression is that 75+% of the content is made by bots, so it is very repetitive and little representative.</ref>
! style="width: 4%" | WP<br>coverage<br>(%)
 
 
! style="width: 8%" | WER<br>(%)
 
! style="width: 8%" | WER<br>(%)
 
! style="width: 4%" | Testvoc<br>(clean %)<br>---<br>Manual disamb.<br>(words)
 
! style="width: 4%" | Testvoc<br>(clean %)<br>---<br>Manual disamb.<br>(words)
Line 24: Line 24:
 
| style="text-align:center" | 7,006
 
| style="text-align:center" | 7,006
 
| style="text-align:center" | 1,213
 
| style="text-align:center" | 1,213
| style="text-align:center" | fra > frp<br>64.5<br>frp > fra*<br>74.7
+
| style="text-align:center" | fra > frp<br>64.5<br>frp > fra<br>74.7
| style="text-align:center" | fra > frp<br>61.3<br>frp > fra**<br>47.9
+
| style="text-align:center" | fra > frp<br>61.3<br>frp > fra<br>47.9
 
| style="text-align:center" |
 
| style="text-align:center" |
 
| style="text-align:center" |
 
| style="text-align:center" |
Line 100: Line 100:
 
| style="text-align:center" |
 
| style="text-align:center" |
 
| style="text-align:center" |
 
| style="text-align:center" |
| style="text-align:center" |
+
| style="text-align:center" | 18,160
| style="text-align:center" |
+
| style="text-align:center" | 19,401<br>fra > frp<br>15,552<br>frp > fra<br>17,498
| style="text-align:center" |
+
| style="text-align:center" | fra > frp<br>93.2<br>frp > fra<br>95.4
| style="text-align:center" |
+
| style="text-align:center" | fra > frp<br>90.2<br>frp > fra<br>78.5
 
| style="text-align:center" |
 
| style="text-align:center" |
 
| style="text-align:center" |
 
| style="text-align:center" |
Line 113: Line 113:
 
| style="text-align:center" |
 
| style="text-align:center" |
 
| style="text-align:center" | Disamb. of frp texts
 
| style="text-align:center" | Disamb. of frp texts
 
| style="text-align:center" | 19,411
  +
| style="text-align:center" | 20,844<br>fra > frp<br>16,915<br>frp > fra<br>18,744
 
| style="text-align:center" | fra > frp<br>94.5<br>frp > fra<br>95.6
 
| style="text-align:center" | fra > frp<br>91.6<br>frp > fra<br>80.1
 
| style="text-align:center" |
 
| style="text-align:center" |
| style="text-align:center" |
+
| style="text-align:center" | 0
| style="text-align:center" |
 
| style="text-align:center" |
 
| style="text-align:center" |
 
 
|-
 
|-
 
! 9
 
! 9
Line 125: Line 126:
 
| style="text-align:center" | fra-frp '''<25%'''
 
| style="text-align:center" | fra-frp '''<25%'''
 
| style="text-align:center" | Disamb. of frp texts
 
| style="text-align:center" | Disamb. of frp texts
| style="text-align:center" |
+
| style="text-align:center" | 19,813
| style="text-align:center" |
+
| style="text-align:center" | 21,465<br>fra > frp<br>17,425<br>frp > fra<br>19,216
| style="text-align:center" |
+
| style="text-align:center" | fra > frp<br>94.8<br>frp > fra<br>95.8
| style="text-align:center" |
+
| style="text-align:center" | fra > frp<br>91.9<br>frp > fra<br>80.8
| style="text-align:center" |
+
| style="text-align:center" | fra > frp<br>5.5<ref>
  +
The test was conducted using three randomly selected texts (total: ~1000 words): two "good" Wikipedia articles ([https://raw.githubusercontent.com/apertium/apertium-fra-frp/master/tests/200727_WP_MyLife.fra.txt In My Life] and [https://raw.githubusercontent.com/apertium/apertium-fra-frp/master/tests/200727_WP_Niger.fra.txt Niger (cheval)]) and the most outstanding article on lemonde.fr at the time of the test ([https://raw.githubusercontent.com/apertium/apertium-fra-frp/master/tests/200727_LM_Pompili.fra.txt Barbara Pompili : « C’est le bon moment pour gagner la bataille écologique »]). The results are surprisingly good, even though the translator still needs work. The reasons for this may be several: the articles may have been easy; the revision (by a native speaker) may have been too acquiescent; the close proximity of the languages and the great variety of Arpitan may make it more difficult to consider the translator's choice wrong than in more distant languages with well-established standards; and, of course, it is a huge advantage to have been able to use an extensive electronic dictionary and to have the help of two great language experts who are very committed to the project.<br>
| style="text-align:center" |
 
  +
Test file: [https://raw.githubusercontent.com/apertium/apertium-fra-frp/master/tests/200727_fra-frp.ini.txt '200727_fra-frp.ini.txt']<br>
  +
Reference file [https://raw.githubusercontent.com/apertium/apertium-fra-frp/master/tests/200727_fra-frp.fin.txt '200727_fra-frp.fin.txt']<br>
  +
<br>
  +
Statistics about input files<br>
  +
-------------------------------------------------------
  +
Number of words in reference: 1180<br>
  +
Number of words in test: 1180<br>
  +
Number of unknown words (marked with a star) in test: 78<br>
  +
Percentage of unknown words: 6.61 %<br>
  +
<br>
  +
Results when removing unknown-word marks (stars)<br>
  +
-------------------------------------------------------
  +
Edit distance: 65<br>
  +
Word error rate (WER): 5.51 %<br>
  +
Number of position-independent correct words: 1122<br>
  +
Position-independent word error rate (PER): 4.92 %<br>
  +
<br>
  +
Results when unknown-word marks (stars) are not removed<br>
  +
-------------------------------------------------------
  +
Edit distance: 125<br>
  +
Word Error Rate (WER): 10.59 %<br>
  +
Number of position-independent correct words: 1062<br>
  +
Position-independent word error rate (PER): 10.00 %<br>
  +
<br>
  +
Statistics about the translation of unknown words<br>
  +
-------------------------------------------------------
  +
Number of unknown words which were free rides: 60<br>
  +
Percentage of unknown words that were free rides: 76.92 %
  +
</ref><br>
  +
frp > fra<br>10.7<ref>No he trobat (encara) la manera de fer una prova realment aleatòria de textos. No conec cap revista electrònica en arpità a la xarxa i la Viquipèdia arpitana està escrita amb diferents normes i, a més, acostuma a tenir articles que són simples plantilles. He agafat tres articles ([https://raw.githubusercontent.com/apertium/apertium-fra-frp/master/tests/200727_WP_France.frp.txt France], [https://raw.githubusercontent.com/apertium/apertium-fra-frp/master/tests/200727_WP_Rimbaud.frp.txt Rimbaud] i [https://raw.githubusercontent.com/apertium/apertium-fra-frp/master/tests/200727_WP_Toquio.frp.txt Toquio], ~ 1000 paraules), sobre els quals no havia treballat abans, i he obtingut uns resultats massa bons. És inimaginable que això passi en textos reals, especialment quan s'escriu amb una acceptació molt àmplia de les normes per a adaptar-les a la parla local (el que en diuen "grafia estreta").<br>
  +
Test file: [https://raw.githubusercontent.com/apertium/apertium-fra-frp/master/tests/200727_frp-fra.ini.txt '200727_frp-fra.ini.txt']<br>
  +
Reference file [https://raw.githubusercontent.com/apertium/apertium-fra-frp/master/tests/200727_frp-fra.fin.txt '200727_frp-fra.fin.txt']<br>
  +
<br>
  +
Statistics about input files<br>
  +
-------------------------------------------------------
  +
Number of words in reference: 932<br>
  +
Number of words in test: 939<br>
  +
Number of unknown words (marked with a star) in test: 57<br>
  +
Percentage of unknown words: 6.07 %<br>
  +
<br>
  +
Results when removing unknown-word marks (stars)<br>
  +
-------------------------------------------------------
  +
Edit distance: 100<br>
  +
Word error rate (WER): 10.73 %<br>
  +
Number of position-independent correct words: 848<br>
  +
Position-independent word error rate (PER): 9.76 %<br>
  +
<br>
  +
Results when unknown-word marks (stars) are not removed<br>
  +
-------------------------------------------------------
  +
Edit distance: 115<br>
  +
Word Error Rate (WER): 12.34 %<br>
  +
Number of position-independent correct words: 833<br>
  +
Position-independent word error rate (PER): 11.37 %<br>
  +
<br>
  +
Statistics about the translation of unknown words<br>
  +
-------------------------------------------------------
  +
Number of unknown words which were free rides: 15<br>
  +
Percentage of unknown words that were free rides: 26.32 %
  +
</ref>
  +
| style="text-align:center" | 0<ref>Disambiguation is working pretty well, using the French prob-file and ad-hoc CG rules , so I used the time for doing other things</ref>
 
|-
 
|-
 
! 10
 
! 10
Line 142: Line 203:
 
| style="text-align:center" |
 
| style="text-align:center" |
 
| style="text-align:center" |
 
| style="text-align:center" |
 
| style="text-align:center" |
 
| style="text-align:center" |
 
| style="text-align:center" |
| style="text-align:center" |
 
 
|-
 
|-
 
! 11
 
! 11
Line 185: Line 246:
 
|}
 
|}
   
* The Arpitan non-Wikipedia corpus contains a few texts in ORB. At the beginning of the project it had 81,670 words. 80% were written by Dominique Stich, and the rest were several sociopolitical texts.
 
 
** The Arpitan Wikipedia corpus contains all the articles written in ORB, except the 366 for every day of year (2375 articles, 347,101 words). My impression is that 75+% of the content is made by bots, so it is very repetitive and little representative.
 
   
 
=== See also ===
 
=== See also ===
 
[[Hectoralos/GSOC_2020_proposal:_French-Arpitan#Workplan | Work plan in the original proposal]]
 
[[Hectoralos/GSOC_2020_proposal:_French-Arpitan#Workplan | Work plan in the original proposal]]
  +
  +
=== Notes ===

Latest revision as of 18:05, 2 August 2020

Week Dates Goals Fulfilled
Bidix
(excluding
proper names)
Coverage WER Testvoc
---
Manual disamb. of frp texts
Frp monodix
(excl.
proper names)
Bidix
(excl.
proper names)
Non-WP
coverage
(%)[1]
WP
coverage
(%)[2]
WER
(%)
Testvoc
(clean %)
---
Manual disamb.
(words)
1 1 June -
7 June
7,006 1,213 fra > frp
64.5
frp > fra
74.7
fra > frp
61.3
frp > fra
47.9
2 8 June -
14 June
~1,500 13,699 6,739 fra > frp
77.6
frp > fra
82.5
fra > frp
74.1
frp > fra
57.9
3 15 June -
21 June
~3,000 14,863 8,922 fra > frp
-
frp > fra
86.3
fra > frp
77.7
frp > fra
61.0
4 22 June -
28 June
~4,500 15,805 10,747 fra > frp
85.2
frp > fra
90.7
fra > frp
81.8
frp > fra
72.0
5 29 June -
5 July
~6,000 fra > frp >80%
frp > fra >85%
16,746 13,637 fra > frp
87.6
frp > fra
92.6
fra > frp
84.3
frp > fra
75.3
6 6 July -
12 July
~7,500 17,392[3] 18,419
fra > frp
14,639
frp > fra
16,743
fra > frp
92.1
frp > fra
95.3
fra > frp
89.2
frp > fra
78.0
7 7 July -
19 July
~8,500 18,160 19,401
fra > frp
15,552
frp > fra
17,498
fra > frp
93.2
frp > fra
95.4
fra > frp
90.2
frp > fra
78.5
8 20 July -
26 July
~9,500 Disamb. of frp texts 19,411 20,844
fra > frp
16,915
frp > fra
18,744
fra > frp
94.5
frp > fra
95.6
fra > frp
91.6
frp > fra
80.1
0
9 27 July -
3 August
~10,500 fra-frp ~89%
frp > fra ~92%
fra-frp <25% Disamb. of frp texts 19,813 21,465
fra > frp
17,425
frp > fra
19,216
fra > frp
94.8
frp > fra
95.8
fra > frp
91.9
frp > fra
80.8
fra > frp
5.5[4]

frp > fra
10.7[5]

0[6]
10 4 August -
10 August
~11,500
11 11 August -
17 August
~12,500 Testvoc: closed categories, vblex
12 18 August -
23 August
~12,750 Testvoc: adj, adv
13 24 August -
30 August
~13,000 fra > frp ~90.0%
frp > fra ~93.0%
fra > frp <20%
frp > fra <25%
Testvoc: n


See also[edit]

Work plan in the original proposal

Notes[edit]

  1. The Arpitan non-Wikipedia corpus contains a few texts in ORB. At the beginning of the project it had 81,670 words. 80% were written by Dominique Stich, and the rest were several sociopolitical texts.
  2. The Arpitan Wikipedia corpus contains all the articles written in ORB, except the 366 for every day of year (2375 articles, 347,101 words). My impression is that 75+% of the content is made by bots, so it is very repetitive and little representative.
  3. + massive inclusion of np.ant and np.cog (thus the jump in the coverage)
  4. The test was conducted using three randomly selected texts (total: ~1000 words): two "good" Wikipedia articles (In My Life and Niger (cheval)) and the most outstanding article on lemonde.fr at the time of the test (Barbara Pompili : « C’est le bon moment pour gagner la bataille écologique »). The results are surprisingly good, even though the translator still needs work. The reasons for this may be several: the articles may have been easy; the revision (by a native speaker) may have been too acquiescent; the close proximity of the languages and the great variety of Arpitan may make it more difficult to consider the translator's choice wrong than in more distant languages with well-established standards; and, of course, it is a huge advantage to have been able to use an extensive electronic dictionary and to have the help of two great language experts who are very committed to the project.
    Test file: '200727_fra-frp.ini.txt'
    Reference file '200727_fra-frp.fin.txt'

    Statistics about input files

    Number of words in reference: 1180
    Number of words in test: 1180
    Number of unknown words (marked with a star) in test: 78
    Percentage of unknown words: 6.61 %

    Results when removing unknown-word marks (stars)


    Edit distance: 65
    Word error rate (WER): 5.51 %
    Number of position-independent correct words: 1122
    Position-independent word error rate (PER): 4.92 %

    Results when unknown-word marks (stars) are not removed


    Edit distance: 125
    Word Error Rate (WER): 10.59 %
    Number of position-independent correct words: 1062
    Position-independent word error rate (PER): 10.00 %

    Statistics about the translation of unknown words


    Number of unknown words which were free rides: 60
    Percentage of unknown words that were free rides: 76.92 %

  5. No he trobat (encara) la manera de fer una prova realment aleatòria de textos. No conec cap revista electrònica en arpità a la xarxa i la Viquipèdia arpitana està escrita amb diferents normes i, a més, acostuma a tenir articles que són simples plantilles. He agafat tres articles (France, Rimbaud i Toquio, ~ 1000 paraules), sobre els quals no havia treballat abans, i he obtingut uns resultats massa bons. És inimaginable que això passi en textos reals, especialment quan s'escriu amb una acceptació molt àmplia de les normes per a adaptar-les a la parla local (el que en diuen "grafia estreta").
    Test file: '200727_frp-fra.ini.txt'
    Reference file '200727_frp-fra.fin.txt'

    Statistics about input files

    Number of words in reference: 932
    Number of words in test: 939
    Number of unknown words (marked with a star) in test: 57
    Percentage of unknown words: 6.07 %

    Results when removing unknown-word marks (stars)


    Edit distance: 100
    Word error rate (WER): 10.73 %
    Number of position-independent correct words: 848
    Position-independent word error rate (PER): 9.76 %

    Results when unknown-word marks (stars) are not removed


    Edit distance: 115
    Word Error Rate (WER): 12.34 %
    Number of position-independent correct words: 833
    Position-independent word error rate (PER): 11.37 %

    Statistics about the translation of unknown words


    Number of unknown words which were free rides: 15
    Percentage of unknown words that were free rides: 26.32 %

  6. Disambiguation is working pretty well, using the French prob-file and ad-hoc CG rules , so I used the time for doing other things