Hectoralos/GSOC 2020 work plan control

From Apertium
Revision as of 20:14, 19 July 2020 by Hectoralos (talk | contribs)
Jump to navigation Jump to search
Week Dates Goals Fulfilled
Bidix
(excluding
proper names)
Coverage WER Testvoc
---
Manual disamb. of frp texts
Frp monodix
(excl.
proper names)
Bidix
(excl.
proper names)
Non-WP
coverage
(%)[1]
WP
coverage
(%)[2]
WER
(%)
Testvoc
(clean %)
---
Manual disamb.
(words)
1 1 June -
7 June
7,006 1,213 fra > frp
64.5
frp > fra*
74.7
fra > frp
61.3
frp > fra**
47.9
2 8 June -
14 June
~1,500 13,699 6,739 fra > frp
77.6
frp > fra
82.5
fra > frp
74.1
frp > fra
57.9
3 15 June -
21 June
~3,000 14,863 8,922 fra > frp
-
frp > fra
86.3
fra > frp
77.7
frp > fra
61.0
4 22 June -
28 June
~4,500 15,805 10,747 fra > frp
85.2
frp > fra
90.7
fra > frp
81.8
frp > fra
72.0
5 29 June -
5 July
~6,000 fra > frp >80%
frp > fra >85%
16,746 13,637 fra > frp
87.6
frp > fra
92.6
fra > frp
84.3
frp > fra
75.3
6 6 July -
12 July
~7,500 17,392[3] 18,419
fra > frp
14,639
frp > fra
16,743
fra > frp
92.1
frp > fra
95.3
fra > frp
89.2
frp > fra
78.0
7 7 July -
19 July
~8,500 18,160 19,401
fra > frp
15,552
frp > fra
17,498
fra > frp
93.2
frp > fra
95.4
fra > frp
90.2
frp > fra
78.5
8 20 July -
26 July
~9,500 Disamb. of frp texts
9 27 July -
3 August
~10,500 fra-frp ~89%
frp > fra ~92%
fra-frp <25% Disamb. of frp texts
10 4 August -
10 August
~11,500
11 11 August -
17 August
~12,500 Testvoc: closed categories, vblex
12 18 August -
23 August
~12,750 Testvoc: adj, adv
13 24 August -
30 August
~13,000 fra > frp ~90.0%
frp > fra ~93.0%
fra > frp <20%
frp > fra <25%
Testvoc: n


See also

Work plan in the original proposal

Notes

  1. The Arpitan non-Wikipedia corpus contains a few texts in ORB. At the beginning of the project it had 81,670 words. 80% were written by Dominique Stich, and the rest were several sociopolitical texts.
  2. The Arpitan Wikipedia corpus contains all the articles written in ORB, except the 366 for every day of year (2375 articles, 347,101 words). My impression is that 75+% of the content is made by bots, so it is very repetitive and little representative.
  3. + massive inclusion of np.ant and np.cog (thus the jump in the coverage)