Hectoralos/GSOC 2020 work plan control
Revision as of 15:18, 4 July 2020 by Hectoralos (talk | contribs)
Week | Dates | Goals | Fulfilled | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Bidix (excluding proper names) |
Coverage | WER | Testvoc --- Manual disamb. of frp texts |
Frp monodix (excl. proper names) |
Bidix (excl. proper names) |
Non-WP coverage (%) |
WP coverage (%) |
WER (%) |
Testvoc (clean %) --- Manual disamb. (words) | ||
1 | 1 June - 7 June |
7,006 | 1,213 | fra > frp 64.5 frp > fra* 74.7 |
fra > frp 61.3 frp > fra** 47.9 |
||||||
2 | 8 June - 14 June |
~1,500 | 13,699 | 6,739 | fra > frp 77.6 frp > fra 82.5 |
fra > frp 74.1 frp > fra 57.9 |
|||||
3 | 15 June - 21 June |
~3,000 | 14,863 | 8,922 | fra > frp - frp > fra 86.3 |
fra > frp 77.7 frp > fra 61.0 |
|||||
4 | 22 June - 28 June |
~4,500 | 15,805 | 10,747 | fra > frp 85.2 frp > fra 90.7 |
fra > frp 81.8 frp > fra 72.0 |
|||||
5 | 29 June - 5 July |
~6,000 | fra > frp >80% frp > fra >85% |
16,746 | 13,637 | fra > frp 87.6 frp > fra 92.6 |
fra > frp 84.3 frp > fra 75.3 |
||||
6 | 6 July - 12 July |
~7,500 | |||||||||
7 | 7 July - 19 July |
~8,500 | |||||||||
8 | 20 July - 26 July |
~9,500 | Disamb. of frp texts | ||||||||
9 | 27 July - 3 August |
~10,500 | fra-frp ~89% frp > fra ~92% |
fra-frp <25% | Disamb. of frp texts | ||||||
10 | 4 August - 10 August |
~11,500 | |||||||||
11 | 11 August - 17 August |
~12,500 | Testvoc: closed categories, vblex | ||||||||
12 | 18 August - 23 August |
~12,750 | Testvoc: adj, adv | ||||||||
13 | 24 August - 30 August |
~13,000 | fra > frp ~90.0% frp > fra ~93.0% |
fra > frp <20% frp > fra <25% |
Testvoc: n |
- The Arpitan non-Wikipedia corpus contains a few texts in ORB. At the beginning of the project it had 81,670 words. 80% were written by Dominique Stich, and the rest were several sociopolitical texts.
- The Arpitan Wikipedia corpus contains all the articles written in ORB, except the 366 for every day of year (2375 articles, 347,101 words). My impression is that 75+% of the content is made by bots, so it is very repetitive and little representative.