Hectoralos/GSOC 2020 work plan control
Revision as of 20:47, 27 June 2020 by Hectoralos (talk | contribs)
| Week | Dates | Goals | Fulfilled | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Bidix (excluding proper names) | Coverage | WER | Testvoc --- Manual disamb. of frp texts | Frp monodix (excl. proper names) | Bidix (excl. proper names) | Non-WP coverage (%) | WP coverage (%) | WER (%) | Testvoc (clean %) --- Manual disamb. (words) | ||
| 1 | 1 June - 7 June | 7,006 | 1,213 | fra > frp 64.5 frp > fra* 74.7 | fra > frp 61.3 frp > fra** 47.9 | ||||||
| 2 | 8 June - 14 June | ~1,500 | 13,699 | 6,739 | fra > frp 77.6 frp > fra 82.5 | fra > frp 74.1 frp > fra 57.9 | |||||
| 3 | 15 June - 21 June | ~3,000 | 14,863 | 8,922 | fra > frp - frp > fra 86.3 | fra > frp 77.7 frp > fra 61.0 | |||||
| 4 | 22 June - 28 June | ~4,500 | 15,805 | 10,747 | fra > frp 85.2 frp > fra 90.7 | fra > frp 81.8 frp > fra 72.0 | |||||
| 5 | 29 June - 5 July | ~6,000 | fra > frp >80% frp > fra >85% | ||||||||
| 6 | 6 July - 12 July | ~7,500 | |||||||||
| 7 | 7 July - 19 July | ~8,500 | |||||||||
| 8 | 20 July - 26 July | ~9,500 | Disamb. of frp texts | ||||||||
| 9 | 27 July - 3 August | ~10,500 | fra-frp ~89% frp > fra ~92% | fra-frp <25% | Disamb. of frp texts | ||||||
| 10 | 4 August - 10 August | ~11,500 | |||||||||
| 11 | 11 August - 17 August | ~12,500 | Testvoc: closed categories, vblex | ||||||||
| 12 | 18 August - 23 August | ~12,750 | Testvoc: adj, adv | ||||||||
| 13 | 24 August  - 30 August | ~13,000 | fra > frp ~90.0% frp > fra ~93.0% | fra > frp <20% frp > fra <25% | Testvoc: n | ||||||
- The Arpitan non-Wikipedia corpus contains a few texts in ORB. At the beginning of the project it had 81,670 words. 80% were written by Dominique Stich, and the rest were several sociopolitical texts.
- The Arpitan Wikipedia corpus contains all the articles written in ORB, except the 366 for every day of year (2375 articles, 347,101 words). My impression is that 75+% of the content is made by bots, so it is very repetitive and little representative.
 

