Difference between revisions of "Hectoralos/GSOC 2020 work plan control"
		
		
		
		
		
		
		Jump to navigation
		Jump to search
		
				
		
		
		
		
		
		
		
	
| Hectoralos (talk | contribs) | Hectoralos (talk | contribs)  | ||
| Line 24: | Line 24: | ||
| | style="text-align:center" | 7,006 | | style="text-align:center" | 7,006 | ||
| | style="text-align:center" | 1,213 | | style="text-align:center" | 1,213 | ||
| | style="text-align:center" |  | | style="text-align:center" | fra > frp<br>64.5<br>frp > fra*<br>74.7 | ||
| | style="text-align:center" |  | | style="text-align:center" | fra > frp<br>61.3<br>frp > fra**<br>47.9 | ||
| | style="text-align:center" |  | | style="text-align:center" |  | ||
| | style="text-align:center" |  | | style="text-align:center" |  | ||
| Line 184: | Line 184: | ||
| | style="text-align:center" |  | | style="text-align:center" |  | ||
| |} | |} | ||
| * The Arpitan non-Wikipedia corpus contains a few texts in ORB. At the beginning of the project it had 81,670 words. 80% were written by Dominique Stich, and the rest were several sociopolitical texts. | |||
| ** The Arpitan Wikipedia corpus contains all the articles written in ORB, except the 366 for every day of year (2375 articles, 347,101 words). My impression is that 75+% of the content is made by bots, so it is very repetitive and little representative. | |||
| === See also === | === See also === | ||
Revision as of 20:12, 7 June 2020
| Week | Dates | Goals | Fulfilled | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Bidix (excluding proper names) | Coverage | WER | Testvoc --- Manual disamb. of frp texts | Frp monodix (excl. proper names) | Bidix (excl. proper names) | Non-WP coverage | WP coverage | WER | Testvoc (clean %) --- Manual disamb. (words) | ||
| 1 | 1 June - 7 June | 7,006 | 1,213 | fra > frp 64.5 frp > fra* 74.7 | fra > frp 61.3 frp > fra** 47.9 | ||||||
| 2 | 8 June - 14 June | ~1,500 | |||||||||
| 3 | 15 June - 21 June | ~3,000 | |||||||||
| 4 | 22 June - 28 June | ~4,500 | |||||||||
| 5 | 29 June - 5 July | ~6,000 | fra > frp >80% frp > fra >85% | ||||||||
| 6 | 6 July - 12 July | ~7,500 | |||||||||
| 7 | 7 July - 19 July | ~8,500 | |||||||||
| 8 | 20 July - 26 July | ~9,500 | Disamb. of frp texts | ||||||||
| 9 | 27 July - 3 August | ~10,500 | fra-frp ~89% frp > fra ~92% | fra-frp <25% | Disamb. of frp texts | ||||||
| 10 | 4 August - 10 August | ~11,500 | |||||||||
| 11 | 11 August - 17 August | ~12,500 | Testvoc: closed categories, vblex | ||||||||
| 12 | 18 August - 23 August | ~12,750 | Testvoc: adj, adv | ||||||||
| 13 | 24 August  - 30 August | ~13,000 | fra > frp ~90.0% frp > fra ~93.0% | fra > frp <20% frp > fra <25% | Testvoc: n | ||||||
- The Arpitan non-Wikipedia corpus contains a few texts in ORB. At the beginning of the project it had 81,670 words. 80% were written by Dominique Stich, and the rest were several sociopolitical texts.
- The Arpitan Wikipedia corpus contains all the articles written in ORB, except the 366 for every day of year (2375 articles, 347,101 words). My impression is that 75+% of the content is made by bots, so it is very repetitive and little representative.
 

