Difference between revisions of "Turkish and Kyrgyz/Kymorph article"
		
		
		
		
		
		
		Jump to navigation
		Jump to search
		
				
		
		
		
		
		
		
		
	
Firespeaker (talk | contribs)  | 
				Firespeaker (talk | contribs)   | 
				||
| Line 8: | Line 8: | ||
* Which corpora to use?  | 
  * Which corpora to use?  | 
||
** Wikipedia  | 
  ** Wikipedia  | 
||
**# [http://paste.pocoo.org/show/2KXepqcTTiWDWLOFPRlR/ punktgen.py] ky.crp.txt ky.pickle  | 
|||
**# aq-wikicrp -x -t ky.pickle kywiki-20110923-pages-articles.xml kywp.xml  | 
|||
** [[Turkish and Kyrgyz/Making a corpus from azattyk|Azattyk]]  | 
  ** [[Turkish and Kyrgyz/Making a corpus from azattyk|Azattyk]]  | 
||
* concerns  | 
  * concerns  | 
||
Revision as of 07:36, 5 October 2011
Outline
Morphotactica
Morphophonologia
Corpora
- Which corpora to use?
- Wikipedia
- punktgen.py ky.crp.txt ky.pickle
 - aq-wikicrp -x -t ky.pickle kywiki-20110923-pages-articles.xml kywp.xml
 
 - Azattyk
 
 - Wikipedia
 - concerns
- Wikipedia is messy; should we have an automated cleaning process or get stats as-is?
- Use aq-wikicrp, this way it is reproducible .
 
 
 - Wikipedia is messy; should we have an automated cleaning process or get stats as-is?
 
Numbers
| wikipedia | azattyk | |
|---|---|---|
| num words | 271005 | |
| xml file size | >3.8MB |