Difference between revisions of "Turkish and Kyrgyz/Kymorph article"
		
		
		
		
		
		
		Jump to navigation
		Jump to search
		
				
		
		
		
		
		
		
		
	
Firespeaker (talk | contribs)  | 
				Firespeaker (talk | contribs)   (Undo revision 28838 by Firespeaker (Talk))  | 
				||
| (11 intermediate revisions by the same user not shown) | |||
| Line 1: | Line 1: | ||
== Outline ==  | 
  == Outline ==  | 
||
== General background ==  | 
|||
* Submitting abstract to: [http://www.lrec-conf.org/lrec2012/ LREC 2012 Istanbul]  | 
|||
* Deadline: October 15, 2011  | 
|||
== Similar articles ==  | 
|||
* [http://www.mt-archive.info/FreeRBMT-2009-Faridee.pdf Abu Zaher Md. Faridee & Francis M. Tyers - Development of a morphological analyser for Bengali]  | 
|||
* [http://www.let.rug.nl/coltekin/papers/coltekin-lrec2010.pdf Çagrı Çöltekin - A Freely Available Morphological Analyzer for Turkish]  | 
|||
== Morphotactica ==  | 
  == Morphotactica ==  | 
||
* Irregular negatives of many verb forms  | 
|||
== Morphophonologia ==  | 
  == Morphophonologia ==  | 
||
* /рн/ nouns  | 
|||
== Corpora ==  | 
  == Corpora ==  | 
||
* Which corpora to use?  | 
  * Which corpora to use?  | 
||
** Wikipedia  | 
  ** Wikipedia  | 
||
**# [http://paste.pocoo.org/show/2KXepqcTTiWDWLOFPRlR/ punktgen.py] ky.crp.txt ky.pickle  | 
|||
**# aq-wikicrp -x -t ky.pickle [http://download.wikimedia.org/kywiki/20110923/kywiki-20110923-pages-articles.xml.bz2 kywiki-20110923-pages-articles.xml] kywp.xml  | 
|||
** [[Turkish and Kyrgyz/Making a corpus from azattyk|Azattyk]]  | 
  ** [[Turkish and Kyrgyz/Making a corpus from azattyk|Azattyk]]  | 
||
* concerns  | 
  * concerns  | 
||
| Line 19: | Line 31: | ||
|  | 
  |  | 
||
! wikipedia  | 
  ! wikipedia  | 
||
! azattyk  | 
  ! azattyk 2010  | 
||
! all azattyk  | 
|||
|-  | 
|||
! num articles  | 
|||
| 1531([http://ky.wikipedia.org/wiki/Special:Statistics ?], [http://dumps.wikimedia.org/kywiki/20110923/ ?])  | 
|||
| 9803 (6627?)  | 
|||
|  | 
|||
|-  | 
  |-  | 
||
! num words  | 
  ! num words  | 
||
| 271005  | 
  | 271005  | 
||
| 3394686  | 
|||
|  | 
  |  | 
||
|-  | 
  |-  | 
||
! xml file size  | 
  ! xml file size  | 
||
|   | 
  | 3.8MB  | 
||
| 49MB  | 
|||
|  | 
  |  | 
||
|}  | 
  |}  | 
||
Latest revision as of 16:50, 13 October 2011
Contents
Outline[edit]
General background[edit]
- Submitting abstract to: LREC 2012 Istanbul
 - Deadline: October 15, 2011
 
Similar articles[edit]
- Abu Zaher Md. Faridee & Francis M. Tyers - Development of a morphological analyser for Bengali
 - Çagrı Çöltekin - A Freely Available Morphological Analyzer for Turkish
 
Morphotactica[edit]
- Irregular negatives of many verb forms
 
Morphophonologia[edit]
- /рн/ nouns
 
Corpora[edit]
- Which corpora to use?
- Wikipedia
- punktgen.py ky.crp.txt ky.pickle
 - aq-wikicrp -x -t ky.pickle kywiki-20110923-pages-articles.xml kywp.xml
 
 - Azattyk
 
 - Wikipedia
 - concerns
- Wikipedia is messy; should we have an automated cleaning process or get stats as-is?
- Use aq-wikicrp, this way it is reproducible .
 
 
 - Wikipedia is messy; should we have an automated cleaning process or get stats as-is?
 
Numbers[edit]
| wikipedia | azattyk 2010 | all azattyk | |
|---|---|---|---|
| num articles | 1531(?, ?) | 9803 (6627?) | |
| num words | 271005 | 3394686 | |
| xml file size | 3.8MB | 49MB |