Difference between revisions of "Turkish and Kyrgyz/Kymorph article"

From Apertium
Jump to navigation Jump to search
(Undo revision 28838 by Firespeaker (Talk))
 
(13 intermediate revisions by 2 users not shown)
Line 1: Line 1:
 
== Outline ==
 
== Outline ==
  +
  +
== General background ==
  +
* Submitting abstract to: [http://www.lrec-conf.org/lrec2012/ LREC 2012 Istanbul]
  +
* Deadline: October 15, 2011
  +
  +
== Similar articles ==
  +
* [http://www.mt-archive.info/FreeRBMT-2009-Faridee.pdf Abu Zaher Md. Faridee & Francis M. Tyers - Development of a morphological analyser for Bengali]
  +
* [http://www.let.rug.nl/coltekin/papers/coltekin-lrec2010.pdf Çagrı Çöltekin - A Freely Available Morphological Analyzer for Turkish]
   
 
== Morphotactica ==
 
== Morphotactica ==
  +
* Irregular negatives of many verb forms
   
 
== Morphophonologia ==
 
== Morphophonologia ==
  +
* /рн/ nouns
   
 
== Corpora ==
 
== Corpora ==
 
* Which corpora to use?
 
* Which corpora to use?
 
** Wikipedia
 
** Wikipedia
  +
**# [http://paste.pocoo.org/show/2KXepqcTTiWDWLOFPRlR/ punktgen.py] ky.crp.txt ky.pickle
  +
**# aq-wikicrp -x -t ky.pickle [http://download.wikimedia.org/kywiki/20110923/kywiki-20110923-pages-articles.xml.bz2 kywiki-20110923-pages-articles.xml] kywp.xml
 
** [[Turkish and Kyrgyz/Making a corpus from azattyk|Azattyk]]
 
** [[Turkish and Kyrgyz/Making a corpus from azattyk|Azattyk]]
 
* concerns
 
* concerns
 
** Wikipedia is messy; should we have an automated cleaning process or get stats as-is?
 
** Wikipedia is messy; should we have an automated cleaning process or get stats as-is?
  +
*** Use aq-wikicrp, this way it is reproducible .
   
 
== Numbers ==
 
== Numbers ==
  +
{|class="wikitable"
  +
|+ size of corpora
  +
|-
  +
|
  +
! wikipedia
  +
! azattyk 2010
  +
! all azattyk
  +
|-
  +
! num articles
  +
| 1531([http://ky.wikipedia.org/wiki/Special:Statistics ?], [http://dumps.wikimedia.org/kywiki/20110923/ ?])
  +
| 9803 (6627?)
  +
|
  +
|-
  +
! num words
  +
| 271005
  +
| 3394686
  +
|
  +
|-
  +
! xml file size
  +
| 3.8MB
  +
| 49MB
  +
|
  +
|}

Latest revision as of 16:50, 13 October 2011

Outline[edit]

General background[edit]

Similar articles[edit]

Morphotactica[edit]

  • Irregular negatives of many verb forms

Morphophonologia[edit]

  • /рн/ nouns

Corpora[edit]

  • Which corpora to use?
  • concerns
    • Wikipedia is messy; should we have an automated cleaning process or get stats as-is?
      • Use aq-wikicrp, this way it is reproducible .

Numbers[edit]

size of corpora
wikipedia azattyk 2010 all azattyk
num articles 1531(?, ?) 9803 (6627?)
num words 271005 3394686
xml file size 3.8MB 49MB