Difference between revisions of "Turkish and Kyrgyz/Kymorph article"

From Apertium
Jump to navigation Jump to search
Line 19: Line 19:
 
** Wikipedia
 
** Wikipedia
 
**# [http://paste.pocoo.org/show/2KXepqcTTiWDWLOFPRlR/ punktgen.py] ky.crp.txt ky.pickle
 
**# [http://paste.pocoo.org/show/2KXepqcTTiWDWLOFPRlR/ punktgen.py] ky.crp.txt ky.pickle
**# aq-wikicrp -x -t ky.pickle kywiki-20110923-pages-articles.xml kywp.xml
+
**# aq-wikicrp -x -t ky.pickle [http://download.wikimedia.org/kywiki/20110923/kywiki-20110923-pages-articles.xml.bz2 kywiki-20110923-pages-articles.xml] kywp.xml
 
** [[Turkish and Kyrgyz/Making a corpus from azattyk|Azattyk]]
 
** [[Turkish and Kyrgyz/Making a corpus from azattyk|Azattyk]]
 
* concerns
 
* concerns

Revision as of 18:21, 6 October 2011

Outline

General background

Similar articles

Morphotactica

  • Irregular negatives of many verb forms

Morphophonologia

  • /рн/ nouns

Corpora

  • Which corpora to use?
  • concerns
    • Wikipedia is messy; should we have an automated cleaning process or get stats as-is?
      • Use aq-wikicrp, this way it is reproducible .

Numbers

size of corpora
wikipedia azattyk
num words 271005
xml file size >3.8MB