Turkish and Kyrgyz/Kymorph article

From Apertium
Jump to navigation Jump to search

Outline

Morphotactica

Morphophonologia

Corpora

  • Which corpora to use?
    • Wikipedia
      1. punktgen.py ky.crp.txt ky.pickle
      2. aq-wikicrp -x -t ky.pickle kywiki-20110923-pages-articles.xml kywp.xml
    • Azattyk
  • concerns
    • Wikipedia is messy; should we have an automated cleaning process or get stats as-is?
      • Use aq-wikicrp, this way it is reproducible .

Numbers

size of corpora
wikipedia azattyk
num words 271005
xml file size >3.8MB