Difference between revisions of "Turkish and Kyrgyz/Kymorph article"
Jump to navigation
Jump to search
Firespeaker (talk | contribs) |
Firespeaker (talk | contribs) m (→Corpora) |
||
Line 19: | Line 19: | ||
** Wikipedia |
** Wikipedia |
||
**# [http://paste.pocoo.org/show/2KXepqcTTiWDWLOFPRlR/ punktgen.py] ky.crp.txt ky.pickle |
**# [http://paste.pocoo.org/show/2KXepqcTTiWDWLOFPRlR/ punktgen.py] ky.crp.txt ky.pickle |
||
**# aq-wikicrp -x -t ky.pickle kywiki-20110923-pages-articles.xml kywp.xml |
**# aq-wikicrp -x -t ky.pickle [http://download.wikimedia.org/kywiki/20110923/kywiki-20110923-pages-articles.xml.bz2 kywiki-20110923-pages-articles.xml] kywp.xml |
||
** [[Turkish and Kyrgyz/Making a corpus from azattyk|Azattyk]] |
** [[Turkish and Kyrgyz/Making a corpus from azattyk|Azattyk]] |
||
* concerns |
* concerns |
Revision as of 18:21, 6 October 2011
Contents
Outline
General background
- Submitting abstract to: LREC 2012 Istanbul
- Deadline: October 15, 2011
Similar articles
- Abu Zaher Md. Faridee & Francis M. Tyers - Development of a morphological analyser for Bengali
- Çagrı Çöltekin - A Freely Available Morphological Analyzer for Turkish
Morphotactica
- Irregular negatives of many verb forms
Morphophonologia
- /рн/ nouns
Corpora
- Which corpora to use?
- Wikipedia
- punktgen.py ky.crp.txt ky.pickle
- aq-wikicrp -x -t ky.pickle kywiki-20110923-pages-articles.xml kywp.xml
- Azattyk
- Wikipedia
- concerns
- Wikipedia is messy; should we have an automated cleaning process or get stats as-is?
- Use aq-wikicrp, this way it is reproducible .
- Wikipedia is messy; should we have an automated cleaning process or get stats as-is?
Numbers
wikipedia | azattyk | |
---|---|---|
num words | 271005 | |
xml file size | >3.8MB |