Difference between revisions of "Turkish and Kyrgyz/Kymorph article"
Jump to navigation
Jump to search
Firespeaker (talk | contribs) (Undo revision 28838 by Firespeaker (Talk)) |
|||
(14 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
− | + | == Outline == |
|
+ | == General background == |
||
⚫ | |||
+ | * Submitting abstract to: [http://www.lrec-conf.org/lrec2012/ LREC 2012 Istanbul] |
||
+ | * Deadline: October 15, 2011 |
||
+ | == Similar articles == |
||
⚫ | |||
+ | * [http://www.mt-archive.info/FreeRBMT-2009-Faridee.pdf Abu Zaher Md. Faridee & Francis M. Tyers - Development of a morphological analyser for Bengali] |
||
+ | * [http://www.let.rug.nl/coltekin/papers/coltekin-lrec2010.pdf Çagrı Çöltekin - A Freely Available Morphological Analyzer for Turkish] |
||
⚫ | |||
⚫ | |||
+ | * Irregular negatives of many verb forms |
||
⚫ | |||
⚫ | |||
+ | * /рн/ nouns |
||
+ | |||
⚫ | |||
+ | * Which corpora to use? |
||
+ | ** Wikipedia |
||
+ | **# [http://paste.pocoo.org/show/2KXepqcTTiWDWLOFPRlR/ punktgen.py] ky.crp.txt ky.pickle |
||
+ | **# aq-wikicrp -x -t ky.pickle [http://download.wikimedia.org/kywiki/20110923/kywiki-20110923-pages-articles.xml.bz2 kywiki-20110923-pages-articles.xml] kywp.xml |
||
+ | ** [[Turkish and Kyrgyz/Making a corpus from azattyk|Azattyk]] |
||
+ | * concerns |
||
+ | ** Wikipedia is messy; should we have an automated cleaning process or get stats as-is? |
||
+ | *** Use aq-wikicrp, this way it is reproducible . |
||
+ | |||
⚫ | |||
+ | {|class="wikitable" |
||
+ | |+ size of corpora |
||
+ | |- |
||
+ | | |
||
+ | ! wikipedia |
||
+ | ! azattyk 2010 |
||
+ | ! all azattyk |
||
+ | |- |
||
+ | ! num articles |
||
+ | | 1531([http://ky.wikipedia.org/wiki/Special:Statistics ?], [http://dumps.wikimedia.org/kywiki/20110923/ ?]) |
||
+ | | 9803 (6627?) |
||
+ | | |
||
+ | |- |
||
+ | ! num words |
||
+ | | 271005 |
||
+ | | 3394686 |
||
+ | | |
||
+ | |- |
||
+ | ! xml file size |
||
+ | | 3.8MB |
||
+ | | 49MB |
||
+ | | |
||
+ | |} |
Latest revision as of 16:50, 13 October 2011
Contents
Outline[edit]
General background[edit]
- Submitting abstract to: LREC 2012 Istanbul
- Deadline: October 15, 2011
Similar articles[edit]
- Abu Zaher Md. Faridee & Francis M. Tyers - Development of a morphological analyser for Bengali
- Çagrı Çöltekin - A Freely Available Morphological Analyzer for Turkish
Morphotactica[edit]
- Irregular negatives of many verb forms
Morphophonologia[edit]
- /рн/ nouns
Corpora[edit]
- Which corpora to use?
- Wikipedia
- punktgen.py ky.crp.txt ky.pickle
- aq-wikicrp -x -t ky.pickle kywiki-20110923-pages-articles.xml kywp.xml
- Azattyk
- Wikipedia
- concerns
- Wikipedia is messy; should we have an automated cleaning process or get stats as-is?
- Use aq-wikicrp, this way it is reproducible .
- Wikipedia is messy; should we have an automated cleaning process or get stats as-is?
Numbers[edit]
wikipedia | azattyk 2010 | all azattyk | |
---|---|---|---|
num articles | 1531(?, ?) | 9803 (6627?) | |
num words | 271005 | 3394686 | |
xml file size | 3.8MB | 49MB |