Difference between revisions of "RFERL corpora"
Jump to navigation
Jump to search
Firespeaker (talk | contribs) |
Firespeaker (talk | contribs) |
||
(4 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
Radio Free Europe / Radio Liberty runs news services in a number of Central Asian languages. The information is essentially free for public use with attribution. {{comment|link to usage info}} |
Radio Free Europe / Radio Liberty runs news services in a number of Central Asian languages. The information is essentially free for public use with attribution. {{comment|need link to usage info}} |
||
We discovered [[Turkish and Kyrgyz/Making a corpus from azattyk|how a corpus could be built from their website]], and have instructions for [[writing a scraper]] in the framework we developed for it. Currently we have corpora for Kazakh and Kyrgyz, covering only a couple years' worth of articles. |
|||
== Kyrgyz == |
== Kyrgyz == |
||
* Site: [http://azattyk.org azattyk.org] |
* Site: [http://azattyk.org azattyk.org] |
||
* Coverage with: [[kymorph]] |
|||
=== 2009 === |
=== 2009 === |
||
* Number of stems: {{:RFERL corpus/ky/2009/stems}} |
* Number of stems: {{:RFERL corpus/ky/2009/stems}} |
||
* Coverage: {{:Kymorph/coverage/rferl2009}} |
* Coverage: ~{{:Kymorph/coverage/rferl2009}}% |
||
=== 2010 === |
=== 2010 === |
||
* Number of stems: {{:RFERL corpus/ky/2010/stems}} |
* Number of stems: {{:RFERL corpus/ky/2010/stems}} |
||
* Coverage: {{:Kymorph/coverage/rferl2010}} |
* Coverage: ~{{:Kymorph/coverage/rferl2010}}% |
||
== Kazakh == |
== Kazakh == |
||
Line 17: | Line 20: | ||
=== 2009 === |
=== 2009 === |
||
* Number of stems: {{:RFERL corpus/kk/2009/stems}} |
* Number of stems: {{:RFERL corpus/kk/2009/stems}} |
||
* Coverage: {{:Kazmorph/coverage/rferl2009}} |
* Coverage: ~{{:Kazmorph/coverage/rferl2009}}% |
||
=== 2010 === |
=== 2010 === |
||
* Number of stems: {{:RFERL corpus/kk/2010/stems}} |
* Number of stems: {{:RFERL corpus/kk/2010/stems}} |
||
* Coverage: {{:Kazmorph/coverage/rferl2010}} |
* Coverage: ~{{:Kazmorph/coverage/rferl2010}}% |
Latest revision as of 18:41, 5 December 2013
Radio Free Europe / Radio Liberty runs news services in a number of Central Asian languages. The information is essentially free for public use with attribution.
need link to usage info
We discovered how a corpus could be built from their website, and have instructions for writing a scraper in the framework we developed for it. Currently we have corpora for Kazakh and Kyrgyz, covering only a couple years' worth of articles.
Kyrgyz[edit]
- Site: azattyk.org
- Coverage with: kymorph
2009[edit]
- Number of stems: 4.1M
- Coverage: ~87.4%
2010[edit]
- Number of stems: 3.4M
- Coverage: ~88%
Kazakh[edit]
- Site: azattyq.org
2009[edit]
- Number of stems: RFERL corpus/kk/2009/stems
- Coverage: ~Kazmorph/coverage/rferl2009%
2010[edit]
- Number of stems: 3.2M
- Coverage: ~85.4%