Apertium has moved from SourceForge to GitHub.
If you have any questions, please come and talk to us on #apertium on irc.freenode.net or contact the GitHub migration team.

RFERL corpora

From Apertium
Jump to: navigation, search
Radio Free Europe / Radio Liberty runs news services in a number of Central Asian languages. The information is essentially free for public use with attribution.

need link to usage info

We discovered how a corpus could be built from their website, and have instructions for writing a scraper in the framework we developed for it. Currently we have corpora for Kazakh and Kyrgyz, covering only a couple years' worth of articles.

Contents

[edit] Kyrgyz

[edit] 2009

  • Number of stems: 4.1M
  • Coverage: ~87.4%

[edit] 2010

  • Number of stems: 3.4M
  • Coverage: ~88%

[edit] Kazakh

[edit] 2009

[edit] 2010

  • Number of stems: 3.2M
  • Coverage: ~85.4%
Personal tools