Apertium has moved from SourceForge to GitHub.
If you have any questions, please come and talk to us on
If you have any questions, please come and talk to us on
#apertium
on irc.freenode.net
or contact the GitHub migration team.WikiExtractor
From Apertium
Revision as of 12:41, 18 August 2016 by Francis Tyers (Talk | contribs)
WikiExtractor is a script for extracting a text corpus from Wikipedia dumps.
You can find the script here:
https://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/WikiExtractor.py
You find a Wikipedia dump here:
https://dumps.wikimedia.org/backup-index.html
Navigate to the page of the language you are interested in, there will be a link called <language code>
wiki.
You want the file that ends in: -pages-articles.xml.bz2
, e.g. euwiki-20160701-pages-articles.xml.bz2
You'll need to download the file, you can use wget
or curl
or something...
$ wget <url>
And you run it like this:
$ python3 WikiExtractor.py --infn euwiki-20160701-pages-articles.xml.bz2
It will spit a lot of output (the article titles) and output a file called wiki.txt
. This is your corpus.