(Created page with "'''WikiExtractor''' is a script for extracting a text corpus from Wikipedia dumps. You can find the script here: https://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/...")
Revision as of 14:37, 9 July 2016
WikiExtractor is a script for extracting a text corpus from Wikipedia dumps.
You can find the script here:
You find a Wikipedia dump here:
Navigate to the page of the language you are interested in, there will be a link called
You want the file that ends in:
And you run it like this:
$ python3 WikiExtractor.py --infn euwiki-20160701-pages-articles.xml.bz2
It will spit a lot of output (the article titles) and output a file called
wiki.txt. This is your corpus.