WikiExtractor

WikiExtractor is a script for extracting a text corpus from Wikipedia dumps.

You can find the script here:

You find a Wikipedia dump here:

Navigate to the page of the language you are interested in, there will be a link called <language code>wiki.

You want the file that ends in: -pages-articles.xml.bz2, e.g. euwiki-20160701-pages-articles.xml.bz2

You'll need to download the file, you can use wget or curl or something...

$ wget <url>

And you run it like this:

$ python3 WikiExtractor.py --infn euwiki-20160701-pages-articles.xml.bz2

It will spit a lot of output (the article titles) and output a file called wiki.txt. This is your corpus.

Navigation menu