Wikipedia Extractor

From Apertium
Revision as of 21:37, 4 June 2020 by Firespeaker (talk | contribs)
Jump to navigation Jump to search



This tool extracts main text from xml Wikipedia dump files (at, ideally the "pages-articles" file), producing a text corpus, which is useful for training unsupervised part-of-speech taggers, n-gram language models, etc.

It was modified by a number of people, including by BenStobaugh during Google Code-In 2013, and can be cloned from GitHub at

This version is much simpler than the old version. This version auto-removes any formatting and only outputs the text to one file. To use it, simply use the following command in your terminal, where dump.xml is the Wikipedia dump

$ python3 --infn dump.xml.bz2

(Note: If you are on a Mac, make sure that -- is really two hyphens and not an em-dash like this: —).

This will run through all of the articles, get all of the text and put it in wiki.txt. This version also supports compression (BZip2 and Gzip), so you can use dump.xml.bz2 or dump.xml.gz instead of dump.xml. You can also compress (Bzip2) the output file by adding --compress to the command.

You can also run python3 --help to get more details.

See also