This tool extracts main text from Wikipedia, producing a text corpus, which is useful for training unsupervised part-of-speech tagger, n-gram language model, etc.


License GPL-V3.


1. Get the script

2. Download the Wikipedia dump file here

Take Chinese as an example, download the file zhwiki-20130625-pages-articles.xml.bz2 on this page Alternatively, we can download the latest version on this page,

3. Use the script

mkdir output

bzcat zhwiki-20130625-pages-articles.xml.bz2 | ./WikiExtractor -o output

Optionally, we can use

"-c" for compression for saving disk space, and

"-b" for setting specified bytes per output file.

More information please type "./WikiExtractor --help".

Ok, let's have a cup of tea and come back an hour later. The output should be output/AA/wikiXX, where wikiXX are the extracted texts.

4. clean up "<>" tags

We are only one step away from the final text corpus, because there are still links in wikiXX files. Let's use the following tiny script to filter out "<>" tags.

#! /usr/bin/python
# -*- coding:utf-8 -*-
import sys
import re
regex = re.compile(ur"<.*?>")
line = sys.stdin.readline()
while line != "":
	line = regex.sub("", line)[ : -1]
	print line
	line = sys.stdin.readline()

5. done :)