Wikipedia Extractor
Contents
Goal
This tool extracts main text from Wikipedia, producing a text corpus, which is useful for training unsupervised part-of-speech taggers, n-gram language models, etc.
Tool
http://code.google.com/p/natural-language-qa/source/browse/MakeCorpus/WikiExtractor.py
License GPL-V3.
Applicable
Yes: Wikipedia, Wikivoyage, Wikibooks
No : Wiktionary
(Thanks to the feedback by Per Tunedal!)
Usage
1. Get the script
http://code.google.com/p/natural-language-qa/source/browse/MakeCorpus/WikiExtractor.py
2. Download the Wikipedia dump file
http://dumps.wikimedia.org/backup-index.html
Take Chinese as an example, download the file zhwiki-20130625-pages-articles.xml.bz2 on this page http://dumps.wikimedia.org/zhwiki/20130625/. Alternatively, we can download the latest version on this page, http://dumps.wikimedia.org/zhwiki/lastest/
3. Use the script
mkdir output bzcat zhwiki-20130625-pages-articles.xml.bz2 | ./WikiExtractor -o output cat output/*/* > zhwiki.text
Optionally, we can use
"-c" for compression for saving disk space, and
"-b" for setting specified bytes per output file.
More information please type "./WikiExtractor --help".
Ok, let's have a cup of tea and come back an hour later. The output should be output/AA/wikiXX, where wikiXX are the extracted texts.
4. clean up "<>" tags
We are only one step away from the final text corpus, because there are still links in wikiXX files. Let's use the following tiny script to filter out "<>" tags and special "__XXX__" marks.
#! /usr/bin/python # -*- coding:utf-8 -*- import sys import re re1 = re.compile(ur"<.*?>") # ref tags re2 = re.compile(ur"__[A-Z]+__") # special marks e.g. __TOC__ __NOTOC__ line = sys.stdin.readline() while line != "": line = re1.sub("", re2.sub("", line)) print line, line = sys.stdin.readline()
Save the above in a file filter.py, and:
python filter.py < zhwiki.text > zhwiki.filter.text