Wikipedia Extractor
Goal
This tool extracts main text from xml Wikipedia dump files, producing a text corpus, which is useful for training unsupervised part-of-speech taggers, n-gram language models, etc.
Tool
There are three versions of Wikipedia Extractor:
The original, works with Python 2.x and outputs multiple files. http://code.google.com/p/natural-language-qa/source/browse/MakeCorpus/WikiExtractor.py License GPL-V3.
Guampa's version, works with Python 3.x and outputs multiple files. https://github.com/hltdi/guampa/blob/master/wikipedia-import/WikiExtractor.py License GPL-V3
New version, a modified version of Guampa's, works with Python 3.x and outputs a single file https://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/WikiExtractor.py
Applicable
Work well : Wikipedia, Wikivoyage, Wikibooks
With problem: Wiktionary (Mistakenly, not all articles fully included, and foreign words' explanations included.)
(Thanks to the feedback by Per Tunedal!)
New Version
The new version was written by BenStobaugh during GCI-2013 and can be downloaded from SVN at https://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/WikiExtractor.py
This version is much simpler than the old version. This version auto-removes any formatting and only outputs the text to one file. To use it, simply use the following command in your terminal, where dump.xml is the Wikipedia dump
$ python3 WikiExtractor.py --infn dump.xml.bz2
This will run through all of the articles, get all of the text and put it in wiki.txt. This version also supports compression (BZip2 and Gzip), so you can use dump.xml.bz2
or dump.xml.gz
instead of dump.xml.
You can also compress (Bzip2) the output file by adding --compress
to the command.
You can also run python3 WikiExtractor.py --help
to get more details.
Old Version
1. Get the script
http://code.google.com/p/natural-language-qa/source/browse/MakeCorpus/WikiExtractor.py
2. Download the Wikipedia dump file
http://dumps.wikimedia.org/backup-index.html
Take Chinese as an example, download the file zhwiki-20130625-pages-articles.xml.bz2 on this page http://dumps.wikimedia.org/zhwiki/20130625/. Alternatively, we can download the latest version on this page, http://dumps.wikimedia.org/zhwiki/lastest/
3. Use the script
mkdir output bzcat zhwiki-20130625-pages-articles.xml.bz2 | ./WikiExtractor -o output cat output/*/* > zhwiki.text
Optionally, we can use
"-c" for compression for saving disk space, and
"-b" for setting specified bytes per output file.
More information please type "./WikiExtractor --help".
Ok, let's have a cup of tea and come back an hour later. The output should be output/AA/wikiXX, where wikiXX are the extracted texts.
4. clean up "<>" tags
We are only one step away from the final text corpus, because there are still links in wikiXX files. Let's use the following tiny script to filter out "<>" tags and special "__XXX__" marks.
#! /usr/bin/python # -*- coding:utf-8 -*- import sys import re re1 = re.compile(ur"<.*?>") # ref tags re2 = re.compile(ur"__[A-Z]+__") # special marks e.g. __TOC__ __NOTOC__ line = sys.stdin.readline() while line != "": line = re1.sub("", re2.sub("", line)) print line, line = sys.stdin.readline()
Save the above lines in a file filter.py, and:
python filter.py < zhwiki.text > zhwiki.filter.text