Difference between revisions of "Wikipedia Extractor"

From Apertium
Jump to navigation Jump to search
Line 42: Line 42:
=== 4. clean up "<>" tags ===
=== 4. clean up "<>" tags ===


We are only one step away from the final text corpus, because there are still links in wikiXX files. Let's use the following tiny script to filter out "<>" tags.
We are only one step away from the final text corpus, because there are still links in wikiXX files. Let's use the following tiny script to filter out "<>" tags and special "__XXX__" marks.


<pre>
<pre>
Line 50: Line 50:
import sys
import sys
import re
import re
regex = re.compile(ur"<.*?>")
re1 = re.compile(ur"<.*?>") # ref tags
re2 = re.compile(ur"__[A-Z]+__") # special marks e.g. __TOC__ __NOTOC__
line = sys.stdin.readline()
line = sys.stdin.readline()
while line != "":
while line != "":
line = regex.sub("", line)[ : -1]
line = re1.sub("", re2.sub("", line))
print line
print line,
line = sys.stdin.readline()
line = sys.stdin.readline()
</pre>
</pre>

Revision as of 04:30, 15 September 2013

Goal

This tool extracts main text from Wikipedia, producing a text corpus, which is useful for training unsupervised part-of-speech taggers, n-gram language models, etc.

Tool

http://code.google.com/p/natural-language-qa/source/browse/MakeCorpus/WikiExtractor.py

License GPL-V3.

Usage

1. Get the script

  http://code.google.com/p/natural-language-qa/source/browse/MakeCorpus/WikiExtractor.py

2. Download the Wikipedia dump file

  http://dumps.wikimedia.org/backup-index.html

Take Chinese as an example, download the file zhwiki-20130625-pages-articles.xml.bz2 on this page http://dumps.wikimedia.org/zhwiki/20130625/. Alternatively, we can download the latest version on this page, http://dumps.wikimedia.org/zhwiki/lastest/

3. Use the script

mkdir output

bzcat zhwiki-20130625-pages-articles.xml.bz2 | ./WikiExtractor -o output

cat output/*/* > zhwiki.text

Optionally, we can use

"-c" for compression for saving disk space, and

"-b" for setting specified bytes per output file.

More information please type "./WikiExtractor --help".

Ok, let's have a cup of tea and come back an hour later. The output should be output/AA/wikiXX, where wikiXX are the extracted texts.

4. clean up "<>" tags

We are only one step away from the final text corpus, because there are still links in wikiXX files. Let's use the following tiny script to filter out "<>" tags and special "__XXX__" marks.


#! /usr/bin/python
# -*- coding:utf-8 -*-
import sys
import re
re1 = re.compile(ur"<.*?>")       # ref tags
re2 = re.compile(ur"__[A-Z]+__")  # special marks e.g. __TOC__ __NOTOC__
line = sys.stdin.readline()
while line != "":
	line = re1.sub("", re2.sub("", line))
	print line,
	line = sys.stdin.readline()

Save the above in a file filter.py, and:

python filter.py < zhwiki.text > zhwiki.filter.text

5. done :)