Difference between revisions of "Wikipedia Extractor"

From Apertium
Jump to navigation Jump to search
Line 1: Line 1:
{{TOCD}}
{{TOCD}}

A single script to extract wikipedia dumps to text format (able to take compressed input and write compressed output) based on the documentation below was put together by BenStobaugh during GCI 2013. It is available in SVN at [https://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/WikiExtractor.py https://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/WikiExtractor.py]

== Goal ==
== Goal ==



Revision as of 17:19, 9 December 2013

A single script to extract wikipedia dumps to text format (able to take compressed input and write compressed output) based on the documentation below was put together by BenStobaugh during GCI 2013. It is available in SVN at https://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/WikiExtractor.py

Goal

This tool extracts main text from Wikipedia, producing a text corpus, which is useful for training unsupervised part-of-speech taggers, n-gram language models, etc.

Tool

http://code.google.com/p/natural-language-qa/source/browse/MakeCorpus/WikiExtractor.py

License GPL-V3.

Applicable

Work well : Wikipedia, Wikivoyage, Wikibooks

With problem: Wiktionary (Mistakenly, not all articles fully included, and foreign words' explanations included.)

(Thanks to the feedback by Per Tunedal!)

Usage

1. Get the script

  http://code.google.com/p/natural-language-qa/source/browse/MakeCorpus/WikiExtractor.py

2. Download the Wikipedia dump file

  http://dumps.wikimedia.org/backup-index.html

Take Chinese as an example, download the file zhwiki-20130625-pages-articles.xml.bz2 on this page http://dumps.wikimedia.org/zhwiki/20130625/. Alternatively, we can download the latest version on this page, http://dumps.wikimedia.org/zhwiki/lastest/

3. Use the script

mkdir output

bzcat zhwiki-20130625-pages-articles.xml.bz2 | ./WikiExtractor -o output

cat output/*/* > zhwiki.text

Optionally, we can use

"-c" for compression for saving disk space, and

"-b" for setting specified bytes per output file.

More information please type "./WikiExtractor --help".

Ok, let's have a cup of tea and come back an hour later. The output should be output/AA/wikiXX, where wikiXX are the extracted texts.

4. clean up "<>" tags

We are only one step away from the final text corpus, because there are still links in wikiXX files. Let's use the following tiny script to filter out "<>" tags and special "__XXX__" marks.


#! /usr/bin/python
# -*- coding:utf-8 -*-
import sys
import re
re1 = re.compile(ur"<.*?>")       # ref tags
re2 = re.compile(ur"__[A-Z]+__")  # special marks e.g. __TOC__ __NOTOC__
line = sys.stdin.readline()
while line != "":
	line = re1.sub("", re2.sub("", line))
	print line,
	line = sys.stdin.readline()

Save the above lines in a file filter.py, and:

python filter.py < zhwiki.text > zhwiki.filter.text

5. done :)