Difference between revisions of "Wikipedia Extractor"

From Apertium
Jump to navigation Jump to search
(15 intermediate revisions by 6 users not shown)
Line 1: Line 1:
 
{{TOCD}}
 
{{TOCD}}
 
A single script to extract wikipedia dumps to text format (able to take compressed input and write compressed output) based on the documentation below was put together by BenStobaugh during GCI 2013. It is available in SVN at [https://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/WikiExtractor.py https://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/WikiExtractor.py]
 
 
== Goal ==
 
== Goal ==
   
This tool extracts main text from Wikipedia, producing a text corpus, which is useful for training unsupervised part-of-speech taggers, n-gram language models, etc.
+
This tool extracts main text from xml [[Wikipedia dump]] files (at http://dumps.wikimedia.org/backup-index.html, ideally the "pages-articles" file), producing a text corpus, which is useful for training unsupervised part-of-speech taggers, n-gram language models, etc.
   
 
It was written by BenStobaugh during GCI-2013 and can be downloaded from SVN at [https://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/WikiExtractor.py https://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/WikiExtractor.py]
== Tool ==
 
http://code.google.com/p/natural-language-qa/source/browse/MakeCorpus/WikiExtractor.py
 
   
 
This version is much simpler than the old version. This version auto-removes any formatting and only outputs the text to one file. To use it, simply use the following command in your terminal, where dump.xml is the Wikipedia dump
License GPL-V3.
 
   
 
$ python3 WikiExtractor.py --infn dump.xml.bz2
== Applicable ==
 
Work well : '''Wikipedia''', '''Wikivoyage''', '''Wikibooks'''
 
   
  +
(Note: If you are on a Mac, make sure that <tt>--</tt> is really two hyphens and not an em-dash like this: &mdash;).
With problem: '''Wiktionary''' (Mistakenly, not all articles fully included, and foreign words' explanations included.)
 
 
(Thanks to the feedback by Per Tunedal!)
 
 
== New Version ==
 
The never version was written by BenStobaugh during GCI-2013 and can be downloaded from SVN at [https://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/WikiExtractor.py https://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/WikiExtractor.py]
 
 
 
This version is much simpler than the old version. This version auto-removes any formatting and only outputs the text to one file. To use it, simply use the following command in your terminal, where dump.xml is the Wikipedia dump
 
 
$ python3 WikiExtractor.py --infn dump.xml
 
   
 
This will run through all of the articles, get all of the text and put it in wiki.txt. This version also supports compression (BZip2 and Gzip), so you can use <code>dump.xml.bz2</code> or <code>dump.xml.gz</code> instead of <code>dump.xml.</code> You can also compress (Bzip2) the output file by adding <code>--compress</code> to the command.
 
This will run through all of the articles, get all of the text and put it in wiki.txt. This version also supports compression (BZip2 and Gzip), so you can use <code>dump.xml.bz2</code> or <code>dump.xml.gz</code> instead of <code>dump.xml.</code> You can also compress (Bzip2) the output file by adding <code>--compress</code> to the command.
   
You can also run <code>python3 WikiExtractor.py --help</code> to get more details.
+
You can also run <code>python3 WikiExtractor.py --help</code> to get more details.
== Old Version ==
 
 
=== 1. Get the script ===
 
 
http://code.google.com/p/natural-language-qa/source/browse/MakeCorpus/WikiExtractor.py
 
 
=== 2. Download the Wikipedia dump file ===
 
 
http://dumps.wikimedia.org/backup-index.html
 
 
Take Chinese as an example, download the file '''zhwiki-20130625-pages-articles.xml.bz2''' on this page http://dumps.wikimedia.org/zhwiki/20130625/. Alternatively, we can download the latest version on this page, http://dumps.wikimedia.org/zhwiki/lastest/
 
 
=== 3. Use the script ===
 
<pre>
 
mkdir output
 
 
bzcat zhwiki-20130625-pages-articles.xml.bz2 | ./WikiExtractor -o output
 
 
cat output/*/* > zhwiki.text
 
 
</pre>
 
 
Optionally, we can use
 
 
"-c" for compression for saving disk space, and
 
 
"-b" for setting specified bytes per output file.
 
 
More information please type "./WikiExtractor --help".
 
 
Ok, let's have a cup of tea and come back an hour later. The output should be '''output/AA/wikiXX''', where wikiXX are the extracted texts.
 
 
=== 4. clean up "<>" tags ===
 
 
We are only one step away from the final text corpus, because there are still links in wikiXX files. Let's use the following tiny script to filter out "<>" tags and special "__XXX__" marks.
 
 
<pre>
 
 
#! /usr/bin/python
 
# -*- coding:utf-8 -*-
 
import sys
 
import re
 
re1 = re.compile(ur"<.*?>") # ref tags
 
re2 = re.compile(ur"__[A-Z]+__") # special marks e.g. __TOC__ __NOTOC__
 
line = sys.stdin.readline()
 
while line != "":
 
line = re1.sub("", re2.sub("", line))
 
print line,
 
line = sys.stdin.readline()
 
</pre>
 
 
Save the above lines in a file '''filter.py''', and:
 
   
  +
==See also==
<pre>
 
  +
* [[ Wikipedia dumps ]]
python filter.py < zhwiki.text > zhwiki.filter.text
 
</pre>
 
   
=== 5. done :) ===
 
   
[[Category:Tools]]
+
[[Category:Resources]]
  +
[[Category:Development]]
  +
[[Category:Corpora]]
  +
[[Category:Documentation in English]]

Revision as of 08:06, 13 September 2017

Contents

Goal

This tool extracts main text from xml Wikipedia dump files (at http://dumps.wikimedia.org/backup-index.html, ideally the "pages-articles" file), producing a text corpus, which is useful for training unsupervised part-of-speech taggers, n-gram language models, etc.

It was written by BenStobaugh during GCI-2013 and can be downloaded from SVN at https://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/WikiExtractor.py

This version is much simpler than the old version. This version auto-removes any formatting and only outputs the text to one file. To use it, simply use the following command in your terminal, where dump.xml is the Wikipedia dump

$ python3 WikiExtractor.py --infn dump.xml.bz2

(Note: If you are on a Mac, make sure that -- is really two hyphens and not an em-dash like this: —).

This will run through all of the articles, get all of the text and put it in wiki.txt. This version also supports compression (BZip2 and Gzip), so you can use dump.xml.bz2 or dump.xml.gz instead of dump.xml. You can also compress (Bzip2) the output file by adding --compress to the command.

You can also run python3 WikiExtractor.py --help to get more details.

See also