Difference between revisions of "Wikipedia Extractor"

From Apertium
Jump to navigation Jump to search
m (formatting changes)
Line 2: Line 2:
 
== Goal ==
 
== Goal ==
   
This tool extracts main text from xml Wikipedia dump files (at http://dumps.wikimedia.org/backup-index.html), producing a text corpus, which is useful for training unsupervised part-of-speech taggers, n-gram language models, etc.
+
This tool extracts main text from xml [[Wikipedia dump]] files (at http://dumps.wikimedia.org/backup-index.html), producing a text corpus, which is useful for training unsupervised part-of-speech taggers, n-gram language models, etc.
   
 
== Tool ==
 
== Tool ==
Line 36: Line 36:
   
 
You can also run <code>python3 WikiExtractor.py --help</code> to get more details.
 
You can also run <code>python3 WikiExtractor.py --help</code> to get more details.
  +
  +
==See also==
  +
* [[ Wikipedia dumps ]]
  +
  +
  +
[[Category:Resources]]
  +
[[Category:Development]]
  +
[[Category:Corpora]]

Revision as of 13:24, 30 January 2016

Goal

This tool extracts main text from xml Wikipedia dump files (at http://dumps.wikimedia.org/backup-index.html), producing a text corpus, which is useful for training unsupervised part-of-speech taggers, n-gram language models, etc.

Tool

There are three versions of Wikipedia Extractor:

The original, works with Python 2.x and outputs multiple files. http://code.google.com/p/natural-language-qa/source/browse/MakeCorpus/WikiExtractor.py License GPL-V3.

Guampa's version, works with Python 3.x and outputs multiple files. https://github.com/hltdi/guampa/blob/master/wikipedia-import/WikiExtractor.py License GPL-V3

New version, a modified version of Guampa's, works with Python 3.x and outputs a single file https://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/WikiExtractor.py

Applicable

Work well : Wikipedia, Wikivoyage, Wikibooks

With problem: Wiktionary (Mistakenly, not all articles fully included, and foreign words' explanations included.)

(Thanks to the feedback by Per Tunedal!)

New Version

The new version was written by BenStobaugh during GCI-2013 and can be downloaded from SVN at https://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/WikiExtractor.py


This version is much simpler than the old version. This version auto-removes any formatting and only outputs the text to one file. To use it, simply use the following command in your terminal, where dump.xml is the Wikipedia dump

   $ python3 WikiExtractor.py --infn dump.xml.bz2

This will run through all of the articles, get all of the text and put it in wiki.txt. This version also supports compression (BZip2 and Gzip), so you can use dump.xml.bz2 or dump.xml.gz instead of dump.xml. You can also compress (Bzip2) the output file by adding --compress to the command.

You can also run python3 WikiExtractor.py --help to get more details.

See also