Difference between revisions of "Wikipedia Extractor"
(Created page with '== Goal == This tool extracts main text from Wikipedia, producing a text corpus, which is useful for part of speech tagging, language model training, etc. == Tool == http://cod…') |
Firespeaker (talk | contribs) |
||
(41 intermediate revisions by 7 users not shown) | |||
Line 1: | Line 1: | ||
{{TOCD}} |
|||
== Goal == |
== Goal == |
||
This tool extracts main text from Wikipedia, producing a text corpus, which is useful for |
This tool extracts main text from xml [[Wikipedia dump]] files (at https://dumps.wikimedia.org/backup-index.html, ideally the "'''pages-articles.xml.bz2'''" file), producing a text corpus, which is useful for training unsupervised part-of-speech taggers, n-gram language models, etc. |
||
It was modified by a number of people, including by BenStobaugh during Google Code-In 2013, and can be cloned from GitHub at [https://github.com/apertium/WikiExtractor https://github.com/apertium/WikiExtractor]. |
|||
⚫ | |||
http://code.google.com/p/natural-language-qa/source/browse/MakeCorpus/WikiExtractor.py |
|||
This version is much simpler than the old version. This version auto-removes any formatting and only outputs the text to one file. To use it, simply use the following command in your terminal, where dump.xml is the Wikipedia dump |
|||
License GPL-V3. |
|||
$ python3 WikiExtractor.py --infn dump.xml.bz2 |
|||
== Usage == |
|||
(Note: If you are on a Mac, make sure that <tt>--</tt> is really two hyphens and not an em-dash like this: —). |
|||
=== 1. Get the script === |
|||
This will run through all of the articles, get all of the text and put it in wiki.txt. This version also supports compression (BZip2 and Gzip), so you can use <code>dump.xml.bz2</code> or <code>dump.xml.gz</code> instead of <code>dump.xml.</code> You can also compress (Bzip2) the output file by adding <code>--compress</code> to the command. |
|||
http://code.google.com/p/natural-language-qa/source/browse/MakeCorpus/WikiExtractor.py |
|||
You can also run <code>python3 WikiExtractor.py --help</code> to get more details. |
|||
⚫ | |||
⚫ | |||
http://dumps.wikimedia.org/backup-index.html |
|||
Here's a simple step-by-step guide to the above. |
|||
Take Chinese as an example, download the file '''zhwiki-20130625-pages-articles.xml.bz2''' on this page http://dumps.wikimedia.org/zhwiki/20130625/. Alternatively, we can download the latest version on this page, http://dumps.wikimedia.org/zhwiki/lastest/ |
|||
# Get the <code>WikiExtractor.py</code> script from https://github.com/apertium/WikiExtractor: |
|||
=== 3. Use the script === |
|||
#: <code>$ wget https://raw.githubusercontent.com/apertium/WikiExtractor/master/WikiExtractor.py</code> |
|||
<pre> |
|||
# Download the Wikipedia dump for the language in question from http://dumps.wikimedia.org/backup-index.html. The below uses a hypothetical file name for a language with the <tt>xyz</tt> code: |
|||
mkdir output |
|||
#: <code>$ wget https://dumps.wikimedia.org/xyzwiki/20230120/xyzwiki-20230120-pages-articles.xml.bz2</code> |
|||
⚫ | |||
#: <code>$ python3 WikiExtractor.py --infn xyzwiki-20230120-pages-articles.xml.bz2 --compress</code> |
|||
This will output a file called <code>wiki.txt.bz2</code>. You will probably want to rename it to something like <code>xyz.wikipedia.20230120.txt.bz2</code>. |
|||
bzcat zhwiki-20130625-pages-articles.xml.bz2 | ./WikiExtractor -o output |
|||
==See also== |
|||
</pre> |
|||
* [[ Wikipedia dumps ]] |
|||
Optionally, we can use |
|||
[[Category:Resources]] |
|||
"-c" for compression for saving disk space, and |
|||
[[Category:Development]] |
|||
[[Category:Corpora]] |
|||
"-b" for setting specified bytes per output file. |
|||
[[Category:Documentation in English]] |
|||
More information please type "./WikiExtractor --help". |
|||
Ok, let's have a cup of tea and come back an hour later. The output should be '''output/AA/wikiXX''', where wikiXX are the extracted texts. |
|||
=== 4. clean up "<>" tags === |
|||
We are only one step away from the final text corpus, because there are still links in wikiXX files. Let's use the following tiny script to filter out "<>" tags. |
|||
<pre> |
|||
#! /usr/bin/python |
|||
# -*- coding:utf-8 -*- |
|||
import sys |
|||
import re |
|||
regex = re.compile(ur"<.*?>") |
|||
line = sys.stdin.readline() |
|||
while line != "": |
|||
line = regex.sub("", line)[ : -1] |
|||
print line |
|||
line = sys.stdin.readline() |
|||
</pre> |
|||
=== 5. done :) === |
Latest revision as of 18:55, 30 January 2023
Contents |
Goal[edit]
This tool extracts main text from xml Wikipedia dump files (at https://dumps.wikimedia.org/backup-index.html, ideally the "pages-articles.xml.bz2" file), producing a text corpus, which is useful for training unsupervised part-of-speech taggers, n-gram language models, etc.
It was modified by a number of people, including by BenStobaugh during Google Code-In 2013, and can be cloned from GitHub at https://github.com/apertium/WikiExtractor.
This version is much simpler than the old version. This version auto-removes any formatting and only outputs the text to one file. To use it, simply use the following command in your terminal, where dump.xml is the Wikipedia dump
$ python3 WikiExtractor.py --infn dump.xml.bz2
(Note: If you are on a Mac, make sure that -- is really two hyphens and not an em-dash like this: —).
This will run through all of the articles, get all of the text and put it in wiki.txt. This version also supports compression (BZip2 and Gzip), so you can use dump.xml.bz2
or dump.xml.gz
instead of dump.xml.
You can also compress (Bzip2) the output file by adding --compress
to the command.
You can also run python3 WikiExtractor.py --help
to get more details.
Steps[edit]
Here's a simple step-by-step guide to the above.
- Get the
WikiExtractor.py
script from https://github.com/apertium/WikiExtractor: - Download the Wikipedia dump for the language in question from http://dumps.wikimedia.org/backup-index.html. The below uses a hypothetical file name for a language with the xyz code:
- Run the script on the Wikipedia dump file:
$ python3 WikiExtractor.py --infn xyzwiki-20230120-pages-articles.xml.bz2 --compress
This will output a file called wiki.txt.bz2
. You will probably want to rename it to something like xyz.wikipedia.20230120.txt.bz2
.