Difference between revisions of "Wikipedia Extractor"

From Apertium
Jump to navigation Jump to search
m (→‎Goal: Add link to dump files)
 
(14 intermediate revisions by 5 users not shown)
Line 2: Line 2:
== Goal ==
== Goal ==


This tool extracts main text from xml Wikipedia dump files (at [[http://dumps.wikimedia.org/backup-index.html]]), producing a text corpus, which is useful for training unsupervised part-of-speech taggers, n-gram language models, etc.
This tool extracts main text from xml [[Wikipedia dump]] files (at https://dumps.wikimedia.org/backup-index.html, ideally the "'''pages-articles.xml.bz2'''" file), producing a text corpus, which is useful for training unsupervised part-of-speech taggers, n-gram language models, etc.


It was modified by a number of people, including by BenStobaugh during Google Code-In 2013, and can be cloned from GitHub at [https://github.com/apertium/WikiExtractor https://github.com/apertium/WikiExtractor].
== Tool ==
There are three versions of Wikipedia Extractor:


This version is much simpler than the old version. This version auto-removes any formatting and only outputs the text to one file. To use it, simply use the following command in your terminal, where dump.xml is the Wikipedia dump
The original, works with Python 2.x and outputs multiple files.
http://code.google.com/p/natural-language-qa/source/browse/MakeCorpus/WikiExtractor.py
License GPL-V3.


$ python3 WikiExtractor.py --infn dump.xml.bz2
Guampa's version, works with Python 3.x and outputs multiple files.
https://github.com/hltdi/guampa/blob/master/wikipedia-import/WikiExtractor.py
License GPL-V3


(Note: If you are on a Mac, make sure that <tt>--</tt> is really two hyphens and not an em-dash like this: &mdash;).
New version, a modified version of Guampa's, works with Python 3.x and outputs a single file
https://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/WikiExtractor.py


This will run through all of the articles, get all of the text and put it in wiki.txt. This version also supports compression (BZip2 and Gzip), so you can use <code>dump.xml.bz2</code> or <code>dump.xml.gz</code> instead of <code>dump.xml.</code> You can also compress (Bzip2) the output file by adding <code>--compress</code> to the command.
== Applicable ==
Work well : '''Wikipedia''', '''Wikivoyage''', '''Wikibooks'''


You can also run <code>python3 WikiExtractor.py --help</code> to get more details.
With problem: '''Wiktionary''' (Mistakenly, not all articles fully included, and foreign words' explanations included.)


== Steps ==
(Thanks to the feedback by Per Tunedal!)


Here's a simple step-by-step guide to the above.
== New Version ==
The new version was written by BenStobaugh during GCI-2013 and can be downloaded from SVN at [https://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/WikiExtractor.py https://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/WikiExtractor.py]


# Get the <code>WikiExtractor.py</code> script from https://github.com/apertium/WikiExtractor:
#: <code>$ wget https://raw.githubusercontent.com/apertium/WikiExtractor/master/WikiExtractor.py</code>
# Download the Wikipedia dump for the language in question from http://dumps.wikimedia.org/backup-index.html. The below uses a hypothetical file name for a language with the <tt>xyz</tt> code:
#: <code>$ wget https://dumps.wikimedia.org/xyzwiki/20230120/xyzwiki-20230120-pages-articles.xml.bz2</code>
# Run the script on the Wikipedia dump file:
#: <code>$ python3 WikiExtractor.py --infn xyzwiki-20230120-pages-articles.xml.bz2 --compress</code>


This will output a file called <code>wiki.txt.bz2</code>. You will probably want to rename it to something like <code>xyz.wikipedia.20230120.txt.bz2</code>.
This version is much simpler than the old version. This version auto-removes any formatting and only outputs the text to one file. To use it, simply use the following command in your terminal, where dump.xml is the Wikipedia dump


==See also==
$ python3 WikiExtractor.py --infn dump.xml.bz2
* [[ Wikipedia dumps ]]


This will run through all of the articles, get all of the text and put it in wiki.txt. This version also supports compression (BZip2 and Gzip), so you can use <code>dump.xml.bz2</code> or <code>dump.xml.gz</code> instead of <code>dump.xml.</code> You can also compress (Bzip2) the output file by adding <code>--compress</code> to the command.


[[Category:Resources]]
You can also run <code>python3 WikiExtractor.py --help</code> to get more details.
[[Category:Development]]
[[Category:Corpora]]
[[Category:Documentation in English]]

Latest revision as of 18:55, 30 January 2023

Goal[edit]

This tool extracts main text from xml Wikipedia dump files (at https://dumps.wikimedia.org/backup-index.html, ideally the "pages-articles.xml.bz2" file), producing a text corpus, which is useful for training unsupervised part-of-speech taggers, n-gram language models, etc.

It was modified by a number of people, including by BenStobaugh during Google Code-In 2013, and can be cloned from GitHub at https://github.com/apertium/WikiExtractor.

This version is much simpler than the old version. This version auto-removes any formatting and only outputs the text to one file. To use it, simply use the following command in your terminal, where dump.xml is the Wikipedia dump

$ python3 WikiExtractor.py --infn dump.xml.bz2

(Note: If you are on a Mac, make sure that -- is really two hyphens and not an em-dash like this: —).

This will run through all of the articles, get all of the text and put it in wiki.txt. This version also supports compression (BZip2 and Gzip), so you can use dump.xml.bz2 or dump.xml.gz instead of dump.xml. You can also compress (Bzip2) the output file by adding --compress to the command.

You can also run python3 WikiExtractor.py --help to get more details.

Steps[edit]

Here's a simple step-by-step guide to the above.

  1. Get the WikiExtractor.py script from https://github.com/apertium/WikiExtractor:
    $ wget https://raw.githubusercontent.com/apertium/WikiExtractor/master/WikiExtractor.py
  2. Download the Wikipedia dump for the language in question from http://dumps.wikimedia.org/backup-index.html. The below uses a hypothetical file name for a language with the xyz code:
    $ wget https://dumps.wikimedia.org/xyzwiki/20230120/xyzwiki-20230120-pages-articles.xml.bz2
  3. Run the script on the Wikipedia dump file:
    $ python3 WikiExtractor.py --infn xyzwiki-20230120-pages-articles.xml.bz2 --compress

This will output a file called wiki.txt.bz2. You will probably want to rename it to something like xyz.wikipedia.20230120.txt.bz2.

See also[edit]