Difference between revisions of "WikiExtractor"

Latest revision as of 02:47, 10 March 2018

Note: After Apertium's migration to GitHub, this tool is read-only on the SourceForge repository and does not exist on GitHub. If you are interested in migrating this tool to GitHub, see Migrating tools to GitHub.

WikiExtractor is a script for extracting a text corpus from Wikipedia dumps.

You can find the script here:

https://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/WikiExtractor.py

You find a Wikipedia dump here:

https://dumps.wikimedia.org/backup-index.html

Navigate to the page of the language you are interested in, there will be a link called <language code>wiki.

You want the file that ends in: -pages-articles.xml.bz2, e.g. euwiki-20160701-pages-articles.xml.bz2

You'll need to download the file, you can use wget or curl or something...

$ wget <url>

And you run it like this:

$ python3 WikiExtractor.py --infn euwiki-20160701-pages-articles.xml.bz2

It will spit a lot of output (the article titles) and output a file called wiki.txt. This is your corpus.

@@ Line 1: / Line 1: @@
+{{Github-unmigrated-tool}}
 '''WikiExtractor''' is a script for extracting a text corpus from Wikipedia dumps.
@@ Line 12: / Line 13: @@
 You want the file that ends in: <code>-pages-articles.xml.bz2</code>, e.g. <code>euwiki-20160701-pages-articles.xml.bz2</code>
+You'll need to download the file, you can use <code>wget</code> or <code>curl</code> or something...
+<pre>
+$ wget <url>
+</pre>
 And you run it like this:

Difference between revisions of "WikiExtractor"

Latest revision as of 02:47, 10 March 2018

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools