Difference between revisions of "WikiExtractor"
Jump to navigation
Jump to search
m (Francis Tyers moved page Wiki Extractor to WikiExtractor) |
|||
(One intermediate revision by one other user not shown) | |||
Line 1: | Line 1: | ||
{{Github-unmigrated-tool}} |
|||
'''WikiExtractor''' is a script for extracting a text corpus from Wikipedia dumps. |
'''WikiExtractor''' is a script for extracting a text corpus from Wikipedia dumps. |
||
Line 12: | Line 13: | ||
You want the file that ends in: <code>-pages-articles.xml.bz2</code>, e.g. <code>euwiki-20160701-pages-articles.xml.bz2</code> |
You want the file that ends in: <code>-pages-articles.xml.bz2</code>, e.g. <code>euwiki-20160701-pages-articles.xml.bz2</code> |
||
You'll need to download the file, you can use <code>wget</code> or <code>curl</code> or something... |
|||
<pre> |
|||
$ wget <url> |
|||
</pre> |
|||
And you run it like this: |
And you run it like this: |
Latest revision as of 02:47, 10 March 2018
Note: After Apertium's migration to GitHub, this tool is read-only on the SourceForge repository and does not exist on GitHub. If you are interested in migrating this tool to GitHub, see Migrating tools to GitHub.
WikiExtractor is a script for extracting a text corpus from Wikipedia dumps.
You can find the script here:
https://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/WikiExtractor.py
You find a Wikipedia dump here:
https://dumps.wikimedia.org/backup-index.html
Navigate to the page of the language you are interested in, there will be a link called <language code>
wiki.
You want the file that ends in: -pages-articles.xml.bz2
, e.g. euwiki-20160701-pages-articles.xml.bz2
You'll need to download the file, you can use wget
or curl
or something...
$ wget <url>
And you run it like this:
$ python3 WikiExtractor.py --infn euwiki-20160701-pages-articles.xml.bz2
It will spit a lot of output (the article titles) and output a file called wiki.txt
. This is your corpus.