Apertium has moved from SourceForge to GitHub.
If you have any questions, please come and talk to us on
If you have any questions, please come and talk to us on
#apertium
on irc.freenode.net
or contact the GitHub migration team.WikiExtractor
From Apertium
(Difference between revisions)
m (Francis Tyers moved page Wiki Extractor to WikiExtractor) |
|||
(One intermediate revision by one user not shown) | |||
Line 1: | Line 1: | ||
+ | {{Github-unmigrated-tool}} |
||
'''WikiExtractor''' is a script for extracting a text corpus from Wikipedia dumps. |
'''WikiExtractor''' is a script for extracting a text corpus from Wikipedia dumps. |
||
Line 12: | Line 13: | ||
You want the file that ends in: <code>-pages-articles.xml.bz2</code>, e.g. <code>euwiki-20160701-pages-articles.xml.bz2</code> |
You want the file that ends in: <code>-pages-articles.xml.bz2</code>, e.g. <code>euwiki-20160701-pages-articles.xml.bz2</code> |
||
+ | |||
+ | You'll need to download the file, you can use <code>wget</code> or <code>curl</code> or something... |
||
+ | |||
+ | <pre> |
||
+ | $ wget <url> |
||
+ | </pre> |
||
And you run it like this: |
And you run it like this: |
Latest revision as of 03:47, 10 March 2018
Note: After Apertium's migration to GitHub, this tool is read-only on the SourceForge repository and does not exist on GitHub. If you are interested in migrating this tool to GitHub, see Migrating tools to GitHub.
WikiExtractor is a script for extracting a text corpus from Wikipedia dumps.
You can find the script here:
https://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/WikiExtractor.py
You find a Wikipedia dump here:
https://dumps.wikimedia.org/backup-index.html
Navigate to the page of the language you are interested in, there will be a link called <language code>
wiki.
You want the file that ends in: -pages-articles.xml.bz2
, e.g. euwiki-20160701-pages-articles.xml.bz2
You'll need to download the file, you can use wget
or curl
or something...
$ wget <url>
And you run it like this:
$ python3 WikiExtractor.py --infn euwiki-20160701-pages-articles.xml.bz2
It will spit a lot of output (the article titles) and output a file called wiki.txt
. This is your corpus.