Apertium has moved from SourceForge to GitHub.
If you have any questions, please come and talk to us on #apertium on irc.freenode.net or contact the GitHub migration team.

WikiExtractor

From Apertium
(Difference between revisions)
Jump to: navigation, search
(Created page with "'''WikiExtractor''' is a script for extracting a text corpus from Wikipedia dumps. You can find the script here: https://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/...")
 
m (Francis Tyers moved page Wiki Extractor to WikiExtractor)

Revision as of 14:37, 9 July 2016

WikiExtractor is a script for extracting a text corpus from Wikipedia dumps.

You can find the script here:

https://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/WikiExtractor.py

You find a Wikipedia dump here:

https://dumps.wikimedia.org/backup-index.html

Navigate to the page of the language you are interested in, there will be a link called <language code>wiki.

You want the file that ends in: -pages-articles.xml.bz2, e.g. euwiki-20160701-pages-articles.xml.bz2

And you run it like this:

$ python3 WikiExtractor.py --infn euwiki-20160701-pages-articles.xml.bz2

It will spit a lot of output (the article titles) and output a file called wiki.txt. This is your corpus.

Personal tools