Difference between revisions of "Wikipedia dumps"

From Apertium
Jump to navigation Jump to search
(+ wikiextractor)
Line 8: Line 8:


* [[Wikipedia Extractor]] – a python script that tries to remove all formatting
* [[Wikipedia Extractor]] – a python script that tries to remove all formatting
* [https://github.com/attardi/wikiextractor wikiextractor] – another python script that removes all formatting (with different options), putting XML marks just to know when begins and ends everty single article
* [https://gist.github.com/unhammer/3372222878580d1e4c6f mwdump-to-pandoc] – shell wrapper around [http://pandoc.org/ pandoc] (see the usage.sh below the script for how to use)
* [https://gist.github.com/unhammer/3372222878580d1e4c6f mwdump-to-pandoc] – shell wrapper around [http://pandoc.org/ pandoc] (see the usage.sh below the script for how to use)
* [[Calculating_coverage#More_involved_scripts]] – an ugly shell script that does the job
* [[Calculating_coverage#More_involved_scripts]] – an ugly shell script that does the job

Revision as of 07:41, 15 May 2018

Wikipedia dumps are useful for quickly getting a corpus. They are also the best corpora for making your language pair are useful for Wikipedia's Content Translation tool :-)

You download them from


There are several tools for turning dumps into useful plaintext, e.g.