Difference between revisions of "Wikipedia dumps"

From Apertium
Jump to navigation Jump to search
(+ wikiextractor)
Line 12: Line 12:
* [[Calculating_coverage#More_involved_scripts]] – an ugly shell script that does the job
* [[Calculating_coverage#More_involved_scripts]] – an ugly shell script that does the job
* [http://wp2txt.rubyforge.org/ wp2txt] – some ruby thing (does this work?)
* [http://wp2txt.rubyforge.org/ wp2txt] – some ruby thing (does this work?)

There are also dumps of the articles translated with the Content Translation tool (which uses apertium (and other MT engines) under the hood):

* https://www.mediawiki.org/wiki/Content_translation/Published_translations


[[Category:Resources]]
[[Category:Resources]]

Revision as of 17:42, 6 August 2018

Wikipedia dumps are useful for quickly getting a corpus. They are also the best corpora for making your language pair are useful for Wikipedia's Content Translation tool :-)

You download them from


There are several tools for turning dumps into useful plaintext, e.g.

There are also dumps of the articles translated with the Content Translation tool (which uses apertium (and other MT engines) under the hood):