Difference between revisions of "Wikipedia dumps"

Revision as of 17:42, 6 August 2018

Wikipedia dumps are useful for quickly getting a corpus. They are also the best corpora for making your language pair are useful for Wikipedia's Content Translation tool :-)

You download them from

http://dumps.wikimedia.org/backup-index.html

There are several tools for turning dumps into useful plaintext, e.g.

Wikipedia Extractor – a python script that tries to remove all formatting
wikiextractor – another python script that removes all formatting (with different options), putting XML marks just to know when begins and ends everty single article
mwdump-to-pandoc – shell wrapper around pandoc (see the usage.sh below the script for how to use)
Calculating_coverage#More_involved_scripts – an ugly shell script that does the job
wp2txt – some ruby thing (does this work?)

There are also dumps of the articles translated with the Content Translation tool (which uses apertium (and other MT engines) under the hood):

https://www.mediawiki.org/wiki/Content_translation/Published_translations

@@ Line 12: / Line 12: @@
 * [[Calculating_coverage#More_involved_scripts]] – an ugly shell script that does the job
 * [http://wp2txt.rubyforge.org/ wp2txt] – some ruby thing (does this work?)
+There are also dumps of the articles translated with the Content Translation tool (which uses apertium (and other MT engines) under the hood):
+* https://www.mediawiki.org/wiki/Content_translation/Published_translations
 [[Category:Resources]]

Difference between revisions of "Wikipedia dumps"

Revision as of 17:42, 6 August 2018

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools