Difference between revisions of "Wikipedia dumps"
Jump to navigation
Jump to search
Line 15: | Line 15: | ||
[[Category:Development]] |
[[Category:Development]] |
||
[[Category:Corpora]] |
[[Category:Corpora]] |
||
[[Category:Documentation in English]] |
Revision as of 16:41, 26 September 2016
Wikipedia dumps are useful for quickly getting a corpus. They are also the best corpora for making your language pair are useful for Wikipedia's Content Translation tool :-)
You download them from
There are several tools for turning dumps into useful plaintext, e.g.
- Wikipedia Extractor – a python script that tries to remove all formatting
- mwdump-to-pandoc – shell wrapper around pandoc (see the usage.sh below the script for how to use)
- Calculating_coverage#More_involved_scripts – an ugly shell script that does the job
- wp2txt – some ruby thing (does this work?)