Difference between revisions of "Wikipedia dumps"
		
		
		
		
		
		
		Jump to navigation
		Jump to search
		
				
		
		
		
		
		
		
		
	
|  (Created page with "Wikipedia dumps are useful for quickly getting a corpus. They are also the best corpora for making your language pair are useful for Wikipedia's Content Translation tool :-)  ...") | |||
| Line 9: | Line 9: | ||
| * [[Wikipedia Extractor]] – a python script that tries to remove all formatting | * [[Wikipedia Extractor]] – a python script that tries to remove all formatting | ||
| * [https://gist.github.com/unhammer/3372222878580d1e4c6f mwdump-to-pandoc] – shell wrapper around [http://pandoc.org/ pandoc] (see the usage.sh below the script for how to use) | * [https://gist.github.com/unhammer/3372222878580d1e4c6f mwdump-to-pandoc] – shell wrapper around [http://pandoc.org/ pandoc] (see the usage.sh below the script for how to use) | ||
| * [http://wp2txt.rubyforge.org/ wp2txt] – some ruby thing (does this work?) | |||
| [[Category:Resources]] | [[Category:Resources]] | ||
Revision as of 13:25, 30 January 2016
Wikipedia dumps are useful for quickly getting a corpus. They are also the best corpora for making your language pair are useful for Wikipedia's Content Translation tool :-)
You download them from
There are several tools for turning dumps into useful plaintext, e.g.
- Wikipedia Extractor – a python script that tries to remove all formatting
- mwdump-to-pandoc – shell wrapper around pandoc (see the usage.sh below the script for how to use)
- wp2txt – some ruby thing (does this work?)

