Difference between revisions of "Wikipedia dumps"

From Apertium
Jump to navigation Jump to search
(Created page with "Wikipedia dumps are useful for quickly getting a corpus. They are also the best corpora for making your language pair are useful for Wikipedia's Content Translation tool :-) ...")
 
Line 9: Line 9:
 
* [[Wikipedia Extractor]] – a python script that tries to remove all formatting
 
* [[Wikipedia Extractor]] – a python script that tries to remove all formatting
 
* [https://gist.github.com/unhammer/3372222878580d1e4c6f mwdump-to-pandoc] – shell wrapper around [http://pandoc.org/ pandoc] (see the usage.sh below the script for how to use)
 
* [https://gist.github.com/unhammer/3372222878580d1e4c6f mwdump-to-pandoc] – shell wrapper around [http://pandoc.org/ pandoc] (see the usage.sh below the script for how to use)
  +
* [http://wp2txt.rubyforge.org/ wp2txt] – some ruby thing (does this work?)
 
   
 
[[Category:Resources]]
 
[[Category:Resources]]

Revision as of 13:25, 30 January 2016

Wikipedia dumps are useful for quickly getting a corpus. They are also the best corpora for making your language pair are useful for Wikipedia's Content Translation tool :-)

You download them from


There are several tools for turning dumps into useful plaintext, e.g.