Difference between revisions of "Wikipedia dumps"

From Apertium
Jump to navigation Jump to search
Line 4: Line 4:
 
* http://dumps.wikimedia.org/backup-index.html
 
* http://dumps.wikimedia.org/backup-index.html
   
  +
==Tools to turn dumps into plaintext==
 
 
There are several tools for turning dumps into useful plaintext, e.g.
 
There are several tools for turning dumps into useful plaintext, e.g.
   
Line 13: Line 13:
 
* [http://wp2txt.rubyforge.org/ wp2txt] – some ruby thing (does this work?)
 
* [http://wp2txt.rubyforge.org/ wp2txt] – some ruby thing (does this work?)
   
  +
==Content Translation dumps==
There are also dumps of the articles translated with the Content Translation tool (which uses apertium (and other MT engines) under the hood):
 
  +
 
There are also dumps of the articles translated with the Content Translation tool, which uses Apertium (and other MT engines) under the hood:
   
 
* https://www.mediawiki.org/wiki/Content_translation/Published_translations
 
* https://www.mediawiki.org/wiki/Content_translation/Published_translations
  +
  +
Turn tmx into a "SOURCE\tMT\tGOLD" tsv:
  +
<pre>
  +
$ zcat ~/Nedlastingar/cx-corpora.nb2nn.text.tmx.gz \
  +
| xmlstarlet sel -R -t -m '//tu[tuv/prop/text()="mt"]' -c 'tuv[./prop/text()="source"]/seg/text()' -o $'\t' -c 'tuv[./prop/text()="mt"]/seg/text()' -o $'\t' -c 'tuv[./prop/text()="user"]/seg/text()' -n
  +
</pre>
   
 
[[Category:Resources]]
 
[[Category:Resources]]

Revision as of 13:46, 15 November 2018

Wikipedia dumps are useful for quickly getting a corpus. They are also the best corpora for making your language pair are useful for Wikipedia's Content Translation tool :-)

You download them from

Tools to turn dumps into plaintext

There are several tools for turning dumps into useful plaintext, e.g.

Content Translation dumps

There are also dumps of the articles translated with the Content Translation tool, which uses Apertium (and other MT engines) under the hood:

Turn tmx into a "SOURCE\tMT\tGOLD" tsv:

$ zcat ~/Nedlastingar/cx-corpora.nb2nn.text.tmx.gz \
  | xmlstarlet  sel -R -t -m '//tu[tuv/prop/text()="mt"]' -c 'tuv[./prop/text()="source"]/seg/text()' -o $'\t' -c 'tuv[./prop/text()="mt"]/seg/text()' -o $'\t'  -c 'tuv[./prop/text()="user"]/seg/text()' -n