Difference between revisions of "Wikipedia dumps"

Revision as of 13:56, 15 November 2018

Wikipedia dumps are useful for quickly getting a corpus. They are also the best corpora for making your language pair are useful for Wikipedia's Content Translation tool :-)

You download them from

http://dumps.wikimedia.org/backup-index.html

Tools to turn dumps into plaintext

There are several tools for turning dumps into useful plaintext, e.g.

Wikipedia Extractor – a python script that tries to remove all formatting
wikiextractor – another python script that removes all formatting (with different options), putting XML marks just to know when begins and ends everty single article
mwdump-to-pandoc – shell wrapper around pandoc (see the usage.sh below the script for how to use)
Calculating_coverage#More_involved_scripts – an ugly shell script that does the job
wp2txt – some ruby thing (does this work?)

Content Translation dumps

There are also dumps of the articles translated with the Content Translation tool, which uses Apertium (and other MT engines) under the hood:

https://www.mediawiki.org/wiki/Content_translation/Published_translations

Turn tmx into a "SOURCE\tMT\tGOLD" tsv:

$ zcat ~/Nedlastingar/cx-corpora.nb2nn.text.tmx.gz          \
  | xmlstarlet  sel -t -m '//tu[tuv/prop/text()="mt"]'      \
       -c 'tuv[./prop/text()="source"]/seg/text()' -o $'\t' \
       -c 'tuv[./prop/text()="mt"]/seg/text()'     -o $'\t' \
       -c 'tuv[./prop/text()="user"]/seg/text()'   -n       \
  > nb2nn.tsv
# Now view the word diff between MT and GOLD with:
$ diff -U0 <(cut -f2 nb2nn.tsv) <(cut -f3 nb2nn.tsv) | dwdiff -c | less
# and find the original if you need it in `cut -f1 nb2nn.tsv`

Difference between revisions of "Wikipedia dumps"

Revision as of 13:56, 15 November 2018

Tools to turn dumps into plaintext

Content Translation dumps

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools