Wikipedia dumps

From Apertium
Jump to navigation Jump to search

Wikipedia dumps are useful for quickly getting a corpus. They are also the best corpora for making your language pair are useful for Wikipedia's Content Translation tool :-)

You download them from

Tools to turn dumps into plaintext

There are several tools for turning dumps into useful plaintext, e.g.

Content Translation dumps

There are also dumps of the articles translated with the Content Translation tool, which uses Apertium (and other MT engines) under the hood:


To turn a tmx into a SOURCE\tMT\tGOLD tab-separated text file, install xmlstarlet (sudo apt install xmlstarlet) and do:

$ zcat ~/Nedlastingar/cx-corpora.nb2nn.text.tmx.gz          \
  | xmlstarlet  sel -t -m '//tu[tuv/prop/text()="mt"]'      \
       -c 'tuv[./prop/text()="source"]/seg/text()' -o $'\t' \
       -c 'tuv[./prop/text()="mt"]/seg/text()'     -o $'\t' \
       -c 'tuv[./prop/text()="user"]/seg/text()'   -n       \
  > nb2nn.tsv

Now view the word diff between MT and GOLD with:

$ diff -U0 <(cut -f2 nb2nn.tsv) <(cut -f3 nb2nn.tsv) | dwdiff --diff-input -c | less

and find the original if you need it in cut -f1 nb2nn.tsv