Apertium has moved from SourceForge to GitHub.
If you have any questions, please come and talk to us on #apertium on irc.freenode.net or contact the GitHub migration team.

Wikipedia dumps

From Apertium
Revision as of 07:40, 10 April 2019 by Ilnar.salimzyan (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Wikipedia dumps are useful for quickly getting a corpus. They are also the best corpora for making your language pair are useful for Wikipedia's Content Translation tool :-)

You download them from

[edit] Tools to turn dumps into plaintext

There are several tools for turning dumps into useful plaintext, e.g.

[edit] Content Translation dumps

There are also dumps of the articles translated with the Content Translation tool, which uses Apertium (and other MT engines) under the hood:

To turn a tmx into a SOURCE\tMT\tGOLD tab-separated text file, install xmlstarlet (sudo apt install xmlstarlet) and do:

$ zcat ~/Nedlastingar/cx-corpora.nb2nn.text.tmx.gz          \
  | xmlstarlet  sel -t -m '//tu[tuv/prop/text()="mt"]'      \
       -c 'tuv[./prop/text()="source"]/seg/text()' -o $'\t' \
       -c 'tuv[./prop/text()="mt"]/seg/text()'     -o $'\t' \
       -c 'tuv[./prop/text()="user"]/seg/text()'   -n       \
  > nb2nn.tsv

Now view the word diff between MT and GOLD with:

$ diff -U0 <(cut -f2 nb2nn.tsv) <(cut -f3 nb2nn.tsv) | dwdiff --diff-input -c | less


and find the original if you need it in cut -f1 nb2nn.tsv.

For some languages, you have to get the _2CODE files or _2_ files, e.g. sv2da is in https://dumps.wikimedia.org/other/contenttranslation/20180810/cx-corpora._2da.text.tmx.gz and da2sv is in https://dumps.wikimedia.org/other/contenttranslation/20180810/cx-corpora._2_.text.tmx.gz – so let's filter it to the languages we want:

$ zcat ~/Nedlastingar/cx-corpora._2_.text.tmx.gz                             \
 | xmlstarlet sel -t                                                         \
    -m '//tu[@srclang="da" and tuv/prop/text()="mt" and tuv/@xml:lang="sv"]' \
    -c 'tuv[./prop/text()="source"]/seg/text()' -o $'\t'                     \
    -c 'tuv[./prop/text()="mt"]/seg/text()' -o $'\t'                         \
    -c 'tuv[./prop/text()="user"]/seg/text()' -n                             \
    > da2sv.tsv
Personal tools