Corpus test

From Apertium
Revision as of 20:16, 23 July 2021 by Popcorndude (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

En français

Corpus testing is the way of testing (translating) a whole corpus and compare the result to last time the corpus was translated. This is very useful if you want to change a rule or word and want to get an overview of the consequences on real-life text. Along with testvoc and regression testing it is a good way to test that your translator works, and that your changes haven't broken anything.

Everything listed below can also be done automatically using Apertium-regtest.

Creation of a corpus

Before you start you first need a corpus. Look in apertium-eo-en/corpa/enwiki.crp.txt.bz2 (run bunzip2 -c enwiki.crp.txt.bz2 > en.crp.txt) to get an idea of how it should look:

  • Use grep to remove all lines in the original corpus containing # and @ as these symbols are used in Apertium for marking errors in the bilingual dictionary and transfer.
    • e.g. grep -v '[@#]' original-corpus > clean-corpus
  • Use the command nl -s '. ' to number the lines in the corpus.
    • e.g. nl clean-corpus > clean-numbered-corpus

testcorpus script

Installation and invocation

Copy from apertium-eo-en and change the names in it.

To start, type bash on a Unix box.


Lines containing @ and # (indicating .dix problem, of which many could also be found by the testvoc method) will be shown.

But most important, in testcorpus_en-eo.txt will end up a list of differences. First is line number, then original text, then < and translation from last time, and last line beginning with > is the translation from this time:

-- 1924 ---
  1924. In Japan there is an input system allowing you to type kanji.
<   1924.       En Japanio estas kontribuaĵan sistemon permesanta vi tajpi *kanji.
>   1924.       En Japanio estas kontribuaĵa sistemo permesanta vin tajpi *kanji.

--- 1937 ---
  1937. However, such apparent simplifications can perversely make a script more complicated.
<   1937.       Tamen, tiaj evidentaj simpligoj povas *perversely fari skribo pli komplika.
>   1937.       Tamen, tiaj evidentaj simpligoj povas *perversely fari skribon pli komplika.

Simple corpus diff

You dont necessarily need a script. Just type

 make && cat corpa/en.crp.txt | apertium -d . en-eo > origina_traduko.txt 

to make 'original translation'. Then change your .dixes, and invoke again

 make && cat corpa/en.crp.txt | apertium -d . en-eo > nova_traduko.txt &

the & sign will make the process run in the background. This means you can examine the differences in translation output before all the sentences are translated. The below commands give such a diff and show it along with the original text:

  diff -w origina_traduko.txt nova_traduko.txt | grep '^[<>]' > /tmp/crpdiff.txt && 
  for i in `cut -c3-8 /tmp/crpdiff.txt | sort -un`; do 
    echo  --- $i ---; grep "^ *$i\." corpa/en.crp.txt; grep "^. *$i\." /tmp/crpdiff.txt; 
  done | less

The first command creates a simple diff, while the for loop goes through each change, and tries to match the line from the original corpus up with the line that had the change.

Take it further: word diffs

dwdiff (sudo apt-get install dwdiff on Ubuntu, sudo pacman -S dwdiff on Arch Linux) is a program that takes diff input and finds word-changes, so that instead of

< Fruit flies enjoy a banana
> Fruit flies like a banana

you get

Fruit flies [-enjoy-] {+like+} a banana

Coupled with colour output (the -c option, available in newer versions), huge corpus diffs become a lot more readable.

If there are not many changes, dwdiff by itself will still display the unchanged lines. Since it can read output from diff itself in the "unified diff" format, one of the best ways of using dwdiff in this case is:

$ diff -U1 origina_traduko.txt nova_traduko.txt | dwdiff -c --diff-input

The -U1 gives unified diff output with only one line of context before and after the change (try -U0, -U10, etc), while the -c ensures you have nice colours, and --diff-input makes dwdiff read from stdin instead of expecting two files[1].

Seeing the forest in all the trees

The following command will give you a hitparade (frequency list) of word-based changes in translation output:

$ diff -U0 origina_traduko.txt nova_traduko.txt | dwdiff --diff-input |grep -v '^@' |sed 's/.*\[-//'|sed 's/+}.*//'  | sort|uniq -c|sort -nr

Thus you can begin with any very high-frequency changes first.

Helper script: diffing while translating

When you run two corpus translations in the background, you often want to diff them as they run. However, you'll get a bunch of extra lines at the bottom of your diff from whichever translation task has come the longest. Save the following script to a file to get a difference that's limited to the shortest of the files:

if [ $# -lt 2 ]; then echo "Usage: $0 file1 file2 [additional options to diff]"; fi

M=$(calc 'min(' $(wc -l < "$1") ', ' $(wc -l < "$2") ')')

# ${@:3} means all args after first and second
diff ${@:3} <(head -n$M "$1") <(head -n$M "$2") 

Call it "mindiff" or something, and you can keep doing mindiff origina_traduko.txt nova_traduko.txt |tail to check on new differences as they come. Or you can do

mindiff origina_traduko.txt nova_traduko.txt -U0|tail |dwdiff --diff-input

as shown above.


  1. I use this wrapper that turns -c on by default if printing to a terminal, and turns --diff-input on automatically if reading from a pipe