Corpus test

From Apertium
Revision as of 10:58, 5 January 2012 by Unhammer (talk | contribs) (since there may be other scripts too)
Jump to navigation Jump to search

Corpus testing is the way of testing (translating) a whole corpus and compare the result to last time the corpus was translated. This is very useful if you want to change a rule or word and want to get an overview of the consequences on real-life text. Along with testvoc and regression testing it is a good way to test that your translator works, and that your changes haven't broken anything.

Creation of a corpus

Before you start you first need a corpus. Look in apertium-eo-en/corpa/enwiki.crp.txt.bz2 (run bunzip2 -c enwiki.crp.txt.bz2 > en.crp.txt) to get an idea of how it should look:

  • Grep out all lines with # and @ - this will help you find problems in bidix (@) and target language monodix (#).
  • Pipe through nl -s '. ' to get the right line numbers.

testcorpus script

Installation and invocation

Copy testcorpus_en-eo.sh from apertium-eo-en and change the names in it.

To start, type bash regression-tests.sh on a Unix box.

Output

Lines containing @ and # (indicating .dix problem, of which many could also be found by the testvoc method) will be shown.

But most important, in testcorpus_en-eo.txt will end up a list of differences. First is line number, then original text, then < and translation from last time, and last line beginning with > is the translation from this time:

-- 1924 ---
  1924. In Japan there is an input system allowing you to type kanji.
<   1924.       En Japanio estas kontribuaĵan sistemon permesanta vi tajpi *kanji.
>   1924.       En Japanio estas kontribuaĵa sistemo permesanta vin tajpi *kanji.

--- 1937 ---
  1937. However, such apparent simplifications can perversely make a script more complicated.
<   1937.       Tamen, tiaj evidentaj simpligoj povas *perversely fari skribo pli komplika.
>   1937.       Tamen, tiaj evidentaj simpligoj povas *perversely fari skribon pli komplika.

Doing without a script

You dont necessarily need a script. Just type

 make && cat corpa/en.crp.txt | apertium -d . en-eo > origina_traduko.txt 

to make 'original translation'. Then change your .dixes, and invoke again

 make && cat corpa/en.crp.txt | apertium -d . en-eo > nova_traduko.txt &

the & sign will make the process run in the background. This means you can examine the differences before all the sentences are translated, with:

  diff -w origina_traduko.txt nova_traduko.txt | grep -r '[<>]' > /tmp/crpdiff.txt && 
  for i in `cut -c3-8 /tmp/crpdiff.txt | sort -un`; do 
    echo  --- $i ---; grep -r "^ *$i\." corpa/en.crp.txt; grep -r "^. *$i\." /tmp/crpdiff.txt; 
  done | less

Take it further: word diffs

dwdiff (sudo apt-get install dwdiff on Ubuntu, sudo pacman -S dwdiff on Arch Linux) is a program that takes diff input and finds word-changes, so that instead of

1c1
< Fruit flies enjoy a banana
---
> Fruit flies like a banana

you get

Fruit flies [-enjoy-] {+like+} a banana

Coupled with colour output (the -c option, available in newer versions), huge corpus diffs become a lot more readable.

If there are not many changes, dwdiff by itself will still display the unchanged lines. Since it can read output from diff itself in the "unified diff" format, one of the best ways of using dwdiff in this case is:

$ diff -U1 corpus.yesterday.out corpus.today.out | dwdiff -c --diff-input

The -U1 gives unified diff output with only one line of context before and after the change (try -U0, -U10, etc), while the -c ensures you have nice colours, and --diff-input makes dwdiff read from stdin instead of expecting two files[1].

Notes

  1. I use this wrapper that turns -c on by default if printing to a terminal, and turns --diff-input on automatically if reading from a pipe