Corpus test

From Apertium
Revision as of 12:18, 7 December 2009 by Jacob Nordfalk (talk | contribs)
Jump to navigation Jump to search

Corpus testing is the way of testing (translating) a whole corpus and compare the result to last time the corpus was translated. This is very useful if you want to change a rule or word and want to get an overview of the consequences on real-life text. Along with testvoc and regression testing it is a good way to test that your translator works, and that your changes haven't broken anything.

Creation of a corpus

Before you start you first need a corpus. Look in apertium-eo-en/corpa/enwiki.crp.txt.bz2 (run bunzip2 -c enwiki.crp.txt.bz2 > en.crp.txt) to get an idea of how it should look: - Grep out all lines with # and @ - this will help you find problems in bidix (@) and target language monodix (#). - Pipe through nl -s '. ' to get the right line numbers.

Installation and invocation

Copy testcorpus_en-eo.sh from apertium-eo-en and change the names in it.

To start, type bash regression-tests.sh on a Unix box.

Output

Lines containing @ and # (indicating .dix problem, of which many could also be found by the testvoc method) will be shown.

But most important, in testcorpus_en-eo.txt will end up a list of differences. First is line number, then original text, then < and translation from last time, and last line beginning with > is the translation from this time:

-- 1924 ---
  1924. In Japan there is an input system allowing you to type kanji.
<   1924.       En Japanio estas kontribuaĵan sistemon permesanta vi tajpi *kanji.
>   1924.       En Japanio estas kontribuaĵa sistemo permesanta vin tajpi *kanji.

--- 1937 ---
  1937. However, such apparent simplifications can perversely make a script more complicated.
<   1937.       Tamen, tiaj evidentaj simpligoj povas *perversely fari skribo pli komplika.
>   1937.       Tamen, tiaj evidentaj simpligoj povas *perversely fari skribon pli komplika.


Doing without a script

You dont necessarily need a script. Just type

 make && cat corpa/en.crp.txt | time apertium -d . en-eo > origina_traduko.txt 

to make 'original translation'. Then change your .dixes, and invoke again

 make && cat corpa/en.crp.txt | time apertium -d . en-eo > nova_traduko.txt &

the & sign will make the process run in the background. This means you can examine the differences before all the sentences are translated, with:

 diff -w origina_traduko.txt nova_traduko.txt | grep -r '[<>]' > /tmp/crpdiff.txt && for i in `cut -c3-8 /tmp/crpdiff.txt | sort -un`; do echo  --- $i ---; grep -r "^ *$i\." corpa/en.crp.txt; grep -r "^. *$i\." /tmp/crpdiff.txt; done | less