Difference between revisions of "Corpus test"
Line 69: | Line 69: | ||
If there are not many changes, dwdiff by itself will still display the unchanged lines. Since it can read output from diff itself in the "unified diff" format, one of the best ways of using dwdiff in this case is: |
If there are not many changes, dwdiff by itself will still display the unchanged lines. Since it can read output from diff itself in the "unified diff" format, one of the best ways of using dwdiff in this case is: |
||
<pre> |
<pre> |
||
$ diff -U1 |
$ diff -U1 origina_traduko.txt nova_traduko.txt | dwdiff -c --diff-input |
||
</pre> |
</pre> |
||
Revision as of 11:16, 5 January 2012
Corpus testing is the way of testing (translating) a whole corpus and compare the result to last time the corpus was translated. This is very useful if you want to change a rule or word and want to get an overview of the consequences on real-life text. Along with testvoc and regression testing it is a good way to test that your translator works, and that your changes haven't broken anything.
Creation of a corpus
Before you start you first need a corpus. Look in apertium-eo-en/corpa/enwiki.crp.txt.bz2 (run bunzip2 -c enwiki.crp.txt.bz2 > en.crp.txt) to get an idea of how it should look:
- Grep out all lines with # and @ - this will help you find problems in bidix (@) and target language monodix (#).
- Pipe through nl -s '. ' to get the right line numbers.
testcorpus script
Installation and invocation
Copy testcorpus_en-eo.sh
from apertium-eo-en and change the names in it.
To start, type bash regression-tests.sh
on a Unix box.
Output
Lines containing @ and # (indicating .dix problem, of which many could also be found by the testvoc method) will be shown.
But most important, in testcorpus_en-eo.txt will end up a list of differences. First is line number, then original text, then < and translation from last time, and last line beginning with > is the translation from this time:
-- 1924 --- 1924. In Japan there is an input system allowing you to type kanji. < 1924. En Japanio estas kontribuaĵan sistemon permesanta vi tajpi *kanji. > 1924. En Japanio estas kontribuaĵa sistemo permesanta vin tajpi *kanji. --- 1937 --- 1937. However, such apparent simplifications can perversely make a script more complicated. < 1937. Tamen, tiaj evidentaj simpligoj povas *perversely fari skribo pli komplika. > 1937. Tamen, tiaj evidentaj simpligoj povas *perversely fari skribon pli komplika.
Simple corpus diff
You dont necessarily need a script. Just type
make && cat corpa/en.crp.txt | apertium -d . en-eo > origina_traduko.txt
to make 'original translation'. Then change your .dixes, and invoke again
make && cat corpa/en.crp.txt | apertium -d . en-eo > nova_traduko.txt &
the & sign will make the process run in the background. This means you can examine the differences in translation output before all the sentences are translated. The below commands give such a diff and show it along with the original text:
diff -w origina_traduko.txt nova_traduko.txt | grep '^[<>]' > /tmp/crpdiff.txt && for i in `cut -c3-8 /tmp/crpdiff.txt | sort -un`; do echo --- $i ---; grep "^ *$i\." corpa/en.crp.txt; grep "^. *$i\." /tmp/crpdiff.txt; done | less
Take it further: word diffs
dwdiff (sudo apt-get install dwdiff
on Ubuntu, sudo pacman -S dwdiff
on Arch Linux) is a program that takes diff input and finds word-changes, so that instead of
1c1 < Fruit flies enjoy a banana --- > Fruit flies like a banana
you get
Fruit flies [-enjoy-] {+like+} a banana
Coupled with colour output (the -c option, available in newer versions), huge corpus diffs become a lot more readable.
If there are not many changes, dwdiff by itself will still display the unchanged lines. Since it can read output from diff itself in the "unified diff" format, one of the best ways of using dwdiff in this case is:
$ diff -U1 origina_traduko.txt nova_traduko.txt | dwdiff -c --diff-input
The -U1
gives unified diff output with only one line of context before and after the change (try -U0, -U10, etc), while the -c
ensures you have nice colours, and --diff-input
makes dwdiff read from stdin instead of expecting two files[1].
Seeing the forest in all the trees
The following command will give you a hitparade of word-based changes in translation output:
$ diff -U0 origina_traduko.txt nova_traduko.txt | dwdiff --diff-input |grep -v '^@' |sed 's/.*\[-//'|sed 's/+}.*//' | sort|uniq -c|sort -n
Thus you can begin with any very high-frequency changes first.
Helper script: diffing while translating
When you run two corpus translations in the background, you often want to diff them as they run. However, you'll get a bunch of extra lines at the bottom of your diff from whichever translation task has come the longest. Save the following script to a file to get a difference that's limited to the shortest of the files:
#!/bin/bash if [ $# -lt 2 ]; then echo "Usage: $0 file1 file2 [additional options to diff]"; fi M=$(calc 'min(' $(wc -l < "$1") ', ' $(wc -l < "$2") ')') # ${@:3} means all args after first and second diff ${@:3} <(head -n$M "$1") <(head -n$M "$2")
Call it "mindiff" or something, and you can keep doing mindiff origina_traduko.txt nova_traduko.txt |tail
to check on new differences as they come. Or you can do mindiff origina_traduko.txt nova_traduko.txt -U0|tail |dwdiff --diff-input
as shown above.