Difference between revisions of "Corpus test"
Line 6: | Line 6: | ||
Before you start you first need a [[Corpora|corpus]]. Look in apertium-eo-en/corpa/enwiki.crp.txt.bz2 (run bunzip2 -c enwiki.crp.txt.bz2 > en.crp.txt) to get an idea of how it should look: |
Before you start you first need a [[Corpora|corpus]]. Look in apertium-eo-en/corpa/enwiki.crp.txt.bz2 (run bunzip2 -c enwiki.crp.txt.bz2 > en.crp.txt) to get an idea of how it should look: |
||
* |
* Use <code>grep</code> to remove all lines in the original corpus containing <code>#</code> and <code>@</code>, these symbols are used in Apertium for marking errors in the bilingual dictionary and transfer. |
||
** |
** e.g. <code>grep -v '[@#]' original-corpus > clean-corpus</code> |
||
* |
* Use the command <code>nl -s '. '</code> to number the lines in the corpus. |
||
** |
** e.g. <code>nl clean-corpus > clean-numbered-corpus</code> |
||
==testcorpus script== |
==testcorpus script== |
Revision as of 00:20, 15 January 2012
Corpus testing is the way of testing (translating) a whole corpus and compare the result to last time the corpus was translated. This is very useful if you want to change a rule or word and want to get an overview of the consequences on real-life text. Along with testvoc and regression testing it is a good way to test that your translator works, and that your changes haven't broken anything.
Creation of a corpus
Before you start you first need a corpus. Look in apertium-eo-en/corpa/enwiki.crp.txt.bz2 (run bunzip2 -c enwiki.crp.txt.bz2 > en.crp.txt) to get an idea of how it should look:
- Use
grep
to remove all lines in the original corpus containing#
and@
, these symbols are used in Apertium for marking errors in the bilingual dictionary and transfer.- e.g.
grep -v '[@#]' original-corpus > clean-corpus
- e.g.
- Use the command
nl -s '. '
to number the lines in the corpus.- e.g.
nl clean-corpus > clean-numbered-corpus
- e.g.
testcorpus script
Installation and invocation
Copy testcorpus_en-eo.sh
from apertium-eo-en and change the names in it.
To start, type bash regression-tests.sh
on a Unix box.
Output
Lines containing @ and # (indicating .dix problem, of which many could also be found by the testvoc method) will be shown.
But most important, in testcorpus_en-eo.txt will end up a list of differences. First is line number, then original text, then < and translation from last time, and last line beginning with > is the translation from this time:
-- 1924 --- 1924. In Japan there is an input system allowing you to type kanji. < 1924. En Japanio estas kontribuaĵan sistemon permesanta vi tajpi *kanji. > 1924. En Japanio estas kontribuaĵa sistemo permesanta vin tajpi *kanji. --- 1937 --- 1937. However, such apparent simplifications can perversely make a script more complicated. < 1937. Tamen, tiaj evidentaj simpligoj povas *perversely fari skribo pli komplika. > 1937. Tamen, tiaj evidentaj simpligoj povas *perversely fari skribon pli komplika.
Simple corpus diff
You dont necessarily need a script. Just type
make && cat corpa/en.crp.txt | apertium -d . en-eo > origina_traduko.txt
to make 'original translation'. Then change your .dixes, and invoke again
make && cat corpa/en.crp.txt | apertium -d . en-eo > nova_traduko.txt &
the & sign will make the process run in the background. This means you can examine the differences in translation output before all the sentences are translated. The below commands give such a diff and show it along with the original text:
diff -w origina_traduko.txt nova_traduko.txt | grep '^[<>]' > /tmp/crpdiff.txt && for i in `cut -c3-8 /tmp/crpdiff.txt | sort -un`; do echo --- $i ---; grep "^ *$i\." corpa/en.crp.txt; grep "^. *$i\." /tmp/crpdiff.txt; done | less
The first command creates a simple diff, while the for loop goes through each change, and tries to match the line from the original corpus up with the line that had the change.
Take it further: word diffs
dwdiff (sudo apt-get install dwdiff
on Ubuntu, sudo pacman -S dwdiff
on Arch Linux) is a program that takes diff input and finds word-changes, so that instead of
1c1 < Fruit flies enjoy a banana --- > Fruit flies like a banana
you get
Fruit flies [-enjoy-] {+like+} a banana
Coupled with colour output (the -c option, available in newer versions), huge corpus diffs become a lot more readable.
If there are not many changes, dwdiff by itself will still display the unchanged lines. Since it can read output from diff itself in the "unified diff" format, one of the best ways of using dwdiff in this case is:
$ diff -U1 origina_traduko.txt nova_traduko.txt | dwdiff -c --diff-input
The -U1
gives unified diff output with only one line of context before and after the change (try -U0, -U10, etc), while the -c
ensures you have nice colours, and --diff-input
makes dwdiff read from stdin instead of expecting two files[1].
Seeing the forest in all the trees
The following command will give you a hitparade (frequency list) of word-based changes in translation output:
$ diff -U0 origina_traduko.txt nova_traduko.txt | dwdiff --diff-input |grep -v '^@' |sed 's/.*\[-//'|sed 's/+}.*//' | sort|uniq -c|sort -n
Thus you can begin with any very high-frequency changes first.
Helper script: diffing while translating
When you run two corpus translations in the background, you often want to diff them as they run. However, you'll get a bunch of extra lines at the bottom of your diff from whichever translation task has come the longest. Save the following script to a file to get a difference that's limited to the shortest of the files:
#!/bin/bash if [ $# -lt 2 ]; then echo "Usage: $0 file1 file2 [additional options to diff]"; fi M=$(calc 'min(' $(wc -l < "$1") ', ' $(wc -l < "$2") ')') # ${@:3} means all args after first and second diff ${@:3} <(head -n$M "$1") <(head -n$M "$2")
Call it "mindiff" or something, and you can keep doing mindiff origina_traduko.txt nova_traduko.txt |tail
to check on new differences as they come. Or you can do
mindiff origina_traduko.txt nova_traduko.txt -U0|tail |dwdiff --diff-input
as shown above.