Difference between revisions of "Corpus test"
Line 37: | Line 37: | ||
You dont necessarily need a script. Just type |
You dont necessarily need a script. Just type |
||
make && cat corpa/en.crp.txt | |
make && cat corpa/en.crp.txt | apertium -d . en-eo > origina_traduko.txt |
||
to make 'original translation'. Then change your .dixes, and invoke again |
to make 'original translation'. Then change your .dixes, and invoke again |
||
make && cat corpa/en.crp.txt | |
make && cat corpa/en.crp.txt | apertium -d . en-eo > nova_traduko.txt & |
||
the & sign will make the process run in the background. This means you can examine the differences before all the sentences are translated, with: |
the & sign will make the process run in the background. This means you can examine the differences before all the sentences are translated, with: |
Revision as of 12:19, 7 December 2009
Corpus testing is the way of testing (translating) a whole corpus and compare the result to last time the corpus was translated. This is very useful if you want to change a rule or word and want to get an overview of the consequences on real-life text. Along with testvoc and regression testing it is a good way to test that your translator works, and that your changes haven't broken anything.
Creation of a corpus
Before you start you first need a corpus. Look in apertium-eo-en/corpa/enwiki.crp.txt.bz2 (run bunzip2 -c enwiki.crp.txt.bz2 > en.crp.txt) to get an idea of how it should look:
- Grep out all lines with # and @ - this will help you find problems in bidix (@) and target language monodix (#).
- Pipe through nl -s '. ' to get the right line numbers.
Installation and invocation
Copy testcorpus_en-eo.sh
from apertium-eo-en and change the names in it.
To start, type bash regression-tests.sh
on a Unix box.
Output
Lines containing @ and # (indicating .dix problem, of which many could also be found by the testvoc method) will be shown.
But most important, in testcorpus_en-eo.txt will end up a list of differences. First is line number, then original text, then < and translation from last time, and last line beginning with > is the translation from this time:
-- 1924 --- 1924. In Japan there is an input system allowing you to type kanji. < 1924. En Japanio estas kontribuaĵan sistemon permesanta vi tajpi *kanji. > 1924. En Japanio estas kontribuaĵa sistemo permesanta vin tajpi *kanji. --- 1937 --- 1937. However, such apparent simplifications can perversely make a script more complicated. < 1937. Tamen, tiaj evidentaj simpligoj povas *perversely fari skribo pli komplika. > 1937. Tamen, tiaj evidentaj simpligoj povas *perversely fari skribon pli komplika.
Doing without a script
You dont necessarily need a script. Just type
make && cat corpa/en.crp.txt | apertium -d . en-eo > origina_traduko.txt
to make 'original translation'. Then change your .dixes, and invoke again
make && cat corpa/en.crp.txt | apertium -d . en-eo > nova_traduko.txt &
the & sign will make the process run in the background. This means you can examine the differences before all the sentences are translated, with:
diff -w origina_traduko.txt nova_traduko.txt | grep -r '[<>]' > /tmp/crpdiff.txt && for i in `cut -c3-8 /tmp/crpdiff.txt | sort -un`; do echo --- $i ---; grep -r "^ *$i\." corpa/en.crp.txt; grep -r "^. *$i\." /tmp/crpdiff.txt; done | less