Difference between revisions of "Corpus test"
(Created page with ' Category:Terminology') |
|||
Line 1: | Line 1: | ||
{{TOCD}} |
|||
'''Corpus testing''' is the way of testing (translating) a whole corpus and compare the result to last time the corpus was translated. This is very useful if you want to change a rule or word and want to get an overview of the consequences on real-life text. Along with [[testvoc]] and [[regression testing]] it is a good way to test that your translator works, and that your changes haven't broken anything. |
|||
==Creation of a corpus== |
|||
Before you start you first need a [[corpus]]. Look in apertium-eo-en/corpa/enwiki.crp.txt.bz2 (run bunzip2 -c enwiki.crp.txt.bz2 > en.crp.txt) to get an idea of how it should look: |
|||
- Grep out all lines with # and @ - this will help you find problems in bidix (@) and target language monodix (#). |
|||
- Pipe through nl -s '. ' to get the right line numbers. |
|||
==Installation and invocation== |
|||
Copy <code>testcorpus_en-eo.sh</code> from apertium-eo-en and change the names in it. |
|||
To start, type <code>bash regression-tests.sh</code> on a Unix box. |
|||
==Output== |
|||
Lines containing @ and # (indicating .dix problem, of which many could also be found by the [[testvoc]] method) will be shown. |
|||
But most important, in testcorpus_en-eo.txt will end up a list of differences. First is line number, then original text, then < and translation from last time, and last line beginning with > is the translation from this time: |
|||
<pre> |
|||
-- 1924 --- |
|||
1924. In Japan there is an input system allowing you to type kanji. |
|||
< 1924. En Japanio estas kontribuaĵan sistemon permesanta vi tajpi *kanji. |
|||
> 1924. En Japanio estas kontribuaĵa sistemo permesanta vin tajpi *kanji. |
|||
--- 1937 --- |
|||
1937. However, such apparent simplifications can perversely make a script more complicated. |
|||
< 1937. Tamen, tiaj evidentaj simpligoj povas *perversely fari skribo pli komplika. |
|||
> 1937. Tamen, tiaj evidentaj simpligoj povas *perversely fari skribon pli komplika. |
|||
</pre> |
|||
==Doing without a script== |
|||
You dont necessarily need a script. Just type |
|||
make && cat corpa/en.crp.txt | time apertium -d . en-eo > origina_traduko.txt |
|||
to make 'original translation'. Then change your .dixes, and invoke again |
|||
make && cat corpa/en.crp.txt | time apertium -d . en-eo > nova_traduko.txt & |
|||
the & sign will make the process run in the background. This means you can examine the differences before all the sentences are translated, with: |
|||
diff -w origina_traduko.txt nova_traduko.txt | grep -r '[<>]' > /tmp/crpdiff.txt && for i in `cut -c3-8 /tmp/crpdiff.txt | sort -un`; do echo --- $i ---; grep -r "^ *$i\." corpa/en.crp.txt; grep -r "^. *$i\." /tmp/crpdiff.txt; done | less |
|||
[[Category:Development]] |
|||
[[Category:Terminology]] |
[[Category:Terminology]] |
Revision as of 12:18, 7 December 2009
Corpus testing is the way of testing (translating) a whole corpus and compare the result to last time the corpus was translated. This is very useful if you want to change a rule or word and want to get an overview of the consequences on real-life text. Along with testvoc and regression testing it is a good way to test that your translator works, and that your changes haven't broken anything.
Creation of a corpus
Before you start you first need a corpus. Look in apertium-eo-en/corpa/enwiki.crp.txt.bz2 (run bunzip2 -c enwiki.crp.txt.bz2 > en.crp.txt) to get an idea of how it should look: - Grep out all lines with # and @ - this will help you find problems in bidix (@) and target language monodix (#). - Pipe through nl -s '. ' to get the right line numbers.
Installation and invocation
Copy testcorpus_en-eo.sh
from apertium-eo-en and change the names in it.
To start, type bash regression-tests.sh
on a Unix box.
Output
Lines containing @ and # (indicating .dix problem, of which many could also be found by the testvoc method) will be shown.
But most important, in testcorpus_en-eo.txt will end up a list of differences. First is line number, then original text, then < and translation from last time, and last line beginning with > is the translation from this time:
-- 1924 --- 1924. In Japan there is an input system allowing you to type kanji. < 1924. En Japanio estas kontribuaĵan sistemon permesanta vi tajpi *kanji. > 1924. En Japanio estas kontribuaĵa sistemo permesanta vin tajpi *kanji. --- 1937 --- 1937. However, such apparent simplifications can perversely make a script more complicated. < 1937. Tamen, tiaj evidentaj simpligoj povas *perversely fari skribo pli komplika. > 1937. Tamen, tiaj evidentaj simpligoj povas *perversely fari skribon pli komplika.
Doing without a script
You dont necessarily need a script. Just type
make && cat corpa/en.crp.txt | time apertium -d . en-eo > origina_traduko.txt
to make 'original translation'. Then change your .dixes, and invoke again
make && cat corpa/en.crp.txt | time apertium -d . en-eo > nova_traduko.txt &
the & sign will make the process run in the background. This means you can examine the differences before all the sentences are translated, with:
diff -w origina_traduko.txt nova_traduko.txt | grep -r '[<>]' > /tmp/crpdiff.txt && for i in `cut -c3-8 /tmp/crpdiff.txt | sort -un`; do echo --- $i ---; grep -r "^ *$i\." corpa/en.crp.txt; grep -r "^. *$i\." /tmp/crpdiff.txt; done | less