Difference between revisions of "Corpus test"

Revision as of 12:18, 7 December 2009

Corpus testing is the way of testing (translating) a whole corpus and compare the result to last time the corpus was translated. This is very useful if you want to change a rule or word and want to get an overview of the consequences on real-life text. Along with testvoc and regression testing it is a good way to test that your translator works, and that your changes haven't broken anything.

Creation of a corpus

Before you start you first need a corpus. Look in apertium-eo-en/corpa/enwiki.crp.txt.bz2 (run bunzip2 -c enwiki.crp.txt.bz2 > en.crp.txt) to get an idea of how it should look: - Grep out all lines with # and @ - this will help you find problems in bidix (@) and target language monodix (#). - Pipe through nl -s '. ' to get the right line numbers.

Installation and invocation

Copy testcorpus_en-eo.sh from apertium-eo-en and change the names in it.

To start, type bash regression-tests.sh on a Unix box.

Output

Lines containing @ and # (indicating .dix problem, of which many could also be found by the testvoc method) will be shown.

But most important, in testcorpus_en-eo.txt will end up a list of differences. First is line number, then original text, then < and translation from last time, and last line beginning with > is the translation from this time:

-- 1924 ---
  1924. In Japan there is an input system allowing you to type kanji.
<   1924.       En Japanio estas kontribuaĵan sistemon permesanta vi tajpi *kanji.
>   1924.       En Japanio estas kontribuaĵa sistemo permesanta vin tajpi *kanji.

--- 1937 ---
  1937. However, such apparent simplifications can perversely make a script more complicated.
<   1937.       Tamen, tiaj evidentaj simpligoj povas *perversely fari skribo pli komplika.
>   1937.       Tamen, tiaj evidentaj simpligoj povas *perversely fari skribon pli komplika.

Doing without a script

You dont necessarily need a script. Just type

 make && cat corpa/en.crp.txt | time apertium -d . en-eo > origina_traduko.txt

to make 'original translation'. Then change your .dixes, and invoke again

 make && cat corpa/en.crp.txt | time apertium -d . en-eo > nova_traduko.txt &

the & sign will make the process run in the background. This means you can examine the differences before all the sentences are translated, with:

 diff -w origina_traduko.txt nova_traduko.txt | grep -r '[<>]' > /tmp/crpdiff.txt && for i in `cut -c3-8 /tmp/crpdiff.txt | sort -un`; do echo  --- $i ---; grep -r "^ *$i\." corpa/en.crp.txt; grep -r "^. *$i\." /tmp/crpdiff.txt; done | less

Difference between revisions of "Corpus test"

Revision as of 12:18, 7 December 2009

Contents

Creation of a corpus

Installation and invocation

Output

Doing without a script

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools

@@ Line 1: / Line 1: @@
+{{TOCD}}
+'''Corpus testing''' is the way of testing (translating) a whole corpus and compare the result to last time the corpus was translated. This is very useful if you want to change a rule or word and want to get an overview of the consequences on real-life text. Along with [[testvoc]] and [[regression testing]] it is a good way to test that your translator works, and that your changes haven't broken anything.
+==Creation of a corpus==
+Before you start you first need a [[corpus]]. Look in apertium-eo-en/corpa/enwiki.crp.txt.bz2 (run bunzip2 -c enwiki.crp.txt.bz2 > en.crp.txt) to get an idea of how it should look:
+- Grep out all lines with # and @ - this will help you find problems in bidix (@) and target language monodix (#).
+- Pipe through nl -s '. ' to get the right line numbers.
+==Installation and invocation==
+Copy <code>testcorpus_en-eo.sh</code> from apertium-eo-en and change the names in it.
+To start, type <code>bash regression-tests.sh</code> on a Unix box.
+==Output==
+Lines containing @ and # (indicating .dix problem, of which many could also be found by the [[testvoc]] method) will be shown.
+But most important, in testcorpus_en-eo.txt will end up a list of differences. First is line number, then original text, then &lt; and translation from last time, and last line beginning with &gt; is the translation from this time:
+<pre>
+-- 1924 ---
+. In Japan there is an input system allowing you to type kanji.
+<   1924.       En Japanio estas kontribuaĵan sistemon permesanta vi tajpi *kanji.
+>   1924.       En Japanio estas kontribuaĵa sistemo permesanta vin tajpi *kanji.
+--- 1937 ---
+. However, such apparent simplifications can perversely make a script more complicated.
+<   1937.       Tamen, tiaj evidentaj simpligoj povas *perversely fari skribo pli komplika.
+>   1937.       Tamen, tiaj evidentaj simpligoj povas *perversely fari skribon pli komplika.
+</pre>
+==Doing without a script==
+You dont necessarily need a script. Just type
+  make && cat corpa/en.crp.txt | time apertium -d . en-eo > origina_traduko.txt
+to make 'original translation'. Then change your .dixes, and invoke again
+  make && cat corpa/en.crp.txt | time apertium -d . en-eo > nova_traduko.txt &
+the & sign will make the process run in the background. This means you can examine the differences before all the sentences are translated, with:
+  diff -w origina_traduko.txt nova_traduko.txt | grep -r '[<>]' > /tmp/crpdiff.txt && for i in `cut -c3-8 /tmp/crpdiff.txt | sort -un`; do echo  --- $i ---; grep -r "^ *$i\." corpa/en.crp.txt; grep -r "^. *$i\." /tmp/crpdiff.txt; done | less
+[[Category:Development]]
 [[Category:Terminology]]