Difference between revisions of "Corpus test"
Popcorndude (talk | contribs) |
|||
(11 intermediate revisions by 4 users not shown) | |||
Line 1: | Line 1: | ||
[[Test de corpus|En français]] |
|||
{{TOCD}} |
{{TOCD}} |
||
'''Corpus testing''' is the way of testing (translating) a whole corpus and compare the result to last time the corpus was translated. This is very useful if you want to change a rule or word and want to get an overview of the consequences on real-life text. Along with [[testvoc]] and [[regression testing]] it is a good way to test that your translator works, and that your changes haven't broken anything. |
'''Corpus testing''' is the way of testing (translating) a whole corpus and compare the result to last time the corpus was translated. This is very useful if you want to change a rule or word and want to get an overview of the consequences on real-life text. Along with [[testvoc]] and [[regression testing]] it is a good way to test that your translator works, and that your changes haven't broken anything. |
||
Everything listed below can also be done automatically using [[Apertium-regtest]]. |
|||
==Creation of a corpus== |
==Creation of a corpus== |
||
Line 6: | Line 10: | ||
Before you start you first need a [[Corpora|corpus]]. Look in apertium-eo-en/corpa/enwiki.crp.txt.bz2 (run bunzip2 -c enwiki.crp.txt.bz2 > en.crp.txt) to get an idea of how it should look: |
Before you start you first need a [[Corpora|corpus]]. Look in apertium-eo-en/corpa/enwiki.crp.txt.bz2 (run bunzip2 -c enwiki.crp.txt.bz2 > en.crp.txt) to get an idea of how it should look: |
||
* Use <code>grep</code> to remove all lines in the original corpus containing <code>#</code> and <code>@</code> as these symbols are used in Apertium for marking errors in the bilingual dictionary and transfer. |
|||
* Grep out all lines with # and @ - this will help you find problems in bidix (@) and target language monodix (#). |
|||
** e.g. <code>grep -v '[@#]' original-corpus > clean-corpus</code> |
|||
* Pipe through nl -s '. ' to get the right line numbers. |
|||
* Use the command <code>nl -s '. '</code> to number the lines in the corpus. |
|||
** e.g. <code>nl clean-corpus > clean-numbered-corpus</code> |
|||
==testcorpus script== |
==testcorpus script== |
||
Line 51: | Line 57: | ||
done | less |
done | less |
||
</pre> |
</pre> |
||
The first command creates a simple diff, while the for loop goes through each change, and tries to match the line from the original corpus up with the line that had the change. |
|||
==Take it further: word diffs== |
==Take it further: word diffs== |
||
Line 75: | Line 83: | ||
==Seeing the forest in all the trees== |
==Seeing the forest in all the trees== |
||
The following command will give you a hitparade of word-based changes in translation output: |
The following command will give you a hitparade (frequency list) of word-based changes in translation output: |
||
<pre> |
<pre> |
||
$ diff -U0 origina_traduko.txt nova_traduko.txt | dwdiff --diff-input |grep -v '^@' |sed 's/.*\[-//'|sed 's/+}.*//' | sort|uniq -c|sort - |
$ diff -U0 origina_traduko.txt nova_traduko.txt | dwdiff --diff-input |grep -v '^@' |sed 's/.*\[-//'|sed 's/+}.*//' | sort|uniq -c|sort -nr |
||
</pre> |
</pre> |
||
Line 94: | Line 102: | ||
</pre> |
</pre> |
||
Call it "mindiff" or something, and you can keep doing <code>mindiff origina_traduko.txt nova_traduko.txt |tail</code> to check on new differences as they come. Or you can do < |
Call it "mindiff" or something, and you can keep doing <code>mindiff origina_traduko.txt nova_traduko.txt |tail</code> to check on new differences as they come. Or you can do <pre>mindiff origina_traduko.txt nova_traduko.txt -U0|tail |dwdiff --diff-input</pre> as shown above. |
||
==Notes== |
==Notes== |
||
Line 102: | Line 110: | ||
[[Category:Quality control]] |
[[Category:Quality control]] |
||
[[Category:Terminology]] |
[[Category:Terminology]] |
||
[[Category:Documentation in English]] |
Latest revision as of 20:16, 23 July 2021
Corpus testing is the way of testing (translating) a whole corpus and compare the result to last time the corpus was translated. This is very useful if you want to change a rule or word and want to get an overview of the consequences on real-life text. Along with testvoc and regression testing it is a good way to test that your translator works, and that your changes haven't broken anything.
Everything listed below can also be done automatically using Apertium-regtest.
Creation of a corpus[edit]
Before you start you first need a corpus. Look in apertium-eo-en/corpa/enwiki.crp.txt.bz2 (run bunzip2 -c enwiki.crp.txt.bz2 > en.crp.txt) to get an idea of how it should look:
- Use
grep
to remove all lines in the original corpus containing#
and@
as these symbols are used in Apertium for marking errors in the bilingual dictionary and transfer.- e.g.
grep -v '[@#]' original-corpus > clean-corpus
- e.g.
- Use the command
nl -s '. '
to number the lines in the corpus.- e.g.
nl clean-corpus > clean-numbered-corpus
- e.g.
testcorpus script[edit]
Installation and invocation[edit]
Copy testcorpus_en-eo.sh
from apertium-eo-en and change the names in it.
To start, type bash regression-tests.sh
on a Unix box.
Output[edit]
Lines containing @ and # (indicating .dix problem, of which many could also be found by the testvoc method) will be shown.
But most important, in testcorpus_en-eo.txt will end up a list of differences. First is line number, then original text, then < and translation from last time, and last line beginning with > is the translation from this time:
-- 1924 --- 1924. In Japan there is an input system allowing you to type kanji. < 1924. En Japanio estas kontribuaĵan sistemon permesanta vi tajpi *kanji. > 1924. En Japanio estas kontribuaĵa sistemo permesanta vin tajpi *kanji. --- 1937 --- 1937. However, such apparent simplifications can perversely make a script more complicated. < 1937. Tamen, tiaj evidentaj simpligoj povas *perversely fari skribo pli komplika. > 1937. Tamen, tiaj evidentaj simpligoj povas *perversely fari skribon pli komplika.
Simple corpus diff[edit]
You dont necessarily need a script. Just type
make && cat corpa/en.crp.txt | apertium -d . en-eo > origina_traduko.txt
to make 'original translation'. Then change your .dixes, and invoke again
make && cat corpa/en.crp.txt | apertium -d . en-eo > nova_traduko.txt &
the & sign will make the process run in the background. This means you can examine the differences in translation output before all the sentences are translated. The below commands give such a diff and show it along with the original text:
diff -w origina_traduko.txt nova_traduko.txt | grep '^[<>]' > /tmp/crpdiff.txt && for i in `cut -c3-8 /tmp/crpdiff.txt | sort -un`; do echo --- $i ---; grep "^ *$i\." corpa/en.crp.txt; grep "^. *$i\." /tmp/crpdiff.txt; done | less
The first command creates a simple diff, while the for loop goes through each change, and tries to match the line from the original corpus up with the line that had the change.
Take it further: word diffs[edit]
dwdiff (sudo apt-get install dwdiff
on Ubuntu, sudo pacman -S dwdiff
on Arch Linux) is a program that takes diff input and finds word-changes, so that instead of
1c1 < Fruit flies enjoy a banana --- > Fruit flies like a banana
you get
Fruit flies [-enjoy-] {+like+} a banana
Coupled with colour output (the -c option, available in newer versions), huge corpus diffs become a lot more readable.
If there are not many changes, dwdiff by itself will still display the unchanged lines. Since it can read output from diff itself in the "unified diff" format, one of the best ways of using dwdiff in this case is:
$ diff -U1 origina_traduko.txt nova_traduko.txt | dwdiff -c --diff-input
The -U1
gives unified diff output with only one line of context before and after the change (try -U0, -U10, etc), while the -c
ensures you have nice colours, and --diff-input
makes dwdiff read from stdin instead of expecting two files[1].
Seeing the forest in all the trees[edit]
The following command will give you a hitparade (frequency list) of word-based changes in translation output:
$ diff -U0 origina_traduko.txt nova_traduko.txt | dwdiff --diff-input |grep -v '^@' |sed 's/.*\[-//'|sed 's/+}.*//' | sort|uniq -c|sort -nr
Thus you can begin with any very high-frequency changes first.
Helper script: diffing while translating[edit]
When you run two corpus translations in the background, you often want to diff them as they run. However, you'll get a bunch of extra lines at the bottom of your diff from whichever translation task has come the longest. Save the following script to a file to get a difference that's limited to the shortest of the files:
#!/bin/bash if [ $# -lt 2 ]; then echo "Usage: $0 file1 file2 [additional options to diff]"; fi M=$(calc 'min(' $(wc -l < "$1") ', ' $(wc -l < "$2") ')') # ${@:3} means all args after first and second diff ${@:3} <(head -n$M "$1") <(head -n$M "$2")
Call it "mindiff" or something, and you can keep doing mindiff origina_traduko.txt nova_traduko.txt |tail
to check on new differences as they come. Or you can do
mindiff origina_traduko.txt nova_traduko.txt -U0|tail |dwdiff --diff-input
as shown above.