Difference between revisions of "Testvoc"

From Apertium
Jump to navigation Jump to search
Line 9: Line 9:
GENERATORBIN=nn-nb.autogen.bin
GENERATORBIN=nn-nb.autogen.bin
ALPHABET="ABCDEFGHIJKLMNOPQRSTUVWXYZÆØÅabcdefghijklmnopqrstuvwxyzæøåcqwxzCQWXZéèêóòâôÉÊÈÓÔÒÂáàÁÀäÄöÖ" # from $MONODIX
ALPHABET="ABCDEFGHIJKLMNOPQRSTUVWXYZÆØÅabcdefghijklmnopqrstuvwxyzæøåcqwxzCQWXZéèêóòâôÉÊÈÓÔÒÂáàÁÀäÄöÖ" # from $MONODIX

lt-expand ${MONODIX} | grep -e ':<:' -e '[$ALPHABET]:[$ALPHABET]' |\
lt-expand ${MONODIX} | grep -e ':<:' -e '[$ALPHABET]:[$ALPHABET]' |\
sed 's/:<:/%/g' | sed 's/:/%/g' | cut -f2 -d'%' | sed 's/^/^/g' | sed 's/$/$ ^.<sent><clb>$/g' |\
sed 's/:<:/%/g' | sed 's/:/%/g' | cut -f2 -d'%' | sed 's/^/^/g' | sed 's/$/$ ^.<sent><clb>$/g' |\

Revision as of 08:21, 9 April 2010

A testvoc is literally a test of vocabulary. At the most basic level, it just expands an sl dictionary, and runs each possibly analysed lexical form through all the translation stages to see that for each possible input, a sensible translation in the tl, without #, or @ symbols is generated.

Example scripts for testvoc of single lexical units

The following is a very simple script illustrating testvoc for 1-stage transfer. The tee command saves the output from transfer, which includes words (actually lexical units) that passed successfully through transfer and words that got an @ prepended. The last file is output from generation, which includes words that were successfully generated, and words that have an # prepended (anything with an @ will also get a #):

MONODIX=apertium-nn-nb.nn.dix
T1X=apertum-nn-nb.nn-nb.t1x
BIDIXBIN=nn-nb.autobil.bin
GENERATORBIN=nn-nb.autogen.bin
ALPHABET="ABCDEFGHIJKLMNOPQRSTUVWXYZÆØÅabcdefghijklmnopqrstuvwxyzæøåcqwxzCQWXZéèêóòâôÉÊÈÓÔÒÂáàÁÀäÄöÖ" # from $MONODIX

lt-expand ${MONODIX} | grep -e ':<:' -e '[$ALPHABET]:[$ALPHABET]' |\
sed 's/:<:/%/g' | sed 's/:/%/g' | cut -f2 -d'%' |  sed 's/^/^/g' | sed 's/$/$ ^.<sent><clb>$/g' |\
apertium-transfer ${T1X} ${T1X}.bin ${BIDIXBIN} | tee after-transfer.txt |\
lt-proc ${GENERATORBIN} > after-generation.txt


The following is a real-life inconsistency.sh script from apertium-br-fr that expands the dictionary of Breton and passes it through the translator:

TMPDIR=/tmp

lt-expand ../apertium-br-fr.br.dix | grep -v '<prn><enc>' | grep -e ':<:' -e '\w:\w' |\
 sed 's/:<:/%/g' | sed 's/:/%/g' | cut -f2 -d'%' |  sed 's/^/^/g' | sed 's/$/$ ^.<sent>$/g' |\
 tee $TMPDIR/tmp_testvoc1.txt |\
        apertium-pretransfer|\
        apertium-transfer ../apertium-br-fr.br-fr.t1x  ../br-fr.t1x.bin  ../br-fr.autobil.bin |\
        apertium-interchunk ../apertium-br-fr.br-fr.t2x  ../br-fr.t2x.bin |\
        apertium-postchunk ../apertium-br-fr.br-fr.t3x  ../br-fr.t3x.bin  |\
        tee $TMPDIR/tmp_testvoc2.txt |\
        lt-proc -d ../br-fr.autogen.bin > $TMPDIR/tmp_testvoc3.txt

paste -d _ $TMPDIR/tmp_testvoc1.txt $TMPDIR/tmp_testvoc2.txt $TMPDIR/tmp_testvoc3.txt |\
 sed 's/\^.<sent>\$//g' | sed 's/_/   --------->  /g'