Difference between revisions of "Testvoc"
Line 38: | Line 38: | ||
</pre> |
|||
==Corpus testvoc== |
|||
One way of looking for @'s in a corpus is: |
|||
<pre> |
|||
$ cat corpus.txt | nl | sed 's/^ *\([0-9][0-9]*\)/<a \1\/>/'| apertium-deshtml | apertium -f none -d . sme-nob-interchunk1 |grep '\^@' |
|||
</pre> |
|||
This will number each line in corpus.txt, then put that number in a fake html tag, which is put into a superblank by deshtml. So if we now see |
|||
<pre> |
|||
<a 276\/> ]^Conj<@CVP><cnjcoo>{^men<cnjcoo>$}$ ^nom<SN><@SUBJ→><nt><pl><ind><nom><unc>{^folk<n><nt><pl><5>$}$ ^verb<SV><@+FMAINV><Ind><pret><p3><pl><m>{^begynne<vblex><pret>$}$ ^part<part>{^å<part>$}$ ^verb<SV><inf><loc-for><m>{^@ballat<V><inf>$}$ |
|||
... |
|||
</pre> |
|||
we can get the original line like this: |
|||
<pre> |
|||
$ head -n 276 corpus.txt |tail -n1 |
|||
</pre> |
</pre> |
||
Revision as of 13:25, 4 October 2010
A testvoc is literally a test of vocabulary. At the most basic level, it just expands an sl dictionary, and runs each possibly analysed lexical form through all the translation stages to see that for each possible input, a sensible translation in the tl, without #
, or @
symbols is generated.
However, as transfer rules may introduce errors that are not visible when translating single lexical units, a release-quality language pair also needs testvoc on phrases consisting of several lexical units. Often one can find a lot of the errors by running a large corpus (with all @, / or # symbols removed) through the translator, with debug symbols on, and grepping for [@#/].
- It would be nice however, with a script that testvoc'ed all possible transfer rule runs (without having to run all possible combinations of lexical units, which would take forever). One problems is that transfer rules can refer to not only tags, but lemmas; and that multi-stage transfer means you have to test fairly long sequences.
Example scripts for testvoc of single lexical units
The following is a very simple script illustrating testvoc for 1-stage transfer. The tee command saves the output from transfer, which includes words (actually lexical units) that passed successfully through transfer and words that got an @ prepended. The last file is output from generation, which includes words that were successfully generated, and words that have an # prepended (anything with an @ will also get a #):
MONODIX=apertium-nn-nb.nn.dix T1X=apertum-nn-nb.nn-nb.t1x BIDIXBIN=nn-nb.autobil.bin GENERATORBIN=nn-nb.autogen.bin ALPHABET="ABCDEFGHIJKLMNOPQRSTUVWXYZÆØÅabcdefghijklmnopqrstuvwxyzæøåcqwxzCQWXZéèêóòâôÉÊÈÓÔÒÂáàÁÀäÄöÖ" # from $MONODIX lt-expand ${MONODIX} | grep -e ':<:' -e '[$ALPHABET]:[$ALPHABET]' |\ sed 's/:<:/%/g' | sed 's/:/%/g' | cut -f2 -d'%' | sed 's/^/^/g' | sed 's/$/$ ^.<sent><clb>$/g' |\ apertium-transfer ${T1X} ${T1X}.bin ${BIDIXBIN} | tee after-transfer.txt |\ lt-proc ${GENERATORBIN} > after-generation.txt
The following is a real-life inconsistency.sh
script from apertium-br-fr
that expands the dictionary of Breton and passes it through the translator:
TMPDIR=/tmp lt-expand ../apertium-br-fr.br.dix | grep -v '<prn><enc>' | grep -e ':<:' -e '\w:\w' |\ sed 's/:<:/%/g' | sed 's/:/%/g' | cut -f2 -d'%' | sed 's/^/^/g' | sed 's/$/$ ^.<sent>$/g' |\ tee $TMPDIR/tmp_testvoc1.txt |\ apertium-pretransfer|\ apertium-transfer ../apertium-br-fr.br-fr.t1x ../br-fr.t1x.bin ../br-fr.autobil.bin |\ apertium-interchunk ../apertium-br-fr.br-fr.t2x ../br-fr.t2x.bin |\ apertium-postchunk ../apertium-br-fr.br-fr.t3x ../br-fr.t3x.bin |\ tee $TMPDIR/tmp_testvoc2.txt |\ lt-proc -d ../br-fr.autogen.bin > $TMPDIR/tmp_testvoc3.txt paste -d _ $TMPDIR/tmp_testvoc1.txt $TMPDIR/tmp_testvoc2.txt $TMPDIR/tmp_testvoc3.txt |\ sed 's/\^.<sent>\$//g' | sed 's/_/ ---------> /g'
Corpus testvoc
One way of looking for @'s in a corpus is:
$ cat corpus.txt | nl | sed 's/^ *\([0-9][0-9]*\)/<a \1\/>/'| apertium-deshtml | apertium -f none -d . sme-nob-interchunk1 |grep '\^@'
This will number each line in corpus.txt, then put that number in a fake html tag, which is put into a superblank by deshtml. So if we now see
<a 276\/> ]^Conj<@CVP><cnjcoo>{^men<cnjcoo>$}$ ^nom<SN><@SUBJ→><nt><pl><ind><nom><unc>{^folk<n><nt><pl><5>$}$ ^verb<SV><@+FMAINV><Ind><pret><p3><pl><m>{^begynne<vblex><pret>$}$ ^part<part>{^å<part>$}$ ^verb<SV><inf><loc-for><m>{^@ballat<V><inf>$}$ ...
we can get the original line like this:
$ head -n 276 corpus.txt |tail -n1