Difference between revisions of "Testvoc"
Line 58: | Line 58: | ||
One solution is to add a "grep -v @" into the pipeline just after the "lt-proc -b" step. |
One solution is to add a "grep -v @" into the pipeline just after the "lt-proc -b" step. |
||
Another solution, if you have [[HFST]] installed, is to |
Another solution, if you have [[HFST]] installed, is to replace <code>lt-expand ana.dix</code> in your testvoc script with this sequence: |
||
<pre> |
<pre> |
||
$ lt-print nno-nob.automorf.bin |sed 's/ /@_SPACE_@/g;s/ε/@0@/g' | hfst-txt2fst | hfst-fst2strings -c1 |
$ lt-print nno-nob.automorf.bin |sed 's/ /@_SPACE_@/g;s/ε/@0@/g' | hfst-txt2fst | hfst-fst2strings -c1 |
||
</pre> |
</pre> |
||
Then replace <code>lt-expand ana.dix</code> with <code>xzcat trimmed.exp.xz</code> in your testvoc script. |
|||
==Corpus testvoc== |
==Corpus testvoc== |
Revision as of 08:22, 7 October 2014
A testvoc is literally a test of vocabulary. At the most basic level, it just expands an sl dictionary, and runs each possibly analysed lexical form through all the translation stages to see that for each possible input, a sensible translation in the tl, without #
, or @
symbols is generated.
However, as transfer rules may introduce errors that are not visible when translating single lexical units, a release-quality language pair also needs testvoc on phrases consisting of several lexical units. Often one can find a lot of the errors by running a large corpus (with all @, / or # symbols removed) through the translator, with debug symbols on, and grepping for [@#/].
- It would be nice however, with a script that testvoc'ed all possible transfer rule runs (without having to run all possible combinations of lexical units, which would take forever). One problems is that transfer rules can refer to not only tags, but lemmas; and that multi-stage transfer means you have to test fairly long sequences.
Example scripts for 1-LU testvoc
The following is a very simple script illustrating testvoc for 1-stage transfer. The tee command saves the output from transfer, which includes words (actually lexical units) that passed successfully through transfer and words that got an @ prepended. The last file is output from generation, which includes words that were successfully generated, and words that have an # prepended (anything with an @ will also get a #):
MONODIX=apertium-nn-nb.nn.dix T1X=apertum-nn-nb.nn-nb.t1x BIDIXBIN=nn-nb.autobil.bin GENERATORBIN=nn-nb.autogen.bin ALPHABET="ABCDEFGHIJKLMNOPQRSTUVWXYZÆØÅabcdefghijklmnopqrstuvwxyzæøåcqwxzCQWXZéèêóòâôÉÊÈÓÔÒÂáàÁÀäÄöÖ" # from $MONODIX lt-expand ${MONODIX} | grep -e ':<:' -e '[$ALPHABET]:[$ALPHABET]' |\ sed 's/:<:/%/g' | sed 's/:/%/g' | cut -f2 -d'%' | sed 's/^/^/g' | sed 's/$/$ ^.<sent><clb>$/g' |\ apertium-transfer ${T1X} ${T1X}.bin ${BIDIXBIN} | tee after-transfer.txt |\ lt-proc ${GENERATORBIN} > after-generation.txt
The following is a real-life inconsistency.sh
script from apertium-br-fr
that expands the dictionary of Breton and passes it through the translator:
TMPDIR=/tmp lt-expand ../apertium-br-fr.br.dix | grep -v '<prn><enc>' | grep -e ':<:' -e '\w:\w' |\ sed 's/:<:/%/g' | sed 's/:/%/g' | cut -f2 -d'%' | sed 's/^/^/g' | sed 's/$/$ ^.<sent>$/g' |\ tee $TMPDIR/tmp_testvoc1.txt |\ apertium-pretransfer|\ apertium-transfer ../apertium-br-fr.br-fr.t1x ../br-fr.t1x.bin ../br-fr.autobil.bin |\ apertium-interchunk ../apertium-br-fr.br-fr.t2x ../br-fr.t2x.bin |\ apertium-postchunk ../apertium-br-fr.br-fr.t3x ../br-fr.t3x.bin |\ tee $TMPDIR/tmp_testvoc2.txt |\ lt-proc -d ../br-fr.autogen.bin > $TMPDIR/tmp_testvoc3.txt paste -d _ $TMPDIR/tmp_testvoc1.txt $TMPDIR/tmp_testvoc2.txt $TMPDIR/tmp_testvoc3.txt |\ sed 's/\^.<sent>\$//g' | sed 's/_/ ---------> /g'
HFST
The Tatar-Bashkir language pair has a testvoc script for use with HFST.
lt-trim testvoc
When using lt-trim, there's no need to testvoc the analyser→bidix step (the @'s), since the analyser will only contain what the bidix contains.
However, you still need to look for #'s and /'s with
- Corpus testvoc to ensure your transfer rules are correct (see #Corpus testvoc below), and
- Generation testvoc to ensure all the forms that are in both analyser and bidix also exist in your generator.
Since the analyser dix file can now be much larger than the trimmed analyser, the above testvoc script will give false hits. That is, a command like lt-expand ana.dix | lt-proc -b bidix.bin | apertium-transfer -b foo.t1x foo.t1x.bin | lt-proc -d gen.bin
will give lots of @'s that won't appear when running the real pipeline.
One solution is to add a "grep -v @" into the pipeline just after the "lt-proc -b" step.
Another solution, if you have HFST installed, is to replace lt-expand ana.dix
in your testvoc script with this sequence:
$ lt-print nno-nob.automorf.bin |sed 's/ /@_SPACE_@/g;s/ε/@0@/g' | hfst-txt2fst | hfst-fst2strings -c1
Corpus testvoc
Typically corpus testvoc consists of running a big corpus through your translator, and grepping for @'s, /'s or #'s. You can use a command like the below to first delete debug symbols from input (so you don't get false hits), run it through your translator (the "dgen" mode runs the generation step using lt-proc -d, which shows the full analysis when a word is not in the generator) and then grep for debug symbols (highlighting some context on either side just to make sure you see the symbol):
xzcat corpora/nno.xz | tr -d '#@/' | apertium -d . nno-nob-dgen | grep '.\{0,6\}[#@/].\{0,6\}'
However, sometimes you want to get to the original line in the corpus that gave that @ or #.
This is one way of looking for @'s in a corpus while still being able to go easily find the original line:
$ cat corpus.txt | apertium-destxt | nl | apertium -f none -d . sme-nob-interchunk1 |grep '\^@'
nl
will number each line in corpus.txt, inside the superblank that is at each line-end. So if we now see
276 ]^part<part>{^å<part>$}$ ^verb<SV><inf><loc-for><m>{^@ballat<V><inf>$}$ ...
we can get the original line like this:
$ head -n 276 corpus.txt |tail -n1