Difference between revisions of "Testvoc"

From Apertium
Jump to navigation Jump to search
(Link to French page)
 
(16 intermediate revisions by 4 users not shown)
Line 1: Line 1:
[[Test de vocabulaire|En français]]
[[Test de vocabulaire|En français]]

{{TOCD}}
{{TOCD}}
A '''testvoc''' is literally a test of vocabulary. At the most basic level, it just expands an {{sc|sl}} dictionary, and runs each possibly analysed [[lexical form]] through all the translation stages to see that for each possible input, a sensible translation in the {{sc|tl}}, without <code>#</code>, or <code>@</code> symbols is generated.
A '''testvoc''' is literally a test of vocabulary. At the most basic level, it just expands an {{sc|sl}} dictionary, and runs each possibly analysed [[lexical form]] through all the translation stages to see that for each possible input, a sensible translation in the {{sc|tl}}, without <code>#</code>, or <code>@</code> symbols is generated.
Line 7: Line 6:
: It would be nice however, with a script that testvoc'ed all possible transfer rule runs (without having to run all possible combinations of lexical units, which would take forever). One problems is that transfer rules can refer to not only tags, but lemmas; and that multi-stage transfer means you have to test fairly long sequences.
: It would be nice however, with a script that testvoc'ed all possible transfer rule runs (without having to run all possible combinations of lexical units, which would take forever). One problems is that transfer rules can refer to not only tags, but lemmas; and that multi-stage transfer means you have to test fairly long sequences.


==Example scripts for 1-LU testvoc==
==Trimmed testvoc==
Most new Apertium pairs use automatically trimmed analysers from monolingual dependencies, e.g. with [[lt-trim]] if the analyser is lttoolbox-based.
The following is a very simple script illustrating testvoc for 1-stage transfer. The tee command saves the output from transfer, which includes words (actually lexical units) that passed successfully through transfer and words that got an @ prepended. The last file is output from generation, which includes words that were successfully generated, and words that have an # prepended (anything with an @ will also get a #):
When using <code>lt-trim</code>, there's no need to testvoc the analyser→bidix step (the '@'-marks), since the analyser will only contain what the bidix contains.
<pre>
MONODIX=apertium-nn-nb.nn.dix
T1X=apertum-nn-nb.nn-nb.t1x
BIDIXBIN=nn-nb.autobil.bin
GENERATORBIN=nn-nb.autogen.bin
ALPHABET="ABCDEFGHIJKLMNOPQRSTUVWXYZÆØÅabcdefghijklmnopqrstuvwxyzæøåcqwxzCQWXZéèêóòâôÉÊÈÓÔÒÂáàÁÀäÄöÖ" # from $MONODIX


However, you still need to look for #'s and /'s with
lt-expand ${MONODIX} | grep -e ':<:' -e '[$ALPHABET]:[$ALPHABET]' |\
* Corpus testvoc to ensure your transfer rules are correct (see [[#Corpus testvoc]] below), and
sed 's/:<:/%/g' | sed 's/:/%/g' | cut -f2 -d'%' | sed 's/^/^/g' | sed 's/$/$ ^.<sent><clb>$/g' |\
* Generation testvoc to ensure all the forms that are in both analyser and bidix also exist in your generator (see next section for real-life script).
apertium-transfer ${T1X} ${T1X}.bin ${BIDIXBIN} | tee after-transfer.txt |\
lt-proc ${GENERATORBIN} > after-generation.txt
</pre>




<small>Since the analyser dix file can be much larger than the trimmed analyser, testvoc scripts that don't take that into account will give false hits. That is, a command like <code>lt-expand complete-analyser.dix | lt-proc -b bidix.bin | apertium-transfer -b foo.t1x foo.t1x.bin | lt-proc -d gen.bin</code> will give lots of @'s that won't appear when running the real pipeline. The [[#Generation testvoc with lttoolbox analyser]] ignores any @ and assumes lt-trim just works.</small>
The following is a real-life <code>inconsistency.sh</code> script from <code>apertium-br-fr</code> that expands the dictionary of Breton and passes it through the translator:
<pre>
TMPDIR=/tmp


==Generation testvoc==
lt-expand ../apertium-br-fr.br.dix | grep -v '<prn><enc>' | grep -e ':<:' -e '\w:\w' |\
sed 's/:<:/%/g' | sed 's/:/%/g' | cut -f2 -d'%' | sed 's/^/^/g' | sed 's/$/$ ^.<sent>$/g' |\
tee $TMPDIR/tmp_testvoc1.txt |\
apertium-pretransfer|\
apertium-transfer ../apertium-br-fr.br-fr.t1x ../br-fr.t1x.bin ../br-fr.autobil.bin |\
apertium-interchunk ../apertium-br-fr.br-fr.t2x ../br-fr.t2x.bin |\
apertium-postchunk ../apertium-br-fr.br-fr.t3x ../br-fr.t3x.bin |\
tee $TMPDIR/tmp_testvoc2.txt |\
lt-proc -d ../br-fr.autogen.bin > $TMPDIR/tmp_testvoc3.txt


===Generation testvoc with lttoolbox analyser===
paste -d _ $TMPDIR/tmp_testvoc1.txt $TMPDIR/tmp_testvoc2.txt $TMPDIR/tmp_testvoc3.txt |\
The script generation.sh in
sed 's/\^.<sent>\$//g' | sed 's/_/ ---------> /g'
https://github.com/apertium/apertium-swe-dan/blob/master/dev/testvoc/generation.sh should work with any pipeline that uses lttoolbox on the analysis side.


It tests that anything the analyser can produce will go through to generation without '/' or '#'-marks (that is, there is one and only one form generated for anything the analyser can produce).


It doesn't test that the bidix contains everything the analyser has – we assume your Makefile uses lt-trim for that (all recent pairs with monolingual dependencies do).

It also only tests single words seperated by periods – any generation problem that crops up with more context (typically due to transfer rules) will require a [[#Corpus testvoc]]. But it's a nice and fairly quick way to get most of your dictionary consistency issues.

====HFST-based testvoc of lttoolbox analyser====
Another way to testvoc a trimmed analyser, if you have [[HFST]] installed, is to replace <code>lt-expand ana.dix</code> in a simple testvoc pipeline with this sequence:
<pre>
lt-print trimmed-analyser.bin |sed 's/ /@_SPACE_@/g' | hfst-txt2fst -e ε | hfst-project -p lower | hfst-fst2strings -c0
</pre>
</pre>
(The -c0 says to never follow cycles; you can also follow them at most once with -c1 etc., but this can take a while depending on how many {{tag|re}}'s you use.)


If we call that command "expand", then the full testvoc pipeline would be something like
<pre>
expand | sed 's/^/^/;s/$/$/' | apertium-pretransfer | apertium-transfer …bin …t1x | lt-proc -d …autogen.bin
</pre>


which may be a more "complete" testvoc.
===HFST===


Running https://github.com/apertium/apertium-swe-dan/blob/master/dev/testvoc/generation.sh with --hfst as the first argument will make it use this method.
The Tatar-Bashkir language pair has a testvoc script for use with HFST.


==lt-trim testvoc==
===Generation testvoc with HFST analyser===
When using [[lt-trim]], there's no need to testvoc the analyser→bidix step (the @'s), since the analyser will only contain what the bidix contains.


The Tatar-Bashkir language pair has a testvoc script for use with HFST, see https://github.com/apertium/apertium-tat-bak/blob/master/dev/inconsistency.sh which contains e.g.
However, you still need to look for #'s and /'s with
<pre>
* Corpus testvoc to ensure your transfer rules are correct (see [[#Corpus testvoc]] below), and
hfst-fst2strings ../.deps/ba.LR-debug.hfst | sort -u | sed 's/:/%/g' | cut -f1 -d'%' | sed 's/^/^/g' | sed 's/$/$ ^.<sent>$/g' | tee $TMPDIR/tmp_testvoc1.txt |
* Generation testvoc to ensure all the forms that are in both analyser and bidix also exist in your generator.
apertium-pretransfer|
apertium-transfer ../apertium-tt-ba.ba-tt.t1x ../ba-tt.t1x.bin ../ba-tt.autobil.bin |
apertium-transfer -n ../apertium-tt-ba.ba-tt.t2x ../ba-tt.t2x.bin | tee $TMPDIR/tmp_testvoc2.txt |
hfst-proc -d ../ba-tt.autogen.hfst > $TMPDIR/tmp_testvoc3.txt
paste -d _ $TMPDIR/tmp_testvoc1.txt $TMPDIR/tmp_testvoc2.txt $TMPDIR/tmp_testvoc3.txt | sed 's/\^.<sent>\$//g' | sed 's/_/ ---------> /g'
</pre>


=== Testvoc "lite" ===
A somewhat more up-to-date approach is used in [https://github.com/apertium/apertium-kaz-kir/tree/master/testvoc/lite apertium-kaz-kir/testvoc/lite], based on [https://github.com/apertium/apertium-kaz/tree/master/tests/morphotactics apertium-kaz/tests/morphotactics]. Also see [https://github.com/apertium/apertium-uzb-kaa/tree/master/testvoc/lite apertium-uzb-kaa/testvoc/lite].


==Words in bidix but not in analyser==
Since the analyser dix file can now be much larger than the trimmed analyser, the above testvoc script will give false hits. That is, a command like <code>lt-expand ana.dix | lt-proc -b bidix.bin | apertium-transfer -b foo.t1x foo.t1x.bin | lt-proc -d gen.bin</code> will give lots of @'s that won't appear when running the real pipeline.


The script bidix-unknowns.sh in https://github.com/apertium/apertium-swe-dan/blob/master/dev/testvoc/ will look for entries in bidix that your analyser would never produce. It should work with any pipeline that uses lttoolbox on the analysis side.
One solution is to add a "grep -v @" into the pipeline just after the "lt-proc -b" step.

This is useful for making sure all your hard bidix work is actually useful. It may find lemmas that are completely missing from the analyser, or that simply have the wrong gender-tag or similar.


Another solution, if you have [[HFST]] installed, is to replace <code>lt-expand ana.dix</code> in your testvoc script with this sequence:
<pre>
lt-print trimmed-analyser.bin |sed 's/ /@_SPACE_@/g;s/ε/@0@/g' | hfst-txt2fst | hfst-fst2strings -c1
</pre>


==Corpus testvoc==
==Corpus testvoc==
Line 89: Line 91:
we can get the original line like this:
we can get the original line like this:
<pre>
<pre>
$ head -n 276 corpus.txt |tail -n1
$ sed -n '276p' corpus.txt
</pre>
</pre>


==Testvoc without trimming==
The following is a very simple script illustrating testvoc for 1-stage transfer. The tee command saves the output from transfer, which includes words (actually lexical units) that passed successfully through transfer and words that got an @ prepended. The last file is output from generation, which includes words that were successfully generated, and words that have an # prepended (anything with an @ will also get a #):
<pre>
MONODIX=apertium-nn-nb.nn.dix
T1X=apertum-nn-nb.nn-nb.t1x
BIDIXBIN=nn-nb.autobil.bin
GENERATORBIN=nn-nb.autogen.bin
ALPHABET="ABCDEFGHIJKLMNOPQRSTUVWXYZÆØÅabcdefghijklmnopqrstuvwxyzæøåcqwxzCQWXZéèêóòâôÉÊÈÓÔÒÂáàÁÀäÄöÖ" # from $MONODIX

lt-expand ${MONODIX} | grep -e ':<:' -e '[$ALPHABET]:[$ALPHABET]' |\
sed 's/:<:/%/g' | sed 's/:/%/g' | cut -f2 -d'%' | sed 's/^/^/g' | sed 's/$/$ ^.<sent><clb>$/g' |\
apertium-transfer ${T1X} ${T1X}.bin ${BIDIXBIN} | tee after-transfer.txt |\
lt-proc ${GENERATORBIN} > after-generation.txt
</pre>


The following is a real-life <code>inconsistency.sh</code> script from <code>apertium-br-fr</code> that expands the dictionary of Breton and passes it through the translator:
<pre>
TMPDIR=/tmp

lt-expand ../apertium-br-fr.br.dix | grep -v '<prn><enc>' | grep -e ':<:' -e '\w:\w' |\
sed 's/:<:/%/g' | sed 's/:/%/g' | cut -f2 -d'%' | sed 's/^/^/g' | sed 's/$/$ ^.<sent>$/g' |\
tee $TMPDIR/tmp_testvoc1.txt |\
apertium-pretransfer|\
apertium-transfer ../apertium-br-fr.br-fr.t1x ../br-fr.t1x.bin ../br-fr.autobil.bin |\
apertium-interchunk ../apertium-br-fr.br-fr.t2x ../br-fr.t2x.bin |\
apertium-postchunk ../apertium-br-fr.br-fr.t3x ../br-fr.t3x.bin |\
tee $TMPDIR/tmp_testvoc2.txt |\
lt-proc -d ../br-fr.autogen.bin > $TMPDIR/tmp_testvoc3.txt

paste -d _ $TMPDIR/tmp_testvoc1.txt $TMPDIR/tmp_testvoc2.txt $TMPDIR/tmp_testvoc3.txt |\
sed 's/\^.<sent>\$//g' | sed 's/_/ ---------> /g'


</pre>



==See also==
==See also==
* [[Automatically trimming a monodix]]
* [[Automatically trimming a monodix]]
* [[Why we trim]]
* [[Why we trim]]
* [[Finding errors in dictionaries]]


[[Category:Terminology]]
[[Category:Terminology]]

Latest revision as of 22:30, 18 January 2021

En français

A testvoc is literally a test of vocabulary. At the most basic level, it just expands an sl dictionary, and runs each possibly analysed lexical form through all the translation stages to see that for each possible input, a sensible translation in the tl, without #, or @ symbols is generated.

However, as transfer rules may introduce errors that are not visible when translating single lexical units, a release-quality language pair also needs testvoc on phrases consisting of several lexical units. Often one can find a lot of the errors by running a large corpus (with all @, / or # symbols removed) through the translator, with debug symbols on, and grepping for [@#/].

It would be nice however, with a script that testvoc'ed all possible transfer rule runs (without having to run all possible combinations of lexical units, which would take forever). One problems is that transfer rules can refer to not only tags, but lemmas; and that multi-stage transfer means you have to test fairly long sequences.

Trimmed testvoc[edit]

Most new Apertium pairs use automatically trimmed analysers from monolingual dependencies, e.g. with lt-trim if the analyser is lttoolbox-based. When using lt-trim, there's no need to testvoc the analyser→bidix step (the '@'-marks), since the analyser will only contain what the bidix contains.

However, you still need to look for #'s and /'s with

  • Corpus testvoc to ensure your transfer rules are correct (see #Corpus testvoc below), and
  • Generation testvoc to ensure all the forms that are in both analyser and bidix also exist in your generator (see next section for real-life script).


Since the analyser dix file can be much larger than the trimmed analyser, testvoc scripts that don't take that into account will give false hits. That is, a command like lt-expand complete-analyser.dix | lt-proc -b bidix.bin | apertium-transfer -b foo.t1x foo.t1x.bin | lt-proc -d gen.bin will give lots of @'s that won't appear when running the real pipeline. The #Generation testvoc with lttoolbox analyser ignores any @ and assumes lt-trim just works.

Generation testvoc[edit]

Generation testvoc with lttoolbox analyser[edit]

The script generation.sh in https://github.com/apertium/apertium-swe-dan/blob/master/dev/testvoc/generation.sh should work with any pipeline that uses lttoolbox on the analysis side.

It tests that anything the analyser can produce will go through to generation without '/' or '#'-marks (that is, there is one and only one form generated for anything the analyser can produce).

It doesn't test that the bidix contains everything the analyser has – we assume your Makefile uses lt-trim for that (all recent pairs with monolingual dependencies do).

It also only tests single words seperated by periods – any generation problem that crops up with more context (typically due to transfer rules) will require a #Corpus testvoc. But it's a nice and fairly quick way to get most of your dictionary consistency issues.

HFST-based testvoc of lttoolbox analyser[edit]

Another way to testvoc a trimmed analyser, if you have HFST installed, is to replace lt-expand ana.dix in a simple testvoc pipeline with this sequence:

lt-print trimmed-analyser.bin |sed 's/ /@_SPACE_@/g' | hfst-txt2fst -e ε | hfst-project -p lower | hfst-fst2strings -c0

(The -c0 says to never follow cycles; you can also follow them at most once with -c1 etc., but this can take a while depending on how many <re>'s you use.)

If we call that command "expand", then the full testvoc pipeline would be something like

expand | sed 's/^/^/;s/$/$/' | apertium-pretransfer | apertium-transfer …bin …t1x | lt-proc -d …autogen.bin

which may be a more "complete" testvoc.

Running https://github.com/apertium/apertium-swe-dan/blob/master/dev/testvoc/generation.sh with --hfst as the first argument will make it use this method.

Generation testvoc with HFST analyser[edit]

The Tatar-Bashkir language pair has a testvoc script for use with HFST, see https://github.com/apertium/apertium-tat-bak/blob/master/dev/inconsistency.sh which contains e.g.

hfst-fst2strings ../.deps/ba.LR-debug.hfst | sort -u |  sed 's/:/%/g' | cut -f1 -d'%' |  sed 's/^/^/g' | sed 's/$/$ ^.<sent>$/g' | tee $TMPDIR/tmp_testvoc1.txt |
        apertium-pretransfer|
        apertium-transfer ../apertium-tt-ba.ba-tt.t1x  ../ba-tt.t1x.bin  ../ba-tt.autobil.bin |
        apertium-transfer -n ../apertium-tt-ba.ba-tt.t2x  ../ba-tt.t2x.bin  | tee $TMPDIR/tmp_testvoc2.txt |
        hfst-proc -d ../ba-tt.autogen.hfst > $TMPDIR/tmp_testvoc3.txt
paste -d _ $TMPDIR/tmp_testvoc1.txt $TMPDIR/tmp_testvoc2.txt $TMPDIR/tmp_testvoc3.txt | sed 's/\^.<sent>\$//g' | sed 's/_/   --------->  /g'

Testvoc "lite"[edit]

A somewhat more up-to-date approach is used in apertium-kaz-kir/testvoc/lite, based on apertium-kaz/tests/morphotactics. Also see apertium-uzb-kaa/testvoc/lite.

Words in bidix but not in analyser[edit]

The script bidix-unknowns.sh in https://github.com/apertium/apertium-swe-dan/blob/master/dev/testvoc/ will look for entries in bidix that your analyser would never produce. It should work with any pipeline that uses lttoolbox on the analysis side.

This is useful for making sure all your hard bidix work is actually useful. It may find lemmas that are completely missing from the analyser, or that simply have the wrong gender-tag or similar.


Corpus testvoc[edit]

Typically corpus testvoc consists of running a big corpus through your translator, and grepping for @'s, /'s or #'s. You can use a command like the below to first delete debug symbols from input (so you don't get false hits), run it through your translator (the "dgen" mode runs the generation step using lt-proc -d, which shows the full analysis when a word is not in the generator) and then grep for debug symbols (highlighting some context on either side just to make sure you see the symbol):

xzcat corpora/nno.xz | tr -d '#@/' | apertium -d . nno-nob-dgen | grep '.\{0,6\}[#@/].\{0,6\}'


However, sometimes you want to get to the original line in the corpus that gave that @ or #.

This is one way of looking for @'s in a corpus while still being able to go easily find the original line:

$ cat corpus.txt | apertium-destxt | nl | apertium -f none -d . sme-nob-interchunk1 |grep '\^@' 

nl will number each line in corpus.txt, inside the superblank that is at each line-end. So if we now see

   276  ]^part<part>{^å<part>$}$  ^verb<SV><inf><loc-for><m>{^@ballat<V><inf>$}$
...

we can get the original line like this:

$ sed -n '276p' corpus.txt


Testvoc without trimming[edit]

The following is a very simple script illustrating testvoc for 1-stage transfer. The tee command saves the output from transfer, which includes words (actually lexical units) that passed successfully through transfer and words that got an @ prepended. The last file is output from generation, which includes words that were successfully generated, and words that have an # prepended (anything with an @ will also get a #):

MONODIX=apertium-nn-nb.nn.dix
T1X=apertum-nn-nb.nn-nb.t1x
BIDIXBIN=nn-nb.autobil.bin
GENERATORBIN=nn-nb.autogen.bin
ALPHABET="ABCDEFGHIJKLMNOPQRSTUVWXYZÆØÅabcdefghijklmnopqrstuvwxyzæøåcqwxzCQWXZéèêóòâôÉÊÈÓÔÒÂáàÁÀäÄöÖ" # from $MONODIX

lt-expand ${MONODIX} | grep -e ':<:' -e '[$ALPHABET]:[$ALPHABET]' |\
sed 's/:<:/%/g' | sed 's/:/%/g' | cut -f2 -d'%' |  sed 's/^/^/g' | sed 's/$/$ ^.<sent><clb>$/g' |\
apertium-transfer ${T1X} ${T1X}.bin ${BIDIXBIN} | tee after-transfer.txt |\
lt-proc ${GENERATORBIN} > after-generation.txt


The following is a real-life inconsistency.sh script from apertium-br-fr that expands the dictionary of Breton and passes it through the translator:

TMPDIR=/tmp

lt-expand ../apertium-br-fr.br.dix | grep -v '<prn><enc>' | grep -e ':<:' -e '\w:\w' |\
 sed 's/:<:/%/g' | sed 's/:/%/g' | cut -f2 -d'%' |  sed 's/^/^/g' | sed 's/$/$ ^.<sent>$/g' |\
 tee $TMPDIR/tmp_testvoc1.txt |\
        apertium-pretransfer|\
        apertium-transfer ../apertium-br-fr.br-fr.t1x  ../br-fr.t1x.bin  ../br-fr.autobil.bin |\
        apertium-interchunk ../apertium-br-fr.br-fr.t2x  ../br-fr.t2x.bin |\
        apertium-postchunk ../apertium-br-fr.br-fr.t3x  ../br-fr.t3x.bin  |\
        tee $TMPDIR/tmp_testvoc2.txt |\
        lt-proc -d ../br-fr.autogen.bin > $TMPDIR/tmp_testvoc3.txt

paste -d _ $TMPDIR/tmp_testvoc1.txt $TMPDIR/tmp_testvoc2.txt $TMPDIR/tmp_testvoc3.txt |\
 sed 's/\^.<sent>\$//g' | sed 's/_/   --------->  /g'



See also[edit]