Difference between revisions of "Calculating coverage"

From Apertium
Jump to navigation Jump to search
(New page: Notes on calculating coverage from wikipedia dumps (based on Asturian#Calculating coverage). wikicat.sh: <pre> #!/bin/sh # clean up wiki for running through apertium-destxt # awk pri...)
 
(→‎See also=: fixed a wiki markup)
 
(46 intermediate revisions by 9 users not shown)
Line 1: Line 1:
Notes on calculating coverage from wikipedia dumps (based on [[Asturian#Calculating coverage]]).
 
   
  +
[[Calculer la couverture|En français]]
wikicat.sh:
 
  +
  +
==Simple bidix-trimmed coverage testing==
  +
  +
First install apertium-cleanstream:
  +
  +
svn checkout https://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/apertium-cleanstream
  +
cd apertium-cleanstream
  +
make
  +
sudo cp apertium-cleanstream /usr/local/bin
  +
  +
'''Note - After Apertium's migration to GitHub, this tool is read-only on the SourceForge repository and does not exist on GitHub. If you are interested in migrating this tool to GitHub, see [[Migrating tools to GitHub]].'''
  +
  +
Then save this as coverage.sh:
  +
  +
#!/bin/bash
  +
mode=$1
  +
outfile=/tmp/$mode.clean
  +
apertium -d . $mode | apertium-cleanstream -n > $outfile
  +
total=$(grep -c '^\^' $outfile)
  +
unknown=$(grep -c '/\*' $outfile)
  +
bidix_unknown=$(grep -c '/@' $outfile)
  +
known_percent=$(calc -p "round( 100*($total-$unknown-$bidix_unknown)/$total, 3)")
  +
echo "$known_percent % known tokens ($unknown unknown, $bidix_unknown bidix-unknown of total $total tokens)"
  +
echo "Top unknown words:"
  +
grep '/[*@]' $outfile | sort | uniq -c | sort -nr | head
  +
  +
And run it like
  +
  +
cat asm.corpus | bash coverage.sh asm-eng-biltrans
  +
  +
(The bidix-unknown count should always be 0 if your pair uses [[lt-trim|automatic analyser trimming]].)
  +
  +
==TODO: paradigm-coverage (less naïve)==
  +
On an analysed corpus, we can sum frequencies into bins for each lemma+mainpos, so if the analysed corpus contains
  +
 
<pre>
 
<pre>
  +
musa/mus<n><f><sg><def>/muse<vblex><past>
#!/bin/sh
 
  +
mus/mus<n><f><sg><ind>/mus<n><f><pl><ind>/muse<vblex><imp>
# clean up wiki for running through apertium-destxt
 
  +
musene/mus<n><f><pl><def>
  +
</pre>
  +
then output has
  +
<pre>
  +
3 mus<n><f>
  +
2 muse<vblex>
  +
</pre>
  +
and we can find paradigms that are likely to mess up disambiguation, or where we need to ensure that the bidix contains the highest-frequency paradigm (since the bidix is typically smaller than the monodix).
   
  +
We could also weight these numbers by number of unique forms in the pardef; if the verb pardef has 6 unique forms and then noun only 3, then the above output should be even more skewed:
# awk prints full lines, make sure each html elt has one
 
  +
<pre>
bzcat "$@" | sed 's/>/>\
 
  +
0.33 mus<n><f>
/g' | sed 's/</\
 
  +
0.75 muse<vblex>
</g' |\
 
# want only stuff between <text...> and </text>
 
awk '
 
/<text.*>/,/<\/text>/ { print $0 }
 
' |\
 
sed 's/\./ /g' |\
 
sed 's/\[\[[^|]*|//g' | sed 's/\]\]//g' | sed 's/\[\[//g' |\
 
# wiki markup, retain bar and fie from [[foo|bar]] [[fie]]
 
sed 's/&.*;/ /g' |\
 
# remove entities
 
sed 's/[;:?,]/ /g' |\
 
# and put space instead of punctuation
 
grep '^[ ]*[A-ZÆØÅ]' # Your alphabet here
 
# Keep only lines starting with a capital letter, removing tables with style info etc.
 
 
</pre>
 
</pre>
   
  +
==Faster coverage testing with frequency lists==
count-tokenized.sh:
 
  +
  +
If words appear several times in your corpus, why bother analysing them several times? We can make a frequency list first and add together the frequencies. This script does some very stupid tokenisation and creates a frequency list:
  +
  +
make-freqlist.sh:
 
<pre>
 
<pre>
#!/bin/sh
+
#!/bin/bash
# http://wiki.apertium.org/wiki/Asturian#Calculating_coverage
 
   
  +
if [[ -t 0 ]]; then
# Calculate the number of tokenised words in the corpus:
 
  +
echo "Expecting a corpus on stdin"
apertium-destxt | lt-proc $1 |apertium-retxt |\
 
  +
exit 2
# for some reason putting the newline in directly doesn't work, so two seds
 
  +
fi
sed 's/\$[^^]*\^/$^/g' | sed 's/\$\^/$\
 
  +
^/g'
 
  +
tr '[:space:][:punct:]' '\n' | grep . | sort | uniq -c | sort -nr
 
</pre>
 
</pre>
  +
And this script runs your analyser, summing up the frequencies:
  +
  +
freqlist-coverage.sh:
  +
<pre>
  +
#!/bin/bash
  +
  +
set -e -u
  +
  +
if [[ $# -eq 0 || -t 0 ]]; then
  +
echo "Expecting apertium arguments and a 'sort|uniq -c|sort -nr' style frequency list on stdin"
  +
echo "For example:"
  +
echo "\$ < spa.freqlist $0 -d . spa-morph"
  +
exit 2
  +
fi
  +
  +
sed 's%^ *%<apertium-notrans>%;s% %</apertium-notrans>%;s%$% .%' |
  +
apertium -f html-noent "$@" |
  +
awk -F'</?apertium-notrans>| *\\^\\./\\.<sent><clb>\\$' '
  +
/[/][*@]/ {
  +
unknown+=$2
  +
if(!printed) print "Top unknown tokens:"
  +
if(++printed<10) print $2,$3
  +
next
  +
}
  +
{
  +
known+=$2
  +
}
  +
END {
  +
total=known+unknown
  +
known_pct=100*known/total
  +
unk_pct=100*unknown/total
  +
print known_pct" % known of total "total" tokens"
  +
}'
  +
</pre>
  +
  +
Usage:
  +
  +
$ chmod +x make-freqlist.sh freqlist-coverage.sh
  +
$ bzcat ~/corpora/nno.txt.bz2 |./make-freqlist.sh > nno.freqlist
  +
$ < nno.freqlist ./freqlist-coverage.sh -d ~/apertium-svn/languages/apertium-nno/ nno-morph
  +
  +
==coverage.py==
  +
  +
https://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/coverage.py is a coverage script that wraps curl and bzcat.
  +
  +
  +
'''Note - After Apertium's migration to GitHub, this tool is read-only on the SourceForge repository and does not exist on GitHub. If you are interested in migrating this tool to GitHub, see [[Migrating tools to GitHub]].'''
  +
  +
== See also ==
  +
  +
* [[Wikipedia dumps]]
  +
* [[Cleanstream]]
   
  +
[[Category:Documentation]]
To find all tokens from a wiki dump:
 
  +
[[Category:Documentation in English]]
<pre>$ ./wikicat.sh nnwiki-20090119-pages-articles.xml.bz2 | ./count-correct.sh nn-nb.automorf.bin | wc -l</pre>
 
To find all correct tokens:
 
<pre>$ ./wikicat.sh nnwiki-20090119-pages-articles.xml.bz2 | ./count-correct.sh nn-nb.automorf.bin | grep -v '\/\*' | wc -l</pre>
 

Latest revision as of 15:18, 10 January 2022

En français

Simple bidix-trimmed coverage testing[edit]

First install apertium-cleanstream:

svn checkout https://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/apertium-cleanstream
cd apertium-cleanstream
make
sudo cp apertium-cleanstream /usr/local/bin

Note - After Apertium's migration to GitHub, this tool is read-only on the SourceForge repository and does not exist on GitHub. If you are interested in migrating this tool to GitHub, see Migrating tools to GitHub.

Then save this as coverage.sh:

#!/bin/bash
mode=$1
outfile=/tmp/$mode.clean
apertium -d . $mode | apertium-cleanstream -n > $outfile
total=$(grep -c '^\^' $outfile)
unknown=$(grep -c '/\*' $outfile)
bidix_unknown=$(grep -c '/@' $outfile)
known_percent=$(calc -p  "round( 100*($total-$unknown-$bidix_unknown)/$total, 3)")
echo "$known_percent % known tokens ($unknown unknown, $bidix_unknown bidix-unknown of total $total tokens)"
echo "Top unknown words:"
grep '/[*@]' $outfile | sort | uniq -c | sort -nr | head

And run it like

cat asm.corpus | bash coverage.sh asm-eng-biltrans

(The bidix-unknown count should always be 0 if your pair uses automatic analyser trimming.)

TODO: paradigm-coverage (less naïve)[edit]

On an analysed corpus, we can sum frequencies into bins for each lemma+mainpos, so if the analysed corpus contains

musa/mus<n><f><sg><def>/muse<vblex><past>
mus/mus<n><f><sg><ind>/mus<n><f><pl><ind>/muse<vblex><imp>
musene/mus<n><f><pl><def>

then output has

3 mus<n><f>
2 muse<vblex>

and we can find paradigms that are likely to mess up disambiguation, or where we need to ensure that the bidix contains the highest-frequency paradigm (since the bidix is typically smaller than the monodix).

We could also weight these numbers by number of unique forms in the pardef; if the verb pardef has 6 unique forms and then noun only 3, then the above output should be even more skewed:

0.33 mus<n><f>
0.75 muse<vblex>

Faster coverage testing with frequency lists[edit]

If words appear several times in your corpus, why bother analysing them several times? We can make a frequency list first and add together the frequencies. This script does some very stupid tokenisation and creates a frequency list:

make-freqlist.sh:

#!/bin/bash

if [[ -t 0 ]]; then
    echo "Expecting a corpus on stdin"
    exit 2
fi

tr '[:space:][:punct:]' '\n' | grep . | sort | uniq -c | sort -nr

And this script runs your analyser, summing up the frequencies:

freqlist-coverage.sh:

#!/bin/bash

set -e -u

if [[ $# -eq 0 || -t 0 ]]; then
    echo "Expecting apertium arguments and a 'sort|uniq -c|sort -nr' style frequency list on stdin"
    echo "For example:"
    echo "\$ < spa.freqlist $0 -d . spa-morph"
    exit 2
fi

sed 's%^ *%<apertium-notrans>%;s% %</apertium-notrans>%;s%$% .%' |
apertium -f html-noent "$@" |
awk -F'</?apertium-notrans>| *\\^\\./\\.<sent><clb>\\$' '
/[/][*@]/ {
    unknown+=$2
    if(!printed) print "Top unknown tokens:"
    if(++printed<10) print $2,$3
    next
}
{
    known+=$2
}
END {
    total=known+unknown
    known_pct=100*known/total
    unk_pct=100*unknown/total
    print known_pct" % known of total "total" tokens"
}'

Usage:

$ chmod +x make-freqlist.sh freqlist-coverage.sh
$ bzcat ~/corpora/nno.txt.bz2 |./make-freqlist.sh > nno.freqlist
$ < nno.freqlist ./freqlist-coverage.sh -d ~/apertium-svn/languages/apertium-nno/ nno-morph

coverage.py[edit]

https://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/coverage.py is a coverage script that wraps curl and bzcat.


Note - After Apertium's migration to GitHub, this tool is read-only on the SourceForge repository and does not exist on GitHub. If you are interested in migrating this tool to GitHub, see Migrating tools to GitHub.

See also[edit]