Difference between revisions of "Calculating coverage"

From Apertium
Jump to navigation Jump to search
(→‎More involved scripts: i don't like supporting this, should be replaced with a simpler example)
Line 110: Line 110:
   
 
https://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/coverage.py is a coverage script that wraps curl and bzcat (?)
 
https://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/coverage.py is a coverage script that wraps curl and bzcat (?)
 
==More involved scripts==
 
Often it's nice to clean up wikipedia fluff etc. for coverage testing.
 
 
Notes on calculating coverage from wikipedia dumps (based on [[Asturian#Calculating coverage]]).
 
 
(Mac OS X `sed' doesn't allow \n in replacements, so I just use an actual (escaped) newline...)
 
 
wikicat.sh:
 
<pre>
 
#!/bin/sh
 
# Clean up wikitext for running through apertium-destxt
 
 
# awk prints full lines, make sure each html element has one
 
bzcat "$@" | sed 's/>/>\
 
/g' | sed 's/</\
 
</g' |\
 
# want only stuff between <text...> and </text>
 
awk '
 
/<text.*>/,/<\/text>/ { print $0 }
 
' |\
 
sed 's/\./ /g' |\
 
# Drop all transwiki links
 
sed 's/\[\[\([a-z]\{2,3\}\|bat-smg\|be-x-old\|cbk-zam\|fiu-vro\|map-bms\|nds-nl\|roa-rup\|roa-tara\|simple\|zh-classical\|zh-min-nan\|zh-yue\):[^]]\+\]\]//g' |\
 
# wiki markup, retain bar and fie from [[foo|bar]] [[fie]]
 
sed 's/\[\[[^]|]*|//g' | sed 's/\]\]//g' | sed 's/\[\[//g' |\
 
# wiki markup, retain `bar fie' from [http://foo bar fie] and remove [http://foo]
 
sed 's/\[http[^ ]*\([^]]*\)\]/\1/g' |\
 
# remove entities
 
sed 's/&[^;]*;/ /g' |\
 
# and put space instead of punctuation
 
sed 's/[;:?,]/ /g' |\
 
# Keep only lines starting with a capital letter, removing tables with style info etc.
 
grep '^[ ]*[A-ZÆØÅ]' # Your alphabet here
 
</pre>
 
 
count-tokenized.sh:
 
<pre>
 
#!/bin/sh
 
# http://wiki.apertium.org/wiki/Asturian#Calculating_coverage
 
 
# Calculate the number of tokenised words in the corpus:
 
apertium-destxt | lt-proc $1 |apertium-retxt |\
 
# for some reason putting the newline in directly doesn't work, so two seds
 
sed 's/\$[^^]*\^/$^/g' | sed 's/\$\^/$\
 
^/g'
 
</pre>
 
 
To find all tokens from a wiki dump:
 
<pre>
 
$ ./wikicat.sh nnwiki-20090119-pages-articles.xml.bz2 > nnwiki.cleaned.txt
 
cat nnwiki.cleaned.txt | ./count-tokenized.sh nn-nb.automorf.bin | wc -l
 
</pre>
 
To find all tokens with at least one analysis (naïve coverage):
 
<pre>
 
$ cat nnwiki.cleaned.txt | ./count-tokenized.sh nn-nb.automorf.bin | grep -v '\/\*' | wc -l
 
</pre>
 
To find the top unknown tokens:
 
<pre>
 
$ cat nnwiki.cleaned.txt | ./count-tokenized.sh nn-nb.automorf.bin | sed 's/[ ]*//g' |\ # tab or space
 
grep '\/\*' | sort -f | uniq -c | sort -gr | head
 
 
</pre>
 
 
=== Script ready to run ===
 
 
corpus-stat.sh
 
<pre>
 
#!/bin/sh
 
# http://wiki.apertium.org/wiki/Asturian#Calculating_coverage
 
 
 
# Example use:
 
# zcat corpa/en.crp.txt.gz | sh corpus-stat.sh
 
 
 
#CMD="cat corpa/en.crp.txt"
 
CMD="cat"
 
 
F=/tmp/corpus-stat-res.txt
 
 
# Calculate the number of tokenised words in the corpus:
 
# for some reason putting the newline in directly doesn't work, so two seds
 
$CMD | apertium-destxt | lt-proc en-eo.automorf.bin |apertium-retxt | sed 's/\$[^^]*\^/$^/g' | sed 's/\$\^/$\
 
^/g' > $F
 
 
NUMWORDS=`cat $F | wc -l`
 
echo "Number of tokenised words in the corpus: $NUMWORDS"
 
 
 
 
# Calculate the number of words that are not unknown
 
 
NUMKNOWNWORDS=`cat $F | grep -v '\*' | wc -l`
 
echo "Number of known words in the corpus: $NUMKNOWNWORDS"
 
 
 
# Calculate the coverage
 
 
COVERAGE=`calc "round($NUMKNOWNWORDS/$NUMWORDS*1000)/10"`
 
echo "Coverage: $COVERAGE %"
 
 
#If you don't have calc, change the above line to:
 
#COVERAGE=$(perl -e 'print int($ARGV[0]/$ARGV[1]*1000)/10;' $NUMKNOWNWORDS $NUMWORDS)
 
 
# Show the top-ten unknown words.
 
 
echo "Top unknown words in the corpus:"
 
cat $F | grep '\*' | sort -f | uniq -c | sort -gr | head -10
 
 
</pre>
 
Sample output:
 
<pre>
 
$ zcat corpa/en.crp.txt.gz | sh corpus-stat.sh
 
Number of tokenised words in the corpus: 478187
 
Number of known words in the corpus: 450255
 
Coverage: 94.2 %
 
Top unknown words in the corpus:
 
191 ^Apollo/*Apollo$
 
104 ^Aramaic/*Aramaic$
 
91 ^Alberta/*Alberta$
 
81 ^de/*de$
 
80 ^Abu/*Abu$
 
63 ^Bakr/*Bakr$
 
62 ^Agassi/*Agassi$
 
59 ^Carnegie/*Carnegie$
 
58 ^Agrippina/*Agrippina$
 
58 ^Achilles/*Achilles$
 
56 ^Adelaide/*Adelaide$
 
</pre>
 
   
 
==See also==
 
==See also==

Revision as of 10:41, 12 July 2017

En français

Simple bidix-trimmed coverage testing

First install apertium-cleanstream:

svn checkout https://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/apertium-cleanstream
cd apertium-cleanstream
make
sudo cp apertium-cleanstream /usr/local/bin

Then save this as coverage.sh:

#!/bin/bash
mode=$1
outfile=/tmp/$mode.clean
apertium -d . $mode | apertium-cleanstream -n > $outfile
total=$(grep -c '^\^' $outfile)
unknown=$(grep -c '/\*' $outfile)
bidix_unknown=$(grep -c '/@' $outfile)
known_percent=$(calc -p  "round( 100*($total-$unknown-$bidix_unknown)/$total, 3)")
echo "$known_percent % known tokens ($unknown unknown, $bidix_unknown bidix-unknown of total $total tokens)"
echo "Top unknown words:"
grep '/[*@]' $outfile | sort | uniq -c | sort -nr | head

And run it like

cat asm.corpus | bash coverage.sh asm-eng-biltrans

(The bidix-unknown count should always be 0 if your pair uses automatic analyser trimming.)

TODO: paradigm-coverage (less naïve)

On an analysed corpus, we can sum frequencies into bins for each lemma+mainpos, so if the analysed corpus contains

musa/mus<n><f><sg><def>/muse<vblex><past>
mus/mus<n><f><sg><ind>/mus<n><f><pl><ind>/muse<vblex><imp>
musene/mus<n><f><pl><def>

then output has

3 mus<n><f>
2 muse<vblex>

and we can find paradigms that are likely to mess up disambiguation, or where we need to ensure that the bidix contains the highest-frequency paradigm (since the bidix is typically smaller than the monodix).

We could also weight these numbers by number of unique forms in the pardef; if the verb pardef has 6 unique forms and then noun only 3, then the above output should be even more skewed:

0.33 mus<n><f>
0.75 muse<vblex>

Faster coverage testing with frequency lists

If words appear several times in your corpus, why bother analysing them several times? We can make a frequency list first and add together the frequencies. This script does some very stupid tokenisation and creates a frequency list:

make-freqlist.sh:

#!/bin/bash

if [[ -t 0 ]]; then
    echo "Expecting a corpus on stdin"
    exit 2
fi

tr '[:space:][:punct:]' '\n' | grep . | sort | uniq -c | sort -nr

And this script runs your analyser, summing up the frequencies:

freqlist-coverage.sh:

#!/bin/bash

set -e -u

if [[ $# -eq 0 || -t 0 ]]; then
    echo "Expecting apertium arguments and a 'sort|uniq -c|sort -nr' style frequency list on stdin"
    echo "For example:"
    echo "\$ < spa.freqlist $0 -d . spa-morph"
    exit 2
fi

sed 's%^ *%<apertium-notrans>%;s% %</apertium-notrans>%;s%$% .%' |
apertium -f html-noent "$@" |
awk -F'</?apertium-notrans>| *\\^\\./\\.<sent><clb>\\$' '
/[/][*@]/ {
    unknown+=$2
    if(!printed) print "Top unknown tokens:"
    if(++printed<10) print $2,$3
    next
}
{
    known+=$2
}
END {
    total=known+unknown
    known_pct=100*known/total
    unk_pct=100*unknown/total
    print known_pct" % known of total "total" tokens"
}'

Usage:

$ chmod +x make-freqlist.sh freqlist-coverage.sh
$ bzcat ~/corpora/nno.txt.bz2 |./make-freqlist.sh > nno.freqlist
$ <nno.freqlist ./freqlist-coverage.sh -d ~/apertium-svn/languages/apertium-nno/ nno-morph

coverage.py

https://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/coverage.py is a coverage script that wraps curl and bzcat (?)

See also