Difference between revisions of "Calculating coverage"

Revision as of 10:41, 12 July 2017

Simple bidix-trimmed coverage testing

First install apertium-cleanstream:

svn checkout https://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/apertium-cleanstream
cd apertium-cleanstream
make
sudo cp apertium-cleanstream /usr/local/bin

Then save this as coverage.sh:

#!/bin/bash
mode=$1
outfile=/tmp/$mode.clean
apertium -d . $mode | apertium-cleanstream -n > $outfile
total=$(grep -c '^\^' $outfile)
unknown=$(grep -c '/\*' $outfile)
bidix_unknown=$(grep -c '/@' $outfile)
known_percent=$(calc -p  "round( 100*($total-$unknown-$bidix_unknown)/$total, 3)")
echo "$known_percent % known tokens ($unknown unknown, $bidix_unknown bidix-unknown of total $total tokens)"
echo "Top unknown words:"
grep '/[*@]' $outfile | sort | uniq -c | sort -nr | head

And run it like

cat asm.corpus | bash coverage.sh asm-eng-biltrans

(The bidix-unknown count should always be 0 if your pair uses automatic analyser trimming.)

TODO: paradigm-coverage (less naïve)

On an analysed corpus, we can sum frequencies into bins for each lemma+mainpos, so if the analysed corpus contains

musa/mus<n><f><sg><def>/muse<vblex><past>
mus/mus<n><f><sg><ind>/mus<n><f><pl><ind>/muse<vblex><imp>
musene/mus<n><f><pl><def>

then output has

3 mus<n><f>
2 muse<vblex>

and we can find paradigms that are likely to mess up disambiguation, or where we need to ensure that the bidix contains the highest-frequency paradigm (since the bidix is typically smaller than the monodix).

We could also weight these numbers by number of unique forms in the pardef; if the verb pardef has 6 unique forms and then noun only 3, then the above output should be even more skewed:

0.33 mus<n><f>
0.75 muse<vblex>

Faster coverage testing with frequency lists

If words appear several times in your corpus, why bother analysing them several times? We can make a frequency list first and add together the frequencies. This script does some very stupid tokenisation and creates a frequency list:

make-freqlist.sh:

#!/bin/bash

if [[ -t 0 ]]; then
    echo "Expecting a corpus on stdin"
    exit 2
fi

tr '[:space:][:punct:]' '\n' | grep . | sort | uniq -c | sort -nr

And this script runs your analyser, summing up the frequencies:

freqlist-coverage.sh:

#!/bin/bash

set -e -u

if [[ $# -eq 0 || -t 0 ]]; then
    echo "Expecting apertium arguments and a 'sort|uniq -c|sort -nr' style frequency list on stdin"
    echo "For example:"
    echo "\$ < spa.freqlist $0 -d . spa-morph"
    exit 2
fi

sed 's%^ *%<apertium-notrans>%;s% %</apertium-notrans>%;s%$% .%' |
apertium -f html-noent "$@" |
awk -F'</?apertium-notrans>| *\\^\\./\\.<sent><clb>\\$' '
/[/][*@]/ {
    unknown+=$2
    if(!printed) print "Top unknown tokens:"
    if(++printed<10) print $2,$3
    next
}
{
    known+=$2
}
END {
    total=known+unknown
    known_pct=100*known/total
    unk_pct=100*unknown/total
    print known_pct" % known of total "total" tokens"
}'

Usage:

$ chmod +x make-freqlist.sh freqlist-coverage.sh
$ bzcat ~/corpora/nno.txt.bz2 |./make-freqlist.sh > nno.freqlist
$ <nno.freqlist ./freqlist-coverage.sh -d ~/apertium-svn/languages/apertium-nno/ nno-morph

coverage.py

https://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/coverage.py is a coverage script that wraps curl and bzcat (?)

@@ Line 110: / Line 110: @@
 https://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/coverage.py is a coverage script that wraps curl and bzcat (?)
-==More involved scripts==
-Often it's nice to clean up wikipedia fluff etc. for coverage testing.
-Notes on calculating coverage from wikipedia dumps (based on [[Asturian#Calculating coverage]]).
-(Mac OS X `sed' doesn't allow \n in replacements, so I just use an actual (escaped) newline...)
-wikicat.sh:
-<pre>
-#!/bin/sh
-# Clean up wikitext for running through apertium-destxt
-# awk prints full lines, make sure each html element has one
-bzcat "$@" | sed 's/>/>\
-/g' | sed 's/</\
-</g' |\
-# want only stuff between <text...> and </text>
-awk '
-/<text.*>/,/<\/text>/ { print $0 }
-' |\
-sed 's/\./ /g' |\
-# Drop all transwiki links
-sed 's/\[\[\([a-z]\{2,3\}\|bat-smg\|be-x-old\|cbk-zam\|fiu-vro\|map-bms\|nds-nl\|roa-rup\|roa-tara\|simple\|zh-classical\|zh-min-nan\|zh-yue\):[^]]\+\]\]//g' |\
-# wiki markup, retain bar and fie from [[foo|bar]] [[fie]]
-sed 's/\[\[[^]|]*|//g' | sed 's/\]\]//g' | sed 's/\[\[//g' |\
-# wiki markup, retain `bar fie' from [http://foo bar fie] and remove [http://foo]
-sed 's/\[http[^ ]*\([^]]*\)\]/\1/g' |\
-# remove entities
-sed 's/&[^;]*;/ /g' |\
-# and put space instead of punctuation
-sed 's/[;:?,]/ /g' |\
-# Keep only lines starting with a capital letter, removing tables with style info etc.
-grep '^[ 	]*[A-ZÆØÅ]' # Your alphabet here
-</pre>
-count-tokenized.sh:
-<pre>
-#!/bin/sh
-# http://wiki.apertium.org/wiki/Asturian#Calculating_coverage
-# Calculate the number of tokenised words in the corpus:
-apertium-destxt | lt-proc $1 |apertium-retxt |\
-# for some reason putting the newline in directly doesn't work, so two seds
-sed 's/\$[^^]*\^/$^/g' | sed 's/\$\^/$\
-^/g'
-</pre>
-To find all tokens from a wiki dump:
-<pre>
-$ ./wikicat.sh nnwiki-20090119-pages-articles.xml.bz2 > nnwiki.cleaned.txt
-cat nnwiki.cleaned.txt | ./count-tokenized.sh nn-nb.automorf.bin | wc -l
-</pre>
-To find all tokens with at least one analysis (naïve coverage):
-<pre>
-$ cat nnwiki.cleaned.txt  | ./count-tokenized.sh nn-nb.automorf.bin | grep -v '\/\*' | wc -l
-</pre>
-To find the top unknown tokens:
-<pre>
-$ cat nnwiki.cleaned.txt  | ./count-tokenized.sh nn-nb.automorf.bin | sed 's/[ 	]*//g' |\ # tab or space
-   grep '\/\*' | sort -f | uniq -c | sort -gr | head
-</pre>
-=== Script ready to run ===
-corpus-stat.sh
-<pre>
-#!/bin/sh
-# http://wiki.apertium.org/wiki/Asturian#Calculating_coverage
-# Example use:
-# zcat corpa/en.crp.txt.gz | sh corpus-stat.sh
-#CMD="cat corpa/en.crp.txt"
-CMD="cat"
-F=/tmp/corpus-stat-res.txt
-# Calculate the number of tokenised words in the corpus:
-# for some reason putting the newline in directly doesn't work, so two seds
-$CMD | apertium-destxt | lt-proc en-eo.automorf.bin |apertium-retxt | sed 's/\$[^^]*\^/$^/g' | sed 's/\$\^/$\
-^/g' > $F
-NUMWORDS=`cat $F | wc -l`
-echo "Number of tokenised words in the corpus: $NUMWORDS"
-# Calculate the number of words that are not unknown
-NUMKNOWNWORDS=`cat $F | grep -v '\*' | wc -l`
-echo "Number of known words in the corpus: $NUMKNOWNWORDS"
-# Calculate the coverage
-COVERAGE=`calc "round($NUMKNOWNWORDS/$NUMWORDS*1000)/10"`
-echo "Coverage: $COVERAGE %"
-#If you don't have calc, change the above line to:
-#COVERAGE=$(perl -e 'print int($ARGV[0]/$ARGV[1]*1000)/10;' $NUMKNOWNWORDS $NUMWORDS)
-# Show the top-ten unknown words.
-echo "Top unknown words in the corpus:"
-cat $F | grep '\*' | sort -f | uniq -c | sort -gr | head -10
-</pre>
-Sample output:
-<pre>
-$ zcat corpa/en.crp.txt.gz | sh corpus-stat.sh
-Number of tokenised words in the corpus: 478187
-Number of known words in the corpus: 450255
-Coverage: 	94.2 %
-Top unknown words in the corpus:
-^Apollo/*Apollo$
-^Aramaic/*Aramaic$
-^Alberta/*Alberta$
-^de/*de$
-^Abu/*Abu$
-^Bakr/*Bakr$
-^Agassi/*Agassi$
-^Carnegie/*Carnegie$
-^Agrippina/*Agrippina$
-^Achilles/*Achilles$
-^Adelaide/*Adelaide$
-</pre>
 ==See also==

Difference between revisions of "Calculating coverage"

Revision as of 10:41, 12 July 2017

Contents

Simple bidix-trimmed coverage testing

TODO: paradigm-coverage (less naïve)

Faster coverage testing with frequency lists

coverage.py

See also

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools