Difference between revisions of "Calculating coverage"
|  (→More involved scripts:   i don't like supporting this, should be replaced with a simpler example) | |||
| Line 110: | Line 110: | ||
| https://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/coverage.py is a coverage script that wraps curl and bzcat (?) | https://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/coverage.py is a coverage script that wraps curl and bzcat (?) | ||
| ==More involved scripts== | |||
| Often it's nice to clean up wikipedia fluff etc. for coverage testing. | |||
| Notes on calculating coverage from wikipedia dumps (based on [[Asturian#Calculating coverage]]).  | |||
| (Mac OS X `sed' doesn't allow \n in replacements, so I just use an actual (escaped) newline...) | |||
| wikicat.sh: | |||
| <pre> | |||
| #!/bin/sh | |||
| # Clean up wikitext for running through apertium-destxt | |||
| # awk prints full lines, make sure each html element has one | |||
| bzcat "$@" | sed 's/>/>\ | |||
| /g' | sed 's/</\ | |||
| </g' |\ | |||
| # want only stuff between <text...> and </text> | |||
| awk ' | |||
| /<text.*>/,/<\/text>/ { print $0 } | |||
| ' |\ | |||
| sed 's/\./ /g' |\ | |||
| # Drop all transwiki links | |||
| sed 's/\[\[\([a-z]\{2,3\}\|bat-smg\|be-x-old\|cbk-zam\|fiu-vro\|map-bms\|nds-nl\|roa-rup\|roa-tara\|simple\|zh-classical\|zh-min-nan\|zh-yue\):[^]]\+\]\]//g' |\ | |||
| # wiki markup, retain bar and fie from [[foo|bar]] [[fie]] | |||
| sed 's/\[\[[^]|]*|//g' | sed 's/\]\]//g' | sed 's/\[\[//g' |\ | |||
| # wiki markup, retain `bar fie' from [http://foo bar fie] and remove [http://foo] | |||
| sed 's/\[http[^ ]*\([^]]*\)\]/\1/g' |\ | |||
| # remove entities | |||
| sed 's/&[^;]*;/ /g' |\ | |||
| # and put space instead of punctuation | |||
| sed 's/[;:?,]/ /g' |\ | |||
| # Keep only lines starting with a capital letter, removing tables with style info etc. | |||
| grep '^[ 	]*[A-ZÆØÅ]' # Your alphabet here | |||
| </pre> | |||
| count-tokenized.sh: | |||
| <pre> | |||
| #!/bin/sh | |||
| # http://wiki.apertium.org/wiki/Asturian#Calculating_coverage | |||
| # Calculate the number of tokenised words in the corpus: | |||
| apertium-destxt | lt-proc $1 |apertium-retxt |\ | |||
| # for some reason putting the newline in directly doesn't work, so two seds | |||
| sed 's/\$[^^]*\^/$^/g' | sed 's/\$\^/$\ | |||
| ^/g'  | |||
| </pre> | |||
| To find all tokens from a wiki dump: | |||
| <pre> | |||
| $ ./wikicat.sh nnwiki-20090119-pages-articles.xml.bz2 > nnwiki.cleaned.txt | |||
| cat nnwiki.cleaned.txt | ./count-tokenized.sh nn-nb.automorf.bin | wc -l | |||
| </pre> | |||
| To find all tokens with at least one analysis (naïve coverage): | |||
| <pre> | |||
| $ cat nnwiki.cleaned.txt  | ./count-tokenized.sh nn-nb.automorf.bin | grep -v '\/\*' | wc -l | |||
| </pre> | |||
| To find the top unknown tokens: | |||
| <pre> | |||
| $ cat nnwiki.cleaned.txt  | ./count-tokenized.sh nn-nb.automorf.bin | sed 's/[ 	]*//g' |\ # tab or space | |||
|    grep '\/\*' | sort -f | uniq -c | sort -gr | head  | |||
| </pre> | |||
| === Script ready to run === | |||
| corpus-stat.sh | |||
| <pre> | |||
| #!/bin/sh | |||
| # http://wiki.apertium.org/wiki/Asturian#Calculating_coverage | |||
| # Example use: | |||
| # zcat corpa/en.crp.txt.gz | sh corpus-stat.sh | |||
| #CMD="cat corpa/en.crp.txt" | |||
| CMD="cat" | |||
| F=/tmp/corpus-stat-res.txt | |||
| # Calculate the number of tokenised words in the corpus: | |||
| # for some reason putting the newline in directly doesn't work, so two seds | |||
| $CMD | apertium-destxt | lt-proc en-eo.automorf.bin |apertium-retxt | sed 's/\$[^^]*\^/$^/g' | sed 's/\$\^/$\ | |||
| ^/g' > $F | |||
| NUMWORDS=`cat $F | wc -l` | |||
| echo "Number of tokenised words in the corpus: $NUMWORDS" | |||
| # Calculate the number of words that are not unknown | |||
| NUMKNOWNWORDS=`cat $F | grep -v '\*' | wc -l` | |||
| echo "Number of known words in the corpus: $NUMKNOWNWORDS" | |||
| # Calculate the coverage | |||
| COVERAGE=`calc "round($NUMKNOWNWORDS/$NUMWORDS*1000)/10"` | |||
| echo "Coverage: $COVERAGE %" | |||
| #If you don't have calc, change the above line to: | |||
| #COVERAGE=$(perl -e 'print int($ARGV[0]/$ARGV[1]*1000)/10;' $NUMKNOWNWORDS $NUMWORDS) | |||
| # Show the top-ten unknown words. | |||
| echo "Top unknown words in the corpus:" | |||
| cat $F | grep '\*' | sort -f | uniq -c | sort -gr | head -10 | |||
| </pre> | |||
| Sample output: | |||
| <pre> | |||
| $ zcat corpa/en.crp.txt.gz | sh corpus-stat.sh | |||
| Number of tokenised words in the corpus: 478187 | |||
| Number of known words in the corpus: 450255 | |||
| Coverage: 	94.2 % | |||
| Top unknown words in the corpus: | |||
|     191 ^Apollo/*Apollo$ | |||
|     104 ^Aramaic/*Aramaic$ | |||
|      91 ^Alberta/*Alberta$ | |||
|      81 ^de/*de$ | |||
|      80 ^Abu/*Abu$ | |||
|      63 ^Bakr/*Bakr$ | |||
|      62 ^Agassi/*Agassi$ | |||
|      59 ^Carnegie/*Carnegie$ | |||
|      58 ^Agrippina/*Agrippina$ | |||
|      58 ^Achilles/*Achilles$ | |||
|      56 ^Adelaide/*Adelaide$ | |||
| </pre> | |||
| ==See also== | ==See also== | ||
Revision as of 10:41, 12 July 2017
Contents
Simple bidix-trimmed coverage testing
First install apertium-cleanstream:
svn checkout https://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/apertium-cleanstream cd apertium-cleanstream make sudo cp apertium-cleanstream /usr/local/bin
Then save this as coverage.sh:
#!/bin/bash mode=$1 outfile=/tmp/$mode.clean apertium -d . $mode | apertium-cleanstream -n > $outfile total=$(grep -c '^\^' $outfile) unknown=$(grep -c '/\*' $outfile) bidix_unknown=$(grep -c '/@' $outfile) known_percent=$(calc -p "round( 100*($total-$unknown-$bidix_unknown)/$total, 3)") echo "$known_percent % known tokens ($unknown unknown, $bidix_unknown bidix-unknown of total $total tokens)" echo "Top unknown words:" grep '/[*@]' $outfile | sort | uniq -c | sort -nr | head
And run it like
cat asm.corpus | bash coverage.sh asm-eng-biltrans
(The bidix-unknown count should always be 0 if your pair uses automatic analyser trimming.)
TODO: paradigm-coverage (less naïve)
On an analysed corpus, we can sum frequencies into bins for each lemma+mainpos, so if the analysed corpus contains
musa/mus<n><f><sg><def>/muse<vblex><past> mus/mus<n><f><sg><ind>/mus<n><f><pl><ind>/muse<vblex><imp> musene/mus<n><f><pl><def>
then output has
3 mus<n><f> 2 muse<vblex>
and we can find paradigms that are likely to mess up disambiguation, or where we need to ensure that the bidix contains the highest-frequency paradigm (since the bidix is typically smaller than the monodix).
We could also weight these numbers by number of unique forms in the pardef; if the verb pardef has 6 unique forms and then noun only 3, then the above output should be even more skewed:
0.33 mus<n><f> 0.75 muse<vblex>
Faster coverage testing with frequency lists
If words appear several times in your corpus, why bother analysing them several times? We can make a frequency list first and add together the frequencies. This script does some very stupid tokenisation and creates a frequency list:
make-freqlist.sh:
#!/bin/bash
if [[ -t 0 ]]; then
    echo "Expecting a corpus on stdin"
    exit 2
fi
tr '[:space:][:punct:]' '\n' | grep . | sort | uniq -c | sort -nr
And this script runs your analyser, summing up the frequencies:
freqlist-coverage.sh:
#!/bin/bash
set -e -u
if [[ $# -eq 0 || -t 0 ]]; then
    echo "Expecting apertium arguments and a 'sort|uniq -c|sort -nr' style frequency list on stdin"
    echo "For example:"
    echo "\$ < spa.freqlist $0 -d . spa-morph"
    exit 2
fi
sed 's%^ *%<apertium-notrans>%;s% %</apertium-notrans>%;s%$% .%' |
apertium -f html-noent "$@" |
awk -F'</?apertium-notrans>| *\\^\\./\\.<sent><clb>\\$' '
/[/][*@]/ {
    unknown+=$2
    if(!printed) print "Top unknown tokens:"
    if(++printed<10) print $2,$3
    next
}
{
    known+=$2
}
END {
    total=known+unknown
    known_pct=100*known/total
    unk_pct=100*unknown/total
    print known_pct" % known of total "total" tokens"
}'
 
Usage:
$ chmod +x make-freqlist.sh freqlist-coverage.sh $ bzcat ~/corpora/nno.txt.bz2 |./make-freqlist.sh > nno.freqlist $ <nno.freqlist ./freqlist-coverage.sh -d ~/apertium-svn/languages/apertium-nno/ nno-morph
coverage.py
https://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/coverage.py is a coverage script that wraps curl and bzcat (?)

