Difference between revisions of "Calculating coverage"
Line 10: | Line 10: | ||
Then save this as coverage.sh: |
Then save this as coverage.sh: |
||
#!/bin/bash |
|||
set -e -u |
|||
mode=$1 |
|||
outfile=/tmp/$mode.clean |
outfile=/tmp/$mode.clean |
||
apertium -d . $mode | apertium-cleanstream -n > $outfile |
apertium -d . $mode | apertium-cleanstream -n > $outfile |
Revision as of 09:10, 2 June 2014
Contents
Simple bidix-trimmed coverage testing
First install apertium-cleanstream:
svn checkout https://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/apertium-cleanstream cd apertium-cleanstream make sudo cp apertium-cleanstream /usr/local/bin
Then save this as coverage.sh:
#!/bin/bash set -e -u mode=$1 outfile=/tmp/$mode.clean apertium -d . $mode | apertium-cleanstream -n > $outfile total=$(grep -c '^\^' $outfile) unknown=$(grep -c '/\*' $outfile) bidix_unknown=$(grep -c '/@' $outfile) known_percent=$(calc -p "round( 100*($total-$unknown-$bidix_unknown)/$total, 3)") echo "$known_percent % known tokens ($unknown unknown, $bidix_unknown bidix-unknown of total $total tokens)" echo "Top unknown words:" sort $outfile | uniq -c | sort -nr | head
And run it like
cat asm.corpus | bash coverage.sh asm-eng-biltrans
(The bidix-unknown count should always be 0 if your pair uses automatic analyser trimming.)
coverage.py
https://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/coverage.py is a coverage script that wraps curl and bzcat (?)
More involved scripts
Often it's nice to clean up wikipedia fluff etc. for coverage testing.
Notes on calculating coverage from wikipedia dumps (based on Asturian#Calculating coverage).
(Mac OS X `sed' doesn't allow \n in replacements, so I just use an actual (escaped) newline...)
wikicat.sh:
#!/bin/sh # Clean up wikitext for running through apertium-destxt # awk prints full lines, make sure each html element has one bzcat "$@" | sed 's/>/>\ /g' | sed 's/</\ </g' |\ # want only stuff between <text...> and </text> awk ' /<text.*>/,/<\/text>/ { print $0 } ' |\ sed 's/\./ /g' |\ # Drop all transwiki links sed 's/\[\[\([a-z]\{2,3\}\|bat-smg\|be-x-old\|cbk-zam\|fiu-vro\|map-bms\|nds-nl\|roa-rup\|roa-tara\|simple\|zh-classical\|zh-min-nan\|zh-yue\):[^]]\+\]\]//g' |\ # wiki markup, retain bar and fie from [[foo|bar]] [[fie]] sed 's/\[\[[^]|]*|//g' | sed 's/\]\]//g' | sed 's/\[\[//g' |\ # wiki markup, retain `bar fie' from [http://foo bar fie] and remove [http://foo] sed 's/\[http[^ ]*\([^]]*\)\]/\1/g' |\ # remove entities sed 's/&[^;]*;/ /g' |\ # and put space instead of punctuation sed 's/[;:?,]/ /g' |\ # Keep only lines starting with a capital letter, removing tables with style info etc. grep '^[ ]*[A-ZÆØÅ]' # Your alphabet here
count-tokenized.sh:
#!/bin/sh # http://wiki.apertium.org/wiki/Asturian#Calculating_coverage # Calculate the number of tokenised words in the corpus: apertium-destxt | lt-proc $1 |apertium-retxt |\ # for some reason putting the newline in directly doesn't work, so two seds sed 's/\$[^^]*\^/$^/g' | sed 's/\$\^/$\ ^/g'
To find all tokens from a wiki dump:
$ ./wikicat.sh nnwiki-20090119-pages-articles.xml.bz2 > nnwiki.cleaned.txt cat nnwiki.cleaned.txt | ./count-tokenized.sh nn-nb.automorf.bin | wc -l
To find all tokens with at least one analysis (naïve coverage):
$ cat nnwiki.cleaned.txt | ./count-tokenized.sh nn-nb.automorf.bin | grep -v '\/\*' | wc -l
To find the top unknown tokens:
$ cat nnwiki.cleaned.txt | ./count-tokenized.sh nn-nb.automorf.bin | sed 's/[ ]*//g' |\ # tab or space grep '\/\*' | sort -f | uniq -c | sort -gr | head
Script ready to run
corpus-stat.sh
#!/bin/sh # http://wiki.apertium.org/wiki/Asturian#Calculating_coverage # Example use: # zcat corpa/en.crp.txt.gz | sh corpus-stat.sh #CMD="cat corpa/en.crp.txt" CMD="cat" F=/tmp/corpus-stat-res.txt # Calculate the number of tokenised words in the corpus: # for some reason putting the newline in directly doesn't work, so two seds $CMD | apertium-destxt | lt-proc en-eo.automorf.bin |apertium-retxt | sed 's/\$[^^]*\^/$^/g' | sed 's/\$\^/$\ ^/g' > $F NUMWORDS=`cat $F | wc -l` echo "Number of tokenised words in the corpus: $NUMWORDS" # Calculate the number of words that are not unknown NUMKNOWNWORDS=`cat $F | grep -v '\*' | wc -l` echo "Number of known words in the corpus: $NUMKNOWNWORDS" # Calculate the coverage COVERAGE=`calc "round($NUMKNOWNWORDS/$NUMWORDS*1000)/10"` echo "Coverage: $COVERAGE %" #If you don't have calc, change the above line to: #COVERAGE=$(perl -e 'print int($ARGV[0]/$ARGV[1]*1000)/10;' $NUMKNOWNWORDS $NUMWORDS) # Show the top-ten unknown words. echo "Top unknown words in the corpus:" cat $F | grep '\*' | sort -f | uniq -c | sort -gr | head -10
Sample output:
$ zcat corpa/en.crp.txt.gz | sh corpus-stat.sh Number of tokenised words in the corpus: 478187 Number of known words in the corpus: 450255 Coverage: 94.2 % Top unknown words in the corpus: 191 ^Apollo/*Apollo$ 104 ^Aramaic/*Aramaic$ 91 ^Alberta/*Alberta$ 81 ^de/*de$ 80 ^Abu/*Abu$ 63 ^Bakr/*Bakr$ 62 ^Agassi/*Agassi$ 59 ^Carnegie/*Carnegie$ 58 ^Agrippina/*Agrippina$ 58 ^Achilles/*Achilles$ 56 ^Adelaide/*Adelaide$