Calculating coverage

Simple bidix-trimmed coverage testing

First install apertium-cleanstream:

svn checkout https://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/apertium-cleanstream
cd apertium-cleanstream
sudo cp apertium-cleanstream /usr/local/bin

Then run this bidix-trimmed-coverage.sh:

cat asm.corpus | apertium -d . asm-eng-biltrans | apertium-cleanstream -n > asm-eng-biltrans.clean
total=$(grep -c '^\^' asm-eng-biltrans.clean)
unknown=$(grep -c '/\*' asm-eng-biltrans.clean)
bidix_unknown=$(grep -c '/@' asm-eng-biltrans.clean)
known_percent=$(calc -p  "round( 100*($total-$unknown-$bidix_unknown)/$total, 3)")
echo "$known_percent % known tokens ($unknown unknown, $bidix_unknown bidix-unknown of total $total tokens)"

(The bidix-unknown count should always be 0 if your pair uses automatic analyser trimming.)


other scripts

Notes on calculating coverage from wikipedia dumps (based on Asturian#Calculating coverage).

(Mac OS X `sed' doesn't allow \n in replacements, so I just use an actual (escaped) newline...)


# Clean up wikitext for running through apertium-destxt

# awk prints full lines, make sure each html element has one
bzcat "$@" | sed 's/>/>\
/g' | sed 's/</\
</g' |\
# want only stuff between <text...> and </text>
awk '
/<text.*>/,/<\/text>/ { print $0 }
' |\
sed 's/\./ /g' |\
# Drop all transwiki links
sed 's/\[\[\([a-z]\{2,3\}\|bat-smg\|be-x-old\|cbk-zam\|fiu-vro\|map-bms\|nds-nl\|roa-rup\|roa-tara\|simple\|zh-classical\|zh-min-nan\|zh-yue\):[^]]\+\]\]//g' |\
# wiki markup, retain bar and fie from [[foo|bar]] [[fie]]
sed 's/\[\[[^]|]*|//g' | sed 's/\]\]//g' | sed 's/\[\[//g' |\
# wiki markup, retain `bar fie' from [http://foo bar fie] and remove [http://foo]
sed 's/\[http[^ ]*\([^]]*\)\]/\1/g' |\
# remove entities
sed 's/&[^;]*;/ /g' |\
# and put space instead of punctuation
sed 's/[;:?,]/ /g' |\
# Keep only lines starting with a capital letter, removing tables with style info etc.
grep '^[ 	]*[A-ZÆØÅ]' # Your alphabet here


# http://wiki.apertium.org/wiki/Asturian#Calculating_coverage

# Calculate the number of tokenised words in the corpus:
apertium-destxt | lt-proc $1 |apertium-retxt |\
# for some reason putting the newline in directly doesn't work, so two seds
sed 's/\$[^^]*\^/$^/g' | sed 's/\$\^/$\

To find all tokens from a wiki dump:

$ ./wikicat.sh nnwiki-20090119-pages-articles.xml.bz2 > nnwiki.cleaned.txt
cat nnwiki.cleaned.txt | ./count-tokenized.sh nn-nb.automorf.bin | wc -l

To find all tokens with at least one analysis (naïve coverage):

$ cat nnwiki.cleaned.txt  | ./count-tokenized.sh nn-nb.automorf.bin | grep -v '\/\*' | wc -l

To find the top unknown tokens:

$ cat nnwiki.cleaned.txt  | ./count-tokenized.sh nn-nb.automorf.bin | sed 's/[ 	]*//g' |\ # tab or space
   grep '\/\*' | sort -f | uniq -c | sort -gr | head 

Script ready to run


# http://wiki.apertium.org/wiki/Asturian#Calculating_coverage

# Example use:
# zcat corpa/en.crp.txt.gz | sh corpus-stat.sh

#CMD="cat corpa/en.crp.txt"


# Calculate the number of tokenised words in the corpus:
# for some reason putting the newline in directly doesn't work, so two seds
$CMD | apertium-destxt | lt-proc en-eo.automorf.bin |apertium-retxt | sed 's/\$[^^]*\^/$^/g' | sed 's/\$\^/$\
^/g' > $F

NUMWORDS=`cat $F | wc -l`
echo "Number of tokenised words in the corpus: $NUMWORDS"

# Calculate the number of words that are not unknown

NUMKNOWNWORDS=`cat $F | grep -v '\*' | wc -l`
echo "Number of known words in the corpus: $NUMKNOWNWORDS"

# Calculate the coverage

echo "Coverage: $COVERAGE %"

#If you don't have calc, change the above line to:
#COVERAGE=$(perl -e 'print int($ARGV[0]/$ARGV[1]*1000)/10;' $NUMKNOWNWORDS $NUMWORDS)

# Show the top-ten unknown words.

echo "Top unknown words in the corpus:"
cat $F | grep '\*' | sort -f | uniq -c | sort -gr | head -10

Sample output:

$ zcat corpa/en.crp.txt.gz | sh corpus-stat.sh
Number of tokenised words in the corpus: 478187
Number of known words in the corpus: 450255
Coverage: 	94.2 %
Top unknown words in the corpus:
    191 ^Apollo/*Apollo$
    104 ^Aramaic/*Aramaic$
     91 ^Alberta/*Alberta$
     81 ^de/*de$
     80 ^Abu/*Abu$
     63 ^Bakr/*Bakr$
     62 ^Agassi/*Agassi$
     59 ^Carnegie/*Carnegie$
     58 ^Agrippina/*Agrippina$
     58 ^Achilles/*Achilles$
     56 ^Adelaide/*Adelaide$

