Difference between revisions of "Calculating coverage"
(→More involved scripts: i don't like supporting this, should be replaced with a simpler example) |
|||
Line 110: | Line 110: | ||
https://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/coverage.py is a coverage script that wraps curl and bzcat (?) |
https://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/coverage.py is a coverage script that wraps curl and bzcat (?) |
||
==More involved scripts== |
|||
Often it's nice to clean up wikipedia fluff etc. for coverage testing. |
|||
Notes on calculating coverage from wikipedia dumps (based on [[Asturian#Calculating coverage]]). |
|||
(Mac OS X `sed' doesn't allow \n in replacements, so I just use an actual (escaped) newline...) |
|||
wikicat.sh: |
|||
<pre> |
|||
#!/bin/sh |
|||
# Clean up wikitext for running through apertium-destxt |
|||
# awk prints full lines, make sure each html element has one |
|||
bzcat "$@" | sed 's/>/>\ |
|||
/g' | sed 's/</\ |
|||
</g' |\ |
|||
# want only stuff between <text...> and </text> |
|||
awk ' |
|||
/<text.*>/,/<\/text>/ { print $0 } |
|||
' |\ |
|||
sed 's/\./ /g' |\ |
|||
# Drop all transwiki links |
|||
sed 's/\[\[\([a-z]\{2,3\}\|bat-smg\|be-x-old\|cbk-zam\|fiu-vro\|map-bms\|nds-nl\|roa-rup\|roa-tara\|simple\|zh-classical\|zh-min-nan\|zh-yue\):[^]]\+\]\]//g' |\ |
|||
# wiki markup, retain bar and fie from [[foo|bar]] [[fie]] |
|||
sed 's/\[\[[^]|]*|//g' | sed 's/\]\]//g' | sed 's/\[\[//g' |\ |
|||
# wiki markup, retain `bar fie' from [http://foo bar fie] and remove [http://foo] |
|||
sed 's/\[http[^ ]*\([^]]*\)\]/\1/g' |\ |
|||
# remove entities |
|||
sed 's/&[^;]*;/ /g' |\ |
|||
# and put space instead of punctuation |
|||
sed 's/[;:?,]/ /g' |\ |
|||
# Keep only lines starting with a capital letter, removing tables with style info etc. |
|||
grep '^[ ]*[A-ZÆØÅ]' # Your alphabet here |
|||
</pre> |
|||
count-tokenized.sh: |
|||
<pre> |
|||
#!/bin/sh |
|||
# http://wiki.apertium.org/wiki/Asturian#Calculating_coverage |
|||
# Calculate the number of tokenised words in the corpus: |
|||
apertium-destxt | lt-proc $1 |apertium-retxt |\ |
|||
# for some reason putting the newline in directly doesn't work, so two seds |
|||
sed 's/\$[^^]*\^/$^/g' | sed 's/\$\^/$\ |
|||
^/g' |
|||
</pre> |
|||
To find all tokens from a wiki dump: |
|||
<pre> |
|||
$ ./wikicat.sh nnwiki-20090119-pages-articles.xml.bz2 > nnwiki.cleaned.txt |
|||
cat nnwiki.cleaned.txt | ./count-tokenized.sh nn-nb.automorf.bin | wc -l |
|||
</pre> |
|||
To find all tokens with at least one analysis (naïve coverage): |
|||
<pre> |
|||
$ cat nnwiki.cleaned.txt | ./count-tokenized.sh nn-nb.automorf.bin | grep -v '\/\*' | wc -l |
|||
</pre> |
|||
To find the top unknown tokens: |
|||
<pre> |
|||
$ cat nnwiki.cleaned.txt | ./count-tokenized.sh nn-nb.automorf.bin | sed 's/[ ]*//g' |\ # tab or space |
|||
grep '\/\*' | sort -f | uniq -c | sort -gr | head |
|||
</pre> |
|||
=== Script ready to run === |
|||
corpus-stat.sh |
|||
<pre> |
|||
#!/bin/sh |
|||
# http://wiki.apertium.org/wiki/Asturian#Calculating_coverage |
|||
# Example use: |
|||
# zcat corpa/en.crp.txt.gz | sh corpus-stat.sh |
|||
#CMD="cat corpa/en.crp.txt" |
|||
CMD="cat" |
|||
F=/tmp/corpus-stat-res.txt |
|||
# Calculate the number of tokenised words in the corpus: |
|||
# for some reason putting the newline in directly doesn't work, so two seds |
|||
$CMD | apertium-destxt | lt-proc en-eo.automorf.bin |apertium-retxt | sed 's/\$[^^]*\^/$^/g' | sed 's/\$\^/$\ |
|||
^/g' > $F |
|||
NUMWORDS=`cat $F | wc -l` |
|||
echo "Number of tokenised words in the corpus: $NUMWORDS" |
|||
# Calculate the number of words that are not unknown |
|||
NUMKNOWNWORDS=`cat $F | grep -v '\*' | wc -l` |
|||
echo "Number of known words in the corpus: $NUMKNOWNWORDS" |
|||
# Calculate the coverage |
|||
COVERAGE=`calc "round($NUMKNOWNWORDS/$NUMWORDS*1000)/10"` |
|||
echo "Coverage: $COVERAGE %" |
|||
#If you don't have calc, change the above line to: |
|||
#COVERAGE=$(perl -e 'print int($ARGV[0]/$ARGV[1]*1000)/10;' $NUMKNOWNWORDS $NUMWORDS) |
|||
# Show the top-ten unknown words. |
|||
echo "Top unknown words in the corpus:" |
|||
cat $F | grep '\*' | sort -f | uniq -c | sort -gr | head -10 |
|||
</pre> |
|||
Sample output: |
|||
<pre> |
|||
$ zcat corpa/en.crp.txt.gz | sh corpus-stat.sh |
|||
Number of tokenised words in the corpus: 478187 |
|||
Number of known words in the corpus: 450255 |
|||
Coverage: 94.2 % |
|||
Top unknown words in the corpus: |
|||
191 ^Apollo/*Apollo$ |
|||
104 ^Aramaic/*Aramaic$ |
|||
91 ^Alberta/*Alberta$ |
|||
81 ^de/*de$ |
|||
80 ^Abu/*Abu$ |
|||
63 ^Bakr/*Bakr$ |
|||
62 ^Agassi/*Agassi$ |
|||
59 ^Carnegie/*Carnegie$ |
|||
58 ^Agrippina/*Agrippina$ |
|||
58 ^Achilles/*Achilles$ |
|||
56 ^Adelaide/*Adelaide$ |
|||
</pre> |
|||
==See also== |
==See also== |
Revision as of 10:41, 12 July 2017
Contents
Simple bidix-trimmed coverage testing
First install apertium-cleanstream:
svn checkout https://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/apertium-cleanstream cd apertium-cleanstream make sudo cp apertium-cleanstream /usr/local/bin
Then save this as coverage.sh:
#!/bin/bash mode=$1 outfile=/tmp/$mode.clean apertium -d . $mode | apertium-cleanstream -n > $outfile total=$(grep -c '^\^' $outfile) unknown=$(grep -c '/\*' $outfile) bidix_unknown=$(grep -c '/@' $outfile) known_percent=$(calc -p "round( 100*($total-$unknown-$bidix_unknown)/$total, 3)") echo "$known_percent % known tokens ($unknown unknown, $bidix_unknown bidix-unknown of total $total tokens)" echo "Top unknown words:" grep '/[*@]' $outfile | sort | uniq -c | sort -nr | head
And run it like
cat asm.corpus | bash coverage.sh asm-eng-biltrans
(The bidix-unknown count should always be 0 if your pair uses automatic analyser trimming.)
TODO: paradigm-coverage (less naïve)
On an analysed corpus, we can sum frequencies into bins for each lemma+mainpos, so if the analysed corpus contains
musa/mus<n><f><sg><def>/muse<vblex><past> mus/mus<n><f><sg><ind>/mus<n><f><pl><ind>/muse<vblex><imp> musene/mus<n><f><pl><def>
then output has
3 mus<n><f> 2 muse<vblex>
and we can find paradigms that are likely to mess up disambiguation, or where we need to ensure that the bidix contains the highest-frequency paradigm (since the bidix is typically smaller than the monodix).
We could also weight these numbers by number of unique forms in the pardef; if the verb pardef has 6 unique forms and then noun only 3, then the above output should be even more skewed:
0.33 mus<n><f> 0.75 muse<vblex>
Faster coverage testing with frequency lists
If words appear several times in your corpus, why bother analysing them several times? We can make a frequency list first and add together the frequencies. This script does some very stupid tokenisation and creates a frequency list:
make-freqlist.sh:
#!/bin/bash if [[ -t 0 ]]; then echo "Expecting a corpus on stdin" exit 2 fi tr '[:space:][:punct:]' '\n' | grep . | sort | uniq -c | sort -nr
And this script runs your analyser, summing up the frequencies:
freqlist-coverage.sh:
#!/bin/bash set -e -u if [[ $# -eq 0 || -t 0 ]]; then echo "Expecting apertium arguments and a 'sort|uniq -c|sort -nr' style frequency list on stdin" echo "For example:" echo "\$ < spa.freqlist $0 -d . spa-morph" exit 2 fi sed 's%^ *%<apertium-notrans>%;s% %</apertium-notrans>%;s%$% .%' | apertium -f html-noent "$@" | awk -F'</?apertium-notrans>| *\\^\\./\\.<sent><clb>\\$' ' /[/][*@]/ { unknown+=$2 if(!printed) print "Top unknown tokens:" if(++printed<10) print $2,$3 next } { known+=$2 } END { total=known+unknown known_pct=100*known/total unk_pct=100*unknown/total print known_pct" % known of total "total" tokens" }'
Usage:
$ chmod +x make-freqlist.sh freqlist-coverage.sh $ bzcat ~/corpora/nno.txt.bz2 |./make-freqlist.sh > nno.freqlist $ <nno.freqlist ./freqlist-coverage.sh -d ~/apertium-svn/languages/apertium-nno/ nno-morph
coverage.py
https://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/coverage.py is a coverage script that wraps curl and bzcat (?)