Apertium has moved from SourceForge to GitHub.
If you have any questions, please come and talk to us on #apertium on irc.freenode.net or contact the GitHub migration team.

Calculating coverage

From Apertium
(Difference between revisions)
Jump to: navigation, search
(other scripts)
m (Faster coverage testing with frequency lists)
 
(19 intermediate revisions by 5 users not shown)
Line 1: Line 1:
  +
  +
[[Calculer la couverture|En français]]
  +
 
==Simple bidix-trimmed coverage testing==
 
==Simple bidix-trimmed coverage testing==
   
Line 8: Line 11:
 
sudo cp apertium-cleanstream /usr/local/bin
 
sudo cp apertium-cleanstream /usr/local/bin
   
Then run this bidix-trimmed-coverage.sh:
+
'''Note - After Apertium's migration to GitHub, this tool is read-only on the SourceForge repository and does not exist on GitHub. If you are interested in migrating this tool to GitHub, see [[Migrating tools to GitHub]].'''
   
cat asm.corpus | apertium -d . asm-eng-biltrans | apertium-cleanstream -n > asm-eng-biltrans.clean
+
Then save this as coverage.sh:
total=$(grep -c '^\^' asm-eng-biltrans.clean)
+
unknown=$(grep -c '/\*' asm-eng-biltrans.clean)
+
#!/bin/bash
bidix_unknown=$(grep -c '/@' asm-eng-biltrans.clean)
+
mode=$1
  +
outfile=/tmp/$mode.clean
  +
apertium -d . $mode | apertium-cleanstream -n > $outfile
  +
total=$(grep -c '^\^' $outfile)
  +
unknown=$(grep -c '/\*' $outfile)
  +
bidix_unknown=$(grep -c '/@' $outfile)
 
known_percent=$(calc -p "round( 100*($total-$unknown-$bidix_unknown)/$total, 3)")
 
known_percent=$(calc -p "round( 100*($total-$unknown-$bidix_unknown)/$total, 3)")
 
echo "$known_percent % known tokens ($unknown unknown, $bidix_unknown bidix-unknown of total $total tokens)"
 
echo "$known_percent % known tokens ($unknown unknown, $bidix_unknown bidix-unknown of total $total tokens)"
  +
echo "Top unknown words:"
  +
grep '/[*@]' $outfile | sort | uniq -c | sort -nr | head
   
(The bidix-unknown count should always be 0 if your pair uses [[lt-trim|automatic analyser trimming]].)
+
And run it like
   
==coverage.py==
+
cat asm.corpus | bash coverage.sh asm-eng-biltrans
   
https://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/coverage.py is a coverage script that wraps curl and bzcat (?)
+
(The bidix-unknown count should always be 0 if your pair uses [[lt-trim|automatic analyser trimming]].)
   
==More involved scripts==
+
==TODO: paradigm-coverage (less naïve)==
Often it's nice to clean up wikipedia fluff etc. for coverage testing.
+
On an analysed corpus, we can sum frequencies into bins for each lemma+mainpos, so if the analysed corpus contains
   
Notes on calculating coverage from wikipedia dumps (based on [[Asturian#Calculating coverage]]).
 
 
(Mac OS X `sed' doesn't allow \n in replacements, so I just use an actual (escaped) newline...)
 
 
wikicat.sh:
 
 
<pre>
 
<pre>
#!/bin/sh
+
musa/mus<n><f><sg><def>/muse<vblex><past>
# Clean up wikitext for running through apertium-destxt
+
mus/mus<n><f><sg><ind>/mus<n><f><pl><ind>/muse<vblex><imp>
+
musene/mus<n><f><pl><def>
# awk prints full lines, make sure each html element has one
 
bzcat "$@" | sed 's/>/>\
 
/g' | sed 's/</\
 
</g' |\
 
# want only stuff between <text...> and </text>
 
awk '
 
/<text.*>/,/<\/text>/ { print $0 }
 
' |\
 
sed 's/\./ /g' |\
 
# Drop all transwiki links
 
sed 's/\[\[\([a-z]\{2,3\}\|bat-smg\|be-x-old\|cbk-zam\|fiu-vro\|map-bms\|nds-nl\|roa-rup\|roa-tara\|simple\|zh-classical\|zh-min-nan\|zh-yue\):[^]]\+\]\]//g' |\
 
# wiki markup, retain bar and fie from [[foo|bar]] [[fie]]
 
sed 's/\[\[[^]|]*|//g' | sed 's/\]\]//g' | sed 's/\[\[//g' |\
 
# wiki markup, retain `bar fie' from [http://foo bar fie] and remove [http://foo]
 
sed 's/\[http[^ ]*\([^]]*\)\]/\1/g' |\
 
# remove entities
 
sed 's/&[^;]*;/ /g' |\
 
# and put space instead of punctuation
 
sed 's/[;:?,]/ /g' |\
 
# Keep only lines starting with a capital letter, removing tables with style info etc.
 
grep '^[ ]*[A-ZÆØÅ]' # Your alphabet here
 
 
</pre>
 
</pre>
+
then output has
count-tokenized.sh:
 
 
<pre>
 
<pre>
#!/bin/sh
+
3 mus<n><f>
# http://wiki.apertium.org/wiki/Asturian#Calculating_coverage
+
2 muse<vblex>
 
# Calculate the number of tokenised words in the corpus:
 
apertium-destxt | lt-proc $1 |apertium-retxt |\
 
# for some reason putting the newline in directly doesn't work, so two seds
 
sed 's/\$[^^]*\^/$^/g' | sed 's/\$\^/$\
 
^/g'
 
 
</pre>
 
</pre>
  +
and we can find paradigms that are likely to mess up disambiguation, or where we need to ensure that the bidix contains the highest-frequency paradigm (since the bidix is typically smaller than the monodix).
   
To find all tokens from a wiki dump:
+
We could also weight these numbers by number of unique forms in the pardef; if the verb pardef has 6 unique forms and then noun only 3, then the above output should be even more skewed:
 
<pre>
 
<pre>
$ ./wikicat.sh nnwiki-20090119-pages-articles.xml.bz2 > nnwiki.cleaned.txt
+
0.33 mus<n><f>
cat nnwiki.cleaned.txt | ./count-tokenized.sh nn-nb.automorf.bin | wc -l
+
0.75 muse<vblex>
 
</pre>
 
</pre>
To find all tokens with at least one analysis (naïve coverage):
 
<pre>
 
$ cat nnwiki.cleaned.txt | ./count-tokenized.sh nn-nb.automorf.bin | grep -v '\/\*' | wc -l
 
</pre>
 
To find the top unknown tokens:
 
<pre>
 
$ cat nnwiki.cleaned.txt | ./count-tokenized.sh nn-nb.automorf.bin | sed 's/[ ]*//g' |\ # tab or space
 
grep '\/\*' | sort -f | uniq -c | sort -gr | head
 
   
</pre>
+
==Faster coverage testing with frequency lists==
   
=== Script ready to run ===
+
If words appear several times in your corpus, why bother analysing them several times? We can make a frequency list first and add together the frequencies. This script does some very stupid tokenisation and creates a frequency list:
   
corpus-stat.sh
+
make-freqlist.sh:
 
<pre>
 
<pre>
#!/bin/sh
+
#!/bin/bash
# http://wiki.apertium.org/wiki/Asturian#Calculating_coverage
 
   
  +
if [[ -t 0 ]]; then
  +
echo "Expecting a corpus on stdin"
  +
exit 2
  +
fi
   
# Example use:
+
tr '[:space:][:punct:]' '\n' | grep . | sort | uniq -c | sort -nr
# zcat corpa/en.crp.txt.gz | sh corpus-stat.sh
+
</pre>
  +
And this script runs your analyser, summing up the frequencies:
   
  +
freqlist-coverage.sh:
  +
<pre>
  +
#!/bin/bash
   
#CMD="cat corpa/en.crp.txt"
+
set -e -u
CMD="cat"
 
   
F=/tmp/corpus-stat-res.txt
+
if [[ $# -eq 0 || -t 0 ]]; then
  +
echo "Expecting apertium arguments and a 'sort|uniq -c|sort -nr' style frequency list on stdin"
  +
echo "For example:"
  +
echo "\$ < spa.freqlist $0 -d . spa-morph"
  +
exit 2
  +
fi
   
# Calculate the number of tokenised words in the corpus:
+
sed 's%^ *%<apertium-notrans>%;s% %</apertium-notrans>%;s%$% .%' |
# for some reason putting the newline in directly doesn't work, so two seds
+
apertium -f html-noent "$@" |
$CMD | apertium-destxt | lt-proc en-eo.automorf.bin |apertium-retxt | sed 's/\$[^^]*\^/$^/g' | sed 's/\$\^/$\
+
awk -F'</?apertium-notrans>| *\\^\\./\\.<sent><clb>\\$' '
^/g' > $F
+
/[/][*@]/ {
  +
unknown+=$2
  +
if(!printed) print "Top unknown tokens:"
  +
if(++printed<10) print $2,$3
  +
next
  +
}
  +
{
  +
known+=$2
  +
}
  +
END {
  +
total=known+unknown
  +
known_pct=100*known/total
  +
unk_pct=100*unknown/total
  +
print known_pct" % known of total "total" tokens"
  +
}'
  +
</pre>
   
NUMWORDS=`cat $F | wc -l`
+
Usage:
echo "Number of tokenised words in the corpus: $NUMWORDS"
 
   
  +
$ chmod +x make-freqlist.sh freqlist-coverage.sh
  +
$ bzcat ~/corpora/nno.txt.bz2 |./make-freqlist.sh > nno.freqlist
  +
$ < nno.freqlist ./freqlist-coverage.sh -d ~/apertium-svn/languages/apertium-nno/ nno-morph
   
  +
==coverage.py==
   
# Calculate the number of words that are not unknown
+
https://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/coverage.py is a coverage script that wraps curl and bzcat.
   
NUMKNOWNWORDS=`cat $F | grep -v '\*' | wc -l`
 
echo "Number of known words in the corpus: $NUMKNOWNWORDS"
 
 
 
# Calculate the coverage
 
 
COVERAGE=`calc "round($NUMKNOWNWORDS/$NUMWORDS*1000)/10"`
 
echo "Coverage: $COVERAGE %"
 
 
#If you don't have calc, change the above line to:
 
#COVERAGE=$(perl -e 'print int($ARGV[0]/$ARGV[1]*1000)/10;' $NUMKNOWNWORDS $NUMWORDS)
 
 
# Show the top-ten unknown words.
 
 
echo "Top unknown words in the corpus:"
 
cat $F | grep '\*' | sort -f | uniq -c | sort -gr | head -10
 
 
</pre>
 
Sample output:
 
<pre>
 
$ zcat corpa/en.crp.txt.gz | sh corpus-stat.sh
 
Number of tokenised words in the corpus: 478187
 
Number of known words in the corpus: 450255
 
Coverage: 94.2 %
 
Top unknown words in the corpus:
 
191 ^Apollo/*Apollo$
 
104 ^Aramaic/*Aramaic$
 
91 ^Alberta/*Alberta$
 
81 ^de/*de$
 
80 ^Abu/*Abu$
 
63 ^Bakr/*Bakr$
 
62 ^Agassi/*Agassi$
 
59 ^Carnegie/*Carnegie$
 
58 ^Agrippina/*Agrippina$
 
58 ^Achilles/*Achilles$
 
56 ^Adelaide/*Adelaide$
 
</pre>
 
   
==External links==
+
'''Note - After Apertium's migration to GitHub, this tool is read-only on the SourceForge repository and does not exist on GitHub. If you are interested in migrating this tool to GitHub, see [[Migrating tools to GitHub]].'''
   
* [http://wp2txt.rubyforge.org/ wp2txt]
+
=See also==
   
* [https://gist.github.com/2283105 simple script to clean an apertium stream]
+
* [[Wikipedia dumps]]
  +
* [[Cleanstream]]
   
 
[[Category:Documentation]]
 
[[Category:Documentation]]

Latest revision as of 19:19, 10 June 2019

En français

Contents

[edit] Simple bidix-trimmed coverage testing

First install apertium-cleanstream:

svn checkout https://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/apertium-cleanstream
cd apertium-cleanstream
make
sudo cp apertium-cleanstream /usr/local/bin

Note - After Apertium's migration to GitHub, this tool is read-only on the SourceForge repository and does not exist on GitHub. If you are interested in migrating this tool to GitHub, see Migrating tools to GitHub.

Then save this as coverage.sh:

#!/bin/bash
mode=$1
outfile=/tmp/$mode.clean
apertium -d . $mode | apertium-cleanstream -n > $outfile
total=$(grep -c '^\^' $outfile)
unknown=$(grep -c '/\*' $outfile)
bidix_unknown=$(grep -c '/@' $outfile)
known_percent=$(calc -p  "round( 100*($total-$unknown-$bidix_unknown)/$total, 3)")
echo "$known_percent % known tokens ($unknown unknown, $bidix_unknown bidix-unknown of total $total tokens)"
echo "Top unknown words:"
grep '/[*@]' $outfile | sort | uniq -c | sort -nr | head

And run it like

cat asm.corpus | bash coverage.sh asm-eng-biltrans

(The bidix-unknown count should always be 0 if your pair uses automatic analyser trimming.)

[edit] TODO: paradigm-coverage (less naïve)

On an analysed corpus, we can sum frequencies into bins for each lemma+mainpos, so if the analysed corpus contains

musa/mus<n><f><sg><def>/muse<vblex><past>
mus/mus<n><f><sg><ind>/mus<n><f><pl><ind>/muse<vblex><imp>
musene/mus<n><f><pl><def>

then output has

3 mus<n><f>
2 muse<vblex>

and we can find paradigms that are likely to mess up disambiguation, or where we need to ensure that the bidix contains the highest-frequency paradigm (since the bidix is typically smaller than the monodix).

We could also weight these numbers by number of unique forms in the pardef; if the verb pardef has 6 unique forms and then noun only 3, then the above output should be even more skewed:

0.33 mus<n><f>
0.75 muse<vblex>

[edit] Faster coverage testing with frequency lists

If words appear several times in your corpus, why bother analysing them several times? We can make a frequency list first and add together the frequencies. This script does some very stupid tokenisation and creates a frequency list:

make-freqlist.sh:

#!/bin/bash

if [[ -t 0 ]]; then
    echo "Expecting a corpus on stdin"
    exit 2
fi

tr '[:space:][:punct:]' '\n' | grep . | sort | uniq -c | sort -nr

And this script runs your analyser, summing up the frequencies:

freqlist-coverage.sh:

#!/bin/bash

set -e -u

if [[ $# -eq 0 || -t 0 ]]; then
    echo "Expecting apertium arguments and a 'sort|uniq -c|sort -nr' style frequency list on stdin"
    echo "For example:"
    echo "\$ < spa.freqlist $0 -d . spa-morph"
    exit 2
fi

sed 's%^ *%<apertium-notrans>%;s% %</apertium-notrans>%;s%$% .%' |
apertium -f html-noent "$@" |
awk -F'</?apertium-notrans>| *\\^\\./\\.<sent><clb>\\$' '
/[/][*@]/ {
    unknown+=$2
    if(!printed) print "Top unknown tokens:"
    if(++printed<10) print $2,$3
    next
}
{
    known+=$2
}
END {
    total=known+unknown
    known_pct=100*known/total
    unk_pct=100*unknown/total
    print known_pct" % known of total "total" tokens"
}'

Usage:

$ chmod +x make-freqlist.sh freqlist-coverage.sh
$ bzcat ~/corpora/nno.txt.bz2 |./make-freqlist.sh > nno.freqlist
$ < nno.freqlist ./freqlist-coverage.sh -d ~/apertium-svn/languages/apertium-nno/ nno-morph

[edit] coverage.py

https://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/coverage.py is a coverage script that wraps curl and bzcat.


Note - After Apertium's migration to GitHub, this tool is read-only on the SourceForge repository and does not exist on GitHub. If you are interested in migrating this tool to GitHub, see Migrating tools to GitHub.

[edit] See also=

Personal tools