Calculating coverage

From Apertium
Revision as of 19:52, 26 April 2018 by Sushain (talk | contribs)
Jump to navigation Jump to search

This page is out of date as a result of the migration to GitHub. Please update this page with new documentation and remove this warning. If you are unsure how to proceed, please contact the GitHub migration team.

En français

Simple bidix-trimmed coverage testing

First install apertium-cleanstream:

svn checkout
cd apertium-cleanstream
sudo cp apertium-cleanstream /usr/local/bin

Then save this as

apertium -d . $mode | apertium-cleanstream -n > $outfile
total=$(grep -c '^\^' $outfile)
unknown=$(grep -c '/\*' $outfile)
bidix_unknown=$(grep -c '/@' $outfile)
known_percent=$(calc -p  "round( 100*($total-$unknown-$bidix_unknown)/$total, 3)")
echo "$known_percent % known tokens ($unknown unknown, $bidix_unknown bidix-unknown of total $total tokens)"
echo "Top unknown words:"
grep '/[*@]' $outfile | sort | uniq -c | sort -nr | head

And run it like

cat asm.corpus | bash asm-eng-biltrans

(The bidix-unknown count should always be 0 if your pair uses automatic analyser trimming.)

TODO: paradigm-coverage (less naïve)

On an analysed corpus, we can sum frequencies into bins for each lemma+mainpos, so if the analysed corpus contains


then output has

3 mus<n><f>
2 muse<vblex>

and we can find paradigms that are likely to mess up disambiguation, or where we need to ensure that the bidix contains the highest-frequency paradigm (since the bidix is typically smaller than the monodix).

We could also weight these numbers by number of unique forms in the pardef; if the verb pardef has 6 unique forms and then noun only 3, then the above output should be even more skewed:

0.33 mus<n><f>
0.75 muse<vblex>

Faster coverage testing with frequency lists

If words appear several times in your corpus, why bother analysing them several times? We can make a frequency list first and add together the frequencies. This script does some very stupid tokenisation and creates a frequency list:


if [[ -t 0 ]]; then
    echo "Expecting a corpus on stdin"
    exit 2

tr '[:space:][:punct:]' '\n' | grep . | sort | uniq -c | sort -nr

And this script runs your analyser, summing up the frequencies:


set -e -u

if [[ $# -eq 0 || -t 0 ]]; then
    echo "Expecting apertium arguments and a 'sort|uniq -c|sort -nr' style frequency list on stdin"
    echo "For example:"
    echo "\$ < spa.freqlist $0 -d . spa-morph"
    exit 2

sed 's%^ *%<apertium-notrans>%;s% %</apertium-notrans>%;s%$% .%' |
apertium -f html-noent "$@" |
awk -F'</?apertium-notrans>| *\\^\\./\\.<sent><clb>\\$' '
/[/][*@]/ {
    if(!printed) print "Top unknown tokens:"
    if(++printed<10) print $2,$3
    print known_pct" % known of total "total" tokens"


$ chmod +x
$ bzcat ~/corpora/nno.txt.bz2 |./ > nno.freqlist
$ <nno.freqlist ./ -d ~/apertium-svn/languages/apertium-nno/ nno-morph is a coverage script that wraps curl and bzcat (?)

See also