Calculating coverage

From Apertium
Revision as of 10:41, 12 July 2017 by Francis Tyers (talk | contribs) (→‎More involved scripts: i don't like supporting this, should be replaced with a simpler example)
Jump to navigation Jump to search

En français

Simple bidix-trimmed coverage testing

First install apertium-cleanstream:

svn checkout
cd apertium-cleanstream
sudo cp apertium-cleanstream /usr/local/bin

Then save this as

apertium -d . $mode | apertium-cleanstream -n > $outfile
total=$(grep -c '^\^' $outfile)
unknown=$(grep -c '/\*' $outfile)
bidix_unknown=$(grep -c '/@' $outfile)
known_percent=$(calc -p  "round( 100*($total-$unknown-$bidix_unknown)/$total, 3)")
echo "$known_percent % known tokens ($unknown unknown, $bidix_unknown bidix-unknown of total $total tokens)"
echo "Top unknown words:"
grep '/[*@]' $outfile | sort | uniq -c | sort -nr | head

And run it like

cat asm.corpus | bash asm-eng-biltrans

(The bidix-unknown count should always be 0 if your pair uses automatic analyser trimming.)

TODO: paradigm-coverage (less naïve)

On an analysed corpus, we can sum frequencies into bins for each lemma+mainpos, so if the analysed corpus contains


then output has

3 mus<n><f>
2 muse<vblex>

and we can find paradigms that are likely to mess up disambiguation, or where we need to ensure that the bidix contains the highest-frequency paradigm (since the bidix is typically smaller than the monodix).

We could also weight these numbers by number of unique forms in the pardef; if the verb pardef has 6 unique forms and then noun only 3, then the above output should be even more skewed:

0.33 mus<n><f>
0.75 muse<vblex>

Faster coverage testing with frequency lists

If words appear several times in your corpus, why bother analysing them several times? We can make a frequency list first and add together the frequencies. This script does some very stupid tokenisation and creates a frequency list:


if [[ -t 0 ]]; then
    echo "Expecting a corpus on stdin"
    exit 2

tr '[:space:][:punct:]' '\n' | grep . | sort | uniq -c | sort -nr

And this script runs your analyser, summing up the frequencies:


set -e -u

if [[ $# -eq 0 || -t 0 ]]; then
    echo "Expecting apertium arguments and a 'sort|uniq -c|sort -nr' style frequency list on stdin"
    echo "For example:"
    echo "\$ < spa.freqlist $0 -d . spa-morph"
    exit 2

sed 's%^ *%<apertium-notrans>%;s% %</apertium-notrans>%;s%$% .%' |
apertium -f html-noent "$@" |
awk -F'</?apertium-notrans>| *\\^\\./\\.<sent><clb>\\$' '
/[/][*@]/ {
    if(!printed) print "Top unknown tokens:"
    if(++printed<10) print $2,$3
    print known_pct" % known of total "total" tokens"


$ chmod +x
$ bzcat ~/corpora/nno.txt.bz2 |./ > nno.freqlist
$ <nno.freqlist ./ -d ~/apertium-svn/languages/apertium-nno/ nno-morph is a coverage script that wraps curl and bzcat (?)

See also