Measuring coverage of HFST transducer

From Apertium
Jump to navigation Jump to search

Here's a script that measures coverage of an HFST transducer (and gives the top of the hitparade):

#!/bin/bash

LG=abc
ANALYSERDIR=/path/to/analyser
CORPUS=/path/to/corpus/corpus.txt.bz2
ANALYSER=$ANALYSERDIR/$LG.automorf.hfst

TMPCORPUS=/tmp/$LG.corpus.txt

bzcat $CORPUS > $TMPCORPUS

echo "Generating hitparade (might take a bit!)"
cat $TMPCORPUS | apertium-destxt | hfst-proc -w $ANALYSER | apertium-retxt | sed 's/\$\s*/\$\n/g' > /tmp/$LG.parade.txt

echo "TOP UNKNOWN WORDS:"

cat /tmp/$LG.parade.txt | grep '\*' | sort | uniq -c | sort -rn | head -n20

TOTAL=`cat /tmp/$LG.parade.txt | wc -l`
KNOWN=`cat /tmp/$LG.parade.txt | grep -v '\*' | wc -l`
UNKNOWN=`cat /tmp/$LG.parade.txt | grep '\*' | wc -l`

PERCENTAGE=`calc $KNOWN/$TOTAL | sed 's/[\s\t]//g'`

echo "coverage: $KNOWN / $TOTAL ($PERCENTAGE)"
echo "remaining unknown forms: $UNKNOWN"