Measuring coverage of HFST transducer
Revision as of 03:14, 6 September 2019 by Firespeaker
Here's a script that measures coverage of an HFST transducer (and gives the top of the hitparade):
#!/bin/bash LG=abc ANALYSERDIR=/path/to/analyser CORPUS=/path/to/corpus/corpus.txt.bz2 ANALYSER=$ANALYSERDIR/$LG.automorf.hfst TMPCORPUS=/tmp/$LG.corpus.txt bzcat $CORPUS > $TMPCORPUS echo "Generating hitparade (might take a bit!)" cat $TMPCORPUS | apertium-destxt | hfst-proc -w $ANALYSER | apertium-retxt | sed 's/\$\s*/\$\n/g' > /tmp/$LG.parade.txt echo "TOP UNKNOWN WORDS:" cat /tmp/$LG.parade.txt | grep '\*' | sort | uniq -c | sort -rn | head -n20 TOTAL=`cat /tmp/$LG.parade.txt | wc -l` KNOWN=`cat /tmp/$LG.parade.txt | grep -v '\*' | wc -l` UNKNOWN=`cat /tmp/$LG.parade.txt | grep '\*' | wc -l` PERCENTAGE=`calc $KNOWN/$TOTAL | sed 's/[\s\t]//g'` echo "coverage: $KNOWN / $TOTAL ($PERCENTAGE)" echo "remaining unknown forms: $UNKNOWN"