Apertium has moved from SourceForge to GitHub.
If you have any questions, please come and talk to us on #apertium on irc.freenode.net or contact the GitHub migration team.

Measuring coverage of HFST transducer

From Apertium
Revision as of 03:14, 6 September 2019 by Firespeaker (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Here's a script that measures coverage of an HFST transducer (and gives the top of the hitparade):

#!/bin/bash

LG=abc
ANALYSERDIR=/path/to/analyser
CORPUS=/path/to/corpus/corpus.txt.bz2
ANALYSER=$ANALYSERDIR/$LG.automorf.hfst

TMPCORPUS=/tmp/$LG.corpus.txt

bzcat $CORPUS > $TMPCORPUS

echo "Generating hitparade (might take a bit!)"
cat $TMPCORPUS | apertium-destxt | hfst-proc -w $ANALYSER | apertium-retxt | sed 's/\$\s*/\$\n/g' > /tmp/$LG.parade.txt

echo "TOP UNKNOWN WORDS:"

cat /tmp/$LG.parade.txt | grep '\*' | sort | uniq -c | sort -rn | head -n20

TOTAL=`cat /tmp/$LG.parade.txt | wc -l`
KNOWN=`cat /tmp/$LG.parade.txt | grep -v '\*' | wc -l`
UNKNOWN=`cat /tmp/$LG.parade.txt | grep '\*' | wc -l`

PERCENTAGE=`calc $KNOWN/$TOTAL | sed 's/[\s\t]//g'`

echo "coverage: $KNOWN / $TOTAL ($PERCENTAGE)"
echo "remaining unknown forms: $UNKNOWN"
Personal tools