Apertium has moved from SourceForge to GitHub.
If you have any questions, please come and talk to us on
If you have any questions, please come and talk to us on
#apertium
on irc.freenode.net
or contact the GitHub migration team.Calculating coverage
From Apertium
Revision as of 05:00, 1 May 2009 by Jacob Nordfalk (Talk | contribs)
Notes on calculating coverage from wikipedia dumps (based on Asturian#Calculating coverage).
(Mac OS X `sed' doesn't allow \n in replacements, so I just use an actual (escaped) newline...)
wikicat.sh:
#!/bin/sh # clean up wiki for running through apertium-destxt # awk prints full lines, make sure each html elt has one bzcat "$@" | sed 's/>/>\ /g' | sed 's/</\ </g' |\ # want only stuff between <text...> and </text> awk ' /<text.*>/,/<\/text>/ { print $0 } ' |\ sed 's/\./ /g' |\ sed 's/\[\[[^|]*|//g' | sed 's/\]\]//g' | sed 's/\[\[//g' |\ # wiki markup, retain bar and fie from [[foo|bar]] [[fie]] sed 's/&.*;/ /g' |\ # remove entities sed 's/[;:?,]/ /g' |\ # and put space instead of punctuation grep '^[ ]*[A-ZÆØÅ]' # Your alphabet here # Keep only lines starting with a capital letter, removing tables with style info etc.
count-tokenized.sh:
#!/bin/sh # http://wiki.apertium.org/wiki/Asturian#Calculating_coverage # Calculate the number of tokenised words in the corpus: apertium-destxt | lt-proc $1 |apertium-retxt |\ # for some reason putting the newline in directly doesn't work, so two seds sed 's/\$[^^]*\^/$^/g' | sed 's/\$\^/$\ ^/g'
To find all tokens from a wiki dump:
$ ./wikicat.sh nnwiki-20090119-pages-articles.xml.bz2 > nnwiki.cleaned.txt cat nnwiki.cleaned.txt | ./count-correct.sh nn-nb.automorf.bin | wc -l
To find all tokens with at least one analysis (naïve coverage):
$ cat nnwiki.cleaned.txt | ./count-correct.sh nn-nb.automorf.bin | grep -v '\/\*' | wc -l
To find the top unknown tokens:
$ cat nnwiki.cleaned.txt | ./count-correct.sh nn-nb.automorf.bin | sed 's/[ ]*//g' |\ # tab or space grep '\/\*' | sort -f | uniq -c | sort -gr | head
Script ready to run
corpus-stat.sh
#!/bin/sh # http://wiki.apertium.org/wiki/Asturian#Calculating_coverage # Example use: # zcat corpa/en.crp.txt.gz | sh corpus-stat.sh #CMD="cat corpa/en.crp.txt" CMD="cat" F=/tmp/corpus-stat-res.txt # Calculate the number of tokenised words in the corpus: # for some reason putting the newline in directly doesn't work, so two seds $CMD | apertium-destxt | lt-proc en-eo.automorf.bin |apertium-retxt | sed 's/\$[^^]*\^/$^/g' | sed 's/\$\^/$\ ^/g' > $F NUMWORDS=`cat $F | wc -l` echo "Number of tokenised words in the corpus: $NUMWORDS" # Calculate the number of words that are not unknown NUMKNOWNWORDS=`cat $F | grep -v '\*' | wc -l` echo "Number of known words in the corpus: $NUMKNOWNWORDS" # Calculate the coverage COVERAGE=`calc "round($NUMKNOWNWORDS/$NUMWORDS*1000)/10"` echo "Coverage: $COVERAGE %" # Show the top-ten unknown words. echo "Top unknown words in the corpus:" cat $F | grep '\*' | sort -f | uniq -c | sort -gr | head -10
Sample output:
$ zcat corpa/en.crp.txt.gz | sh corpus-stat.sh Number of tokenised words in the corpus: 478187 Number of known words in the corpus: 450255 Coverage: 94.2 % Top unknown words in the corpus: 191 ^Apollo/*Apollo$ 104 ^Aramaic/*Aramaic$ 91 ^Alberta/*Alberta$ 81 ^de/*de$ 80 ^Abu/*Abu$ 63 ^Bakr/*Bakr$ 62 ^Agassi/*Agassi$ 59 ^Carnegie/*Carnegie$ 58 ^Agrippina/*Agrippina$ 58 ^Achilles/*Achilles$ 56 ^Adelaide/*Adelaide$