Difference between revisions of "Calculating coverage"
Jump to navigation
Jump to search
Line 100: | Line 100: | ||
</pre> |
</pre> |
||
Sample output: |
|||
<pre> |
|||
$ zcat corpa/en.crp.txt.gz | sh corpus-stat.sh |
|||
Number of tokenised words in the corpus: 478187 |
|||
Number of known words in the corpus: 450255 |
|||
Coverage: 94.2 % |
|||
Top unknown words in the corpus: |
|||
191 ^Apollo/*Apollo$ |
|||
104 ^Aramaic/*Aramaic$ |
|||
91 ^Alberta/*Alberta$ |
|||
81 ^de/*de$ |
|||
80 ^Abu/*Abu$ |
|||
63 ^Bakr/*Bakr$ |
|||
62 ^Agassi/*Agassi$ |
|||
59 ^Carnegie/*Carnegie$ |
|||
58 ^Agrippina/*Agrippina$ |
|||
58 ^Achilles/*Achilles$ |
|||
56 ^Adelaide/*Adelaide$ |
|||
</pre> |
|||
[[Category:Documentation]] |
[[Category:Documentation]] |
Revision as of 04:00, 1 May 2009
Notes on calculating coverage from wikipedia dumps (based on Asturian#Calculating coverage).
(Mac OS X `sed' doesn't allow \n in replacements, so I just use an actual (escaped) newline...)
wikicat.sh:
#!/bin/sh # clean up wiki for running through apertium-destxt # awk prints full lines, make sure each html elt has one bzcat "$@" | sed 's/>/>\ /g' | sed 's/</\ </g' |\ # want only stuff between <text...> and </text> awk ' /<text.*>/,/<\/text>/ { print $0 } ' |\ sed 's/\./ /g' |\ sed 's/\[\[[^|]*|//g' | sed 's/\]\]//g' | sed 's/\[\[//g' |\ # wiki markup, retain bar and fie from [[foo|bar]] [[fie]] sed 's/&.*;/ /g' |\ # remove entities sed 's/[;:?,]/ /g' |\ # and put space instead of punctuation grep '^[ ]*[A-ZÆØÅ]' # Your alphabet here # Keep only lines starting with a capital letter, removing tables with style info etc.
count-tokenized.sh:
#!/bin/sh # http://wiki.apertium.org/wiki/Asturian#Calculating_coverage # Calculate the number of tokenised words in the corpus: apertium-destxt | lt-proc $1 |apertium-retxt |\ # for some reason putting the newline in directly doesn't work, so two seds sed 's/\$[^^]*\^/$^/g' | sed 's/\$\^/$\ ^/g'
To find all tokens from a wiki dump:
$ ./wikicat.sh nnwiki-20090119-pages-articles.xml.bz2 > nnwiki.cleaned.txt cat nnwiki.cleaned.txt | ./count-correct.sh nn-nb.automorf.bin | wc -l
To find all tokens with at least one analysis (naïve coverage):
$ cat nnwiki.cleaned.txt | ./count-correct.sh nn-nb.automorf.bin | grep -v '\/\*' | wc -l
To find the top unknown tokens:
$ cat nnwiki.cleaned.txt | ./count-correct.sh nn-nb.automorf.bin | sed 's/[ ]*//g' |\ # tab or space grep '\/\*' | sort -f | uniq -c | sort -gr | head
Script ready to run
corpus-stat.sh
#!/bin/sh # http://wiki.apertium.org/wiki/Asturian#Calculating_coverage # Example use: # zcat corpa/en.crp.txt.gz | sh corpus-stat.sh #CMD="cat corpa/en.crp.txt" CMD="cat" F=/tmp/corpus-stat-res.txt # Calculate the number of tokenised words in the corpus: # for some reason putting the newline in directly doesn't work, so two seds $CMD | apertium-destxt | lt-proc en-eo.automorf.bin |apertium-retxt | sed 's/\$[^^]*\^/$^/g' | sed 's/\$\^/$\ ^/g' > $F NUMWORDS=`cat $F | wc -l` echo "Number of tokenised words in the corpus: $NUMWORDS" # Calculate the number of words that are not unknown NUMKNOWNWORDS=`cat $F | grep -v '\*' | wc -l` echo "Number of known words in the corpus: $NUMKNOWNWORDS" # Calculate the coverage COVERAGE=`calc "round($NUMKNOWNWORDS/$NUMWORDS*1000)/10"` echo "Coverage: $COVERAGE %" # Show the top-ten unknown words. echo "Top unknown words in the corpus:" cat $F | grep '\*' | sort -f | uniq -c | sort -gr | head -10
Sample output:
$ zcat corpa/en.crp.txt.gz | sh corpus-stat.sh Number of tokenised words in the corpus: 478187 Number of known words in the corpus: 450255 Coverage: 94.2 % Top unknown words in the corpus: 191 ^Apollo/*Apollo$ 104 ^Aramaic/*Aramaic$ 91 ^Alberta/*Alberta$ 81 ^de/*de$ 80 ^Abu/*Abu$ 63 ^Bakr/*Bakr$ 62 ^Agassi/*Agassi$ 59 ^Carnegie/*Carnegie$ 58 ^Agrippina/*Agrippina$ 58 ^Achilles/*Achilles$ 56 ^Adelaide/*Adelaide$