Difference between revisions of "Calculating coverage"
Jump to navigation
Jump to search
Line 54: | Line 54: | ||
</pre> |
</pre> |
||
== Script ready to run == |
|||
corpus-stat.sh |
|||
<pre> |
|||
#!/bin/sh |
|||
# http://wiki.apertium.org/wiki/Asturian#Calculating_coverage |
|||
# Example use: |
|||
# zcat corpa/en.crp.txt.gz | sh corpus-stat.sh |
|||
#CMD="cat corpa/en.crp.txt" |
|||
CMD="cat" |
|||
F=/tmp/corpus-stat-res.txt |
|||
# Calculate the number of tokenised words in the corpus: |
|||
# for some reason putting the newline in directly doesn't work, so two seds |
|||
$CMD | apertium-destxt | lt-proc en-eo.automorf.bin |apertium-retxt | sed 's/\$[^^]*\^/$^/g' | sed 's/\$\^/$\ |
|||
^/g' > $F |
|||
NUMWORDS=`cat $F | wc -l` |
|||
echo "Number of tokenised words in the corpus: $NUMWORDS" |
|||
# Calculate the number of words that are not unknown |
|||
NUMKNOWNWORDS=`cat $F | grep -v '\*' | wc -l` |
|||
echo "Number of known words in the corpus: $NUMKNOWNWORDS" |
|||
# Calculate the coverage |
|||
COVERAGE=`calc "round($NUMKNOWNWORDS/$NUMWORDS*1000)/10"` |
|||
echo "Coverage: $COVERAGE %" |
|||
# Show the top-ten unknown words. |
|||
echo "Top unknown words in the corpus:" |
|||
cat $F | grep '\*' | sort -f | uniq -c | sort -gr | head -10 |
|||
</pre> |
|||
[[Category:Documentation]] |
[[Category:Documentation]] |
Revision as of 03:59, 1 May 2009
Notes on calculating coverage from wikipedia dumps (based on Asturian#Calculating coverage).
(Mac OS X `sed' doesn't allow \n in replacements, so I just use an actual (escaped) newline...)
wikicat.sh:
#!/bin/sh # clean up wiki for running through apertium-destxt # awk prints full lines, make sure each html elt has one bzcat "$@" | sed 's/>/>\ /g' | sed 's/</\ </g' |\ # want only stuff between <text...> and </text> awk ' /<text.*>/,/<\/text>/ { print $0 } ' |\ sed 's/\./ /g' |\ sed 's/\[\[[^|]*|//g' | sed 's/\]\]//g' | sed 's/\[\[//g' |\ # wiki markup, retain bar and fie from [[foo|bar]] [[fie]] sed 's/&.*;/ /g' |\ # remove entities sed 's/[;:?,]/ /g' |\ # and put space instead of punctuation grep '^[ ]*[A-ZÆØÅ]' # Your alphabet here # Keep only lines starting with a capital letter, removing tables with style info etc.
count-tokenized.sh:
#!/bin/sh # http://wiki.apertium.org/wiki/Asturian#Calculating_coverage # Calculate the number of tokenised words in the corpus: apertium-destxt | lt-proc $1 |apertium-retxt |\ # for some reason putting the newline in directly doesn't work, so two seds sed 's/\$[^^]*\^/$^/g' | sed 's/\$\^/$\ ^/g'
To find all tokens from a wiki dump:
$ ./wikicat.sh nnwiki-20090119-pages-articles.xml.bz2 > nnwiki.cleaned.txt cat nnwiki.cleaned.txt | ./count-correct.sh nn-nb.automorf.bin | wc -l
To find all tokens with at least one analysis (naïve coverage):
$ cat nnwiki.cleaned.txt | ./count-correct.sh nn-nb.automorf.bin | grep -v '\/\*' | wc -l
To find the top unknown tokens:
$ cat nnwiki.cleaned.txt | ./count-correct.sh nn-nb.automorf.bin | sed 's/[ ]*//g' |\ # tab or space grep '\/\*' | sort -f | uniq -c | sort -gr | head
Script ready to run
corpus-stat.sh
#!/bin/sh # http://wiki.apertium.org/wiki/Asturian#Calculating_coverage # Example use: # zcat corpa/en.crp.txt.gz | sh corpus-stat.sh #CMD="cat corpa/en.crp.txt" CMD="cat" F=/tmp/corpus-stat-res.txt # Calculate the number of tokenised words in the corpus: # for some reason putting the newline in directly doesn't work, so two seds $CMD | apertium-destxt | lt-proc en-eo.automorf.bin |apertium-retxt | sed 's/\$[^^]*\^/$^/g' | sed 's/\$\^/$\ ^/g' > $F NUMWORDS=`cat $F | wc -l` echo "Number of tokenised words in the corpus: $NUMWORDS" # Calculate the number of words that are not unknown NUMKNOWNWORDS=`cat $F | grep -v '\*' | wc -l` echo "Number of known words in the corpus: $NUMKNOWNWORDS" # Calculate the coverage COVERAGE=`calc "round($NUMKNOWNWORDS/$NUMWORDS*1000)/10"` echo "Coverage: $COVERAGE %" # Show the top-ten unknown words. echo "Top unknown words in the corpus:" cat $F | grep '\*' | sort -f | uniq -c | sort -gr | head -10