Difference between revisions of "Calculating coverage"
		
		
		
		
		
		
		Jump to navigation
		Jump to search
		
				
		
		
		
		
		
		
		
	
| Line 54: | Line 54: | ||
| </pre> | </pre> | ||
| == Script ready to run == | |||
| corpus-stat.sh | |||
| <pre> | |||
| #!/bin/sh | |||
| # http://wiki.apertium.org/wiki/Asturian#Calculating_coverage | |||
| # Example use: | |||
| # zcat corpa/en.crp.txt.gz | sh corpus-stat.sh | |||
| #CMD="cat corpa/en.crp.txt" | |||
| CMD="cat" | |||
| F=/tmp/corpus-stat-res.txt | |||
| # Calculate the number of tokenised words in the corpus: | |||
| # for some reason putting the newline in directly doesn't work, so two seds | |||
| $CMD | apertium-destxt | lt-proc en-eo.automorf.bin |apertium-retxt | sed 's/\$[^^]*\^/$^/g' | sed 's/\$\^/$\ | |||
| ^/g' > $F | |||
| NUMWORDS=`cat $F | wc -l` | |||
| echo "Number of tokenised words in the corpus: $NUMWORDS" | |||
| # Calculate the number of words that are not unknown | |||
| NUMKNOWNWORDS=`cat $F | grep -v '\*' | wc -l` | |||
| echo "Number of known words in the corpus: $NUMKNOWNWORDS" | |||
| # Calculate the coverage | |||
| COVERAGE=`calc "round($NUMKNOWNWORDS/$NUMWORDS*1000)/10"` | |||
| echo "Coverage: $COVERAGE %" | |||
| # Show the top-ten unknown words. | |||
| echo "Top unknown words in the corpus:" | |||
| cat $F | grep '\*' | sort -f | uniq -c | sort -gr | head -10 | |||
| </pre> | |||
| [[Category:Documentation]] | [[Category:Documentation]] | ||
Revision as of 03:59, 1 May 2009
Notes on calculating coverage from wikipedia dumps (based on Asturian#Calculating coverage).
(Mac OS X `sed' doesn't allow \n in replacements, so I just use an actual (escaped) newline...)
wikicat.sh:
#!/bin/sh
# clean up wiki for running through apertium-destxt
# awk prints full lines, make sure each html elt has one
bzcat "$@" | sed 's/>/>\
/g' | sed 's/</\
</g' |\
# want only stuff between <text...> and </text>
awk '
/<text.*>/,/<\/text>/ { print $0 }
' |\
sed 's/\./ /g' |\
sed 's/\[\[[^|]*|//g' | sed 's/\]\]//g' | sed 's/\[\[//g' |\
# wiki markup, retain bar and fie from [[foo|bar]] [[fie]]
sed 's/&.*;/ /g' |\
# remove entities
sed 's/[;:?,]/ /g' |\
# and put space instead of punctuation
grep '^[ 	]*[A-ZÆØÅ]' # Your alphabet here
# Keep only lines starting with a capital letter, removing tables with style info etc.
count-tokenized.sh:
#!/bin/sh # http://wiki.apertium.org/wiki/Asturian#Calculating_coverage # Calculate the number of tokenised words in the corpus: apertium-destxt | lt-proc $1 |apertium-retxt |\ # for some reason putting the newline in directly doesn't work, so two seds sed 's/\$[^^]*\^/$^/g' | sed 's/\$\^/$\ ^/g'
To find all tokens from a wiki dump:
$ ./wikicat.sh nnwiki-20090119-pages-articles.xml.bz2 > nnwiki.cleaned.txt cat nnwiki.cleaned.txt | ./count-correct.sh nn-nb.automorf.bin | wc -l
To find all tokens with at least one analysis (naïve coverage):
$ cat nnwiki.cleaned.txt | ./count-correct.sh nn-nb.automorf.bin | grep -v '\/\*' | wc -l
To find the top unknown tokens:
$ cat nnwiki.cleaned.txt | ./count-correct.sh nn-nb.automorf.bin | sed 's/[ ]*//g' |\ # tab or space grep '\/\*' | sort -f | uniq -c | sort -gr | head
Script ready to run
corpus-stat.sh
#!/bin/sh # http://wiki.apertium.org/wiki/Asturian#Calculating_coverage # Example use: # zcat corpa/en.crp.txt.gz | sh corpus-stat.sh #CMD="cat corpa/en.crp.txt" CMD="cat" F=/tmp/corpus-stat-res.txt # Calculate the number of tokenised words in the corpus: # for some reason putting the newline in directly doesn't work, so two seds $CMD | apertium-destxt | lt-proc en-eo.automorf.bin |apertium-retxt | sed 's/\$[^^]*\^/$^/g' | sed 's/\$\^/$\ ^/g' > $F NUMWORDS=`cat $F | wc -l` echo "Number of tokenised words in the corpus: $NUMWORDS" # Calculate the number of words that are not unknown NUMKNOWNWORDS=`cat $F | grep -v '\*' | wc -l` echo "Number of known words in the corpus: $NUMKNOWNWORDS" # Calculate the coverage COVERAGE=`calc "round($NUMKNOWNWORDS/$NUMWORDS*1000)/10"` echo "Coverage: $COVERAGE %" # Show the top-ten unknown words. echo "Top unknown words in the corpus:" cat $F | grep '\*' | sort -f | uniq -c | sort -gr | head -10

