Difference between revisions of "Calculating coverage"
Jump to navigation
Jump to search
(Documentation in English) |
|||
Line 129: | Line 129: | ||
* [http://wp2txt.rubyforge.org/ wp2txt] |
* [http://wp2txt.rubyforge.org/ wp2txt] |
||
* [https://gist.github.com/2283105 simple script to clean an apertium stream] |
|||
[[Category:Documentation]] |
[[Category:Documentation]] |
Revision as of 13:42, 8 May 2012
Notes on calculating coverage from wikipedia dumps (based on Asturian#Calculating coverage).
(Mac OS X `sed' doesn't allow \n in replacements, so I just use an actual (escaped) newline...)
wikicat.sh:
#!/bin/sh # Clean up wikitext for running through apertium-destxt # awk prints full lines, make sure each html element has one bzcat "$@" | sed 's/>/>\ /g' | sed 's/</\ </g' |\ # want only stuff between <text...> and </text> awk ' /<text.*>/,/<\/text>/ { print $0 } ' |\ sed 's/\./ /g' |\ # Drop all transwiki links sed 's/\[\[\([a-z]\{2,3\}\|bat-smg\|be-x-old\|cbk-zam\|fiu-vro\|map-bms\|nds-nl\|roa-rup\|roa-tara\|simple\|zh-classical\|zh-min-nan\|zh-yue\):[^]]\+\]\]//g' |\ # wiki markup, retain bar and fie from [[foo|bar]] [[fie]] sed 's/\[\[[^]|]*|//g' | sed 's/\]\]//g' | sed 's/\[\[//g' |\ # wiki markup, retain `bar fie' from [http://foo bar fie] and remove [http://foo] sed 's/\[http[^ ]*\([^]]*\)\]/\1/g' |\ # remove entities sed 's/&[^;]*;/ /g' |\ # and put space instead of punctuation sed 's/[;:?,]/ /g' |\ # Keep only lines starting with a capital letter, removing tables with style info etc. grep '^[ ]*[A-ZÆØÅ]' # Your alphabet here
count-tokenized.sh:
#!/bin/sh # http://wiki.apertium.org/wiki/Asturian#Calculating_coverage # Calculate the number of tokenised words in the corpus: apertium-destxt | lt-proc $1 |apertium-retxt |\ # for some reason putting the newline in directly doesn't work, so two seds sed 's/\$[^^]*\^/$^/g' | sed 's/\$\^/$\ ^/g'
To find all tokens from a wiki dump:
$ ./wikicat.sh nnwiki-20090119-pages-articles.xml.bz2 > nnwiki.cleaned.txt cat nnwiki.cleaned.txt | ./count-tokenized.sh nn-nb.automorf.bin | wc -l
To find all tokens with at least one analysis (naïve coverage):
$ cat nnwiki.cleaned.txt | ./count-tokenized.sh nn-nb.automorf.bin | grep -v '\/\*' | wc -l
To find the top unknown tokens:
$ cat nnwiki.cleaned.txt | ./count-tokenized.sh nn-nb.automorf.bin | sed 's/[ ]*//g' |\ # tab or space grep '\/\*' | sort -f | uniq -c | sort -gr | head
Script ready to run
corpus-stat.sh
#!/bin/sh # http://wiki.apertium.org/wiki/Asturian#Calculating_coverage # Example use: # zcat corpa/en.crp.txt.gz | sh corpus-stat.sh #CMD="cat corpa/en.crp.txt" CMD="cat" F=/tmp/corpus-stat-res.txt # Calculate the number of tokenised words in the corpus: # for some reason putting the newline in directly doesn't work, so two seds $CMD | apertium-destxt | lt-proc en-eo.automorf.bin |apertium-retxt | sed 's/\$[^^]*\^/$^/g' | sed 's/\$\^/$\ ^/g' > $F NUMWORDS=`cat $F | wc -l` echo "Number of tokenised words in the corpus: $NUMWORDS" # Calculate the number of words that are not unknown NUMKNOWNWORDS=`cat $F | grep -v '\*' | wc -l` echo "Number of known words in the corpus: $NUMKNOWNWORDS" # Calculate the coverage COVERAGE=`calc "round($NUMKNOWNWORDS/$NUMWORDS*1000)/10"` echo "Coverage: $COVERAGE %" #If you don't have calc, change the above line to: #COVERAGE=$(perl -e 'print int($ARGV[0]/$ARGV[1]*1000)/10;' $NUMKNOWNWORDS $NUMWORDS) # Show the top-ten unknown words. echo "Top unknown words in the corpus:" cat $F | grep '\*' | sort -f | uniq -c | sort -gr | head -10
Sample output:
$ zcat corpa/en.crp.txt.gz | sh corpus-stat.sh Number of tokenised words in the corpus: 478187 Number of known words in the corpus: 450255 Coverage: 94.2 % Top unknown words in the corpus: 191 ^Apollo/*Apollo$ 104 ^Aramaic/*Aramaic$ 91 ^Alberta/*Alberta$ 81 ^de/*de$ 80 ^Abu/*Abu$ 63 ^Bakr/*Bakr$ 62 ^Agassi/*Agassi$ 59 ^Carnegie/*Carnegie$ 58 ^Agrippina/*Agrippina$ 58 ^Achilles/*Achilles$ 56 ^Adelaide/*Adelaide$