Difference between revisions of "Calculating coverage"
Jump to navigation
Jump to search
m (User:Unhammer/Coverage moved to Calculating coverage) |
(will be faster this way.) |
||
Line 40: | Line 40: | ||
To find all tokens from a wiki dump: |
To find all tokens from a wiki dump: |
||
<pre> |
|||
⚫ | |||
$ ./wikicat.sh nnwiki-20090119-pages-articles.xml.bz2 > nnwiki.cleaned.txt |
|||
⚫ | |||
</pre> |
|||
To find all tokens with at least one analysis (naïve coverage): |
To find all tokens with at least one analysis (naïve coverage): |
||
<pre> |
|||
$ cat nnwiki.cleaned.txt | ./count-correct.sh nn-nb.automorf.bin | grep -v '\/\*' | wc -l |
|||
</pre> |
|||
To find the top unknown tokens: |
To find the top unknown tokens: |
||
<pre> |
|||
$ cat nnwiki.cleaned.txt | ./count-correct.sh nn-nb.automorf.bin | sed 's/[ ]*//g' |\ # tab or space |
|||
grep '\/\*' | sort -f | uniq -c | sort -gr | head |
grep '\/\*' | sort -f | uniq -c | sort -gr | head |
||
</pre> |
Revision as of 09:46, 26 March 2009
Notes on calculating coverage from wikipedia dumps (based on Asturian#Calculating coverage).
(Mac OS X `sed' doesn't allow \n in replacements, so I just use an actual (escaped) newline...)
wikicat.sh:
#!/bin/sh # clean up wiki for running through apertium-destxt # awk prints full lines, make sure each html elt has one bzcat "$@" | sed 's/>/>\ /g' | sed 's/</\ </g' |\ # want only stuff between <text...> and </text> awk ' /<text.*>/,/<\/text>/ { print $0 } ' |\ sed 's/\./ /g' |\ sed 's/\[\[[^|]*|//g' | sed 's/\]\]//g' | sed 's/\[\[//g' |\ # wiki markup, retain bar and fie from [[foo|bar]] [[fie]] sed 's/&.*;/ /g' |\ # remove entities sed 's/[;:?,]/ /g' |\ # and put space instead of punctuation grep '^[ ]*[A-ZÆØÅ]' # Your alphabet here # Keep only lines starting with a capital letter, removing tables with style info etc.
count-tokenized.sh:
#!/bin/sh # http://wiki.apertium.org/wiki/Asturian#Calculating_coverage # Calculate the number of tokenised words in the corpus: apertium-destxt | lt-proc $1 |apertium-retxt |\ # for some reason putting the newline in directly doesn't work, so two seds sed 's/\$[^^]*\^/$^/g' | sed 's/\$\^/$\ ^/g'
To find all tokens from a wiki dump:
$ ./wikicat.sh nnwiki-20090119-pages-articles.xml.bz2 > nnwiki.cleaned.txt cat nnwiki.cleaned.txt | ./count-correct.sh nn-nb.automorf.bin | wc -l
To find all tokens with at least one analysis (naïve coverage):
$ cat nnwiki.cleaned.txt | ./count-correct.sh nn-nb.automorf.bin | grep -v '\/\*' | wc -l
To find the top unknown tokens:
$ cat nnwiki.cleaned.txt | ./count-correct.sh nn-nb.automorf.bin | sed 's/[ ]*//g' |\ # tab or space grep '\/\*' | sort -f | uniq -c | sort -gr | head