Apertium has moved from SourceForge to GitHub.
If you have any questions, please come and talk to us on
If you have any questions, please come and talk to us on
#apertium
on irc.freenode.net
or contact the GitHub migration team.Calculating coverage
From Apertium
(Difference between revisions)
m |
|||
Line 44: | Line 44: | ||
<pre>$ ./wikicat.sh nnwiki-20090119-pages-articles.xml.bz2 | ./count-correct.sh nn-nb.automorf.bin | grep -v '\/\*' | wc -l</pre> |
<pre>$ ./wikicat.sh nnwiki-20090119-pages-articles.xml.bz2 | ./count-correct.sh nn-nb.automorf.bin | grep -v '\/\*' | wc -l</pre> |
||
To find the top unknown tokens: |
To find the top unknown tokens: |
||
− | <pre>$ ./wikicat.sh nnwiki-20090119-pages-articles.xml.bz2 | ./count-correct.sh nn-nb.automorf.bin | grep '\/\*' |\ |
+ | <pre>$ ./wikicat.sh nnwiki-20090119-pages-articles.xml.bz2 | ./count-correct.sh nn-nb.automorf.bin | sed 's/[ ]*//g' |\ # tab or space |
− | sort -f | uniq -c | sort -gr | head </pre> |
+ | grep '\/\*' | sort -f | uniq -c | sort -gr | head </pre> |
Revision as of 19:37, 25 March 2009
Notes on calculating coverage from wikipedia dumps (based on Asturian#Calculating coverage).
(Mac OS X `sed' doesn't allow \n in replacements, so I just use an actual (escaped) newline...)
wikicat.sh:
#!/bin/sh # clean up wiki for running through apertium-destxt # awk prints full lines, make sure each html elt has one bzcat "$@" | sed 's/>/>\ /g' | sed 's/</\ </g' |\ # want only stuff between <text...> and </text> awk ' /<text.*>/,/<\/text>/ { print $0 } ' |\ sed 's/\./ /g' |\ sed 's/\[\[[^|]*|//g' | sed 's/\]\]//g' | sed 's/\[\[//g' |\ # wiki markup, retain bar and fie from [[foo|bar]] [[fie]] sed 's/&.*;/ /g' |\ # remove entities sed 's/[;:?,]/ /g' |\ # and put space instead of punctuation grep '^[ ]*[A-ZÆØÅ]' # Your alphabet here # Keep only lines starting with a capital letter, removing tables with style info etc.
count-tokenized.sh:
#!/bin/sh # http://wiki.apertium.org/wiki/Asturian#Calculating_coverage # Calculate the number of tokenised words in the corpus: apertium-destxt | lt-proc $1 |apertium-retxt |\ # for some reason putting the newline in directly doesn't work, so two seds sed 's/\$[^^]*\^/$^/g' | sed 's/\$\^/$\ ^/g'
To find all tokens from a wiki dump:
$ ./wikicat.sh nnwiki-20090119-pages-articles.xml.bz2 | ./count-correct.sh nn-nb.automorf.bin | wc -l
To find all correct tokens:
$ ./wikicat.sh nnwiki-20090119-pages-articles.xml.bz2 | ./count-correct.sh nn-nb.automorf.bin | grep -v '\/\*' | wc -l
To find the top unknown tokens:
$ ./wikicat.sh nnwiki-20090119-pages-articles.xml.bz2 | ./count-correct.sh nn-nb.automorf.bin | sed 's/[ ]*//g' |\ # tab or space grep '\/\*' | sort -f | uniq -c | sort -gr | head