Difference between revisions of "Calculating coverage"

Revision as of 09:46, 26 March 2009

Notes on calculating coverage from wikipedia dumps (based on Asturian#Calculating coverage).

(Mac OS X `sed' doesn't allow \n in replacements, so I just use an actual (escaped) newline...)

wikicat.sh:

#!/bin/sh
# clean up wiki for running through apertium-destxt

# awk prints full lines, make sure each html elt has one
bzcat "$@" | sed 's/>/>\
/g' | sed 's/</\
</g' |\
# want only stuff between <text...> and </text>
awk '
/<text.*>/,/<\/text>/ { print $0 }
' |\
sed 's/\./ /g' |\
sed 's/\[\[[^|]*|//g' | sed 's/\]\]//g' | sed 's/\[\[//g' |\
# wiki markup, retain bar and fie from [[foo|bar]] [[fie]]
sed 's/&.*;/ /g' |\
# remove entities
sed 's/[;:?,]/ /g' |\
# and put space instead of punctuation
grep '^[ 	]*[A-ZÆØÅ]' # Your alphabet here
# Keep only lines starting with a capital letter, removing tables with style info etc.

count-tokenized.sh:

#!/bin/sh
# http://wiki.apertium.org/wiki/Asturian#Calculating_coverage

# Calculate the number of tokenised words in the corpus:
apertium-destxt | lt-proc $1 |apertium-retxt |\
# for some reason putting the newline in directly doesn't work, so two seds
sed 's/\$[^^]*\^/$^/g' | sed 's/\$\^/$\
^/g'

To find all tokens from a wiki dump:

$ ./wikicat.sh nnwiki-20090119-pages-articles.xml.bz2 > nnwiki.cleaned.txt
cat nnwiki.cleaned.txt | ./count-correct.sh nn-nb.automorf.bin | wc -l

To find all tokens with at least one analysis (naïve coverage):

$ cat nnwiki.cleaned.txt  | ./count-correct.sh nn-nb.automorf.bin | grep -v '\/\*' | wc -l

To find the top unknown tokens:

$ cat nnwiki.cleaned.txt  | ./count-correct.sh nn-nb.automorf.bin | sed 's/[ 	]*//g' |\ # tab or space
   grep '\/\*' | sort -f | uniq -c | sort -gr | head

Difference between revisions of "Calculating coverage"

Revision as of 09:46, 26 March 2009

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools

@@ Line 40: / Line 40: @@
 To find all tokens from a wiki dump:
+<pre>
-<pre>$ ./wikicat.sh nnwiki-20090119-pages-articles.xml.bz2 | ./count-correct.sh nn-nb.automorf.bin | wc -l</pre>
+$ ./wikicat.sh nnwiki-20090119-pages-articles.xml.bz2 > nnwiki.cleaned.txt
+cat nnwiki.cleaned.txt | ./count-correct.sh nn-nb.automorf.bin | wc -l
+</pre>
 To find all tokens with at least one analysis (naïve coverage):
+<pre>
-<pre>$ ./wikicat.sh nnwiki-20090119-pages-articles.xml.bz2 | ./count-correct.sh nn-nb.automorf.bin | grep -v '\/\*' | wc -l</pre>
+$ cat nnwiki.cleaned.txt  | ./count-correct.sh nn-nb.automorf.bin | grep -v '\/\*' | wc -l
+</pre>
 To find the top unknown tokens:
+<pre>
-<pre>$ ./wikicat.sh nnwiki-20090119-pages-articles.xml.bz2 | ./count-correct.sh nn-nb.automorf.bin | sed 's/[ 	]*//g' |\ # tab or space
+$ cat nnwiki.cleaned.txt  | ./count-correct.sh nn-nb.automorf.bin | sed 's/[ 	]*//g' |\ # tab or space
-   grep '\/\*' | sort -f | uniq -c | sort -gr | head </pre>
+   grep '\/\*' | sort -f | uniq -c | sort -gr | head
+</pre>