Difference between revisions of "Calculating coverage"
Leftmostcat (talk | contribs) (New wikicat.sh; fixes greediness, kills transwikis, puts comments where they'd typically be expected) |
(→See also=: fixed a wiki markup) |
||
(33 intermediate revisions by 8 users not shown) | |||
Line 1: | Line 1: | ||
Notes on calculating coverage from wikipedia dumps (based on [[Asturian#Calculating coverage]]). |
|||
[[Calculer la couverture|En français]] |
|||
(Mac OS X `sed' doesn't allow \n in replacements, so I just use an actual (escaped) newline...) |
|||
==Simple bidix-trimmed coverage testing== |
|||
wikicat.sh: |
|||
<pre> |
|||
#!/bin/sh |
|||
# Clean up wikitext for running through apertium-destxt |
|||
First install apertium-cleanstream: |
|||
# awk prints full lines, make sure each html element has one |
|||
bzcat "$@" | sed 's/>/>\ |
|||
/g' | sed 's/</\ |
|||
</g' |\ |
|||
# want only stuff between <text...> and </text> |
|||
awk ' |
|||
/<text.*>/,/<\/text>/ { print $0 } |
|||
' |\ |
|||
sed 's/\./ /g' |\ |
|||
# Drop all transwiki links |
|||
sed 's/\[\[\([a-z]\{2,3\}\):[^]]\+\]\]//g' |\ |
|||
# wiki markup, retain bar and fie from [[foo|bar]] [[fie]] |
|||
sed 's/\[\[[^]|]*|//g' | sed 's/\]\]//g' | sed 's/\[\[//g' |\ |
|||
# remove entities |
|||
sed 's/&[^;]*;/ /g' |\ |
|||
# and put space instead of punctuation |
|||
sed 's/[;:?,]/ /g' |\ |
|||
# Keep only lines starting with a capital letter, removing tables with style info etc. |
|||
grep '^[ ]*[A-ZÆØÅ]' # Your alphabet here |
|||
</pre> |
|||
svn checkout https://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/apertium-cleanstream |
|||
count-tokenized.sh: |
|||
cd apertium-cleanstream |
|||
<pre> |
|||
make |
|||
#!/bin/sh |
|||
sudo cp apertium-cleanstream /usr/local/bin |
|||
# http://wiki.apertium.org/wiki/Asturian#Calculating_coverage |
|||
'''Note - After Apertium's migration to GitHub, this tool is read-only on the SourceForge repository and does not exist on GitHub. If you are interested in migrating this tool to GitHub, see [[Migrating tools to GitHub]].''' |
|||
# Calculate the number of tokenised words in the corpus: |
|||
apertium-destxt | lt-proc $1 |apertium-retxt |\ |
|||
Then save this as coverage.sh: |
|||
# for some reason putting the newline in directly doesn't work, so two seds |
|||
sed 's/\$[^^]*\^/$^/g' | sed 's/\$\^/$\ |
|||
#!/bin/bash |
|||
^/g' |
|||
mode=$1 |
|||
</pre> |
|||
outfile=/tmp/$mode.clean |
|||
apertium -d . $mode | apertium-cleanstream -n > $outfile |
|||
total=$(grep -c '^\^' $outfile) |
|||
unknown=$(grep -c '/\*' $outfile) |
|||
bidix_unknown=$(grep -c '/@' $outfile) |
|||
known_percent=$(calc -p "round( 100*($total-$unknown-$bidix_unknown)/$total, 3)") |
|||
echo "$known_percent % known tokens ($unknown unknown, $bidix_unknown bidix-unknown of total $total tokens)" |
|||
echo "Top unknown words:" |
|||
grep '/[*@]' $outfile | sort | uniq -c | sort -nr | head |
|||
And run it like |
|||
cat asm.corpus | bash coverage.sh asm-eng-biltrans |
|||
(The bidix-unknown count should always be 0 if your pair uses [[lt-trim|automatic analyser trimming]].) |
|||
==TODO: paradigm-coverage (less naïve)== |
|||
On an analysed corpus, we can sum frequencies into bins for each lemma+mainpos, so if the analysed corpus contains |
|||
To find all tokens from a wiki dump: |
|||
<pre> |
<pre> |
||
musa/mus<n><f><sg><def>/muse<vblex><past> |
|||
$ ./wikicat.sh nnwiki-20090119-pages-articles.xml.bz2 > nnwiki.cleaned.txt |
|||
mus/mus<n><f><sg><ind>/mus<n><f><pl><ind>/muse<vblex><imp> |
|||
cat nnwiki.cleaned.txt | ./count-correct.sh nn-nb.automorf.bin | wc -l |
|||
musene/mus<n><f><pl><def> |
|||
</pre> |
</pre> |
||
then output has |
|||
To find all tokens with at least one analysis (naïve coverage): |
|||
<pre> |
<pre> |
||
3 mus<n><f> |
|||
$ cat nnwiki.cleaned.txt | ./count-correct.sh nn-nb.automorf.bin | grep -v '\/\*' | wc -l |
|||
2 muse<vblex> |
|||
</pre> |
</pre> |
||
and we can find paradigms that are likely to mess up disambiguation, or where we need to ensure that the bidix contains the highest-frequency paradigm (since the bidix is typically smaller than the monodix). |
|||
To find the top unknown tokens: |
|||
<pre> |
|||
$ cat nnwiki.cleaned.txt | ./count-correct.sh nn-nb.automorf.bin | sed 's/[ ]*//g' |\ # tab or space |
|||
grep '\/\*' | sort -f | uniq -c | sort -gr | head |
|||
We could also weight these numbers by number of unique forms in the pardef; if the verb pardef has 6 unique forms and then noun only 3, then the above output should be even more skewed: |
|||
<pre> |
|||
0.33 mus<n><f> |
|||
0.75 muse<vblex> |
|||
</pre> |
</pre> |
||
==Faster coverage testing with frequency lists== |
|||
== Script ready to run == |
|||
If words appear several times in your corpus, why bother analysing them several times? We can make a frequency list first and add together the frequencies. This script does some very stupid tokenisation and creates a frequency list: |
|||
corpus-stat.sh |
|||
make-freqlist.sh: |
|||
<pre> |
<pre> |
||
#!/bin/ |
#!/bin/bash |
||
# http://wiki.apertium.org/wiki/Asturian#Calculating_coverage |
|||
if [[ -t 0 ]]; then |
|||
echo "Expecting a corpus on stdin" |
|||
exit 2 |
|||
fi |
|||
tr '[:space:][:punct:]' '\n' | grep . | sort | uniq -c | sort -nr |
|||
# Example use: |
|||
</pre> |
|||
# zcat corpa/en.crp.txt.gz | sh corpus-stat.sh |
|||
And this script runs your analyser, summing up the frequencies: |
|||
freqlist-coverage.sh: |
|||
<pre> |
|||
#!/bin/bash |
|||
set -e -u |
|||
#CMD="cat corpa/en.crp.txt" |
|||
CMD="cat" |
|||
if [[ $# -eq 0 || -t 0 ]]; then |
|||
F=/tmp/corpus-stat-res.txt |
|||
echo "Expecting apertium arguments and a 'sort|uniq -c|sort -nr' style frequency list on stdin" |
|||
echo "For example:" |
|||
echo "\$ < spa.freqlist $0 -d . spa-morph" |
|||
exit 2 |
|||
fi |
|||
sed 's%^ *%<apertium-notrans>%;s% %</apertium-notrans>%;s%$% .%' | |
|||
# Calculate the number of tokenised words in the corpus: |
|||
apertium -f html-noent "$@" | |
|||
# for some reason putting the newline in directly doesn't work, so two seds |
|||
awk -F'</?apertium-notrans>| *\\^\\./\\.<sent><clb>\\$' ' |
|||
$CMD | apertium-destxt | lt-proc en-eo.automorf.bin |apertium-retxt | sed 's/\$[^^]*\^/$^/g' | sed 's/\$\^/$\ |
|||
/[/][*@]/ { |
|||
^/g' > $F |
|||
unknown+=$2 |
|||
if(!printed) print "Top unknown tokens:" |
|||
if(++printed<10) print $2,$3 |
|||
next |
|||
} |
|||
{ |
|||
known+=$2 |
|||
} |
|||
END { |
|||
total=known+unknown |
|||
known_pct=100*known/total |
|||
unk_pct=100*unknown/total |
|||
print known_pct" % known of total "total" tokens" |
|||
}' |
|||
</pre> |
|||
Usage: |
|||
NUMWORDS=`cat $F | wc -l` |
|||
echo "Number of tokenised words in the corpus: $NUMWORDS" |
|||
$ chmod +x make-freqlist.sh freqlist-coverage.sh |
|||
$ bzcat ~/corpora/nno.txt.bz2 |./make-freqlist.sh > nno.freqlist |
|||
$ < nno.freqlist ./freqlist-coverage.sh -d ~/apertium-svn/languages/apertium-nno/ nno-morph |
|||
==coverage.py== |
|||
https://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/coverage.py is a coverage script that wraps curl and bzcat. |
|||
# Calculate the number of words that are not unknown |
|||
NUMKNOWNWORDS=`cat $F | grep -v '\*' | wc -l` |
|||
echo "Number of known words in the corpus: $NUMKNOWNWORDS" |
|||
'''Note - After Apertium's migration to GitHub, this tool is read-only on the SourceForge repository and does not exist on GitHub. If you are interested in migrating this tool to GitHub, see [[Migrating tools to GitHub]].''' |
|||
== See also == |
|||
# Calculate the coverage |
|||
* [[Wikipedia dumps]] |
|||
COVERAGE=`calc "round($NUMKNOWNWORDS/$NUMWORDS*1000)/10"` |
|||
* [[Cleanstream]] |
|||
echo "Coverage: $COVERAGE %" |
|||
# Show the top-ten unknown words. |
|||
echo "Top unknown words in the corpus:" |
|||
cat $F | grep '\*' | sort -f | uniq -c | sort -gr | head -10 |
|||
</pre> |
|||
Sample output: |
|||
<pre> |
|||
$ zcat corpa/en.crp.txt.gz | sh corpus-stat.sh |
|||
Number of tokenised words in the corpus: 478187 |
|||
Number of known words in the corpus: 450255 |
|||
Coverage: 94.2 % |
|||
Top unknown words in the corpus: |
|||
191 ^Apollo/*Apollo$ |
|||
104 ^Aramaic/*Aramaic$ |
|||
91 ^Alberta/*Alberta$ |
|||
81 ^de/*de$ |
|||
80 ^Abu/*Abu$ |
|||
63 ^Bakr/*Bakr$ |
|||
62 ^Agassi/*Agassi$ |
|||
59 ^Carnegie/*Carnegie$ |
|||
58 ^Agrippina/*Agrippina$ |
|||
58 ^Achilles/*Achilles$ |
|||
56 ^Adelaide/*Adelaide$ |
|||
</pre> |
|||
[[Category:Documentation]] |
[[Category:Documentation]] |
||
[[Category:Documentation in English]] |
Latest revision as of 15:18, 10 January 2022
Contents
Simple bidix-trimmed coverage testing[edit]
First install apertium-cleanstream:
svn checkout https://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/apertium-cleanstream cd apertium-cleanstream make sudo cp apertium-cleanstream /usr/local/bin
Note - After Apertium's migration to GitHub, this tool is read-only on the SourceForge repository and does not exist on GitHub. If you are interested in migrating this tool to GitHub, see Migrating tools to GitHub.
Then save this as coverage.sh:
#!/bin/bash mode=$1 outfile=/tmp/$mode.clean apertium -d . $mode | apertium-cleanstream -n > $outfile total=$(grep -c '^\^' $outfile) unknown=$(grep -c '/\*' $outfile) bidix_unknown=$(grep -c '/@' $outfile) known_percent=$(calc -p "round( 100*($total-$unknown-$bidix_unknown)/$total, 3)") echo "$known_percent % known tokens ($unknown unknown, $bidix_unknown bidix-unknown of total $total tokens)" echo "Top unknown words:" grep '/[*@]' $outfile | sort | uniq -c | sort -nr | head
And run it like
cat asm.corpus | bash coverage.sh asm-eng-biltrans
(The bidix-unknown count should always be 0 if your pair uses automatic analyser trimming.)
TODO: paradigm-coverage (less naïve)[edit]
On an analysed corpus, we can sum frequencies into bins for each lemma+mainpos, so if the analysed corpus contains
musa/mus<n><f><sg><def>/muse<vblex><past> mus/mus<n><f><sg><ind>/mus<n><f><pl><ind>/muse<vblex><imp> musene/mus<n><f><pl><def>
then output has
3 mus<n><f> 2 muse<vblex>
and we can find paradigms that are likely to mess up disambiguation, or where we need to ensure that the bidix contains the highest-frequency paradigm (since the bidix is typically smaller than the monodix).
We could also weight these numbers by number of unique forms in the pardef; if the verb pardef has 6 unique forms and then noun only 3, then the above output should be even more skewed:
0.33 mus<n><f> 0.75 muse<vblex>
Faster coverage testing with frequency lists[edit]
If words appear several times in your corpus, why bother analysing them several times? We can make a frequency list first and add together the frequencies. This script does some very stupid tokenisation and creates a frequency list:
make-freqlist.sh:
#!/bin/bash if [[ -t 0 ]]; then echo "Expecting a corpus on stdin" exit 2 fi tr '[:space:][:punct:]' '\n' | grep . | sort | uniq -c | sort -nr
And this script runs your analyser, summing up the frequencies:
freqlist-coverage.sh:
#!/bin/bash set -e -u if [[ $# -eq 0 || -t 0 ]]; then echo "Expecting apertium arguments and a 'sort|uniq -c|sort -nr' style frequency list on stdin" echo "For example:" echo "\$ < spa.freqlist $0 -d . spa-morph" exit 2 fi sed 's%^ *%<apertium-notrans>%;s% %</apertium-notrans>%;s%$% .%' | apertium -f html-noent "$@" | awk -F'</?apertium-notrans>| *\\^\\./\\.<sent><clb>\\$' ' /[/][*@]/ { unknown+=$2 if(!printed) print "Top unknown tokens:" if(++printed<10) print $2,$3 next } { known+=$2 } END { total=known+unknown known_pct=100*known/total unk_pct=100*unknown/total print known_pct" % known of total "total" tokens" }'
Usage:
$ chmod +x make-freqlist.sh freqlist-coverage.sh $ bzcat ~/corpora/nno.txt.bz2 |./make-freqlist.sh > nno.freqlist $ < nno.freqlist ./freqlist-coverage.sh -d ~/apertium-svn/languages/apertium-nno/ nno-morph
coverage.py[edit]
https://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/coverage.py is a coverage script that wraps curl and bzcat.
Note - After Apertium's migration to GitHub, this tool is read-only on the SourceForge repository and does not exist on GitHub. If you are interested in migrating this tool to GitHub, see Migrating tools to GitHub.