Difference between revisions of "Make a frequency list"

From Apertium
Jump to navigation Jump to search
 
Line 1: Line 1:
==Forms==

The simple way to make a frequency list / hitparade of just forms:
The simple way to make a frequency list / hitparade of just forms:
<pre>
<pre>
cat buncha-words.txt | tr ' ' '\n' | sort | uniq -c | sort -nr > hitparade.txt
cat buncha-words.txt | tr ' ' '\n' | sort | uniq -c | sort -nr > hitparade.txt
</pre>
</pre>

==Lemmas==


If you want to make a list of ''lemmas'', you'll have to run it through an analyser. If you have a Spanish corpus in corpus.spa.bz2, then you can get a list of lemmas by analysing it with [[apertium-spa]] and getting the lemmas out with [[Apertium_stream_format#Python_parsing_library|apertium streamparser]] and then doing the sort|uniq|sort dance:
If you want to make a list of ''lemmas'', you'll have to run it through an analyser. If you have a Spanish corpus in corpus.spa.bz2, then you can get a list of lemmas by analysing it with [[apertium-spa]] and getting the lemmas out with [[Apertium_stream_format#Python_parsing_library|apertium streamparser]] and then doing the sort|uniq|sort dance:
Line 18: Line 22:
import sys
import sys
for lu in streamparser.parse_file(sys.stdin):
for lu in streamparser.parse_file(sys.stdin):
print("/".join(set(sub.baseform for r in lu.readings for sub in r if lu.knownness == streamparser.known)))
print("/".join(sorted(set(sub.baseform for r in lu.readings for sub in r if lu.knownness == streamparser.known))))
' \
' \
| LC_ALL=C sort \
| LC_ALL=C sort \
Line 25: Line 29:
</pre>
</pre>
(the LC_ALL=C makes sorting the same across different Unix locales, avoiding things like treating 'aa' and 'å' the same in Norwegian).
(the LC_ALL=C makes sorting the same across different Unix locales, avoiding things like treating 'aa' and 'å' the same in Norwegian).

Note that using <code>spa-disam</code> in the script above will run the disambiguator, so the list will be of disambiguated lemmas. If you don't trust your disambiguator, you may want to run just <code>spa-morph</code>, giving you "lemma-sets" in your hitparade. You can also tweak the streamparser-script to include the main pos if that's useful, etc.


[[Category:Documentation]]
[[Category:Documentation]]

Latest revision as of 10:10, 9 August 2018

Forms[edit]

The simple way to make a frequency list / hitparade of just forms:

cat buncha-words.txt | tr ' ' '\n' | sort | uniq -c | sort -nr > hitparade.txt

Lemmas[edit]

If you want to make a list of lemmas, you'll have to run it through an analyser. If you have a Spanish corpus in corpus.spa.bz2, then you can get a list of lemmas by analysing it with apertium-spa and getting the lemmas out with apertium streamparser and then doing the sort|uniq|sort dance:

#/bin/sh

wget -c https://raw.githubusercontent.com/apertium/streamparser/fff6780c420fdf7437495456d65c9e781a0e437c/streamparser.py

bzcat corpus.spa.bz2                                      \
    | apertium-deshtml                                    \
    | apertium -f none -d /path/to/apertium-spa spa-disam \
    | python3 -c '
import streamparser
import sys
for lu in streamparser.parse_file(sys.stdin):
    print("/".join(sorted(set(sub.baseform for r in lu.readings for sub in r if lu.knownness == streamparser.known))))
'                                                         \
    | LC_ALL=C sort                                       \
    | LC_ALL=C uniq -c                                    \
    | LC_ALL=C sort -nr > hitparade.txt

(the LC_ALL=C makes sorting the same across different Unix locales, avoiding things like treating 'aa' and 'å' the same in Norwegian).

Note that using spa-disam in the script above will run the disambiguator, so the list will be of disambiguated lemmas. If you don't trust your disambiguator, you may want to run just spa-morph, giving you "lemma-sets" in your hitparade. You can also tweak the streamparser-script to include the main pos if that's useful, etc.