Make a frequency list
Jump to navigation
Jump to search
The simple way to make a frequency list / hitparade of just forms:
cat buncha-words.txt | tr ' ' '\n' | sort | uniq -c | sort -nr > hitparade.txt
If you want to make a list of lemmas, you'll have to run it through an analyser. If you have a Spanish corpus in corpus.spa.bz2, then you can get a list of lemmas by analysing it with apertium-spa and getting the lemmas out with apertium streamparser and then doing the sort|uniq|sort dance:
#/bin/sh wget -c https://raw.githubusercontent.com/apertium/streamparser/fff6780c420fdf7437495456d65c9e781a0e437c/streamparser.py bzcat corpus.spa.bz2 \ | apertium-deshtml \ | apertium -f none -d /path/to/apertium-spa spa-disam \ | python3 -c ' import streamparser import sys for lu in streamparser.parse_file(sys.stdin): print("/".join(set(sub.baseform for r in lu.readings for sub in r if lu.knownness == streamparser.known))) ' \ | LC_ALL=C sort \ | LC_ALL=C uniq -c \ | LC_ALL=C sort -nr > hitparade.txt
(the LC_ALL=C makes sorting the same across different Unix locales, avoiding things like treating 'aa' and 'å' the same in Norwegian).