Difference between revisions of "Make a frequency list"
(Created page with "The simple way to make a frequency list / hitparade of just forms: <pre> cat buncha-words.txt | tr ' ' '\n' | sort | uniq -c | sort -nr > hitparade.txt </pre> If you want to ...") |
|||
(One intermediate revision by the same user not shown) | |||
Line 1: | Line 1: | ||
==Forms== |
|||
The simple way to make a frequency list / hitparade of just forms: |
The simple way to make a frequency list / hitparade of just forms: |
||
<pre> |
<pre> |
||
cat buncha-words.txt | tr ' ' '\n' | sort | uniq -c | sort -nr > hitparade.txt |
cat buncha-words.txt | tr ' ' '\n' | sort | uniq -c | sort -nr > hitparade.txt |
||
</pre> |
</pre> |
||
==Lemmas== |
|||
If you want to make a list of ''lemmas'', you'll have to run it through an analyser. If you have a Spanish corpus in corpus.spa.bz2, then you can get a list of lemmas by analysing it with [[apertium-spa]] and getting the lemmas out with [[Apertium_stream_format#Python_parsing_library|apertium streamparser]] and then doing the sort|uniq|sort dance: |
If you want to make a list of ''lemmas'', you'll have to run it through an analyser. If you have a Spanish corpus in corpus.spa.bz2, then you can get a list of lemmas by analysing it with [[apertium-spa]] and getting the lemmas out with [[Apertium_stream_format#Python_parsing_library|apertium streamparser]] and then doing the sort|uniq|sort dance: |
||
Line 18: | Line 22: | ||
import sys |
import sys |
||
for lu in streamparser.parse_file(sys.stdin): |
for lu in streamparser.parse_file(sys.stdin): |
||
print("/".join(set(sub.baseform for r in lu.readings for sub in r if lu.knownness == streamparser.known))) |
print("/".join(sorted(set(sub.baseform for r in lu.readings for sub in r if lu.knownness == streamparser.known)))) |
||
' \ |
' \ |
||
| LC_ALL=C sort \ |
| LC_ALL=C sort \ |
||
Line 25: | Line 29: | ||
</pre> |
</pre> |
||
(the LC_ALL=C makes sorting the same across different Unix locales, avoiding things like treating 'aa' and 'å' the same in Norwegian). |
(the LC_ALL=C makes sorting the same across different Unix locales, avoiding things like treating 'aa' and 'å' the same in Norwegian). |
||
Note that using <code>spa-disam</code> in the script above will run the disambiguator, so the list will be of disambiguated lemmas. If you don't trust your disambiguator, you may want to run just <code>spa-morph</code>, giving you "lemma-sets" in your hitparade. You can also tweak the streamparser-script to include the main pos if that's useful, etc. |
|||
[[Category:Documentation]] |
|||
[[Category:Documentation in English]] |
Latest revision as of 10:10, 9 August 2018
Forms[edit]
The simple way to make a frequency list / hitparade of just forms:
cat buncha-words.txt | tr ' ' '\n' | sort | uniq -c | sort -nr > hitparade.txt
Lemmas[edit]
If you want to make a list of lemmas, you'll have to run it through an analyser. If you have a Spanish corpus in corpus.spa.bz2, then you can get a list of lemmas by analysing it with apertium-spa and getting the lemmas out with apertium streamparser and then doing the sort|uniq|sort dance:
#/bin/sh wget -c https://raw.githubusercontent.com/apertium/streamparser/fff6780c420fdf7437495456d65c9e781a0e437c/streamparser.py bzcat corpus.spa.bz2 \ | apertium-deshtml \ | apertium -f none -d /path/to/apertium-spa spa-disam \ | python3 -c ' import streamparser import sys for lu in streamparser.parse_file(sys.stdin): print("/".join(sorted(set(sub.baseform for r in lu.readings for sub in r if lu.knownness == streamparser.known)))) ' \ | LC_ALL=C sort \ | LC_ALL=C uniq -c \ | LC_ALL=C sort -nr > hitparade.txt
(the LC_ALL=C makes sorting the same across different Unix locales, avoiding things like treating 'aa' and 'å' the same in Norwegian).
Note that using spa-disam
in the script above will run the disambiguator, so the list will be of disambiguated lemmas. If you don't trust your disambiguator, you may want to run just spa-morph
, giving you "lemma-sets" in your hitparade. You can also tweak the streamparser-script to include the main pos if that's useful, etc.