N-grams

From Apertium
Revision as of 13:53, 10 February 2015 by Unhammer (talk | contribs) (Created page with "Say you have a corpus and an analyser, how do you make an trigram frequency list in three shell commands? Grab apertium-cleanstream, then do: <pre> bzcat corpus.bz2 | ape...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Say you have a corpus and an analyser, how do you make an trigram frequency list in three shell commands?

Grab apertium-cleanstream, then do:

bzcat corpus.bz2 | apertium-deshtml | lt-proc foo.bin | apertium-cleanstream -n >corpus.ana
paste corpus.ana <(tail -n+1 corpus.ana) <(tail -n+2 corpus.ana) >corpus.trigrams
sort corpus.trigrams | uniq -c | sort -nr > corpus.trigrams.hitparade