Difference between revisions of "N-grams"

From Apertium
Jump to navigation Jump to search
(Created page with "Say you have a corpus and an analyser, how do you make an trigram frequency list in three shell commands? Grab apertium-cleanstream, then do: <pre> bzcat corpus.bz2 | ape...")
 
(No difference)

Latest revision as of 13:53, 10 February 2015

Say you have a corpus and an analyser, how do you make an trigram frequency list in three shell commands?

Grab apertium-cleanstream, then do:

bzcat corpus.bz2 | apertium-deshtml | lt-proc foo.bin | apertium-cleanstream -n >corpus.ana
paste corpus.ana <(tail -n+1 corpus.ana) <(tail -n+2 corpus.ana) >corpus.trigrams
sort corpus.trigrams | uniq -c | sort -nr > corpus.trigrams.hitparade