N-grams

From Apertium
Jump to navigation Jump to search

Say you have a corpus and an analyser, how do you make an trigram frequency list in three shell commands?

Grab apertium-cleanstream, then do:

bzcat corpus.bz2 | apertium-deshtml | lt-proc foo.bin | apertium-cleanstream -n >corpus.ana
paste corpus.ana <(tail -n+1 corpus.ana) <(tail -n+2 corpus.ana) >corpus.trigrams
sort corpus.trigrams | uniq -c | sort -nr > corpus.trigrams.hitparade