N-grams
Jump to navigation
Jump to search
Say you have a corpus and an analyser, how do you make an trigram frequency list in three shell commands?
Grab apertium-cleanstream, then do:
bzcat corpus.bz2 | apertium-deshtml | lt-proc foo.bin | apertium-cleanstream -n >corpus.ana paste corpus.ana <(tail -n+1 corpus.ana) <(tail -n+2 corpus.ana) >corpus.trigrams sort corpus.trigrams | uniq -c | sort -nr > corpus.trigrams.hitparade