From Apertium
Revision as of 02:07, 10 March 2018 by Shardulc (talk | contribs) (GitHub migration)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search
Note: After Apertium's migration to GitHub, this tool is read-only on the SourceForge repository and does not exist on GitHub. If you are interested in migrating this tool to GitHub, see Migrating tools to GitHub.


You need a corpus:

  • XX-corpus.txt: A clean (orthographically correct) corpus in the language

You need two wordlists:

  • XX-clean.txt: A file with a list of known clean words
  • XX-prettyclean.txt: A file with a list of probably clean words (can be empty)

First do 'make', you might want to edit the makefile to change your paths.

Then, put the source files (corpus, wordlists) where can find them, and run:

$ cat cv-training.crp | perl -t -l cv
Reading the clean dictionary...
Clean dictionary processed...
Reading the "pretty clean" dictionary...
"Pretty clean" dictionary processed...
Reading the training text...
Training texts processed (35919 words)...
Computing final probabilities...
Dumping plain text hashes to disk...


$ perl -m -l cv
Reading in plain text hashes...
Saving storable hashes to disk...

Now you should be able to use the program.

$ cat testing/cv-source.txt  | perl -r -d . -l cv

You can use apertium-eval-translator to evaluate the quality of the restoration:

$ apertium-eval-translator -t testing/cv-test.txt -r testing/cv-reference.txt 
apertium-eval-translator -t testing/cv-test.txt -r testing/cv-reference.txt 
Test file: 'testing/cv-test.txt'
Reference file 'testing/cv-reference.txt'

Statistics about input files
Number of words in reference: 804
Number of words in test: 804
Number of unknown words (marked with a star) in test: 
Percentage of unknown words: 0.00 %

Results when removing unknown-word marks (stars)
Edit distance: 108
Word error rate (WER): 13.43 %
Number of position-independent word errors: 108
Position-independent word error rate (PER): 13.43 %

Results when unknown-word marks (stars) are not removed
Edit distance: 108
Word Error Rate (WER): 13.43 %
Number of position-independent word errors: 108
Position-independent word error rate (PER): 13.43 %

Statistics about the translation of unknown words
Number of unknown words which were free rides: 0
Percentage of unknown words that were free rides: 0%

See also[edit]