Difference between revisions of "Charlifter"
Jump to navigation
Jump to search
(GitHub migration) |
|||
(4 intermediate revisions by one other user not shown) | |||
Line 1: | Line 1: | ||
{{Github-unmigrated-tool}} |
|||
==Training== |
==Training== |
||
Line 8: | Line 9: | ||
* <code>XX-clean.txt</code>: A file with a list of known clean words |
* <code>XX-clean.txt</code>: A file with a list of known clean words |
||
* <code>XX-prettyclean.txt</code>: A file with a list of probably clean words (can be |
* <code>XX-prettyclean.txt</code>: A file with a list of probably clean words (can be empty) |
||
;Process |
;Process |
||
Line 23: | Line 24: | ||
"Pretty clean" dictionary processed... |
"Pretty clean" dictionary processed... |
||
Reading the training text... |
Reading the training text... |
||
1000... |
|||
⚫ | |||
2000... |
|||
3000... |
|||
⚫ | |||
Computing final probabilities... |
Computing final probabilities... |
||
Dumping plain text hashes to disk... |
Dumping plain text hashes to disk... |
||
Line 50: | Line 54: | ||
<pre> |
<pre> |
||
$ apertium-eval-translator -t testing/cv-test.txt -r testing/cv-reference.txt |
$ apertium-eval-translator -t testing/cv-test.txt -r testing/cv-reference.txt |
||
apertium-eval-translator -t testing/cv-test.txt -r testing/cv-reference.txt |
|||
Test file: 'testing/cv-test.txt' |
Test file: 'testing/cv-test.txt' |
||
Reference file 'testing/cv-reference.txt' |
Reference file 'testing/cv-reference.txt' |
||
Line 62: | Line 67: | ||
Results when removing unknown-word marks (stars) |
Results when removing unknown-word marks (stars) |
||
------------------------------------------------------- |
------------------------------------------------------- |
||
Edit distance: |
Edit distance: 108 |
||
Word error rate (WER): |
Word error rate (WER): 13.43 % |
||
Number of position-independent word errors: |
Number of position-independent word errors: 108 |
||
Position-independent word error rate (PER): |
Position-independent word error rate (PER): 13.43 % |
||
Results when unknown-word marks (stars) are not removed |
Results when unknown-word marks (stars) are not removed |
||
------------------------------------------------------- |
------------------------------------------------------- |
||
Edit distance: |
Edit distance: 108 |
||
Word Error Rate (WER): |
Word Error Rate (WER): 13.43 % |
||
Number of position-independent word errors: |
Number of position-independent word errors: 108 |
||
Position-independent word error rate (PER): |
Position-independent word error rate (PER): 13.43 % |
||
Statistics about the translation of unknown words |
Statistics about the translation of unknown words |
||
Line 78: | Line 83: | ||
Number of unknown words which were free rides: 0 |
Number of unknown words which were free rides: 0 |
||
Percentage of unknown words that were free rides: 0% |
Percentage of unknown words that were free rides: 0% |
||
</pre> |
</pre> |
||
==See also== |
|||
* [https://apertium.svn.sourceforge.net/svnroot/apertium/branches/gsoc2011/charlifter Half-finished C++ port] |
|||
* [http://sourceforge.net/projects/lingala/files/charlifter/ SourceForge: Download charlifter] |
|||
[[Category:Tools]] |
[[Category:Tools]] |
Latest revision as of 02:07, 10 March 2018
Note: After Apertium's migration to GitHub, this tool is read-only on the SourceForge repository and does not exist on GitHub. If you are interested in migrating this tool to GitHub, see Migrating tools to GitHub.
Training[edit]
You need a corpus:
XX-corpus.txt
: A clean (orthographically correct) corpus in the language
You need two wordlists:
XX-clean.txt
: A file with a list of known clean wordsXX-prettyclean.txt
: A file with a list of probably clean words (can be empty)
- Process
First do 'make', you might want to edit the makefile
to change your paths.
Then, put the source files (corpus, wordlists) where sf.pl
can find them, and run:
$ cat cv-training.crp | perl sf.pl -t -l cv Reading the clean dictionary... Clean dictionary processed... Reading the "pretty clean" dictionary... "Pretty clean" dictionary processed... Reading the training text... 1000... 2000... 3000... Training texts processed (35919 words)... Computing final probabilities... Dumping plain text hashes to disk...
Then:
$ perl sf.pl -m -l cv Reading in plain text hashes... Saving storable hashes to disk...
Now you should be able to use the program.
- Usage
$ cat testing/cv-source.txt | perl sf.pl -r -d . -l cv
- Evaluation
You can use apertium-eval-translator to evaluate the quality of the restoration:
$ apertium-eval-translator -t testing/cv-test.txt -r testing/cv-reference.txt apertium-eval-translator -t testing/cv-test.txt -r testing/cv-reference.txt Test file: 'testing/cv-test.txt' Reference file 'testing/cv-reference.txt' Statistics about input files ------------------------------------------------------- Number of words in reference: 804 Number of words in test: 804 Number of unknown words (marked with a star) in test: Percentage of unknown words: 0.00 % Results when removing unknown-word marks (stars) ------------------------------------------------------- Edit distance: 108 Word error rate (WER): 13.43 % Number of position-independent word errors: 108 Position-independent word error rate (PER): 13.43 % Results when unknown-word marks (stars) are not removed ------------------------------------------------------- Edit distance: 108 Word Error Rate (WER): 13.43 % Number of position-independent word errors: 108 Position-independent word error rate (PER): 13.43 % Statistics about the translation of unknown words ------------------------------------------------------- Number of unknown words which were free rides: 0 Percentage of unknown words that were free rides: 0%