Difference between revisions of "Aragonese and Catalan/Evaluation"
Jump to navigation
Jump to search
(Created page with "== Version 0.1 (Beta) == === Naïve coverage === ==== arg-cat ==== <pre> $ cat corpus_narrative.txt | sh corpus-stat-arg-cat.sh Number of tokenised words in the corpus: 37844...") |
|||
Line 60: | Line 60: | ||
Percentage of unknown words that were free rides: 32.69 % |
Percentage of unknown words that were free rides: 32.69 % |
||
</pre> |
</pre> |
||
[[Category:Aragonese and Catalan]] |
Latest revision as of 08:55, 16 January 2016
Contents
Version 0.1 (Beta)[edit]
Naïve coverage[edit]
arg-cat[edit]
$ cat corpus_narrative.txt | sh corpus-stat-arg-cat.sh Number of tokenised words in the corpus: 378440 Number of known words in the corpus: 337924 Coverage: 89.3 % $ cat sentencelistanwiki.txt | sh corpus-stat-arg-cat.sh Number of tokenised words in the corpus: 2673751 Number of known words in the corpus: 2344686 Coverage: 87.7 %
cat-arg[edit]
$ cat ../apertium-es-ca/ca-tagger-data/ca.tagged.txt | sh corpus-stat-cat-arg.sh Number of tokenised words in the corpus: 24590 Number of known words in the corpus: 22919 Coverage: 93.2 % trunk/apertium-eo-ca/tekstaro/ca.crp.txt $ cat ca.crp.txt | sed 's/^ *[0123456789]*\.//g'| sh ./corpus-stat-cat-arg.sh Number of tokenised words in the corpus: 567608 Number of known words in the corpus: 497165 Coverage: 87.6 %
Translation Quality[edit]
cat-arg[edit]
$../apertium-eval-translator/apertium-eval-translator.pl -test MT.txt -ref postedit.txt Test file: 'MT.txt' Reference file 'postedit.txt' Statistics about input files ------------------------------------------------------- Number of words in reference: 1311 Number of words in test: 1315 Number of unknown words (marked with a star) in test: 156 Percentage of unknown words: 11.86 % Results when removing unknown-word marks (stars) ------------------------------------------------------- Edit distance: 203 Word error rate (WER): 15.48 % Number of position-independent correct words: 1132 Position-independent word error rate (PER): 13.96 % Results when unknown-word marks (stars) are not removed ------------------------------------------------------- Edit distance: 254 Word Error Rate (WER): 19.37 % Number of position-independent correct words: 1081 Position-independent word error rate (PER): 17.85 % Statistics about the translation of unknown words ------------------------------------------------------- Number of unknown words which were free rides: 51 Percentage of unknown words that were free rides: 32.69 %