Difference between revisions of "Aragonese and Catalan/Evaluation"

From Apertium
Jump to navigation Jump to search
(Created page with "== Version 0.1 (Beta) == === Naïve coverage === ==== arg-cat ==== <pre> $ cat corpus_narrative.txt | sh corpus-stat-arg-cat.sh Number of tokenised words in the corpus: 37844...")
 
 
Line 60: Line 60:
 
Percentage of unknown words that were free rides: 32.69 %
 
Percentage of unknown words that were free rides: 32.69 %
 
</pre>
 
</pre>
  +
  +
[[Category:Aragonese and Catalan]]

Latest revision as of 08:55, 16 January 2016

Version 0.1 (Beta)[edit]

Naïve coverage[edit]

arg-cat[edit]

$ cat corpus_narrative.txt | sh corpus-stat-arg-cat.sh
Number of tokenised words in the corpus: 378440
Number of known words in the corpus: 337924
Coverage:     89.3 %

$ cat sentencelistanwiki.txt | sh corpus-stat-arg-cat.sh
Number of tokenised words in the corpus: 2673751
Number of known words in the corpus: 2344686
Coverage:     87.7 %

cat-arg[edit]

$ cat ../apertium-es-ca/ca-tagger-data/ca.tagged.txt | sh corpus-stat-cat-arg.sh
Number of tokenised words in the corpus: 24590
Number of known words in the corpus: 22919
Coverage:     93.2 %

trunk/apertium-eo-ca/tekstaro/ca.crp.txt
$ cat ca.crp.txt | sed 's/^ *[0123456789]*\.//g'| sh ./corpus-stat-cat-arg.sh
Number of tokenised words in the corpus: 567608
Number of known words in the corpus: 497165
Coverage:     87.6 %

Translation Quality[edit]

cat-arg[edit]

$../apertium-eval-translator/apertium-eval-translator.pl -test MT.txt -ref postedit.txt
Test file: 'MT.txt'
Reference file 'postedit.txt'

Statistics about input files
-------------------------------------------------------
Number of words in reference: 1311
Number of words in test: 1315
Number of unknown words (marked with a star) in test: 156
Percentage of unknown words: 11.86 %

Results when removing unknown-word marks (stars)
-------------------------------------------------------
Edit distance: 203
Word error rate (WER): 15.48 %
Number of position-independent correct words: 1132
Position-independent word error rate (PER): 13.96 %

Results when unknown-word marks (stars) are not removed
-------------------------------------------------------
Edit distance: 254
Word Error Rate (WER): 19.37 %
Number of position-independent correct words: 1081
Position-independent word error rate (PER): 17.85 %

Statistics about the translation of unknown words
-------------------------------------------------------
Number of unknown words which were free rides: 51
Percentage of unknown words that were free rides: 32.69 %