Aragonese and Catalan/Evaluation

From Apertium
< Aragonese and Catalan
Revision as of 08:54, 16 January 2016 by Juanpabl (talk | contribs) (Created page with "== Version 0.1 (Beta) == === Naïve coverage === ==== arg-cat ==== <pre> $ cat corpus_narrative.txt | sh corpus-stat-arg-cat.sh Number of tokenised words in the corpus: 37844...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Version 0.1 (Beta)

Naïve coverage

arg-cat

$ cat corpus_narrative.txt | sh corpus-stat-arg-cat.sh
Number of tokenised words in the corpus: 378440
Number of known words in the corpus: 337924
Coverage:     89.3 %

$ cat sentencelistanwiki.txt | sh corpus-stat-arg-cat.sh
Number of tokenised words in the corpus: 2673751
Number of known words in the corpus: 2344686
Coverage:     87.7 %

cat-arg

$ cat ../apertium-es-ca/ca-tagger-data/ca.tagged.txt | sh corpus-stat-cat-arg.sh
Number of tokenised words in the corpus: 24590
Number of known words in the corpus: 22919
Coverage:     93.2 %

trunk/apertium-eo-ca/tekstaro/ca.crp.txt
$ cat ca.crp.txt | sed 's/^ *[0123456789]*\.//g'| sh ./corpus-stat-cat-arg.sh
Number of tokenised words in the corpus: 567608
Number of known words in the corpus: 497165
Coverage:     87.6 %

Translation Quality

cat-arg

$../apertium-eval-translator/apertium-eval-translator.pl -test MT.txt -ref postedit.txt
Test file: 'MT.txt'
Reference file 'postedit.txt'

Statistics about input files
-------------------------------------------------------
Number of words in reference: 1311
Number of words in test: 1315
Number of unknown words (marked with a star) in test: 156
Percentage of unknown words: 11.86 %

Results when removing unknown-word marks (stars)
-------------------------------------------------------
Edit distance: 203
Word error rate (WER): 15.48 %
Number of position-independent correct words: 1132
Position-independent word error rate (PER): 13.96 %

Results when unknown-word marks (stars) are not removed
-------------------------------------------------------
Edit distance: 254
Word Error Rate (WER): 19.37 %
Number of position-independent correct words: 1081
Position-independent word error rate (PER): 17.85 %

Statistics about the translation of unknown words
-------------------------------------------------------
Number of unknown words which were free rides: 51
Percentage of unknown words that were free rides: 32.69 %