Apertium has moved from SourceForge to GitHub.
If you have any questions, please come and talk to us on #apertium on irc.freenode.net or contact the GitHub migration team.

Aragonese and Catalan/Evaluation

From Apertium
Jump to: navigation, search

Contents

[edit] Version 0.1 (Beta)

[edit] Naïve coverage

[edit] arg-cat

$ cat corpus_narrative.txt | sh corpus-stat-arg-cat.sh
Number of tokenised words in the corpus: 378440
Number of known words in the corpus: 337924
Coverage:     89.3 %

$ cat sentencelistanwiki.txt | sh corpus-stat-arg-cat.sh
Number of tokenised words in the corpus: 2673751
Number of known words in the corpus: 2344686
Coverage:     87.7 %

[edit] cat-arg

$ cat ../apertium-es-ca/ca-tagger-data/ca.tagged.txt | sh corpus-stat-cat-arg.sh
Number of tokenised words in the corpus: 24590
Number of known words in the corpus: 22919
Coverage:     93.2 %

trunk/apertium-eo-ca/tekstaro/ca.crp.txt
$ cat ca.crp.txt | sed 's/^ *[0123456789]*\.//g'| sh ./corpus-stat-cat-arg.sh
Number of tokenised words in the corpus: 567608
Number of known words in the corpus: 497165
Coverage:     87.6 %

[edit] Translation Quality

[edit] cat-arg

$../apertium-eval-translator/apertium-eval-translator.pl -test MT.txt -ref postedit.txt
Test file: 'MT.txt'
Reference file 'postedit.txt'

Statistics about input files
-------------------------------------------------------
Number of words in reference: 1311
Number of words in test: 1315
Number of unknown words (marked with a star) in test: 156
Percentage of unknown words: 11.86 %

Results when removing unknown-word marks (stars)
-------------------------------------------------------
Edit distance: 203
Word error rate (WER): 15.48 %
Number of position-independent correct words: 1132
Position-independent word error rate (PER): 13.96 %

Results when unknown-word marks (stars) are not removed
-------------------------------------------------------
Edit distance: 254
Word Error Rate (WER): 19.37 %
Number of position-independent correct words: 1081
Position-independent word error rate (PER): 17.85 %

Statistics about the translation of unknown words
-------------------------------------------------------
Number of unknown words which were free rides: 51
Percentage of unknown words that were free rides: 32.69 %
Personal tools