Difference between revisions of "Evaluation"
(formatting) |
|||
| Line 1: | Line 1: | ||
Evaluation can give you some idea as to how well a language pair works in practice. There are many ways to evaluate, and the test chosen should depend on the intended use of the language pair: |
'''Evaluation''' can give you some idea as to how well a language pair works in practice. There are many ways to evaluate, and the test chosen should depend on the intended use of the language pair: |
||
* how many words need to be changed before a text is publication-ready (Word-Error Rate, see [http://en.wikipedia.org/wiki/Word_error_rate Wikipedia on WER]), here '''lower scores are better''' |
* how many words need to be changed before a text is publication-ready (Word-Error Rate, see [http://en.wikipedia.org/wiki/Word_error_rate Wikipedia on WER]), here '''lower scores are better''' |
||
* how many [[N-gram]]'s are common to the MT output and one or more reference translations (see [http://en.wikipedia.org/wiki/BLEU Wikipedia on Bleu] or [http://en.wikipedia.org/wiki/NIST_%28metric%29 NIST]), here '''higher scores are better''' |
* how many [[N-gram]]'s are common to the MT output and one or more reference translations (see [http://en.wikipedia.org/wiki/BLEU Wikipedia on Bleu] or [http://en.wikipedia.org/wiki/NIST_%28metric%29 NIST]), here '''higher scores are better''' |
||
| Line 5: | Line 6: | ||
{{TOCD}} |
{{TOCD}} |
||
==Using apertium-eval-translator for WER and PER== |
==Using apertium-eval-translator for WER and PER== |
||
Revision as of 11:49, 26 October 2011
Evaluation can give you some idea as to how well a language pair works in practice. There are many ways to evaluate, and the test chosen should depend on the intended use of the language pair:
- how many words need to be changed before a text is publication-ready (Word-Error Rate, see Wikipedia on WER), here lower scores are better
- how many N-gram's are common to the MT output and one or more reference translations (see Wikipedia on Bleu or NIST), here higher scores are better
- how well a user understands the message of the original text (this typically requires an experiment with real human subjects).
Using apertium-eval-translator for WER and PER
apertium-eval-translator is a script written in Perl. It calculates the word error rate (WER) and the position-independent word error rate (PER) between a translation performed by an Apertium-based MT system and its human-corrected translation at document level. Although it has been designed to evaluate Apertium-based systems, it can be easily adapted to evaluate other MT systems.
To use it, first translate a text with apertium, save that into MT.txt, then manually post-edit that so it looks understandable and grammatical (but trying to avoid major rewrites), save that into postedit.txt. Then run apertium-eval-translator -test MT.txt -ref postedit.txt and you'll see a bunch of numbers indicating how good the translation was, for post-editing.
Detailed usage
apertium-eval-translator -test testfile -ref reffile [-beam <n>]
Options:
-test|-t Specify the file with the translation to evaluate
-ref|-r Specify the file with the reference translation
-beam|-b Perform a beam search by looking only to the <n> previous
and <n> posterior neigboring words (optional parameter
to make the evaluation much faster)
-help|-h Show this help message
-version|-v Show version information and exit
Note: The <n> value provided with -beam is language-pair dependent. The
closer the languages involved are, the lesser <n> can be without
affecting the evaluation results. This parameter only affects the WER
evaluation.
Note: Reference translation MUST have no unknown-word marks, even if
they are free rides.
This software calculates (at document level) the word error rate (WER)
and the postion-independent word error rate (PER) between a translation
performed by the Apertium MT system and a reference translation obtained
by post-editing the system ouput.
It is assumed that unknow words are marked with a start (*), as Apertium
does; nevertheless, it can be easily adapted to evaluate other MT
systems that do not mark unknown words with a star.
See English and Esperanto/Evaluation for an example. In Northern Sámi and Norwegian there is a Makefile to translate a set of source-language files and then run the evaluation on them.
Evaluating with Wikipedia
- Main article: Evaluating with Wikipedia