Difference between revisions of "Evaluation"

From Apertium
Jump to navigation Jump to search
Line 3: Line 3:
* how many words need to be changed before a text is publication-ready (Word-Error Rate, see [http://en.wikipedia.org/wiki/Word_error_rate Wikipedia on WER]), here '''lower scores are better'''
* how many words need to be changed before a text is publication-ready (Word-Error Rate, see [http://en.wikipedia.org/wiki/Word_error_rate Wikipedia on WER]), here '''lower scores are better'''
* how many [[N-gram]]'s are common to the MT output and one or more reference translations (see [http://en.wikipedia.org/wiki/BLEU Wikipedia on Bleu] or [http://en.wikipedia.org/wiki/NIST_%28metric%29 NIST]), here '''higher scores are better'''
* how many [[N-gram]]'s are common to the MT output and one or more reference translations (see [http://en.wikipedia.org/wiki/BLEU Wikipedia on Bleu] or [http://en.wikipedia.org/wiki/NIST_%28metric%29 NIST]), here '''higher scores are better'''
* how well a user ''understands'' the message of the original text (this typically requires an experiment with real human subjects).
* how well a user ''understands'' the message of the original text (this typically requires an experiment with real human subjects, see [[Assimilation Evaluation Toolkit]]).





Revision as of 09:00, 23 April 2015

Evaluation can give you some idea as to how well a language pair works in practice. There are many ways to evaluate, and the test chosen should depend on the intended use of the language pair:

  • how many words need to be changed before a text is publication-ready (Word-Error Rate, see Wikipedia on WER), here lower scores are better
  • how many N-gram's are common to the MT output and one or more reference translations (see Wikipedia on Bleu or NIST), here higher scores are better
  • how well a user understands the message of the original text (this typically requires an experiment with real human subjects, see Assimilation Evaluation Toolkit).


Most released language pairs have had some evaluation, see Quality for a per-pair summary.


Using apertium-eval-translator for WER and PER

apertium-eval-translator is a script written in Perl. It calculates the word error rate (WER) and the position-independent word error rate (PER) between a translation performed by an Apertium-based MT system and its human-corrected translation at document level. Although it has been designed to evaluate Apertium-based systems, it can be easily adapted to evaluate other MT systems.

To use it, first translate a text with apertium, save that into MT.txt, then manually post-edit that so it looks understandable and grammatical (but trying to avoid major rewrites), save that into postedit.txt. Then run apertium-eval-translator -test MT.txt -ref postedit.txt and you'll see a bunch of numbers indicating how good the translation was, for post-editing.

Detailed usage

    apertium-eval-translator -test testfile -ref reffile [-beam <n>]

    Options:

      -test|-t     Specify the file with the translation to evaluate 
      -ref|-r      Specify the file with the reference translation 
      -beam|-b     Perform a beam search by looking only to the <n> previous 
                   and <n> posterior neigboring words (optional parameter 
                   to make the evaluation much faster)
      -help|-h     Show this help message
      -version|-v  Show version information and exit

    Note: The <n> value provided with -beam is language-pair dependent. The
    closer the languages involved are, the lesser <n> can be without
    affecting the evaluation results. This parameter only affects the WER
    evaluation.

    Note: Reference translation MUST have no unknown-word marks, even if
    they are free rides.

    This software calculates (at document level) the word error rate (WER)
    and the postion-independent word error rate (PER) between a translation
    performed by the Apertium MT system and a reference translation obtained
    by post-editing the system ouput.

    It is assumed that unknow words are marked with a start (*), as Apertium
    does; nevertheless, it can be easily adapted to evaluate other MT
    systems that do not mark unknown words with a star.

See English and Esperanto/Evaluation for an example. In Northern Sámi and Norwegian there is a Makefile to translate a set of source-language files and then run the evaluation on them.

dwdiff

If you just need a quick-and-dirty PER (position-independent WER) test, you can use dwdiff -s reference.txt MT_output.txt and look for % changed.

Pair bootstrap resampling

Evaluating with Wikipedia

Main article: Evaluating with Wikipedia

See also

External links