Difference between revisions of "Evaluation"

Revision as of 09:00, 23 April 2015

Evaluation can give you some idea as to how well a language pair works in practice. There are many ways to evaluate, and the test chosen should depend on the intended use of the language pair:

how many words need to be changed before a text is publication-ready (Word-Error Rate, see Wikipedia on WER), here lower scores are better
how many N-gram's are common to the MT output and one or more reference translations (see Wikipedia on Bleu or NIST), here higher scores are better
how well a user understands the message of the original text (this typically requires an experiment with real human subjects).

Most released language pairs have had some evaluation, see Quality for a per-pair summary.

Using apertium-eval-translator for WER and PER

apertium-eval-translator is a script written in Perl. It calculates the word error rate (WER) and the position-independent word error rate (PER) between a translation performed by an Apertium-based MT system and its human-corrected translation at document level. Although it has been designed to evaluate Apertium-based systems, it can be easily adapted to evaluate other MT systems.

To use it, first translate a text with apertium, save that into MT.txt, then manually post-edit that so it looks understandable and grammatical (but trying to avoid major rewrites), save that into postedit.txt. Then run apertium-eval-translator -test MT.txt -ref postedit.txt and you'll see a bunch of numbers indicating how good the translation was, for post-editing.

Detailed usage

    apertium-eval-translator -test testfile -ref reffile [-beam <n>]

    Options:

      -test|-t     Specify the file with the translation to evaluate 
      -ref|-r      Specify the file with the reference translation 
      -beam|-b     Perform a beam search by looking only to the <n> previous 
                   and <n> posterior neigboring words (optional parameter 
                   to make the evaluation much faster)
      -help|-h     Show this help message
      -version|-v  Show version information and exit

    Note: The <n> value provided with -beam is language-pair dependent. The
    closer the languages involved are, the lesser <n> can be without
    affecting the evaluation results. This parameter only affects the WER
    evaluation.

    Note: Reference translation MUST have no unknown-word marks, even if
    they are free rides.

    This software calculates (at document level) the word error rate (WER)
    and the postion-independent word error rate (PER) between a translation
    performed by the Apertium MT system and a reference translation obtained
    by post-editing the system ouput.

    It is assumed that unknow words are marked with a start (*), as Apertium
    does; nevertheless, it can be easily adapted to evaluate other MT
    systems that do not mark unknown words with a star.

See English and Esperanto/Evaluation for an example. In Northern Sámi and Norwegian there is a Makefile to translate a set of source-language files and then run the evaluation on them.

dwdiff

If you just need a quick-and-dirty PER (position-independent WER) test, you can use dwdiff -s reference.txt MT_output.txt and look for % changed.

Pair bootstrap resampling

Evaluating with Wikipedia

Main article: Evaluating with Wikipedia

External links

Wikipedia on Evaluation of MT (by Francis Tyers)

@@ Line 62: / Line 62: @@
 ==See also==
+* [[Assimilation Evaluation Toolkit]] / [[Ideas for Google Summer of Code/Apertium assimilation evaluation toolkit]]
 * [[Regression testing]]
 * [[Quality control]]

Difference between revisions of "Evaluation"

Revision as of 09:00, 23 April 2015

Contents

Using apertium-eval-translator for WER and PER

Detailed usage

dwdiff

Pair bootstrap resampling

Evaluating with Wikipedia

See also

External links

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools