Difference between revisions of "Evaluation"
Popcorndude (talk | contribs) |
|||
Line 85: | Line 85: | ||
==See also== |
==See also== |
||
* [[Assimilation Evaluation Toolkit]] / [[Ideas for Google Summer of Code/Apertium assimilation evaluation toolkit]] |
* [[Assimilation Evaluation Toolkit]] / [[Ideas for Google Summer of Code/Apertium assimilation evaluation toolkit]] |
||
* [[ |
* [[Apertium-regtest]] |
||
* [[Quality control]] |
* [[Quality control]] |
||
* [[Calculating coverage]] |
* [[Calculating coverage]] |
Revision as of 20:32, 23 July 2021
Evaluation can give you some idea as to how well a language pair works in practice. There are many ways to evaluate, and the test chosen should depend on the intended use of the language pair:
- how many words need to be changed before a text is publication-ready (Word-Error Rate, see Wikipedia on WER), here lower scores are better
- how many word N-gram's are common to the MT output and one or more reference translations (see Wikipedia on Bleu or NIST), here higher scores are better
- how many character N-gram's are common to MT output and a post-edit (the Fuzzy Match score, an unordered comparison with the Sørensen–Dice coefficient).[1]
- or the Character N-gram F-score (code at https://github.com/Waino/chrF)
- how well a user understands the message of the original text (this typically requires an experiment with real human subjects, see Assimilation Evaluation Toolkit).
Most released language pairs have had some evaluation, see Quality for a per-pair summary.
Using apertium-eval-translator for WER and PER
apertium-eval-translator.pl is a script written in that calculates the word error rate (WER) and the position-independent word error rate (PER) between a translation performed by an Apertium-based MT system and its human-corrected translation at document level. Although it has been designed to evaluate Apertium-based systems, it can be easily adapted to evaluate other MT systems.
To use it, first translate a text with apertium, save that into MT.txt
, then manually post-edit that so it looks understandable and grammatical (but trying to avoid major rewrites), save that into postedit.txt
. Then run apertium-eval-translator -test MT.txt -ref postedit.txt
and you'll see a bunch of numbers indicating how good the translation was, for post-editing.
If your text is fairly long (>10k words), the full WER calculation is quite slow. You can speed it up with the -b/-beam option, which will make WER only take N words of context into account. But be sure to make N large enough, otherwise you may get artificially low/high WER. As an example, a 17k word text that took nearly an hour to get the full WER took a few seconds with -b 150 and gave the same result (19.69%), but with -b 5 it gave 73.54% and -b 15 it gave 7.85 %. So if you don't have time to do a full WER without -beam, there is a wrapper that will increase the beam context N until it seems to stabilise: beam-eval-until-stable -t testfile -r reffile
.
Detailed usage
apertium-eval-translator -test testfile -ref reffile [-beam <n>] Options: -test|-t Specify the file with the translation to evaluate -ref|-r Specify the file with the reference translation -beam|-b Perform a beam search by looking only to the <n> previous and <n> posterior neighboring words (optional parameter to make the evaluation much faster) -help|-h Show this help message -version|-v Show version information and exit Note: The <n> value provided with -beam is language-pair dependent. The closer the languages involved are, the lesser <n> can be without affecting the evaluation results. This parameter only affects the WER evaluation. Note: Reference translation MUST have no unknown-word marks, even if they are free rides. This software calculates (at document level) the word error rate (WER) and the postion-independent word error rate (PER) between a translation performed by the Apertium MT system and a reference translation obtained by post-editing the system ouput. It is assumed that unknow words are marked with a start (*), as Apertium does; nevertheless, it can be easily adapted to evaluate other MT systems that do not mark unknown words with a star.
See English and Esperanto/Evaluation for an example. In Northern Sámi and Norwegian there is a Makefile to translate a set of source-language files and then run the evaluation on them.
dwdiff
If you just need a quick-and-dirty PER (position-independent WER) test, you can use dwdiff -s reference.txt MT_output.txt
and look for % changed.
Pair bootstrap resampling
Detailed usage
bootstrap_resampling.pl -source srcfile -test testfile -ref reffile -times <n> -eval /full/path/to/eval/script Options: -source|-s Specify the file with the source file -test|-t Specify the file with the translations to evaluate -ref|-r Specify the file with the reference translations -times|-n Specify how many times the resampling should be done -eval|-e Specify the full path to the MT evaluation script -help|-h Show this help message Note: Reference translation MUST have no unknown-word marks, even if they are free rides.
Evaluating with Wikipedia
- Main article: Evaluating with Wikipedia
See also
- Assimilation Evaluation Toolkit / Ideas for Google Summer of Code/Apertium assimilation evaluation toolkit
- Apertium-regtest
- Quality control
- Calculating coverage
External links