Evaluation - Revision history

Unhammer at 10:23, 3 September 2024

2024-09-03T10:23:49Z

Unhammer: meteor

2024-09-03T10:17:42Z

meteor

Unhammer at 09:13, 21 November 2021

2021-11-21T09:13:21Z

Unhammer at 09:13, 21 November 2021

2021-11-21T09:13:04Z

Unhammer at 09:10, 21 November 2021

2021-11-21T09:10:10Z

Popcorndude: /* See also */

2021-07-23T20:32:33Z

Unhammer: /* Using apertium-eval-translator for WER and PER */

2018-11-16T10:44:49Z

Using apertium-eval-translator for WER and PER

Purplemoon at 16:32, 9 November 2018

2018-11-09T16:32:38Z

Xavivars: /* Using apertium-eval-translator for WER and PER */

2018-03-14T17:13:08Z

Using apertium-eval-translator for WER and PER

Unhammer at 11:45, 9 June 2017

2017-06-09T11:45:41Z

@@ Line 8: / Line 8: @@
 * how many word [[N-gram]]'s are common to the MT output and one or more reference translations (see [http://en.wikipedia.org/wiki/BLEU Wikipedia on Bleu], [https://en.wikipedia.org/wiki/METEOR Meteor] or [http://en.wikipedia.org/wiki/NIST_%28metric%29 NIST]), here '''higher scores are better'''
 * how many character [[N-gram]]'s are common to MT output and a post-edit (the Fuzzy Match score, an unordered comparison with the Sørensen–Dice coefficient).[http://amtaweb.org/wp-content/uploads/2015/10/MTSummitXV_ResearchTrack.pdf#page=138]
-** or the [http://www.aclweb.org/anthology/W/W15/W15-30.pdf#page=412 Character N-gram F-score] (code at https://github.com/Waino/chrF)
+** or the [https://aclanthology.org/W15-3049/ Character N-gram F-score] (code at https://github.com/Waino/chrF)
 * how well a user ''understands'' the message of the original text (this typically requires an experiment with real human subjects, see [[Assimilation Evaluation Toolkit]] which lets you make gap-filling tests).
 * user-interviews to find the subjective experience of using the translator for their task (whether post-editing or gisting)

@@ Line 6: / Line 6: @@
 * how many words need to be changed before a text is publication-ready (Word-Error Rate, see [http://en.wikipedia.org/wiki/Word_error_rate Wikipedia on WER]), here '''lower scores are better'''
-* how many word [[N-gram]]'s are common to the MT output and one or more reference translations (see [http://en.wikipedia.org/wiki/BLEU Wikipedia on Bleu] or [http://en.wikipedia.org/wiki/NIST_%28metric%29 NIST]), here '''higher scores are better'''
+* how many word [[N-gram]]'s are common to the MT output and one or more reference translations (see [http://en.wikipedia.org/wiki/BLEU Wikipedia on Bleu], [https://en.wikipedia.org/wiki/METEOR Meteor] or [http://en.wikipedia.org/wiki/NIST_%28metric%29 NIST]), here '''higher scores are better'''
 * how many character [[N-gram]]'s are common to MT output and a post-edit (the Fuzzy Match score, an unordered comparison with the Sørensen–Dice coefficient).[http://amtaweb.org/wp-content/uploads/2015/10/MTSummitXV_ResearchTrack.pdf#page=138]
 ** or the [http://www.aclweb.org/anthology/W/W15/W15-30.pdf#page=412 Character N-gram F-score] (code at https://github.com/Waino/chrF)

@@ Line 3: / Line 3: @@
 Most evaluations focus on numerical metrics like WER, which make the most sense when done on fresh post-edits. WER gives less information when run on pretranslated text (due to multiple possible translations, and MT shaping the translation output). WER is quite far from the task of ''gisting'', where cloze-like tests may be a more useful metric, and interviews may give more useful information. Beware of overfitting and [https://www.nngroup.com/articles/campbells-law/ Campbell's law].
-Common evaluation measures are:
+Common evaluation measures and methods are:
 * how many words need to be changed before a text is publication-ready (Word-Error Rate, see [http://en.wikipedia.org/wiki/Word_error_rate Wikipedia on WER]), here '''lower scores are better'''

@@ Line 2: / Line 2: @@
 Most evaluations focus on numerical metrics like WER, which make the most sense when done on fresh post-edits. WER gives less information when run on pretranslated text (due to multiple possible translations, and MT shaping the translation output). WER is quite far from the task of ''gisting'', where cloze-like tests may be a more useful metric, and interviews may give more useful information. Beware of overfitting and [https://www.nngroup.com/articles/campbells-law/ Campbell's law].
+Common evaluation measures are:
 * how many words need to be changed before a text is publication-ready (Word-Error Rate, see [http://en.wikipedia.org/wiki/Word_error_rate Wikipedia on WER]), here '''lower scores are better'''
@@ Line 7: / Line 9: @@
 * how many character [[N-gram]]'s are common to MT output and a post-edit (the Fuzzy Match score, an unordered comparison with the Sørensen–Dice coefficient).[http://amtaweb.org/wp-content/uploads/2015/10/MTSummitXV_ResearchTrack.pdf#page=138]
 ** or the [http://www.aclweb.org/anthology/W/W15/W15-30.pdf#page=412 Character N-gram F-score] (code at https://github.com/Waino/chrF)
-* how well a user ''understands'' the message of the original text (this typically requires an experiment with real human subjects, see [[Assimilation Evaluation Toolkit]]).
+* how well a user ''understands'' the message of the original text (this typically requires an experiment with real human subjects, see [[Assimilation Evaluation Toolkit]] which lets you make gap-filling tests).
+* user-interviews to find the subjective experience of using the translator for their task (whether post-editing or gisting)

@@ Line 1: / Line 1: @@
-'''Evaluation''' can give you some idea as to how well a language pair works in practice. There are many ways to evaluate, and the test chosen should depend on the [[Assimilation and dissemination|intended use of the language pair]]:
+'''Evaluation''' can give you some idea as to how well a language pair works in practice. There are many ways to evaluate, and the test chosen should depend on the [[Assimilation and dissemination|intended use of the language pair]].
+Most evaluations focus on numerical metrics like WER, which make the most sense when done on fresh post-edits. WER gives less information when run on pretranslated text (due to multiple possible translations, and MT shaping the translation output). WER is quite far from the task of ''gisting'', where cloze-like tests may be a more useful metric, and interviews may give more useful information. Beware of overfitting and [https://www.nngroup.com/articles/campbells-law/ Campbell's law].
 * how many words need to be changed before a text is publication-ready (Word-Error Rate, see [http://en.wikipedia.org/wiki/Word_error_rate Wikipedia on WER]), here '''lower scores are better'''
@@ Line 6: / Line 8: @@
 ** or the [http://www.aclweb.org/anthology/W/W15/W15-30.pdf#page=412 Character N-gram F-score] (code at https://github.com/Waino/chrF)
 * how well a user ''understands'' the message of the original text (this typically requires an experiment with real human subjects, see [[Assimilation Evaluation Toolkit]]).

@@ Line 85: / Line 85: @@
 ==See also==
 * [[Assimilation Evaluation Toolkit]] / [[Ideas for Google Summer of Code/Apertium assimilation evaluation toolkit]]
-* [[Regression testing]]
+* [[Apertium-regtest]]
 * [[Quality control]]
 * [[Calculating coverage]]

@@ Line 15: / Line 15: @@
 ==Using apertium-eval-translator for WER and PER==
-[http://svn.code.sf.net/p/apertium/svn/trunk/apertium-eval-translator/ apertium-eval-translator] is a script written in Perl. It calculates the word error rate (WER) and the position-independent word error rate (PER) between a translation performed by an Apertium-based MT system and its human-corrected translation at document level. Although it has been designed to evaluate Apertium-based systems, it can be easily adapted to evaluate other MT systems.
+[https://github.com/apertium/apertium-eval-translator apertium-eval-translator.pl] is a script written in that calculates the word error rate (WER) and the position-independent word error rate (PER) between a translation performed by an Apertium-based MT system and its human-corrected translation at document level. Although it has been designed to evaluate Apertium-based systems, it can be easily adapted to evaluate other MT systems.
 To use it, first translate a text with apertium, save that into <code>MT.txt</code>, then manually post-edit that so it looks understandable and grammatical (but trying to avoid major rewrites), save that into <code>postedit.txt</code>. Then run <code>apertium-eval-translator -test MT.txt -ref postedit.txt</code> and you'll see a bunch of numbers indicating how good the translation was, for post-editing.
+If your text is fairly long (>10k words), the full WER calculation is quite slow. You can speed it up with the -b/-beam option, which will make WER only take N words of context into account. But be sure to make N large enough, otherwise you may get artificially low/high WER. As an example, a 17k word text that took nearly an hour to get the full WER took a few seconds with -b 150 and gave the same result (19.69%), but with -b 5 it gave 73.54% and -b 15 it gave 7.85 %. So if you don't have time to do a full WER without -beam, there is a wrapper that will increase the beam context N until it seems to stabilise: <code>beam-eval-until-stable -t testfile -r reffile</code>.
 ===Detailed usage===
@@ Line 28: / Line 31: @@
       -ref|-r      Specify the file with the reference translation
       -beam|-b     Perform a beam search by looking only to the <n> previous
-                   and <n> posterior neigboring words (optional parameter
+                   and <n> posterior neighboring words (optional parameter
                    to make the evaluation much faster)
       -help|-h     Show this help message

@@ Line 57: / Line 57: @@
 ==Pair bootstrap resampling==
+===Detailed usage===
+<pre>
+bootstrap_resampling.pl -source srcfile -test testfile -ref
+reffile -times <n> -eval /full/path/to/eval/script
+Options:
+  -source|-s   Specify the file with the source file
+  -test|-t     Specify the file with the translations to evaluate
+  -ref|-r      Specify the file with the reference translations
+  -times|-n    Specify how many times the resampling should be done
+  -eval|-e     Specify the full path to the MT evaluation script
+  -help|-h     Show this help message
+Note: Reference translation MUST have no unknown-word marks, even if
+      they are free rides.
+</pre>