Difference between revisions of "Translation quality statistics"

From Apertium
Jump to navigation Jump to search
 
(23 intermediate revisions by 6 users not shown)
Line 1: Line 1:
This page aims to give an overview of the ''quality'' of various translators available in the Apertium platform. Word Error Rate (WER) and Position-independent Word Error Rate (PWER) are measures of post-edition effort. The number gives the expected number of words needed to be corrected in 100 words of running text. So, a WER of 4.7% indicates that in a given 100 words of text, 4.7 of them will need to be corrected by the post-editor.
This page aims to give an overview of the ''quality'' of various translators available in the Apertium platform.


The metrics used here don't tell the whole story – see [[Evaluation]] for some discussion of strengths and limitations of WER/BLEU.
Precise numbers may vary due to differences in how sentences are selected to be evaluated. In some pairs, unknown words may be taken into account, in others not.


==Quality Evaluations==
{|class="wikitable"

! Translator !! Date !! Version !! Direction !! WER !! PWER !! BLEU !! Reference
These measures are used in the table below:
* [[wikipedia:Word error rate|Word Error Rate]] (WER) and Position-independent Word Error Rate (PWER) are measures of post-edition effort. The number gives the expected number of words needed to be corrected in 100 words of running text. So, a WER of 4.7% indicates that in a given 100 words of text, 4.7 of them will need to be corrected by the post-editor – '''for WER, lower is better'''.
* [[wikipedia:BLEU|Bilingual Evaluation Understudy]] (BLEU) varies from 0 (bad) to 1 (perfect), so '''for BLEU, higher is better'''.

Precise numbers may vary due to differences in how sentences are selected to be evaluated. In some pairs, unknown words may be taken into account, in others not. Evaluations where unknown words are allowed will likely give me accurate numbers for postedition error, providing the corpus on which the evaluation was made resembles the corpus on which further translations will be made. Evaluations not allowing unknown words will give a better indication of "best-case" working of transfer rules.

{|class="wikitable sortable"
! Translator !! Date !! Version !! Direction !! Unknown<br/>words !! data-sort-type="number"|WER !! data-sort-type="number"|PWER !! data-sort-type="number"|BLEU !! Reference / Notes
|-
|rowspan=2| <code>[[apertium-eo-fr]]</code> ||rowspan=2|11th&nbsp;February&nbsp;2011 ||rowspan=2| || fr → eo ||rowspan=2 {{yes}} || 22.4 % || 20.6 % || - ||rowspan=2| [[French_and_Esperanto/Quality_tests]]
|-
| eo → fr || - || - || -
|-
|rowspan=2| <code>[[apertium-mk-en]]</code> ||rowspan=2|19th&nbsp;September&nbsp;2010||rowspan=2| 0.1.0 || mk → en ||rowspan=2 {{no}} || 43.96% || 31.22% || - ||rowspan=2| Percentage is average of 1,000 words from SETimes and 1,000 from Wikipedia
|-
| en → mk || - || - ||
|-
|rowspan=2| <code>[[apertium-mk-bg]]</code> ||rowspan=2|31st&nbsp;August&nbsp;2010||rowspan=2| 0.1.0 || mk → bg ||rowspan=2 {{yes}} || 26.67 % || 25.39 % || - ||rowspan=2| -
|-
| bg → mk || - || - ||
|-
|-
|rowspan=2| <code>[[apertium-nn-nb]]</code> ||rowspan=2|12th October 2009||rowspan=2| 0.6.1 || nnnb || - || - || - ||rowspan=2| Unhammer and Trosterud, 2009 (two reference translations)
|rowspan=2| <code>[[apertium-nno-nob]]</code> ||rowspan=2|12th&nbsp;October&nbsp;2009||rowspan=2| 0.6.1 || nnonob ||rowspan=2 {{yes}} || - || - || - ||rowspan=2| Unhammer and Trosterud, 2009<br/> (two reference translations). '''2021 Update''': As of [https://github.com/apertium/apertium-nno-nob/commit/02ea2f196848e323b0898e9225334182968e7894 Nov 2021] the [https://github.com/apertium/apertium-nno-nob/blob/master/freerbmt09_nnnb/eval/500.apertium.testset same test set] gives 0.862±0.011. '''2024 Update''': As of June 2024, apertium-nno-nob from git used to translate fresh news articles had WER below 0.6 % (so on average, every 166th word needed altering)
|-
|-
| nbnn ||32.5%, 17.7% || - || 0.74
| nobnno ||32.5%, 17.7% || - || 0.74
|-
|-
|rowspan=2| <code>[[apertium-br-fr]]</code> ||rowspan=2| March 2010 ||rowspan=2| 0.2.0 || br → fr || 38 % || 22 % || - ||rowspan=2| Tyers, 2010
|rowspan=2| <code>[[apertium-br-fr]]</code> ||rowspan=2| March&nbsp;2010 ||rowspan=2| 0.2.0 || br → fr ||rowspan=2 {{no}} || 38 % || 22 % || - ||rowspan=2| Tyers, 2010
|-
|-
| fr → br || - || - || -
| fr → br || - || - || -
|-
|-
|rowspan=2| <code>[[apertium-sv-da]]</code> ||rowspan=2|12th October 2009 ||rowspan=2| 0.5.0 || sv → da || 30.3 % || 27.7 % || - ||rowspan=2| http://wiki.apertium.org/w/index.php?title=Swedish_and_Danish/Evaluation&oldid=14881
|rowspan=2| <code>[[apertium-sv-da]]</code> ||rowspan=2|12th&nbsp;October&nbsp;2009 ||rowspan=2| 0.5.0 || sv → da ||rowspan=2 {{yes}} || 30.3 % || 27.7 % || - ||rowspan=2| [http://wiki.apertium.org/w/index.php?title=Swedish_and_Danish/Evaluation&oldid=14881 Swedish_and_Danish/Evaluation]
|-
|-
| da → sv || - || - || -
| da → sv || - || - || -
|-
|-
|rowspan=2| <code>[[apertium-eu-es]]</code> ||rowspan=2|2nd September 2009 ||rowspan=2| || eu → es || 72.4 % || 39.8 % || - ||rowspan=2| Ginestí-Rosell et al., 2009
|rowspan=2| <code>[[apertium-eu-es]]</code> ||rowspan=2|2nd&nbsp;September&nbsp;2009 ||rowspan=2| || eu → es ||rowspan=2 {{unknown}} || 72.4 % || 39.8 % || - ||rowspan=2| Ginestí-Rosell et al., 2009
|-
|-
| es → eu || - || - || -
| es → eu || - || - || -
|-
|-
|rowspan=2| <code>[[apertium-cy-en]]</code> ||rowspan=2|2nd January 2009 ||rowspan=2| || cy → en || 55.7 % || 30.5 % || - ||rowspan=2| Tyers and Donnelly, 2009
|rowspan=2| <code>[[apertium-cy-en]]</code> ||rowspan=2|2nd&nbsp;January&nbsp;2009 ||rowspan=2| || cy → en ||rowspan=2 {{unknown}} || 55.7 % || 30.5 % || - ||rowspan=2| Tyers and Donnelly, 2009
|-
|-
| en → cy || - || - || -
| en → cy || - || - || -
|-
|-
|rowspan=2| <code>[[apertium-eo-en]]</code> ||rowspan=2|08 May 2009 ||rowspan=2| 0.9.0 || en → eo || 21.0 % || 19.0 % || - ||rowspan=2| http://wiki.apertium.org/w/index.php?title=English_and_Esperanto/Evaluation&oldid=12418
|rowspan=2| <code>[[apertium-eo-en]]</code> ||rowspan=2|8th&nbsp;May&nbsp;2009 ||rowspan=2| 0.9.0 || en → eo ||rowspan=2 {{unknown}} || 21.0 % || 19,0 % || - ||rowspan=2| [http://wiki.apertium.org/w/index.php?title=English_and_Esperanto/Evaluation&oldid=12418 English_and_Esperanto/Evaluation]
|-
|-
| eo → en || - || - || -
| eo → en || - || - || -
|-
|-
|rowspan=2| <code>[[apertium-es-pt]]</code> ||rowspan=2|15th May 2006 ||rowspan=2| || es → pt || 4.7 % || - || - ||rowspan=2| Armentano et al., 2006
|rowspan=2| <code>[[apertium-es-pt]]</code> ||rowspan=2|15th&nbsp;May&nbsp;2006 ||rowspan=2| || es → pt ||rowspan=2 {{unknown}} || 4.7 % || - || - ||rowspan=2| Armentano et al., 2006
|-
|-
| pt → es || 11.3 % || - || -
| pt → es || 11.3 % || - || -
|-
|-
|rowspan=2| <code>[[apertium-oc-ca]]</code> ||rowspan=2|10th May 2006 ||rowspan=2| || oc → ca || 9.6 % || - || - ||rowspan=2| Armentano and Forcada, 2006
|rowspan=2| <code>[[apertium-oc-ca]]</code> ||rowspan=2|10th&nbsp;May&nbsp;2006 ||rowspan=2| || oc → ca ||rowspan=2 {{unknown}} || 9.6 % || - || - ||rowspan=2| Armentano and Forcada, 2006
|-
|-
| ca → oc || - || - || -
| ca → oc || - || - || -
Line 40: Line 60:




|rowspan=2| <code>[[apertium-pt-ca]]</code> ||rowspan=2| 28th July 2008 ||rowspan=2| || pt → ca || 16.6% || - || - ||rowspan=2| Armentano and Forcada, 2008
|rowspan=2| <code>[[apertium-pt-ca]]</code> ||rowspan=2| 28th&nbsp;July&nbsp;2008 ||rowspan=2| || pt → ca ||rowspan=2 {{unknown}} || 16.6% || - || - ||rowspan=2| Armentano and Forcada, 2008
|-
|-
| ca → pt || 14.1% || - || -
| ca → pt || 14.1% || - || -
|-
|-
|rowspan=2| <code>[[apertium-en-es]]</code> ||rowspan=2| May 2009 ||rowspan=2| || en → es || - || - || 0.1851
|rowspan=2| <code>[[apertium-en-es]]</code> ||rowspan=2| May&nbsp;2009 ||rowspan=2| || en → es ||rowspan=2 {{unknown}} || - || - || 0.1851
|rowspan=2| Sánchez-Martínez, 2009
|rowspan=2| Sánchez-Martínez, 2009
|-
|-
Line 51: Line 71:
|}
|}


== Coverage and Dictionary size ==
The number of entries in a dictionary, as well as the number of corpus forms that get some analysis, may give an indication of the maturity of a language pair.

For most dictionaries, there are numbers at least for dictionary size on the wiki, some also have coverage stats, see [[:Category:Datastats]]. The stats of a certain apertium package will have a page with that package name (the name of the repository in GitHub) followed by "/stats", e.g. [[apertium-es-ca/stats]]. Some language pairs are split into several packages, so for nno-nob, there are pages [[apertium-nno]], [[apertium-nob]] and [[apertium-nno-nob]], but for dictionary counts you should consult the last one.

(Stats pages currently do not show number of CG or transfer rules.)


== Usage Stats ==
The [http://apertium.projectjj.com/piwik/index.php?module=API&method=Events.getName&format=RSS&idSite=1&period=year&date=last10&expanded=1&translateColumnNames=1&language=ca&token_auth=df589f4df94838a3793384ec2f7e11d9&filter_limit=100 apertium.org usage stats] give some indication of which pairs have the most users, which in turn might say something about quality. However, there may be various reasons for why a pair sees a lot or little use:
* some pairs are only offered for free by Apertium (e.g. nob-nno)
* for some pairs, there are very few speakers of one of the languages (though the pair itself may have high quality)
* and, of course some pairs simply have very good quality (e.g. spa-cat)


==References==
==References==
Line 64: Line 95:


[[Category:Evaluation]]
[[Category:Evaluation]]
[[Category:Documentation in English]]

Latest revision as of 10:23, 3 June 2024

This page aims to give an overview of the quality of various translators available in the Apertium platform.

The metrics used here don't tell the whole story – see Evaluation for some discussion of strengths and limitations of WER/BLEU.

Quality Evaluations[edit]

These measures are used in the table below:

  • Word Error Rate (WER) and Position-independent Word Error Rate (PWER) are measures of post-edition effort. The number gives the expected number of words needed to be corrected in 100 words of running text. So, a WER of 4.7% indicates that in a given 100 words of text, 4.7 of them will need to be corrected by the post-editor – for WER, lower is better.
  • Bilingual Evaluation Understudy (BLEU) varies from 0 (bad) to 1 (perfect), so for BLEU, higher is better.

Precise numbers may vary due to differences in how sentences are selected to be evaluated. In some pairs, unknown words may be taken into account, in others not. Evaluations where unknown words are allowed will likely give me accurate numbers for postedition error, providing the corpus on which the evaluation was made resembles the corpus on which further translations will be made. Evaluations not allowing unknown words will give a better indication of "best-case" working of transfer rules.

Translator Date Version Direction Unknown
words
WER PWER BLEU Reference / Notes
apertium-eo-fr 11th February 2011 fr → eo Yes 22.4 % 20.6 % - French_and_Esperanto/Quality_tests
eo → fr - - -
apertium-mk-en 19th September 2010 0.1.0 mk → en No 43.96% 31.22% - Percentage is average of 1,000 words from SETimes and 1,000 from Wikipedia
en → mk - -
apertium-mk-bg 31st August 2010 0.1.0 mk → bg Yes 26.67 % 25.39 % - -
bg → mk - -
apertium-nno-nob 12th October 2009 0.6.1 nno → nob Yes - - - Unhammer and Trosterud, 2009
(two reference translations). 2021 Update: As of Nov 2021 the same test set gives 0.862±0.011. 2024 Update: As of June 2024, apertium-nno-nob from git used to translate fresh news articles had WER below 0.6 % (so on average, every 166th word needed altering)
nob → nno 32.5%, 17.7% - 0.74
apertium-br-fr March 2010 0.2.0 br → fr No 38 % 22 % - Tyers, 2010
fr → br - - -
apertium-sv-da 12th October 2009 0.5.0 sv → da Yes 30.3 % 27.7 % - Swedish_and_Danish/Evaluation
da → sv - - -
apertium-eu-es 2nd September 2009 eu → es Unknown 72.4 % 39.8 % - Ginestí-Rosell et al., 2009
es → eu - - -
apertium-cy-en 2nd January 2009 cy → en Unknown 55.7 % 30.5 % - Tyers and Donnelly, 2009
en → cy - - -
apertium-eo-en 8th May 2009 0.9.0 en → eo Unknown 21.0 % 19,0 % - English_and_Esperanto/Evaluation
eo → en - - -
apertium-es-pt 15th May 2006 es → pt Unknown 4.7 % - - Armentano et al., 2006
pt → es 11.3 % - -
apertium-oc-ca 10th May 2006 oc → ca Unknown 9.6 % - - Armentano and Forcada, 2006
ca → oc - - -
apertium-pt-ca 28th July 2008 pt → ca Unknown 16.6% - - Armentano and Forcada, 2008
ca → pt 14.1% - -
apertium-en-es May 2009 en → es Unknown - - 0.1851 Sánchez-Martínez, 2009
es → en - - 0.1881

Coverage and Dictionary size[edit]

The number of entries in a dictionary, as well as the number of corpus forms that get some analysis, may give an indication of the maturity of a language pair.

For most dictionaries, there are numbers at least for dictionary size on the wiki, some also have coverage stats, see Category:Datastats. The stats of a certain apertium package will have a page with that package name (the name of the repository in GitHub) followed by "/stats", e.g. apertium-es-ca/stats. Some language pairs are split into several packages, so for nno-nob, there are pages apertium-nno, apertium-nob and apertium-nno-nob, but for dictionary counts you should consult the last one.

(Stats pages currently do not show number of CG or transfer rules.)

Usage Stats[edit]

The apertium.org usage stats give some indication of which pairs have the most users, which in turn might say something about quality. However, there may be various reasons for why a pair sees a lot or little use:

  • some pairs are only offered for free by Apertium (e.g. nob-nno)
  • for some pairs, there are very few speakers of one of the languages (though the pair itself may have high quality)
  • and, of course some pairs simply have very good quality (e.g. spa-cat)

References[edit]