Apertium language modules and translation pairs are subject to the following types of evaluation:
- Morphology coverage / regression testing
- Size of system
- Number of stems in lexc, monodix, bidix
- Number of disambiguation rules
- Number of lexical selection rules
- Number of transfer rules
- Naïve coverage
- Monolingual naïve coverage
- Trimmed naïve coverage (i.e., using a trimmed dictionary)
- Accuracy of analyser
- Accuracy of translation
- Overall accuracy (over parallel corpora): WER/PER/BLEU
- Regression tests (pairs of phrases or sentences)
- Cleanliness of translation output
Two complaints: they don't support directionality restrictions on tests, and they don't return error codes.
In theory, aq-covtest does this, but mostly people write their own scripts.
A good generalised script that supports hfst and lttoolbox binaries and arbitrary corpora would be good. It should also (optionally) output hitparades (e.g., frequency lists of unknown forms in the corpus).
- WER/PER: apertium-eval-translator.pl and apertium-eval-translator-line.pl work well but are a bit old, and could probably benefit from being rewritten in python
- BLEU: nothing existing
- Regression testing: we have some once-off scripts for this?
There are several ways to test translation cleanliness that are good for different purposes:
- morphology expansion testvoc ("standard testvoc")
- prefixed morphology expansion testvoc ("testvoc lite")
- corpus testvoc