Apertium language modules and translation pairs are subject to the following types of evaluation:
- Morphology coverage / regression testing
- Size of system
- Number of stems in lexc, monodix, bidix
- Number of disambiguation rules
- Number of lexical selection rules
- Number of transfer rules
- Naïve coverage
- Monolingual naïve coverage
- Trimmed naïve coverage (i.e., using a trimmed dictionary)
- Accuracy of analyses
- Accuracy of translation
- Cleanliness of translation output
Two complaints: they don't support directionality restrictions on tests, and they don't return error codes.
In theory, aq-covtest does this, but mostly people write their own scripts.
A good generalised script that supports hfst and lttoolbox binaries and arbitrary corpora would be good. It should also (optionally) output hitparades (e.g., frequency lists of unknown forms in the corpus).
There are several ways to test translation cleanliness that are good for different purposes:
- morphology expansion testvoc ("standard testvoc")
- prefixed morphology expansion testvoc ("testvoc lite")
- corpus testvoc