- 1 Introduction
- 2 Prerequisites
- 3 A typical use case/walkthrough — apertium-mt-he
- 3.1 Getting the goodies
- 3.2 Herein lies the testing
- 3.3 What now?
- 3.4 Conclusion
The Apertium Quality Control Framework, or `apertium-quality`, is a framework and toolkit for unit testing Apertium dictionaries, and to a lesser extent, HFST finite state transducers, and recording detailed statistics for later analysis. The toolkit portion of the project consists of tools such as regression testing, coverage testing, vocabulary testing and statistics storage, while the framework is robust enough to allow extensive modification through standard interfaces and detailed XML schemas.
You must have, at the very least, the following applications and libraries installed:
- Python >= 3.1
- Apertium >= 3.2
Assuming Debian or Ubuntu, you can most likely get these by running:
apt-get install automake autoconf git apertium python3 python3-lxml
A typical use case/walkthrough — apertium-mt-he
Getting the goodies
First of all, we need to actually install `apertium-quality`, so in the terminal, enter:
git clone https://github.com/apertium/apertium-quality
It may ask you to accept a security certificate. Feel free to accept permanently. Once it has downloaded,
cd into the directory and run the usual commands:
cd apertium-quality ./autogen.sh && make && sudo make install && sudo make install-nltk
If you see any "errors" relating to libyaml, ignore them. They're not errors, it simply isn't finding the C libyaml libraries, and instead uses pure Python ones.
Add `--prefix` wherever you please, and if Python 3 isn't detected for some reason, prefix the `./autogen.sh` command with `PYTHON=/path/to/python3`.
Before running any of the tools you must export the line specified by the Makefile after you build the module if you installed to a prefix, otherwise Python won't be able to find the module.
PLEASE NOTE: if you cannot install lxml, you may come across a few bugs, as lxml is preferred over the built-in ElementTree library for compatibility and speed reasons. If you do come across any bugs however, please report them!
`apertium-mt-he` is a dictionary pair that was being developed during the GSoC period while I was coding `apertium-quality`, so why not try it out and see how it's going?
Let's download it:
git clone https://github.com/apertium/apertium-mlt-heb cd apertium-mlt-heb
So assuming that the aq tools were installed correctly and are visible on your `$PATH`, we can begin testing. Otherwise, set up your $PATH correctly.
First of all, we should pick a few revisions to test so we can get out some pretty stats, so the logs today show this for me:
$ git log commit efc73a5695e3c549ccfa4a66bb6a63d2fd830e8c Author: Francis M. Tyers <firstname.lastname@example.org> Date: Fri Nov 18 12:32:33 2016 +0000 modes commit 686db20d4fcb5156e73f4f2c94acdf8ee2a6371a Author: Francis M. Tyers <email@example.com> Date: Fri Nov 18 12:31:35 2016 +0000 move commit 8b3078c3d83ed2f51f7e782304851ddb56e03a46 Author: Francis M. Tyers <firstname.lastname@example.org> Date: Fri Nov 18 12:30:38 2016 +0000 move to mlt-heb commit 386091477c4f1640ad11b70db3ff998234e002e5 Author: Kevin Brubeck Unhammer <email@example.com> Date: Mon Jan 4 10:34:00 2016 +0000 tolbox→toolbox
Let's start by checking out an earlier revision and work our way forward:
git checkout a3f1b90732c3d155b2db0ae1211288d51d998c77
Herein lies the testing
So you've successfully gotten up to this point? Wonderful! Let's start doing some testing. But first, a few things:
- Never do testing on a dictionary without committing changes first. It most likely won't let you in order to guarantee the integrity of the data.
- Unless you want to save statistics, don't use -X. Test first, statistic save later.
First, let's compile the dictionary.
./autogen.sh && make
If you have installed Apertium to a prefix, make sure you prepend ./autogen in the above code with
This kind of test allows you to see the naive ambiguity of your dictionaries by counting the forms of each word, then averaging the result. We do this by running `aq-ambtest` on each dix. So let's do it!
aq-ambtest apertium-mt-he.he.dix -X aq-ambtest apertium-mt-he.mt.dix -X aq-ambtest apertium-mt-he.mt-he.dix -X
See that dangling `-X` on the end? That's the flag to save statistics. By default it saves to a file called `quality-stats.xml`. If you add a filename to the end of the `-X`, it'll save in that file. By putting it at the end, we can just use the default, and we will! All tests use `-X` for saving stats, so easy to remember.
Dictionaries are complex beings. Sometimes you want to write some tests to prove you haven't broken something, or you've met a milestone. Luckily, this dictionary has quite a few regression tests already written for our testing pleasure, so let's have a look at it then!
aq-regtest -d . mt-he http://wiki.apertium.org/wiki/Special:Export/Maltese_and_Hebrew/Regression_tests -X aq-regtest -d . mt-he http://wiki.apertium.org/wiki/Special:Export/Maltese_and_Hebrew/Pending_tests -X
You should get a lot of output saying WORKS everywhere for the first one, with 24/24 passes. 100%, yay! Pending however is a bit more disappointing. 4/33? How disheartening. Not to matter, maybe it gets better in the future!
Wikipedia corpus extractor
So in order to do some of the tests like generation testing or coverage testing, we need corpora, right? Have no fear, for `aq-wikicrp` is here! Let us get a Maltese wikipedia dump and make a lovely little corpus with it, but first, we must get a sentence tokeniser compatible with our version of Python.
- Python 3.1: `wget http://apertium.bbqsrc.net/static/maltese-3.1.pickle`
- Python 3.2: `wget http://apertium.bbqsrc.net/static/maltese-3.2.pickle`
For information on how to generate your own tokenisers, check out dev/punktgen.py in the apertium-quality git repo.
And now, assuming Python 3.2, we run the following:
wget http://dumps.wikimedia.org/mtwiki/latest/mtwiki-latest-pages-articles.xml.bz2 && bunzip2 mtwiki-latest-pages-articles.xml.bz2 aq-wikicrp mtwiki-latest-pages-articles.xml mt.wikipedia.crp.txt -t ./maltese-3.2.pickle aq-wikicrp -x mtwiki-latest-pages-articles.xml mt.wikipedia.crp.xml -t ./maltese-3.2.pickle
Have a look at both of the output files. One of them is purely plain text, with a sentence per line. The other file is an XML corpus, separated into "entries". The XML format allows for some more advanced parsing and searching, but is so far hardly used with this toolkit, other than coverage testing, where both formats are supported. Feel free to use whichever you're more comfortable with.
Coverage testing does what you'd expect, tests the dictionary for coverage. Using our newly created corpora, we can test the coverage! Feel free to use either one, but be consistent; only use one of them.
aq-covtest mt.wikipedia.crp.txt mt-he.automorf.bin -X # OR aq-covtest mt.wikipedia.crp.xml mt-he.automorf.bin -X
This command should run fairly quickly, and give you the number of tokenised words, coverage percentage and top unknown words. It also displays the speed it translated the whole corpus, and it's all stored nice and cosy in the stats file.
There are a few other tests that can be done on the dictionaries that don't really need their own command, so these functions are in `aq-dixtest`. Currently they include counting entries and counting transfer rules, but might be extended to count CG rules as well. It can be run:
aq-dixtest mt-he -X
This will show the count of rules per file, and the entry count per file. It also shows the totals and unique entries.
- See also: Testvoc
This test allows you to see the manner that your dictionary's transfer rules are working, and get some useful output. By default it will save the output in voctest.txt, so it's worth having a look at this output to find transfer bugs and what not.
aq-voctest mt-he -X
You should get some pretty line counts, # counts and @ counts. Sweet!
Generation testing isn't working at the moment sadly, so stay tuned.
Morph testing isn't supported by the language we're using, but it is as simple to run as regression testing. One simply runs a configuration file like:
aq-morftest tests.yaml -X
This command will do very similar things to the regression test, but for HFST dictionaries. It allows you to test morphology in both directions to find bugs and regressions. Pretty schmick.
Now we've run pretty much every test. What now? Didn't you say something about pretty graphs? Yes, I did! But first, we need more statistics, and who really wants to type all of these commands all of the time unless testing for output?
`aq-autotest` takes an .aqx file as input. It is an XML configuration file, defined at http://github.com/bbqsrc/apertium-schemas in the aqx.rnc file. For this walkthrough, here's a ready-made configuration file just for you!
<config xmlns="http://apertium.org/xml/quality/config/0.1"> <commands> <command>./autogen.sh</command> <command>make clean</command> <command>make</command> </commands> <coverage> <corpus language="mt-he" path="mt.wikipedia.crp.txt"/> </coverage> <regression> <test language="mt-he" path="http://wiki.apertium.org/wiki/Special:Export/Maltese_and_Hebrew/Regression_tests"/> <test language="mt-he" path="http://wiki.apertium.org/wiki/Special:Export/Maltese_and_Hebrew/Pending_tests"/> </regression> </config>
If you have installed Apertium to a prefix, make sure you prepend ./autogen in the above file with
For this example, simply save the file in the current directory as `quality.aqx`. Before running it, let's delete `quality-stats.xml` so we don't duplicate data unnecessarily, and feed the aqx file to `aq-autotest` like so:
aq-autotest quality.aqx -X
This has much less output than the other frontends, due to its primary purpose being to generate statistics as easily as possible.
Now to make this rather short, I have made a simple script so that you can generate the remaining revisions in a timely fashion for the other revisions:
for i in 33402 33403 33404 33405 33406 33421 33423 33424 33425 33426 33427 33471 33523 33524 33525 33534 33535 33681 33682 33683; do git pull -r $i; aq-autotest quality.aqx -X; done
You might notice that some of the testing fails and bails out. This is normal. It means that revision of the dictionary was buggy and won't compile correctly, so is simply skipped. As it should be.
So, once that finishes running, hurray! We have enough stats to do something fun. What's that, you ask?
Welcome to pretty graphs — aq-htmlgen
We can generate the preliminary version of some pretty graphs with a very simple command:
aq-htmlgen quality-stats.xml out
This will output a bunch of JS, CSS, and HTML files to the directory specified, which here is `out`. Open `index.html` in `out` and enjoy the graphs! They are a bit limited right now, but can be easily improved with a drop-in replacement of raphael_linegraph.js which is on the list of things to do!
I hope the walkthrough wasn't too tedious and highlights how using this framework can improve your testing life. Developer documentation will come very soon so if you wish to extend or improve the framework, you will be able to do it with ease.
Documentation about the configuration formats and XML schemas can be found on the wiki at http://wiki.apertium.org/wiki/apertium-quality.