English and Catalan/GSOC 2017
This page serves as a summary of all the work done in the English-Catalan pair during Google Summer of Code 2017.
Main work
The main goal of this project was to bring the apertium-eng-cat pair to a state where it can fully replace the old apertium-en-ca pair, which has become outdated and difficult to maintain.
Work has affected several areas of the English-Catalan pair. You can see a full list of commits and modified files here.
Dictionaries
On the one hand, the bilingual dictionary has grown from 35,000 entries to more than 66,000. This is significantly higher than the initial goal of 59,000 entries thanks to supporting work done on automatisation, which has been very helpful. In addition to new entries (added from frequency lists and crossdics), entries broken due to tag changes have been fixed, and the dictionary has been reorganised and cleaned. The monolingual dictionaries have been reorganised too, so developers of other pairs using the English and Catalan dictionaries will notice improvements.
On the other hand, unexpected testvoc errors (over 7,000 for EN>CA and around 10,000 for CA>EN) prevent the pair from being released in trunk immediately after GSoC. However, the pair is mature enough to completely replace the old apertium-en-ca pair once it becomes "testvoc clean".
Transfer rules
Transfer rules were (and still are) perhaps the most fragile component of this pair. The number of rules is quite high given the differences between the two languages, several rules take place through different transfer stages, and the code is quite convoluted. Work related to transfer rules has focused on code updates to ensure everything is applied correctly and some new rule additions, mostly related to verbs.
Separable verbs
Thanks to the creation of the Lsx module, in parallel to this project, Apertium now has a powerful module to correctly analyse separable words. Given the importance of phrasal verbs in English, support for this module has been added to the English-Catalan pair, and it will surely improve translation quality in the future.