English and Catalan/GSOC 2017
This page serves as a summary of all the work done in the English-Catalan pair during Google Summer of Code 2017. For a more detailed workplan of the project, please check this page.
The main goal of this project was to bring the apertium-eng-cat pair to a state where it can fully replace the old apertium-en-ca pair, which has become outdated and difficult to maintain.
Work has affected several areas of the English-Catalan pair. You can see a full list of commits and modified files here.
On the one hand, the bilingual dictionary has grown from 35,000 entries to more than 66,000. This is significantly higher than the initial goal of 59,000 entries thanks to supporting work done on automation, which has been very helpful. In addition to new entries (added from frequency lists and crossdics), entries broken due to tag changes have been fixed, and the dictionary has been reorganised and cleaned. The monolingual dictionaries have been reorganised too, so developers of other pairs using the English and Catalan dictionaries will notice improvements.
On the other hand, unexpected testvoc errors (over 7,000 for EN>CA and around 10,000 for CA>EN) prevent the pair from being released in trunk immediately after GSoC. However, the pair is mature enough to completely replace the old apertium-en-ca pair once it becomes "testvoc clean". There is also pending work regarding proper noun tags in Catalan (they lack gender and number), but this is mostly complete and there are only ~900 lemmas left to change before the fixes can be merged.
Transfer rules were (and still are) perhaps the most fragile component of this pair. The number of rules is quite high given the differences between the two languages, several rules take place through different transfer stages, and the code is quite convoluted. Work related to transfer rules has focused on code updates to ensure everything is applied correctly and some new rule additions, mostly related to verbs. The final WER/PER (51.08%/33.68%) is higher than expected, but it will surely improve once the pair is testvoc clean and proper nouns are corrected.
Thanks to the creation of the Lsx module, in parallel to this project, Apertium now has a powerful module to correctly analyse separable words. Given the importance of phrasal verbs in English, support for this module has been added to the English-Catalan pair, and it will surely improve translation quality in the future.
Tagger and CG
The English tagger had not been retrained since some time ago, and its accuracy had declined due to tag changes, so updating it was a priority. The English module now has a new tagger based on a perceptron (instead of Hidden Markov Models) that has reached a 90% tagging accuracy. In addition, new CG rules have been added to further improve disambiguation.
The lack of documentation regarding the language pair, the monolingual dictionaries or even the tagger has made me put an effort on documenting as many things as possible to help current and future Apertium developers and to define the current status of the different modules better. The most important work has been to build a catalogue of all the transfer rules and macros in the pair, but there is also new documentation about the dictionaries and guidelines to make team work easier:
In order to make pair development easier and faster, several bash scripts have been created and used. The most remarkable one is transfer_documentation.sh, which allows the user to create wikitables with a lot of information about transfer rules from data embedded into the rule files (T1X, T2X and T3X). There are other scripts that can help with dictionary entries too, specially with proper nouns, as they usually do not need obscure paradigms.
Given that these scripts may be very useful to other developers, they are now available in this repository.
Working with Apertium as part of Google Summer of Code has been an amazing experience. As a translator and not a professional programmer, there was a lot to learn at the beginning, but once I had become familiar with the file formats everything felt easier. Using the English-Catalan pair every day has made me see its pros and cons, as well as its importance. Even though it may never become a perfect translator, I am committed to keep working on it and improve it even more.
The most challenging aspect of this project was probably the lack of proper documentation and development guidelines, so I have tried my best to write documentation about the pair and other areas I have worked in (such as the perceptron tagger). Given that most new pairs depend on shared language modules, tidiness and organisation is crucial. Hopefully, this will not only improve the overall quality and robustness of pairs, but also make Apertium a better homeplace.