English and Catalan/GSOC 2017

From Apertium
Jump to navigation Jump to search

This page serves as a summary of all the work done in the English-Catalan pair during Google Summer of Code 2017. For a more detailed workplan, please check this page.

Main work

The main goal of this project was to bring the apertium-eng-cat pair to a state where it can fully replace the old apertium-en-ca pair, which has become outdated and difficult to maintain.

Work has affected several areas of the English-Catalan pair. You can see a full list of commits and modified files here.

Dictionaries

On the one hand, the bilingual dictionary has grown from 35,000 entries to more than 66,000. This is significantly higher than the initial goal of 59,000 entries thanks to supporting work done on automatisation, which has been very helpful. In addition to new entries (added from frequency lists and crossdics), entries broken due to tag changes have been fixed, and the dictionary has been reorganised and cleaned. The monolingual dictionaries have been reorganised too, so developers of other pairs using the English and Catalan dictionaries will notice improvements.

On the other hand, unexpected testvoc errors (over 7,000 for EN>CA and around 10,000 for CA>EN) prevent the pair from being released in trunk immediately after GSoC. However, the pair is mature enough to completely replace the old apertium-en-ca pair once it becomes "testvoc clean". There is also pending work regarding proper noun tags in Catalan (they lack gender and number), but this is mostly complete and there are only ~900 lemmas left to change before the fixes can be merged.

Transfer rules

Transfer rules were (and still are) perhaps the most fragile component of this pair. The number of rules is quite high given the differences between the two languages, several rules take place through different transfer stages, and the code is quite convoluted. Work related to transfer rules has focused on code updates to ensure everything is applied correctly and some new rule additions, mostly related to verbs.

Separable verbs

Thanks to the creation of the Lsx module, in parallel to this project, Apertium now has a powerful module to correctly analyse separable words. Given the importance of phrasal verbs in English, support for this module has been added to the English-Catalan pair, and it will surely improve translation quality in the future.

Tagger and CG

The English tagger had not been retrained since some time ago, and its accuracy had declined due to tag changes, so updating it was a priority. The English module now has a new tagger based on a perceptron (instead of Hidden Markov Models) that has reached a 90% tagging accuracy. In addition, new CG rules have been added to further improve disambiguation.

Supporting work

Documentation

The lack of documentation regarding the language pair, the monolingual dictionaries or even the tagger has made me put an effort on documenting as many things as possible to help current and future Apertium developers and to define the current status of the different modules better. The most important work has been to build a catalogue of all the transfer rules and macros in the pair, but there is also new documentation about the dictionaries and guidelines to make team work easier:

Automatisation

In order to make pair development easier and faster, several bash scripts have been created and used. The most remarkable one is transfer_documentation.sh, which allows the user to create wikitables with a lot of information about transfer rules from data embedded into the rule files (T1X, T2X and T3X). There are other scripts that can help with dictionary entries too, specially with proper nouns, as they usually do not need obscure paradigms.

Given that these scripts may be very useful to other developers, they are now available in this repository.