Difference between revisions of "English and Catalan/GSOC 2017"
Line 20: | Line 20: | ||
Thanks to the creation of the [[Lsx module]], in parallel to this project, Apertium now has a powerful module to correctly analyse separable words. Given the importance of phrasal verbs in English, support for this module has been added to the English-Catalan pair, and it will surely improve translation quality in the future. |
Thanks to the creation of the [[Lsx module]], in parallel to this project, Apertium now has a powerful module to correctly analyse separable words. Given the importance of phrasal verbs in English, support for this module has been added to the English-Catalan pair, and it will surely improve translation quality in the future. |
||
===Tagger and CG=== |
|||
The English tagger had not been retrained since some time ago, and its accuracy had declined due to tag changes, so updating it was a priority. The English module now has a new tagger based on a perceptron (instead of Hidden Markov Models) that has reached a 90% tagging accuracy. In addition, new CG rules have been added to further improve disambiguation. |
|||
==Supporting work== |
==Supporting work== |
Revision as of 16:31, 26 August 2017
This page serves as a summary of all the work done in the English-Catalan pair during Google Summer of Code 2017.
Contents
Main work
The main goal of this project was to bring the apertium-eng-cat pair to a state where it can fully replace the old apertium-en-ca pair, which has become outdated and difficult to maintain.
Work has affected several areas of the English-Catalan pair. You can see a full list of commits and modified files here.
Dictionaries
On the one hand, the bilingual dictionary has grown from 35,000 entries to more than 66,000. This is significantly higher than the initial goal of 59,000 entries thanks to supporting work done on automatisation, which has been very helpful. In addition to new entries (added from frequency lists and crossdics), entries broken due to tag changes have been fixed, and the dictionary has been reorganised and cleaned. The monolingual dictionaries have been reorganised too, so developers of other pairs using the English and Catalan dictionaries will notice improvements.
On the other hand, unexpected testvoc errors (over 7,000 for EN>CA and around 10,000 for CA>EN) prevent the pair from being released in trunk immediately after GSoC. However, the pair is mature enough to completely replace the old apertium-en-ca pair once it becomes "testvoc clean".
Transfer rules
Transfer rules were (and still are) perhaps the most fragile component of this pair. The number of rules is quite high given the differences between the two languages, several rules take place through different transfer stages, and the code is quite convoluted. Work related to transfer rules has focused on code updates to ensure everything is applied correctly and some new rule additions, mostly related to verbs.
Separable verbs
Thanks to the creation of the Lsx module, in parallel to this project, Apertium now has a powerful module to correctly analyse separable words. Given the importance of phrasal verbs in English, support for this module has been added to the English-Catalan pair, and it will surely improve translation quality in the future.
Tagger and CG
The English tagger had not been retrained since some time ago, and its accuracy had declined due to tag changes, so updating it was a priority. The English module now has a new tagger based on a perceptron (instead of Hidden Markov Models) that has reached a 90% tagging accuracy. In addition, new CG rules have been added to further improve disambiguation.