Romanian and Catalan/GSOC 2018
This page serves as a summary of all the work done in the Romanian-Catalan pair during Google Summer of Code 2018. It also includes information on the upgrade of four language pairs which was carried out during the same period. For a more detailed workplan of the project, please check this page.
The main goal of this project was to upgrade the Romanian-Catalan language pair to the monolingual package system and develop it to bring it to release quality. In addition, four other language pairs have been upgraded to the monolingual package system to ease future development. As such, the commits are distributed into many repositories, which are listed below:
- Romanian: apertium-ron
- Catalan: apertium-cat (only commits after March 2018 are to be considered as part of the project)
- Romanian-Catalan: apertium-ron-cat
- Indonesian-Malay: apertium-ind-zlm apertium-ind apertium-zlm
- Welsh-English: apertium-cym-eng apertium-cym
- Catalan-Italian: apertium-cat-ita
- Afrikaans-Dutch: apertium-afr-nld apertium-afr apertium-nld
The bilingual dictionary, initially comprising 12,819 entries, has been expanded to 23,015 entries, obtained from crossdics with Romanian-Spanish and frequency lists. While this is substantially lower than the initial goal of 31,000 entries, a great part of the existing entries were not working due to a lack of gender tags, incorrect direction tags in monolingual dictionaries, or a lack of corresponding entries in the monolingual dictionaries. As such, fixing these existing entries has been given priority, and in the end it has proven to be fruitful, with Wikipedia coverage surpassing expectations (86.8% for Romanian and 88.7% for Catalan).
The Romanian monolingual dictionary, given its limited development in Apertium, contained many inaccuracies and errors which have been fixed as part of the project. The Catalan side, on the other hand, has not needed major changes due to extensive development and consensus between developers.
Despite the existence a few dictionary-related testvoc errors in both directions which prevent the immediate release of the pair at the end of GSoC, they are easy to fix and the pair should be clean with a few more days of work.
Before this project, the pair contained many transfer rules for the Romanian → Catalan direction, with most of them being ported from the Romanian-Spanish pair. These were single-step transfer rules which have been kept in the pair as chunking rules after being partially rewritten. In addition, basic interchunk rules have been added, allowing gender and number match between noun/pronoun and verb chunks. This has reduced the PER in this direction from 36% to 29%.
The other direction (Catalan → Romanian) did not have any previous transfer rules and has been written from scratch. Most of the work has been done in the chunking stage, but there are basic interchunk rules too. Also, due to the way gender is handled in Romanian, a fourth step has been added to transfer to generate adjectives and determiners correctly. In this direction, the PER has been reduced from 61% to 46%.
Given the effort needed to develop transfer rules and the tight schedule, it has not been possible to reduce PER as much as desired. However, the results are good enough to allow for a release of the pair in both directions, and the current transfer rules will ease future development.
Other language pair upgrades
Work needed for release
- Clean up testvoc errors
- Finish the rewrite of old T1X rules for Romanian → Catalan
- Expand transfer rules to handle more complex structures, specially verb-pronoun structures