Romanian and Catalan/GSOC 2018

From Apertium
Jump to navigation Jump to search

This page serves as a summary of all the work done in the Romanian-Catalan pair during Google Summer of Code 2018. It also includes information on the upgrade of four language pairs which was carried out during the same period. For a more detailed workplan of the project, please check this page.

Main work[edit]

The main goal of this project was to upgrade the Romanian-Catalan language pair to the monolingual package system and develop it to bring it to release quality. In addition, four other language pairs have been upgraded to the monolingual package system to ease future development. As such, the commits are distributed into many repositories, which are listed below:

Romanian-Catalan pair[edit]

Other pairs[edit]

Dictionaries[edit]

The bilingual dictionary, initially comprising 12,819 entries, has been expanded to 23,015 entries, obtained from crossdics with Romanian-Spanish and frequency lists. While this is substantially lower than the initial goal of 31,000 entries, a great part of the existing entries were not working due to a lack of gender tags, incorrect direction tags in monolingual dictionaries, or a lack of corresponding entries in the monolingual dictionaries. As such, fixing these existing entries has been given priority, and in the end it has proven to be fruitful, with Wikipedia coverage surpassing expectations (86.8% for Romanian and 88.7% for Catalan).

The Romanian monolingual dictionary, given its limited development in Apertium, contained many inaccuracies and errors which have been fixed as part of the project. The Catalan side, on the other hand, has not needed major changes due to extensive development and consensus between developers.

Despite the existence a few dictionary-related testvoc errors in both directions which prevent the immediate release of the pair at the end of GSoC, they are easy to fix and the pair should be clean with a few more days of work.

Transfer rules[edit]

Before this project, the pair contained many transfer rules for the Romanian → Catalan direction, with most of them being ported from the Romanian-Spanish pair. These were single-step transfer rules which have been kept in the pair as chunking rules after being partially rewritten. In addition, basic interchunk rules have been added, allowing gender and number match between noun/pronoun and verb chunks. This has reduced the PER in this direction from 36% to 29%.

The other direction (Catalan → Romanian) did not have any previous transfer rules and has been written from scratch. Most of the work has been done in the chunking stage, but there are basic interchunk rules too. Also, due to the way gender is handled in Romanian, a fourth step has been added to transfer to generate adjectives and determiners correctly. In this direction, the PER has been reduced from 61% to 46%.

Given the effort needed to develop transfer rules and the tight schedule, it has not been possible to reduce PER as much as desired. However, the results are good enough to allow for a release of the pair in both directions, and the current transfer rules will ease future development.

Other language pair upgrades[edit]

In addition to the main work on the Romanian-Catalan pair, four other pairs (Indonesian-Malay, Welsh-English, Catalan-Italian and Afrikaans-Dutch) have been upgraded to the monolingual package system. All four pairs had been released in the past, but they relied on embedded monolingual data that made future development tedious and difficult. The upgrade has included the separation of the monolingual data (the upgrade itself) and the fix of all testvoc errors which had emerged after the change. In the case of Welsh-English, which had been published with errors, it has been impossible to fix all of them due to a lack of knowledge of the Welsh language, but the amount of errors has been reduced nonetheless.

Supporting work[edit]

Automatic evaluation[edit]

The development of transfer rules has required the adoption of an effective way to constantly evaluate translations and see the differences in results between versions. This has been possible thanks to the work of Xavi Ivars (mentor) and Jaume Ortolà (Apertium contributor), who have included the pair in an automatic diff system powered by Softcatalà.

Inspired by this idea, a local system of equal functionality was written for more intensive, personal use. The evaluations have greatly helped to detect errors and flaws in the pair and tweak the transfer rules for better results.

Documentation[edit]

As the pair had very little documentation (limited to a few inline comments in transfer rule files and dictionaries), documentation has also been taken into consideration during development. The following wiki pages have been created or modified to reflect information on the pair, and will be updated in the future with more useful data:

Future work[edit]

Work needed for release[edit]

  • Clean up testvoc errors

Other plans[edit]

  • Finish the rewrite of old T1X rules for Romanian → Catalan
  • Expand transfer rules to handle more complex structures, specially verb-pronoun structures

Final words[edit]

Working with Apertium during Google Summer of Code 2018 for a second time has been a great experience that has allowed me to get to know Apertium even better and feel more confident to keep contributing in the future. The experience gained from the first time has allowed me to embark on more complex challenges and find the best solutions to solve problems. Hopefully, the good results of the Romanian-Catalan pair will make Apertium a good and reliable alternative to commercial machine translation systems between these two languages. I also hope that the upgrade of the other four pairs will encourage other developers to work on these pairs in the future.

Last but not least, I would like to thank not only my mentors (Xavi Ivars and Hèctor Alòs) but also other Apertium contributors and friends such as Fran Tyers and Jaume Ortolà, who have all provided me with invaluable feedback and help.