User:Marcriera/Proposal2018

From Apertium
Jump to navigation Jump to search

Contact Information

Name: Marc Riera Irigoyen

Location: Barcelona, Spain

E-mail: marc.riera.irigoyen@gmail.com

IRC: mriera_trad

SourceForge: marcriera

Timezone: UTC+02:00

Why is it you are interested in machine translation?

As a Traslation and Interpreting student and will-be professional translator, machine translation is interesting due to its dramatic improvement over the last few years and its increasing prevalence in society. Therefore, I am very interested in how translators can make the most out of it and use it responsibly.

Why is it that you are interested in Apertium?

Apertium is not only the organization behind a great open source project; it is also a very welcoming family of collaborators and language enthusiasts. After successfully participating in GSoC 2017 with Apertium and completing my project, I felt motivated to keep contributing regularly. Now, taking part of GSoC 2018 with Apertium again is the best way to boost development, make it gain even more importance and reach new users.

Which of the published tasks are you interested in? What do you plan to do?

I am interested in upgrading several language pairs to ease future development and bring one of them (Romanian-Catalan) to release status. Development of the Romanian-Catalan pair will take place during the first two thirds of the programme, and the upgrade of the other pairs will take place during the last third.

Apertium currently uses independent language modules for pairs, so monolingual data is shared between pairs. However, this was different in the beginning, when pairs are self-contained and included monolingual data. Lots of pairs have been upgraded to the new system, which is more efficient and allows users to easily share work, but there are still a few that have not been upgraded. Consequently, potential language developers avoid them due to the extra difficulty and the pairs quickly become out of date.

One of these language pairs, Romanian-Catalan, was upgraded recently to use the new system. Despite still being unreleased, it contains a basic but decent bilingual dictionary and transfer rules for the Romanian > Catalan direction re-used from a very similar pair, Romanian > Spanish. The Apertium wiki also contains several documentation pages related this other pair providing very useful information. While many entries in the bilingual dictionary are broken as an effect of the upgrade, the two languages are close to each other and with some intense development the pair would be error-free and ready for release. A working direct Romanian-Catalan pair would also be unique to Apertium, as other proprietary machine translation platforms (such as Google and Yandex) use English as a pivot language and the results could be much improved.

The other pairs will be first upgraded to use monolingual modules and then cleaned until they are testvoc-clear.

Title

Reasons why Google and Apertium should sponsor it

How and who it will benefit in society

List your skills and give evidence of your qualifications

List any non-Summer-of-Code plans you have for the Summer

My plan

Major goals

Workplan

Coding challenge