User:Shraier/GSoC2012-Application1

From Apertium
Jump to navigation Jump to search

Apertium-sl-it: machine translation between Slovene and Italian[edit]

Name: Aleš Horvat

Other information that may be useful to contact you: IRC: Shraier


Why is it you are interested in machine translation?

It does not matter how many years passed since the first development of a machine translation system and how many hours of development of big companies like Google and Microsoft have been put in it, the perfect machine translation system is still far away from us. I find it very fascinating how languages can express same things in different ways and how machine translation makes you notice these little differences by making the cross-language variation explicit. Since I have been involved in the translation area for a quite some time and my study consists in computer science at Faculty of Mathematics, Natural Sciences and Information Technologies, I believe that this would be a great opportunity to use my developing skills and my ability to complete a longer-term project. I have a very strong interest in linguistics from my childhood steps, which have potentially grown after I started using Apertium and I think this is one of the most exciting technological grounds. I love the idea to help the open-source community by sharing my knowledge and time to make it available to everyone.


Why is it that they are interested in the Apertium project?

The idea of a free and open-source platform for developing rule-based machine translation system like Apertium is great. As far as I had the chance to work with the Apertium system I like its architecture and its low entry requirements for making new translation pairs. In the past months I have also been working on my bachelor thesis and an article, which are related to last year GSoC Project - Apertium-sl-es: machine translation between Slovene and Spanish.

So far, I have been working with Apertium for more than a year now. At the first beginning my professor at the University inspired me with its architecture and all its features so I had to take a deeper look into it. One of the most appealing reasons for using a RBMT machine translation system like Apertium is the ability for the experts of the field to refine the results of the automatically produced data. All data is organized in the XML files, which are humanly readable and editable. I believe it provides everything a developer needs to make a good translation system and that is why I would really love to be part of its community, even after this contest.

Which of the published tasks are you interested in? What do you plan to do?


Title

Apertium-sl-it: machine translation between Slovene and Italian


Reasons why Google and Apertium should sponsor it

Currently Apertium does not have a release-quality of the translation system for the Slovenian and Italian language pair. Even more, there is no translation tool for this pair that uses direct translation method. Google and Microsoft have developed translation tools, which translate this language pair via another language (English) [2]. Consequently the quality is not as good as it could be with a direct translation approach. Another reason for this project to be sponsored is a cross-development of the existing language pair, Slovene-Spanish. Since both languages (Italian and Spanish) are Romance languages, there is a lot of work that would be done for the Slovene-Italian language pair, which could be easily used in the Slovene-Spanish language pair.

The development process of Slovene-Italian language pair would be easier and more effective than the last year’s Slovene-Spanish language pair since I live in a bilingual region where Italian language is commonly used everyday. I can also say that my knowledge of Italian language is as good as Slovenian language, which is my native language. It should be also pointed out that I have been studying Italian language (advanced course) for 8 years.

In my opinion, the development of the Slovene-Italian language pair would make a great contribution to Apertium systems, since a cross-development for Slovene-Spanish language pair would be done.


How and who it will benefit in society

The Republic of Slovenia has a population of approximately 2.05 million. It is located in Central Europe, touching the Alps and bordering the Mediterranean. Slovenia borders with Italy, Croatia, Hungary and Austria, and also has a small portion of coastline along the Adriatic Sea.

According to Wikipedia,Slovenia is ranked among the top European countries regarding the knowledge of foreign languages. The most often taught foreign languages are English, German, Italian, French and Spanish. What has to be pointed out is that the western part of the Slovenian Istria is a bilingual region where both Slovene and Italian must be used in education, legal and administrative environments. Since Italian language is obligatory to be taught in educational institutions in our region, in the past few years a lot of students coming from other regions have problems learning it. Consequently I believe that it would be great for educational institutions to have access to an open-source system like Apertium with Slovene-Italian, Slovene-Spanish and vice versa translators.

On the other side we have tourism that has tremendously evolved in the past few years and we have more and more tourists every year. A lot of people from foreign countries, especially from Italy, spend their time in Slovenia and that is why I think a Slovene-Italian and vice versa translator would be very useful for everyone, tourists, Slovenian people and Italian minority living in this region along the border between Slovenia and Italy with around 3000 people.


Detailed work plan (including, if possible, a brief schedule with milestones and deliverables)

The Apertium SVN has extremely good and polished monolingual morphology files of both languages. I will take the monolingual morphology for Slovenian language from Apertium-sl-es pair and for Italian language from Apertium-es-it pair. The bilingual translation dictionary will be generated by making intersect between Apertium-sl-es bilingual dictionary and Apertium-es-it bilingual dictionary. A manual revision will be needed to fix the mistakes and differences in source and target tag sets (if needed, write rules for this part). Additional work will be put in this section to expand the monolingual morphology of the Italian language by adding new lemmas, but only if time will permit as other tasks will have priority. I believe 10387 lemmas are enough for the first version of the language pair.

I remade last year’s small-scale test of translating randomly selected lemmas from the Slovenian part of the bilingual dictionary and came to same conclusions:

I will be using Google Translate for nouns and adjectives because it translates them very well, but for verbs and pronouns I will have to take a different way. Currently I have a list of translations for 400 verbs that will be surely added to the bilingual dictionary. I am not discarding the possibility of a manual job if it will be needed.

The main problems that I foresee are the POS tagger for the Slovenian language and the distance of the languages. I will devote as much time as possible at the discovery of the differences and writing appropriate transfer rules. The distance between the languages does not allow me the creation of both translation directions. I would like to focus on only one direction and in order to solve/avoid the first problem, the disambiguation of the Slovenian language, I will choose the Italian to Slovenian translation only.

Transfer rules will be manually constructed according [6][7][8] and adapted for the Slovene-Spanish language pair.

Week plan

Community bonding period: In this period I will establish a clean apertium-sl-it system, add the already mentioned list of verbs to the monolingual morphology of both languages and correct the differences in source and target tags sets of the morphological dictionaries. A small research will be made to find some aligned text in Slovene-Italian languages, which will help me out later on. I will add a small set of entries in the bilingual dictionary to test the whole Apertium pipeline and also make a set of pending translation candidates. In this period I would also like to become more a part of the Apertium community and to refresh connections with other community members.

Week 1: Correction of the differences in source and target tag sets of the morphological dictionaries (possibly write rules for this part).

Deliverable #1: Changed morphological dictionaries that reflect the differences in tag sets.

Week 2, 3: Generation of the bilingual translation dictionary (sl-es, es-it). Five days to compile the bilingual translational dictionary using Google Translate, three days to compile the bilingual translation dictionary using the method presented in [3] and the rest of the second week to do a manual revision of the bilingual translation dictionary entries.

Week 4: Manual refinement of the bilingual dictionary. Statistical methods can produce erroneous data and I am not aware of any non-manual method that could help me in this task.

Deliverable #2: Bilingual dictionary.

Week 5: Preparation of the automatic evaluation framework based on METEOR [1]. The framework will be used continually through the development of the translation systems. Preparation of a set of test-sentences as a “golden standard” translation set. The golden standard sentences will be used to discover problems in the initial systems. Evaluation system will be used at later stages to monitor translation quality changes (hopefully for the better).

Deliverable #3: Evaluation system

Week 6: Compile transfer rules according to the contrastive grammar described in [6][7][8].

Week 7: Compile transfer rules according to the contrastive grammar described in [6][7][8].

Week 8, 9, 10: Iteratively construct the missing rules. We realized that we did not have enough time to construct all the needed rules for a distant language pair Slovenian-Spanish. This time slot should suffice.

Deliverable #4: Transfer rules.

Week 11: Iteratively find/solve problems (debug).

Week 12: Final evaluation of the system.

Project completed


List your skills and give evidence of your qualifications.

I study computer science at Faculty of Mathematics, Natural Sciences and Information Technologies, University of Primorska, Slovenia. I am in the last year, currently writing my bachelor thesis and an article about last year GSoC Project - Apertium-sl-es: machine translation between Slovene and Spanish.

I consider myself to be a very competent student of an undergraduate program - computer science with a strong emphasys on data structures, algorithms, parallel programming and so on. In the past years I have been working with many programming languages like java, c++, c#, ml, php, javascript, etc.

I have also participated in many contests and projects:

- In the year 2008 I participated in the Google GHOP program, now called Google Code-in and completed two tasks for two different projects.

- In the year 2010 I participated in the ACM ICPC Queue programming contest and also in the faculty programming contest where I achieved the third place.

- In the year 2010 I have also completed a project for an organization called “Društvo DOVES” member of “FEE International”. The project consisted in making an online survey system for a well-known programme called BlueFlag. The architecture of the system is very complex as it has to fit all their needs. The system has also been released for all BlueFlag countries and has been running for two years now.

- In the year 2011 I participated in the ACM UPM (University Programming Marathon) contest where my team achieved the title “Champion of the University of Primorska”.

- In the year 2011 I have also participated and successfully completed the Google Summer of Code project called: Apertium-sl-es: machine translation between Slovene and Spanish.


As for my linguistic skills, my native language is Slovenian. The place where I live is close to the Italian border and it is a bilingual territory where Italian language enjoys the status of official language. My knowledge of Italian language is as good as Slovenian, since I have been studying Italian language (advanced course) for 8 years. Even my girlfriend has Italian roots and speaks Italian language natively, so I believe my native knowledge of both languages can make a great contribution to this project.

List any non-Summer-of-Code plans you have for the summer, especially employment, if you are applying for

Google Summer of Code is my only plan for the summer. School finishes at the end of May, so I will easily establish at least 30-hours weekly for the GSoC project.


Coding Challenge

I am also working on the coding challenge for Apertium-sl-it language pair. Further work will be done in the next days.

Link: http://wiki.apertium.org/wiki/User:Shraier/GSoC2012-CodingChallenge


[1] Banerjee, S. and A. Lavie: „METEOR (2005): An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments, Proceedings of Workshop on Intrinsic and Extrinsic Evaluation Measures for MT and/or Summarization at the 43th Annual Meeting of the ACL.

[2] Jernej, Vičič (2010): Strojno prevajanje in slovenščina, proceedings of the 13th International Multiconference Information Society - IS 2010

[3] Jernej, Vičič and Petr Homola (2010): Speeding up the Implementation Process of a Shallow Transfer Machine Translation System, Proceedings of the 14th {EAMT} Conference

[4] LDC (2005): Linguistic data annotation specification: Assessment of fluency and adequacy in translations

[5] Levenshtein, V.(1965): Binary codes capable of correcting deletions, insertions and reversals, Doklady Akademii Nauk

[6] Miklič, Tjaša and Martina Ožbot (2007): Teaching the uses of Italian verb forms to Slovene speakers. Linguistica, pages 65-76

[7] Miklič, Tjaša: Slovene and Italian contrastive grammar, unpublished student's script

[8] Ožbot, Martina (2009): Nekaj kontrastivnih beležk o italijanščini in slovenščini in nekaj opažanj o jezikovni produkciji pri govorcih slovenščine v Italiji. Jez. Slovst., pages 25-47