User:Shraier/Application

From Apertium
Jump to navigation Jump to search
Apertium-sl-es: machine translation between Slovene and Spanish

Aleš Horvat

06.04.2011


Name: Aleš Horvat

E-mail address: /

Other information that may be useful to contact you:

  • Cell phone: /
  • IRC: /
  • Skype: /


Why is it you are interested in machine translation?

It has been for years now that the giants like Google and Microsoft are developing algorithms for automatic machine translation. Since I study computer science at Faculty of Mathematics, Natural Sciences and Information Technologies and I have been involved in the translation area for quite some time, I think that this would be a great opportunity to use my developing skills and my ability to complete a longer-term project. I have a very strong interest in linguistics from my childhood steps and I think this is one of the most exciting technological grounds. I love the idea to help the open-source community by sharing my knowledge and make it available to everyone. It fascinates me how languages can express same things in different ways and how machine translation makes you notice these little differences by making the cross-language variation explicit.


Why is it that they are interested in the Apertium project?

The idea of a free and open-source platform for developing rule-based machine translation system like Apertium is great. As far as I had the chance to get in touch with the Apertium architecture I like its low entry requirements for making new translation pairs.

I have been watching over Apertium for a long time now. My professor at the University inspired me with its architecture and all its features so I took a deeper look into it. One of the most appealing reasons for using a RBMT machine translation system like Apertium is the ability for the experts of the field to refine the results of the automatically produced data. All data is organized in the XML which is humanly readable and editable. I believe it provides everything a developer needs and that is why I would really love to be part of its community, even after this contest.


Which of the published tasks are you interested in? What do you plan to do?

Title

Apertium-sl-es: machine translation between Slovene and Spanish


Reasons why Google and Apertium should sponsor it

Currently Apertium does not have a release-quality of the translation system for the Slovenian and Spanish language. Even more, there is no translation tool for this pair that uses direct translation method. Google and Microsoft have developed translation tools which translate this language pair via another language (English) [2]. Consequently the quality is not as good as it could be with a direct translation approach. I think this group would make a great contribution to Apertium systems.


How and who it will benefit in society

The Republic of Slovenia has a population of approximately 2.05 million. It is located in Central Europe, touching the Alps and bordering the Mediterranean. Slovenia borders with Italy, Croatia, Hungary and Austria, and also has a small portion of coastline along the Adriatic Sea.

According to Wikipedia,Slovenia is ranked among the top European countries regarding the knowledge of foreign languages. The most often taught foreign languages are English, German, Italian, French and Spanish. Our educational institutions would like Spanish as an obligatory language to be taught in schools. Consequently I think that it would be great for educational institutions to have access to an open-source system like Apertium with Slovene-Spanish and vice versa translator.

From the other side we have tourism that has tremendously evolved in the past few years and we have more and more tourists every year. A lot of people from foreign countries spend their time in Slovenia and that is why I think a Slovene-Spanish and vice versa translator would be very useful for everyone, tourists and Slovenian people. An application for mobile devices could be developed later on for a handy usage of the translator.

In my opinion, Apertium-sl-es pair would benefit a quite large portion of world's population.


Detailed work plan (including, if possible, a brief schedule with milestones and deliverables)

The apertium-sl-mk from Apertium SVN has monolingual morphology for Slovenian language that I will use in my project. I checked it out and found out that it contains errors that will be surely corrected at the beginning of the project. I will also take the monolingual morphology for Spanish language from an already existing dictionary.

For the POS tagger I will take the probabilities that are contained in the sl-mk system and after that I will spend some time to make a good and reliable bilingual dictionary with Google Translate. I made a small-scale test of translating randomly selected lemmas from the Slovenian part of the bilingual dictionary and came to these conclusions:

I will be using Google Translate for nouns and adjectives because it translates them very well, but for verbs and pronouns I will have to take a different way. I am not discarding the possibility of a manual job. The other word classes will be made manually. For the time-being, my research did not go further than this.

Transfer rules will be manually constructed according [6][7][8][9]. There is an ongoing Slovene to Italian Apertium translation system construction project. I will try to use the transfer rules from this project.


Week plan

Community bonding period: In this period I will devote myself into the Apertium system. I will initialize this new language pair and add it to SVN repository. I will also make a research to find some aligned text in Slovene-Spanish which will help me out later on. In this period I would also like to become more a part of the Apertium community and to establish a connection with other community members.

Week 1: Correction of the differences in source and target tag sets of the morphological dictionaries (possibly write rules for this part).

Deliverable #1: Changed morphological dictionaries that reflect the differences in tag sets.

Week 2: Correction of errors of the Slovenian monolingual morphology (manual).

Week 3: Correction of errors of the Slovenian monolingual morphology (manual).

Deliverable #2: Cleaned Slovenian morphology.

Week 4: Preparation of the automatic evaluation framework based on METEOR [1]. The framework will be used continually through the development of the translation systems. Preparation of a set of test-sentences as a “golden standard” translation set. The golden standard sentences will be used to discover problems in the initial systems. Evaluation system will be used at later stages to monitor translation quality changes (hopefully for the better).

Deliverable #3: Evaluation system

Week 5: Two days to compile the bilingual translational dictionary using Google Translate, three days to compile the bilingual translation dictionary using the method presented in [3] and the rest of the week to do a manual revision of the bilingual translation dictionary entries.

Week 6: Manual revision of the bilingual translation dictionary entries.

Week 7: Manual refinement of the bilingual dictionary. Statistical methods can produce erroneous data and I am not aware of any non-manual method that could help me in this task.

Deliverable #4: Bilingual dictionary.

Week 8: Compile transfer rules according to the contrastive grammar described in [6][7][8][9].

Week 9: Compile transfer rules according to the contrastive grammar described in [6][7][8][9].

Deliverable #5: Transfer rules.

Week 10: Iteratively find/solve problems (debug).

Week 11: Iteratively find/solve problems (debug).

Week 12: Final evaluation using modified Levenshtein distance[5] and LDC[4] guidelines

Project completed


List your skills and give evidence of your qualifications.

As mentioned before, I study computer science at Faculty of Mathematics, Natural Sciences and Information Technologies, University of Primorska, Slovenia. I am in the last year and my supervisor encouraged me to take part at the GSoC project. A successful completion of the project will result in an almost complete bachelor thesis.

This is just one more reason for my complete devotion to the project’s successful realization.

I consider myself to be a very competent student of an undergraduate program - computer science with a strong emphasys on data structures, algorithms, parallel programming and so on. In the past few years I have been working with the next programming languages: java, c++, c#, ml, php, javascript, etc.

In the year 2008 I participated in the Google GHOP program, now called Google Code-in and got done two tasks at two different projects.

In the year 2010 I participated in the ACM ICPC Queue programming contest and also in the faculty programming contest where I got the third place.

A few months ago me and three of my friends participated in a programming course where we made a game that will be soon released as an open-source project for a mobile platform.

Recently I finished a project for an international organization called BlueFlag. The project consisted in making an online survey system that allows every national coordinator to make its own custom survey with features like predefined validations and so on. The architecture of the system is very complex as it has to fit all their needs. The system will be soon released for all BlueFlag countries.

As for my linguistic skills, my native language is Slovenian. The place where I live is close to the Italian border and it is a bilingual territory where Italian language enjoys the status of official language. My knowledge of Italian language is very good and the knowledge of a romance language will be very helpful with this project. Some members of my family are native speakers of Spanish language.

List any non-Summer-of-Code plans you have for the summer, especially employment, if you are applying for

Google Summer of Code is my only plan for the summer. School finishes at the end of May, so I will easily establish at least 30-hours weekly for the GSoC project.


[1] Banerjee, S. and A. Lavie: „METEOR (2005): An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments, Proceedings of Workshop on Intrinsic and Extrinsic Evaluation Measures for MT and/or Summarization at the 43th Annual Meeting of the ACL.

[2] Jernej, Vičič (2010): Strojno prevajanje in slovenščina, proceedings of the 13th International Multiconference Information Society - IS 2010

[3] Jernej, Vičič and Petr Homola (2010): Speeding up the Implementation Process of a Shallow Transfer Machine Translation System, Proceedings of the 14th {EAMT} Conference

[4] LDC (2005): Linguistic data annotation specification: Assessment of fluency and adequacy in translations

[5] Levenshtein, V.(1965): Binary codes capable of correcting deletions, insertions and reversals, Doklady Akademii Nauk

[6] Markič, Jasmina and Barbara Pihler: Španska slovnica po naše

[7] Miklič, Tjaša and Martina Ožbot (2007): Teaching the uses of Italian verb forms to Slovene speakers. Linguistica, pages 65-76

[8] Miklič, Tjaša: Slovene and Italian contrastive grammar, unpublished student's script

[9] Ožbot, Martina (2009): Nekaj kontrastivnih beležk o italijanščini in slovenščini in nekaj opažanj o jezikovni produkciji pri govorcih slovenščine v Italiji. Jez. Slovst., pages 25-47