User:Shraier/GSoC2012-Application2
Apertium-sl-sh: machine translation between Slovene and Serbo-Croatian
Name: Aleš Horvat
Other information that may be useful to contact you: IRC: Shraier
Why is it you are interested in machine translation?
It does not matter how many years passed since the first development of a machine translation system and how many hours of development of big companies like Google and Microsoft have been put in it, the perfect machine translation system is still far away from us. I find it very fascinating how languages can express same things in different ways and how machine translation makes you notice these little differences by making the cross-language variation explicit. Since I have been involved in the translation area for a quite some time and my study consists in computer science at Faculty of Mathematics, Natural Sciences and Information Technologies, I believe that this would be a great opportunity to use my developing skills and my ability to complete a longer-term project. I have a very strong interest in linguistics from my childhood steps, which have potentially grown after I started using Apertium and I think this is one of the most exciting technological grounds. I love the idea to help the open-source community by sharing my knowledge and time to make it available to everyone.
Why is it that they are interested in the Apertium project?
The idea of a free and open-source platform for developing rule-based machine translation system like Apertium is great. As far as I had the chance to work with the Apertium system I like its architecture and its low entry requirements for making new translation pairs. In the past months I have also been working on my bachelor thesis and an article, which are related to last year GSoC Project - Apertium-sl-es: machine translation between Slovene and Spanish.
So far, I have been working with Apertium for more than a year now. At the first beginning my professor at the University inspired me with its architecture and all its features so I had to take a deeper look into it. One of the most appealing reasons for using a RBMT machine translation system like Apertium is the ability for the experts of the field to refine the results of the automatically produced data. All data is organized in the XML files, which are humanly readable and editable. I believe it provides everything a developer needs to make a good translation system and that is why I would really love to be part of its community, even after this contest.
Which of the published tasks are you interested in? What do you plan to do?
Title
Apertium-sl-sh: machine translation between Slovene and Serbo-Croatian
Reasons why Google and Apertium should sponsor it
Currently Apertium does not have a release-quality of translation system for the Slovenian and Serbo-Croatian language pair. Even more, there is no translation tool for this pair that uses direct translation method. Consequently the quality is not as good as it could be with a direct translation approach.
Another reason for this project to be sponsored is to finally develop a release quality translation system for Slovenian language. In the last GSoC program the Apertium-sh-mk language pair was successfully developed and these resources would be used in this project. These resources would also make a great contribution to the development process, since the languages belong to the same group of Southern Slavic languages and are very similar. Another resource that would be used and would make a great contribution to the development process is the unpublished Apertium-sl-sr language pair.
In my opinion, the development of the Slovene-Serbo Croatian language pair would make a great contribution to Apertium systems and in the same time help improve machine translation of Slovenian, Croatian, Serbian and Bosnian standard languages.
How and who it will benefit in society
The Republic of Slovenia has a population of approximately 2.05 million. It is located in Central Europe, touching the Alps and bordering the Mediterranean. Slovenia borders with Italy, Croatia, Hungary and Austria, and also has a small portion of coastline along the Adriatic Sea. During most of the 20th century, Slovenia was part of Yugoslavia, a country in the western part of the Balkans established by the union of the Sates of Slovenes, Croats and Serbs and the Kingdom of Serbia.
Slovene, Serbian, Croatian and Bosnian languages belong to a group of Southern Slavic languages and were spoken mostly in former Yugoslavia. The listed languages share common roots and common historical environment. As mentioned before, they were spoken and taught in the same country at a time, but nowadays the younger generations, the post-Yugoslavia breakage generations, have difficulties in mutual communication so there is a quite big interest in construction of such translation system. All these languages are highly inflective and morphologically and derivationally rich languages and defer greatly from mostly used languages in electronic materials like English, Arabic, Chinese, Spanish and French. In other words, most of the data and translation methods must be at least revised or even worse, rewritten. This language pair is closely related lexicographically and syntactically, which simplifies most of the normal translation system production steps.
According to Wikipedia, a significant number of Slovenian population (around 113.000 people) speaks a variant of Serbo-Croatian (Serbian, Croatian, Bosnian or Montenegrin) as their native language. Overall, all these languages are altogether spoken by more than 17 million people, which is a great amount of potential users of the translation system.
Detailed work plan (including, if possible, a brief schedule with milestones and deliverables)
The Apertium SVN has good and polished monolingual morphology files of both languages. I will take the monolingual morphology for Slovenian language from Apertium-sl-es pair and for Croatian language from Apertium-sh-mk pair. The monolingual morphology for Serbian language will be taken from a non-published language pair Apertium-sl-sr (it was listed in the incubator folder). I will reconstruct the entire bilingual dictionary (sl-cr) from scratch, using the one from Apertrium-sl-sr as a rough guideline and recycling the good translation entries. Further work will be put in the manual revision of the bilingual translation dictionary. I will add the missing translations and also fix the mistakes and differences in source and target tag sets (if needed, write rules for this part). Additional resources like transfer rules will be modified and used from the Apertium-sl-sr language pair. Additional work on the monolingual morphology of the Serbian language and on the bilingual translation dictionary for sl-sr will be made, but only if time will permit. The sl-cr pair will have priority.
I remade last year’s small-scale test of translating randomly selected lemmas from the Slovenian part of the bilingual dictionary and came to same conclusions:
I will be using Google Translate for nouns and adjectives because it translates them very well, but for verbs and pronouns I will have to take a different way. I am not discarding the possibility of a manual job if it will be needed.
For the disambiguation problem of the Slovenian language, the “sh” Constraint Grammar from Apertium-sh-mk language pair will be modified and used to suit our needs. I will devote as much time as possible at the discovery of the differences and writing appropriate transfer rules. Transfer rules for sl-cr direction will be taken from Apertium-sl-sr language pair and appropriately modified, for the opposite direction will be made from scratch.
The Multext-east corpus will be used as a resource for the construction of the bilingual translation dictionary.
Week plan
Community bonding period: In this period I will establish a clean apertium-sl-sh system and correct the differences in source and target tags sets of the morphological dictionaries. A small research will be made to find some aligned text in Slovene-Serbo Croatian languages, which will help me out later on. I will add a small set of entries in the bilingual dictionary to test the whole Apertium pipeline and also make a set of pending translation candidates. In this period I would also like to become more a part of the Apertium community and to refresh connections with other community members.
Week 1: Correction of the differences in source and target tag sets of the morphological dictionaries (possibly write rules for this part).
Deliverable #1: Changed morphological dictionaries that reflect the differences in tag sets.
Week 2, 3: Generation of the bilingual translation dictionary (sl-cr, sl-sr). Five days to compile the bilingual translational dictionary using Google Translate, three days to compile the bilingual translation dictionary using the method presented in [3] and the rest of the second week to do a manual revision of the bilingual translation dictionary entries. As stated before, sl-cr will have priority.
Week 4: Manual refinement of the bilingual dictionary. Statistical methods can produce erroneous data and I am not aware of any non-manual method that could help me in this task.
Deliverable #2: Bilingual dictionary.
Week 5: Preparation of the automatic evaluation framework based on METEOR [1]. The framework will be used continually through the development of the translation systems. Preparation of a set of test-sentences as a “golden standard” translation set. The golden standard sentences will be used to discover problems in the initial systems. Evaluation system will be used at later stages to monitor translation quality changes (hopefully for the better).
Deliverable #3: Evaluation system
Week 6 and 7: Compile transfer rules for “sl-cr” direction.
Week 8, 9, 10: Iteratively construct the missing rules and compile transfer rules for “cr-sl” direction.
Deliverable #4: Transfer rules.
Week 11: Iteratively find/solve problems (debug).
Week 12: Final evaluation of the system.
Project completed
List your skills and give evidence of your qualifications.
I study computer science at Faculty of Mathematics, Natural Sciences and Information Technologies, University of Primorska, Slovenia. I am in the last year, currently writing my bachelor thesis and an article about last year GSoC Project - Apertium-sl-es: machine translation between Slovene and Spanish.
I consider myself to be a very competent student of an undergraduate program - computer science with a strong emphasys on data structures, algorithms, parallel programming and so on. In the past years I have been working with many programming languages like java, c++, c#, ml, php, javascript, etc.
I have also participated in many contests and projects:
- In the year 2008 I participated in the Google GHOP program, now called Google Code-in and completed two tasks for two different projects.
- In the year 2010 I participated in the ACM ICPC Queue programming contest and also in the faculty programming contest where I achieved the third place.
- In the year 2010 I have also completed a project for an organization called “Društvo DOVES” member of “FEE International”. The project consisted in making an online survey system for a well-known programme called BlueFlag. The architecture of the system is very complex as it has to fit all their needs. The system has also been released for all BlueFlag countries and has been running for two years now.
- In the year 2011 I participated in the ACM UPM (University Programming Marathon) contest where my team achieved the title “Champion of the University of Primorska”.
- In the year 2011 I have also participated and successfully completed the Google Summer of Code project called: Apertium-sl-es: machine translation between Slovene and Spanish.
As for my linguistic skills, my native language is Slovenian. The place where I live is close to the Italian border and it is a bilingual territory where Italian language enjoys the status of official language. It should be pointed out that the place where I live is even closer to the Croatian border (like 500m from my home) and I understand most of Croatian and Serbian languages. As for my parents, they can natively talk both languages, Croatian and Serbian, since they lived in Yugoslavia. Also, my closer friends talk Croatian and Serbian very well and I believe a great translation system can be developed at this point.
List any non-Summer-of-Code plans you have for the summer, especially employment, if you are applying for
Google Summer of Code is my only plan for the summer. School finishes at the end of May, so I will easily establish at least 30-hours weekly for the GSoC project.
Coding Challenge
I am also working on the coding challenge for Apertium-sl-it language pair. Unfortunately I did not have much time in the past two weeks so I could not work on coding challenge for both proposals. Most of the actions that I took in the described challenge apply to producing a new language and not for a specific language pair.
Link: http://wiki.apertium.org/wiki/User:Shraier/GSoC2012-CodingChallenge
[1]Banerjee, S. and A. Lavie: „METEOR (2005): An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments, Proceedings of Workshop on Intrinsic and Extrinsic Evaluation Measures for MT and/or Summarization at the 43th Annual Meeting of the ACL.
[2] Jernej, Vičič (2010): Strojno prevajanje in slovenščina, proceedings of the 13th International Multiconference Information Society - IS 2010
[3] Jernej, Vičič and Petr Homola (2010): Speeding up the Implementation Process of a Shallow Transfer Machine Translation System, Proceedings of the 14th {EAMT} Conference
[4] LDC (2005): Linguistic data annotation specification: Assessment of fluency and adequacy in translations
[5] Levenshtein, V.(1965): Binary codes capable of correcting deletions, insertions and reversals, Doklady Akademii Nauk