User:Aha/GsocApplication

Name[edit]

Joanna Ruth

Contact information[edit]

E-mail: joannaruth1@gmail.com
Skype: joanna_ruth
IRC: Aha_(irc.freenode.net)

Why are you interested in machine translation?[edit]

Before I took up Computer Science I had thought about being a language teacher as I've always enjoyed learning foreign languages and exploring cultures. Eventually my interest in programming and technology took the lead, but it turned out I can still expand my knowledge in the field of linguistics by means of Natural Language Processing. Machine translation, being a sub-field of NLP, enables to explore the grammar of a language and deal with it from a computational perspective. I really like the idea of automatic text translation especially nowadays, when the Internet is growing so rapidly. It is impossible to translate all the information manually. MT enables translations, which are (at present) less accurate than those made by human translators, but in many cases sufficient. I can't wait to see people of different nations communicating with one another without the need to know the languages used by their interlocutors. I'm sure the future has it in store.

Why are you interested in the Apertium project?[edit]

I'm strongly convinced that the only chance for machine translation project to be successful it to realize it through open-source. Only within multinational, motivated community like Apertium's it is possible to accomplish such numerous language-pair translation. The project supports both widely spoken languages and minority languages. In the age of globalization it is a very important issue as many languages are in danger of dying away.

Why Google and Apertium should sponsor it?[edit]

New language-pair incorporation is Apertium's top priority. Introduction of each language might significantly increase the number of people using it. There is very little support for West-Slavic languages in Apertium at present. None of the languages from this group is among language-pairs in release or stable versions of the project. Polish-English and Czech-Slovenian are currently under development, but there is a lot to be done yet to make them work. Developing the Polish-Czech pair would help to make the other mentioned pairs work better. Apertium has proven to be a very good platform for closely related languages like Polish and Czech therefore I think bringing this pair to Apertium will be very beneficial and should give high-quality results.

How and who it will benefit in society?[edit]

A great number of people use Internet as a primary source of information. Because of language barrier the amount of data that might be of use to them is limited to data available in the languages they speak. Introduction of Polish-Czech language pair in Apertium might help a lot in this respect. Polish and Czech are very close languages and thanks to that Polish people can usually understand Czech (and vice versa). Misunderstandings occur however relatively often because of so-called false friends - words that sounds or look similar but differ in meaning. Development of Polish-Czech language pair might solve this problem. It will also bring along other benefits: better software localisation and quicker text translation by human translators (they might use Apertium to obtain preliminary translation).

Which of the published tasks are you interested in? What do you plan to do?[edit]

The project I'd like to work on is Polish-Czech language pair machine translation for Apertium.

Some work has already been done for this language pair. I consulted with Jimmy O'Regan and found out that most inflection rules for Polish are already covered and I should focus mainly on expanding the vocabulary. The Czech part is more or less at the same stage of development.

I intend to use IPI PAN "Frequency dictionary of contemporary Polish" as a source of Polish hand-tagged training corpora (around 500 000 words). It needs though to be converted to the format used by the apertium-tagger before carrying out supervised tagger training. The morphological data might be retrieved from Morfologik (morphological dictionary of Polish) and from Morfo (Czech morphological analyzer).

Czech like Polish, is a West-Slavic language and consequently they have a lot in common. Both languages are highly inflected languages including 7 cases for nouns, pronouns, adjectives and numerals. Genders are almost the same, however, in Polish there is additional personal masculine gender. Word order is more or less the same, but Czech allows for more freedom which may pose a challenge in translation. Nevertheless the translation would be expected to give fine results due to the closeness of these two languages.

I've already got quite familiar with Apertium framework. I added some words and paradigms to the dictionaries and updated the pending tests for Polish-Czech language pair.

Currently Polish monodix contains 199 paradigms/518 lemmas and Czech monodix contains 216 paradigms/1148 lemmas. There are 502 entries in the bilingual dictionary, but only around 50-60 words can be translated correctly for each direction.

Work plan[edit]

Community Bonding Period[edit]

set up work environment (installation and configuration)
study Polish and Czech language rules thoroughly
check what has already been done (study monodices from Czech-Slovenian and Polish-English language pairs)
get monolingual and multilingual aligned corpora for further analysis (possibly from JRC Acquis)
prepare a list of words sorted by frequency of occurrence for both dictionaries (to acquire at least 80% coverage)
learn to use dictionaries and tools (dixtools package, apertium-viewer, poliqarp) in practise

Week1[edit]

write test scripts (make use of the existing language-pair regression and corpus tests)
add the missing close-class words (pronouns, prepositions, conjunctions, determiners, numbers, modal verbs and the like) to the dictionaries

Week2[edit]

work on Polish monodix; add open-class words (nouns, verbs, adjectives, adverbs) according to the frequency list

Week3[edit]

work on Czech monodix; add open-class words (nouns, verbs, adjectives, adverbs) according to the frequency list

Week4[edit]

add the rest of the words

Deliverable1: Desirable coverage acquired for both languages

Week5[edit]

gather translational data with the use of parallel corpora
add basic transfer rules for the purpose of testing, verify the tag definition files
work on bilingual dictionary

Week6[edit]

work further on bilingual dictionary
update the Polish-Czech page of the "False Friends of the Slavist" wikibook

Week7[edit]

prepare a list of word sequences that frequently appear together for both Polish and Czech (use Apriori algorithm to find frequent sets)
add multiwords with translations to the dictionaries

Week8[edit]

bring the dictionaries to a consistent state (successful vocabulary tests)

Deliverable2: Bilingual dictionary completed

Week9[edit]

obtain hand-tagged training corpora
study the word order rules of Czech and Polish (identify restrictions)
work on tag definition files
carry out supervised tagger training (with retraining on untagged text corpora) for both languages

Week10[edit]

extract segments of the parallel corpora that are translated (more or less) literally
work on transfer rules

Week11[edit]

carry out thorough regression tests
check dictionaries manually to spot possible errors

Week12[edit]

clean up, evaluation of results

Project completed

During the whole work the quality of translations will be controlled by means of regression and vocabulary tests. The work will be consulted on every stage and the progress will be reported on dedicated Wiki page.

List your skills and give evidence of your qualifications[edit]

I'm currently first year student of Master in Computer Science at Gdansk University of Technology, Poland. I have Individual Studies Program and received scholarship for high academic achievements. During my previous studies I did a lot of programming mainly using c/c++, java and C#. I also have completed courses in algorithms and data structures, logic, operating systems (shell scripting, regular expressions), data mining, automata theory and formal languages. I have learned how the compiler works and how to generate simple lexical, syntactic and semantic analyzers for pascal and ada languages using flex, bison and yacc. I also completed a course in artificial intelligence where I learned about the hidden Markov model and neural networks.

So far I haven't participated in open-source project, but I've been involved in several research projects at my University concerning motion detection and tracking and hand gesture recognition. At present I'm working on a speech interface for a smart medical services system that will enable the user to communicate using 3D avatar.

I have been working as an intern in Speednet company for 1,5 year. During that time I was part of a team that developed Electronic Health Card System. I was responsible for the mobile part of the system written in .NET Compact Framework. I became familiar with software localisation and used MT to automate translation between Polish and English. Apart from that I learned how to use TortoiseSvn and MantisBT.

In my projects I use PostgreSQL and Microsoft SQL Server DBMSes. Recently I also started a course in Oracle. I know .NET technology (windows forms, windows forms ce, wpf, wcf, silverlight) and the basics of JEE (servlets, jsp/jsf, facelets, JPA, JAAS, JMS). I'm also familiar distributted and parallel programming concepts.

I really enjoy leaning languages and I consider myself good at it. I know Polish (mother tongue) English (Cambridge CAE Certificate), German (pre-intermediate level) and some basic Croatian. Whenever I go abroad I always remember to take a language guide with me. Although I never took Czech lessons, because of it's similarity to Polish language, I can understand it quite well. I strongly believe I can manage to successfully realize a translator for this language pair.

My non-Summer-of-Code plans for the Summer[edit]

have no other plans for the Summer than GSoC program. I intended to apply for a job, but if my application is accepted I'll postpone it until the project is completed. The GSoC program begins before my academic year will have ended, therefore I would like to work on the project a bit longer than it is specified - perhaps till the end of August, or even longer. During May and June I will have to combine my studies with developing the project and then I can fully focus on it when my summer break starts in July. I'm sure there won't be any problems with me studying and working on the GSoC project simultaneously as I've already managed to work during 3 semesters of my studies.