User:Bibaeva/proposal

From Apertium
Jump to navigation Jump to search


Contact information

Name: Maria Bibaeva
E-mail: melisanushk@gmail.com
IRC: melisan
Phone: 89030131839
Place: Moscow, Russia (UTC+3)
Github: https://github.com/mbibaeva

Why is it you are interested in machine translation?

Machine Translation is one of the major tasks of modern computational linguistics. It is a very developed but still not perfectly built area, and I really want not only to know more about it but to see exactly how it works. Such experience would be extremely usefull for me as a computational linguist.

Why is it that you are interested in Apertium?

As a computational linguist, I find it both interesting and advantageous to contribute to developing a language tool like Apertium. Another reason is that Apertium works with minor languages, which gives me an opportunity to use not only my programming skills but also the knowledge of minor languages like Moksha or Hill Mari.

Reasons why Google and Apertium should sponsor it

There is only one Uralic language that Apertium works with right now, and even though it is in pair with a language of a different family, it would be nice to add some other languages of this family, so that at least there are several monolingual dictionaries of Uralic languages.

Skills

Programming and computer skills: Python 3, HTML, R, JS
Languages: Russian(native), English(advanced), German(intermediate), French(intermediate), Japanese(beginner), Moksha(as an object of research), Hill Mari(as an object of research)
Usefull courses: Natural Language Processing, Theory of computation, Language Diversity, Lexicography, Formal Semantics

The Task

I think that the best task for me would be to adopt an unreleased language pair, particularly the Moksha-Russian language pair (mdf-rus), but I could also work with Erzya(myv) or Hill Mari(mrj).

Work plan

Postapplication perion:
Learn more about Apertium and the tool, install Linux and get used to it, get acquainted with the code of other Apertium bilingual dictionaries.

Summer
Week 1: define frequency of both Russian and Moksha words, using corpora and adopt existing Moksha dictionary for work.
Week 2: check Russian monolingual dictionarty and start working on Moskha monolingual dictionary. Week 3-4: creating monolingual dictionary for Moksha.

Deliverable 1: a proper monolingual dictionary.

Week 5-6: start working on the bilingual dictionary, creating noun transfer.
Week 7-8: working on verb and adjectives transfer.

Deliverable 2: two proper monolingual dictionaries and part of the bilingual dictionary.

Week 9-10: working on other parts of speech and constructions
Week 11: testing, fixing, adding whatever needs to be added.
Week 12: final debug, documentation and cleaning up the code.

Non-Summer-of-Code plans for the Summer

I have several exams in June, but I will try to pass them earlier, but if I do not succeed, it might take about 3 hours from daily worktime. I am free for the rest of the summer so I might be able to devote 45-50 hours per week to the task.