User:Ilnar.salimzyan/GSoC2014/Application
Contents
- 1 Contact information
- 2 Why is it you are interested in machine translation?
- 3 Why is it that you are interested in the Apertium project?
- 4 Which of the published tasks are you interested in? What do you plan to do?
- 5 How it will benefit the society
- 6 Why should Google and Apertium sponsor it?
- 7 Major goals/deliverables
- 8 Available resources
- 9 Workplan
- 10 List your skills and give evidence of your qualifications
- 11 List any non-Summer-of-Code plans you have for the Summer
- 12 References
Contact information
Name: Ilnar Salimzyanov
E-mail address: ilnar.salimzyan@gmail.com
IRC: selimcan Sourceforge account: selimcan Timezone: UTC+4
Why is it you are interested in machine translation?
First, rule-based machine translation involves both linguistics and programming, in which I am interested in and, second, it is a complex problem, which makes it exciting to work on it.
Why is it that you are interested in the Apertium project?
Apertium fascinates me being one of the few (the only?) open-source RBMT platforms showing growth. I also like and appreciate its focus on minority languages.
Which of the published tasks are you interested in? What do you plan to do?
Task: Adopting a language pair
Title: Apertium-tat-rus -- machine translation system from Tatar to Russian
How it will benefit the society
Tatar is a Turkic language, spoken by about 7 Millions of people primarily in and around Tatarstan, where it shares an official status with Russian. There is a lot of translation work done in Tatarstan, for example, all legislation documents are available in both Tatar and Russian and information agencies provide news in both languages, so there is a demand for Tatar-Russian machine translation system.
I understand that to think that the output of such system will be used for post-editing and then publishing is too optimistic. However, I expect it to perform well on shorter segments of text and to be used, for example, in combination with a translation memory and some fuzzy-match repair system to support the human translator, as described in [1].
Another use case is gisting -- allowing Russian speaking people which do not speak Tatar understand what the text is about. It also might be useful for learners of the language to quickly look up words and phrases.
Why should Google and Apertium sponsor it?
In my 2012 proposal for the Kazakh-Tatar pair I wrote that "Turkic languages represent a very good 'new working field' for Apertium -- they build a large group (more than 150 Mio speakers) of similar languages, but the most of the languages belong to non-central ones". Since then a lot has been done regarding Turkic languages in Apertium, and now there are several Turkic-to-Turkic pairs and transducers for many of them.
Like Tatar, most of the Turkic languages share an official status with Russian either in a region of Russia or in one of the post-soviet republics. There aren't any released Turkic to Russian pairs in Apertium yet. The value of the Tatar-Russian pair will be not only in the pair itself, but also in that it will be the first (by the end of the GSoC programme hopefully released) system in Apertium [2] for translating from a Turkic language to Russian.
I will place high emphasis on making the code obvious and tested, so that the purpose of disambiguation and transfer rules are clear (in particular by providing Input (from the previous module)
and Output (after this rule applied)
comments for them in a disciplined manner and trying to keep rules independent from each other), which will make it easy to understand and to adopt them for other Turkic languages. I will also try to write disambiguation rules not tight to a particular wordform/lemma as much as possible.
Another benefit for Apertium would be that Russian transducer would get additional extension and testing. It is already involved in many pairs and can potentially be used in many more.
Major goals/deliverables
- 10000 top stems in bidix and at least 80% trimmed coverage
- Clean testvoc
- Constraint grammar for Tatar (per-token ambiguity <= __ and precision >= __; full disambiguation of a particular hand-tagged corpus)
- Average WER on unseen texts below 50
Available resources
Morphological dictionaries
I will not have to start from scratch. Transducers for both Tatar and Russian are already available, with 90% coverage of each (Russian as well?). Few phonology issues remain in apertium-tat; apertium-rus might require some evaluation and checking of the paradigms.
Bilingual dictionary
Bilingual dictionary has to be written from scratch. However, there are a lot of dictionaries, both in print and online (e.g. at [3]), which can be used for reference.
Parallel corpora
A lot of Tatar prose has been translated into Russian, some of it is available online [4]. Of more value for this project are legislation documents and news at the government's site [tatarstan.ru], which are in the public domain. Being legal texts, they aren't 'free' translations as fiction text translations usually are.
Workplan
Overview
Month | Community bonding | Month 1 | Month 2 | Month 3 |
---|---|---|---|---|
Morning | Coverage | Tatar CG | Lexical selection | |
Afternoon | Morphology & Testvocing closed categories and open categories leaving only one word in the latter[1] | Transfer rules |
Schedule
See GSoC 2014 Timeline for complete timeline.
week | dates | goals | notes |
---|---|---|---|
post-application period 22 March - 20 April |
| ||
community bonding period 21 April - 19 May |
| ||
1 | 19 - 24 May | ||
2 | 25 - 31 May | ||
3 | 1 - 7 June | ||
4 | 8 - 14 June | ||
5 | 15 - 21 June | ||
6 | 22 - 28 June | ||
7 | 29 June - 5 July | ||
midterm eval July 6 |
|||
8 | 6 - 12 July | ||
9 | 13 - 19 July | ||
10 | 20 - 26 July | ||
11 | 27 July - 2 August | ||
12 | 3 - 10 August | ||
pencils-down week final evaluation 11 August - 18 August |
|
List your skills and give evidence of your qualifications
I have studied German philology and Applied Linguistics at Kazan Federal University and recently have been accepted to the M.Sc. programme in Computational Linguistics offered by the Natural Language Processing Institute at the University of Stuttgart.
In 2012, I have already participated in GSoC and worked on the Kazakh-Tatar pair and currently maintain it.
List any non-Summer-of-Code plans you have for the Summer
I have no other commitments for the entire period of the program.
References
[1] Ideas_for_Google_Summer_of_Code/Command-line_translation_memory_fuzzy-match_repair
[2] There is a closed-source bidirectional Tatar-Russian machine translation system for Windows available free of charge at tatar.com.ru. See this page to see how it translates the "Mary and James" story in both directions.
[3] suzlek.ru
[4] http://rinatmuhamadiev.ru/produkts.html
- ↑ In the manner described here, with the difference that pronouns.exp or whatever it is will be first saved as a yaml file (with clitics removed) in apertium-tat/tests/morphotactics. That way I will be able to check whether the output of the Tatar transducer actually makes sense before trying to get it testvoc clean.