User:Ilnar.salimzyan/GSoC2014/Application

From Apertium
Jump to navigation Jump to search

Contact information

Name: Ilnar Salimzyanov
E-mail address: ilnar.salimzyan@gmail.com
IRC: selimcan Sourceforge account: selimcan Timezone: UTC+4

Why is it you are interested in machine translation?

First, rule-based machine translation involves both linguistics and programming, in which I am interested in and, second, it is a complex problem, which makes it exciting to work on it.

Why is it that you are interested in the Apertium project?

Apertium fascinates me being one of the few (the only?) open-source RBMT platforms showing growth. I also like and appreciate its focus on minority languages.

Which of the published tasks are you interested in? What do you plan to do?

Task: Adopting a language pair

Title: Apertium-tat-rus -- machine translation system from Tatar to Russian

How it will benefit the society

Tatar is a Turkic language, spoken by about 7 Millions of people primarily in and around Tatarstan, where it shares an official status with Russian. There is a lot of translation work done in Tatarstan, for example, all legislation documents are available in both Tatar and Russian and information agencies provide news in both languages, so there is a demand for Tatar-Russian machine translation system.

I understand that to think that the output of such system will be used for post-editing and then publishing is too optimistic. However, I expect it to perform well on shorter segments of text and to be used, for example, in combination with a translation memory and some fuzzy-match repair system to support the human translator, as described in [1].

Another use case is gisting -- allowing Russian speaking people which do not speak Tatar understand what the text is about. It also might be useful for learners of the language to quickly look up words and phrases.

Why should Google and Apertium sponsor it?

In my 2012 proposal for the Kazakh-Tatar pair I wrote that "Turkic languages represent a very good 'new working field' for Apertium -- they build a large group (more than 150 Mio speakers) of similar languages, but the most of the languages belong to non-central ones". Since then a lot has been done regarding Turkic languages in Apertium, and now there are several Turkic-to-Turkic pairs and transducers for many of them.

Like Tatar, most of the Turkic languages share an official status with Russian either in a region of Russia or in one of the post-soviet republics. There aren't any released Turkic to Russian pairs in Apertium yet. The value of the Tatar-Russian pair will be not only in the pair itself, but also in that it will be the first (by the end of the GSoC programme hopefully released) system in Apertium [2] for translating from a Turkic language to Russian.

I will place high emphasis on making the code obvious and tested, so that the purpose of disambiguation and transfer rules are clear (in particular by providing Input (from the previous module) and Output (after this rule applied) comments for them in a disciplined manner and trying to keep rules independent from each other), which will make it easy to understand and to adopt them for other Turkic languages. I will also try to write disambiguation rules not tight to a particular wordform/lemma as much as possible.

Another benefit for Apertium would be that Russian transducer would get additional extension and testing. It is already involved in many pairs and can potentially be used in many more.

Major goals/deliverables

  • 10000 top stems in bidix and at least 80% trimmed coverage
  • Clean testvoc
  • Constraint grammar for Tatar (per-token ambiguity __ and precision __; full disambiguation of a particular hand-tagged corpus we are interested in?)
  • Average WER on unseen texts from the domain(s) I have been focusing on below 50

Available resources

Morphological dictionaries

I will not have to start from scratch. Transducers for both Tatar and Russian are already available, with 90% coverage of each (Russian as well?). Few phonology issues remain in apertium-tat; apertium-rus might require some evaluation and checking of the paradigms.

Bilingual dictionary

Bilingual dictionary has to be written from scratch. However, there are a lot of dictionaries, both in print and online (e.g. at [3]), which can be used for reference.

Parallel corpora

A lot of Tatar prose has been translated into Russian, some of it is available online [4]. Of more value for this project are legislation documents and news at the government's site [tatarstan.ru], which are in the public domain. Being legal texts, they aren't 'free' translations as fiction text translations usually are.

Workplan

Overview

Post-application:

  • improve WER of the 'James and Mary' story translation
  • read documentation on chunking-based transfer
  • read papers describing other Apertium pairs for distant languages

Community-bonding period:

  • collect test data for transfer (consider books on contrastive grammar of Tatar and Russian which I have)
  • set up regression, corpus and WER tests so that I can be sure that the translator improved rather than broke before I commit
  • write transfer rules for single-word chunks
  • set up a progress tracking page and scripts
Month Month 1 Month 2 Month 3
Morning Morphology&Coverage Tatar CG Lexical selection
Afternoon Transfer rules

Month | Month 1 | Month 2 | Month 3 | Morning | | Tatar CG | Lexical selection | Afternoon | Writing transfer rules |

List your skills and give evidence of your qualifications

I have studied German philology and Applied Linguistics at Kazan Federal University and recently have been accepted to the M.Sc. programme in Computational Linguistics offered by the Natural Language Processing Institute at the University of Stuttgart.

In 2012, I have already participated in GSoC and worked on the Kazakh-Tatar pair and currently maintain it.

List any non-Summer-of-Code plans you have for the Summer

I have no other commitments for the entire period of the program.

References

[1] Ideas_for_Google_Summer_of_Code/Command-line_translation_memory_fuzzy-match_repair

[2] There is a closed-source bidirectional Tatar-Russian machine translation system for Windows available free of charge at tatar.com.ru. I haven't been able to install it yet and cannot evaluate its quality (but will try to do so later).

[3] suzlek.ru

[4] http://rinatmuhamadiev.ru/produkts.html