Difference between revisions of "User:Ilnar.salimzyan/GSoC2014/Application"
Jump to navigation
Jump to search
(some extensions, still work in progress) |
|||
(12 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
You can find my proposal for GSoC 2014 [http://www.google-melange.com/gsoc/proposal/public/google/gsoc2014/selimcan/5649050225344512 here]. |
|||
== Contact information == |
|||
[[Category:GSoC_2014_Student_proposals|Ilnar.salimzyan]] |
|||
'''Name:''' Ilnar Salimzyanov <br/> |
|||
'''E-mail address:''' ilnar.salimzyan@gmail.com <br/> |
|||
'''IRC:''' selimcan '''Sourceforge account:''' selimcan '''Timezone:''' UTC+4 <br/> |
|||
== Why is it you are interested in machine translation? == |
|||
First, rule-based machine translation involves both linguistics and programming, in which I am interested in and, second, it is a complex problem, which makes it exciting to work on it. |
|||
== Why is it that you are interested in the Apertium project? == |
|||
Apertium fascinates me being one of the few (the only?) open-source RBMT platforms showing growth. I also like and appreciate its focus on minority languages. |
|||
== Which of the published tasks are you interested in? What do you plan to do? == |
|||
Task: Adopting a language pair |
|||
Title: '''Apertium-tat-rus -- machine translation system from Tatar to Russian''' |
|||
== How it will benefit the society == |
|||
Tatar is a Turkic language, spoken by about 7 Millions of people primarily in and around Tatarstan, where it shares an official status with Russian. There is a lot of translation work done in Tatarstan, for example, all legislation documents are available in both Tatar and Russian and information agencies provide news in both languages, so there is a demand for Tatar-Russian machine translation system. |
|||
I understand that to think that the output of such system will be used for post-editing and then publishing is too optimistic. However, I expect it to perform well on shorter segments of text and to be used, for example, in combination with a translation memory and some fuzzy-match repair system to support the human translator, as described in [1]. |
|||
Another use case is gisting -- allowing Russian speaking people which do not speak Tatar understand what the text is about. It also might be |
|||
useful for learners of the language to quickly look up words and phrases. |
|||
== Why should Google and Apertium sponsor it? == |
|||
In my 2012 proposal for the Kazakh-Tatar pair I wrote that "Turkic languages represent a very good 'new working field' for Apertium -- they build a large group (more than 150 Mio speakers) of similar languages, but the most of the languages belong to non-central ones". Since then a lot has been done regarding Turkic languages in Apertium, and now there are several Turkic-to-Turkic pairs and transducers for many of them. |
|||
Like Tatar, most of the Turkic languages share an official status with Russian either in a region of Russia or in one of the post-soviet republics. There aren't any released Turkic to Russian pairs in Apertium yet. The value of the Tatar-Russian pair will be not only in the pair itself, but also in that it will be the first (by the end of the GSoC programme hopefully released) system in Apertium [2] for translating from a Turkic language to Russian. |
|||
I will place high emphasis on making the code obvious and tested, so that the purpose of disambiguation and transfer rules are clear (in particular by providing <code>Input (from the previous module)</code> and <code>Output (after this rule applied)</code> comments for them in a disciplined manner and trying to keep rules independent from each other), which will make it easy to understand and to adopt them for other Turkic languages. I will also try to write disambiguation rules not tight to a particular wordform/lemma as much as possible. |
|||
Another benefit for Apertium would be that Russian transducer would get additional extension and testing. It is already involved in many pairs |
|||
and can potentially be used in many more. |
|||
== Major goals/deliverables == |
|||
* 10000 top stems in bidix and at least 80% trimmed coverage |
|||
* Clean testvoc |
|||
* Constraint grammar for Tatar (per-token ambiguity <= __ and precision >= __; full disambiguation of a particular hand-tagged corpus) |
|||
* Average WER on unseen texts below 50 |
|||
== Available resources == |
|||
=== Morphological dictionaries === |
|||
I will not have to start from scratch. Transducers for both Tatar and Russian are already available, with 90% coverage of each (Russian as |
|||
well?). Few phonology issues remain in apertium-tat; apertium-rus might require some evaluation and checking of the paradigms. |
|||
=== Bilingual dictionary === |
|||
Bilingual dictionary has to be written from scratch. However, there are a lot of dictionaries, both in print and online (e.g. at [3]), which can be used for reference. |
|||
=== Parallel corpora === |
|||
A lot of Tatar prose has been translated into Russian, some of it is available online [4]. Of more value for this project are legislation |
|||
documents and news at the government's site [tatarstan.ru], which are in the public domain. Being legal texts, they aren't 'free' translations |
|||
as fiction text translations usually are. |
|||
== Workplan == |
|||
=== Overview === |
|||
{|class="wikitable" |
|||
|- |
|||
! Month |
|||
! Community bonding !! Month 1 !! Month 2 !! Month 3 |
|||
|- |
|||
! Morning || || Coverage || Tatar CG || Lexical selection |
|||
|- |
|||
! Afternoon || Morphology & "Single-word for each lexicon testvoc" || colspan="3"|Transfer rules |
|||
|} |
|||
=== Schedule === |
|||
See [http://www.google-melange.com/gsoc/events/google/gsoc2014 GSoC 2014 Timeline] for complete timeline. |
|||
{|class="wikitable" |
|||
! week |
|||
! dates |
|||
! goals |
|||
! notes |
|||
|- |
|||
!colspan="2" style="text-align: right"|post-application period<br />22 March - 20 April |
|||
| |
|||
* improve WER of the 'James and Mary' story translation |
|||
* read documentation on chunking-based transfer |
|||
* read papers describing other Apertium pairs for distant languages |
|||
|- |
|||
!colspan="2" style="text-align: right"|community bonding period<br />21 April - 19 May |
|||
| |
|||
* collect test data for transfer (consider books on contrastive grammar of Tatar and Russian which I have) |
|||
* set up regression, corpus and WER tests so that I can be sure that the translator improved rather than broke before I commit |
|||
* write transfer rules for single-word chunks |
|||
* set up a progress tracking page and scripts |
|||
|- |
|||
! 1 !! 19 - 24 May |
|||
| |
|||
|- |
|||
! 2 !! 25 - 31 May |
|||
| |
|||
|- |
|||
! 3 !! 1 - 7 June |
|||
| |
|||
|- |
|||
! 4 !! 8 - 14 June |
|||
| |
|||
|- |
|||
! 5 !! 15 - 21 June |
|||
| |
|||
|- |
|||
! 6 !! 22 - 28 June |
|||
| |
|||
|- |
|||
! 7 !! 29 June - 5 July |
|||
| |
|||
|- |
|||
!colspan="2" style="text-align: right"|midterm eval<br />July 6 |
|||
| |
|||
|- |
|||
! 8 !! 6 - 12 July |
|||
| |
|||
|- |
|||
! 9 !! 13 - 19 July |
|||
| |
|||
|- |
|||
! 10 !! 20 - 26 July |
|||
| |
|||
|- |
|||
! 11 !! 27 July - 2 August |
|||
| |
|||
|- |
|||
! 12 !! 3 - 10 August |
|||
| |
|||
|- |
|||
!colspan="2" style="text-align: right"|pencils-down week<br />final evaluation<br />11 August - 18 August |
|||
| |
|||
* document installation and usage for end-users |
|||
|} |
|||
== List your skills and give evidence of your qualifications == |
|||
I have studied German philology and Applied Linguistics at Kazan Federal University and recently have been accepted to the M.Sc. programme in Computational Linguistics offered by the Natural Language Processing Institute at the University of Stuttgart. |
|||
In 2012, I have already participated in GSoC and worked on the Kazakh-Tatar pair and currently maintain it. |
|||
== List any non-Summer-of-Code plans you have for the Summer == |
|||
I have no other commitments for the entire period of the program. |
|||
== References == |
|||
[1] [[Ideas_for_Google_Summer_of_Code/Command-line_translation_memory_fuzzy-match_repair]] |
|||
[2] There is a closed-source bidirectional Tatar-Russian machine translation system for Windows available free of charge at [http://tatar.com.ru/trans.php tatar.com.ru]. See [http://bpaste.net/show/GOYSg1ibXlAlVFl1ymus/ this] page to see how it translates the "Mary and James" story in both directions. |
|||
[3] [http://suzlek.ru suzlek.ru] |
|||
[4] [http://rinatmuhamadiev.ru/produkts.html http://rinatmuhamadiev.ru/produkts.html] |
|||
[[Category:GSoC_2014_Student_proposals]] |
Latest revision as of 13:17, 14 May 2014
You can find my proposal for GSoC 2014 here.