Difference between revisions of "User:Ilnar.salimzyan/GSoC2014/Application"

From Apertium
Jump to navigation Jump to search
(Replaced content with 'You can find my proposal on [http://google-melange.appspot.com/gsoc/proposal/review/google/gsoc2012/selimcan/1# Google-Melange].')
Line 1: Line 1:
  +
You can find my proposal on [http://google-melange.appspot.com/gsoc/proposal/review/google/gsoc2012/selimcan/1# Google-Melange].
Remember that this is only a preview :)
 
 
== GSoC application: Apertium-kaz-tat: machine translation between Kazakh and Tatar ==
 
'''Name:''' Ilnar Salimzyanov
 
 
'''E-mail adress:''' ilnar.salimzyan@gmail.com
 
 
''Other information that may be useful to contact you:''
 
 
'''IRC:''' selimcan '''Sourceforge account:''' selimcan '''Cellphone:''' +79625617985 '''Timezone:''' UTC+04.00
 
 
==Why is it you are interested in machine translation?==
 
Since school years I've had a passion for math and languages (this has probably something to do with my parents being teachers - my father teaching math and mother German). It was a dilemma for me back then choosing what to study. And although I have chosen linguistics, computational linguistics seems to be a good compromise, involving, in some degree or another, both. And I really like being involved to this field.
 
 
==Why is it that you are interested in the Apertium project?==
 
Apertium fascinates me being one of the few (only?) open-source RBMT platforms showing rapid growth. Even more I like it for prioritizing work on so-called non-central<ref>http://165.134.12.12/pub/mt.pdf</ref>, or, in other words, under-resourced / marginalized languages, to which unfortunately my native language (Tatar) also belongs.
 
 
==Which of the published tasks are you interested in? What do you plan to do?==
 
 
'''Task:'''
 
''Adopting a language pair''
 
 
'''Title:'''
 
''Apertium-kaz-tat — machine translation between Kazakh and Tatar''
 
 
==Why should Google and Apertium sponsor it?==
 
 
I think that Turkic languages represent a very good "new working field" for Apertium - it is a large group (more than 150 Mio speakers) of mutually intelligible and similar languages, but the most of the languages belong to under-resourced ones. Kazakh and Tatar are in the "Top 7" of the Turkic languages by the number of speakers.
 
 
==Work plan==
 
 
For Kazakh-Tatar language pair I will not have to start from absolute scratch. Transducers for both languages perform quite well, having 76% and 71% coverage each <ref>Consult ‘Kazmorph’ and ‘Tatmorph’ pages on the wiki</ref>. Having that, I thought that the crucial thing to benefit from these separate transducers most with less work is to write bidix file, translating words from Kazakh lexc file into Tatar.
 
 
; Bilingual dictionary
 
All words in kazakh.lexc <ref>See /branches/apertium-kaz/</ref> were commented with English glosses (thanks who had done this!). Using a simple sed one-liner, I prepared bidix entries with Kazakh words as the left side, putting English glosses again into comments. In few hour’s work, I translated ~500 nouns (not proper nouns) and most of the adjectives into Tatar <ref>See /branches/apertium-kaz-tat/words</ref>. For Kazakh words which look very similar to Tatar ones and have the same meaning as these Tatar equivalents, this can be done very quickly. For others I consulted Kazakh-Russian dictionaries too, but again, translating all remaining words from kazakh.lexc will take no more than few days of focused work.
 
 
; Parallel Corpora
 
As a source for parallel corpora Bible and Quran translations can serve<ref>See e.g. tanzil.net; kkitap.net; http://www.ibt.org.ru/english/bible/ttr.htm and kuran.kz</ref>. There are also both Kazakh and Tatar localization teams for several FOSS desktop environments (LXDE, XFCE, Gnome and KDE), but the localizations are far from being complete. Some sentences are available at Tatoeba project<ref>www.tatoeba.org</ref>.
 
 
; Monolingual corpora and frequency lists
 
For Kazakh there is a corpus made of materials of Radio Free Europe/Radio Liberty, and the part of the stems in the Kazakh lexc are the most frequent words taken from it. This is certainly good news. As for Tatar, corpus.tatfolk.ru (a project aiming to create a web-crawled corpus of Tatar, similar in functionality with Wortschatz project), after sharing with them preprocessed pages collected by me earlier and concatenating them with what corpus.tatfolk.ru had, provided me a frequency list of Tatar wordforms <ref>See branches/apertium-tat/words</ref>.
 
 
===Before 21 May===
 
* Improve Tatar lexc and twol files as part of Apertium Tatar-Bashkir. The goal is to increase the coverage up to 80% (the description of work on apertium-tt-ba see below);
 
** Read Documentation where necessary; finish reading FSMBook.
 
* Create bilingual dictionary, by:
 
** translating stems from Kazakh transducer into Tatar;
 
** translating Tatar stems which are in apertium-tt-ba/tt.lexc but not yet in the bilingual dictionary into Kazakh (as they are most frequent Tatar words).
 
* Collect sentences difficult for translation (put them on the wiki under something like Kazakh-Tatar/Pending tests<ref>Yes, I read talk about regression tests! Here it is only about difficult cases).</ref>
 
 
'''Deliverable #I:''' A bidix file containing all the stems from branches/apertium-kaz/kaz.lexc and nursery/apertium-tt-ba/tt.lexc
 
'''Deliverable #II:''' A morphological transducer for Tatar with 80% coverage (as part of apertium-tt-ba)
 
 
===The coding period. General description===
 
 
* Make transducers really compatible. I already started doing this [see apertium-tat and apertium-kaz in branches], following tag-choosing conventions described in [Turkic Languages]<ref>http://wiki.apertium.org/wiki/Turkic_languages</ref> and general structure of continuation classes and lexicons implemented in [branches tur kir] (as most complete transducers for a Turkic language).
 
 
* Add Kazakh stems from the bilingual dictionary, where they appeared after translating Tatar words as described above, to the lexc-file for Kazakh. Expand transducer for Tatar with stems from the bilingual dictionary, if it doesn’t "recognize" them. In other words, check that both lexc-files are up-to-date with the bilingual dictionary (which contains maximum stems in our case).
 
 
* Evaluate transducers: write some basic transfer rules (the ones from the tt-ba pair can be reused — since this languages are *very* close, there aren't many transfer rules actually. What we need is just rules which will take a lexical unit and output a lexical unit with the same tags) and run testvoc.
 
 
* If Kazakh transducer doesn't make it up to 80% of monolingual coverage after that, expand it with the not-recognized-words taken from the RFE/RL corpus. Add this words to the bilingual dictionary and corresponding translations to Tatar lexc-file.
 
 
* Work on constraint grammars and transfer rules. Since Turkic languages usually share the POS-ambiguities, by using the same tags and having the same logic of morphotactics (e.g. using “syntactic” categories like ‘subst’, ‘attr’ etc), I guess that translators would perform quite well even without much of Constraint grammar rules. So invest more time to transfer rules.
 
** Write rules so that sentences from "Pending tests" collected earlier are translated correctly.
 
 
==Week plan==
 
 
* Weeks 1-3: Make Kazakh transducer consistent with Tatar one. Tweak twol-file for Kazakh if necessary
 
* Week 4: Check whether lexc-files contain all the stems from the bidix. Run testvoc
 
 
* '''Deliverable #1''': Testvoc-clean, 80% monolingual coverage morphological transducers
 
 
* Week 5: Write CG rules for cases, where Kazakh and Tatar surface forms don't share ambiguity
 
 
* '''Deliverable #2''': Minimalistic Constraint Grammars
 
 
* Week 6-8: Continue on disambiguation rules. Write transfer rules
 
 
* '''Deliverable #3''': Transfer rules
 
 
* Week 9-11: Testing. Write lexical selection rules. Improve transfer rules
 
* '''Deliverable #4''': Lexical selection rules. Update versions of everything delivered before
 
 
* Week 12: Evaluation
 
'''Release'''
 
 
==List your skills and give evidence of your qualifications==
 
 
I am the first year master’s student at the Kazan Federal University, studying Applied Linguistics <ref>A not-so-clear term, which caused many debates. What we study is a mix of computational linguistics, lexicography and several other courses.</ref>
 
 
I got to know about Apertium first time in 2009, while writing a small paper at the university on comparison of available machine translation systems. Apertium fascinated me then being open source, showing rapid growth and being a good potential starting point for Tatar and other Turkic languages (yes, I have thought about them too). I played around with lttoolbox dictionary for Tatar (bad idea, I know, but I didn’t know about "X/S/HFST"s then and there weren’t any other Turkic languages involved). I even managed to model nouns morphotactics using it!
 
 
Back in 2009 I translated part of the Official Documentation into Russian <ref> See /apertium/trunk/apertium-documentation/apertium-2.0/ru/apertium_docu.odt</ref> (till chapter 3.2.3; besides someone willing to finish it the translation needs a good editor). Also in 2009 I translated Apertium New language pair Howto into Russian.
 
 
I was one of the participants of the Šupaškar Apertium Workshop, held in January this year, where Francis Tyers, Hector Alos-i-Font,
 
Jonathan Washington and Trond Trosterud were instructors.
 
 
I was very fortunate to see Jonathan and Francis work on Tatar-Bashkir pair as an example pair for the Šupaškar Workshop and move it to nursery. It is very useful to have a transducer for my native language (and a language closest to it) to learn the semantics and structure of lexc and twol files (which I wasn’t really familiar with, since using HFST with Apertium is relatively new thing and it is not mentioned in the Official Documentation), along with the reading the famous FSMBook.
 
 
I have been involved in work on Tatar-Bashkir pair as, let’s say, “language-consultant” and “tester”<ref>See accepted, but not-yet-published paper here: https://www.softconf.com/lrec2012/TurkicLanguage2012/cgi-bin/scmd.cgi?scmd=getFinal&passcode=18X-P9A6A3D6H8&_lDoc=Paper</ref>. With another fellow from Ufa we have been translating top-5000 wordlist of Russian National Corpus into Tatar and Bashkir. This translations were added then to the translator files. Also, I have been analyzing some errors in the translations finding out, where Apertium-tt-ba performed not so well, describing it on the wiki <ref>http://wiki.apertium.org/wiki/Morphology_of_Tatar_language</ref> and commiting from time to time to svn.
 
 
==Non-GSoC activities==
 
I have an exam in the first week of June, after that I have no other commitments.
 
 
==References==
 
<references/>
 
 
[[Category:GSoC 2012 Student Proposals]]
 

Revision as of 16:14, 6 April 2012

You can find my proposal on Google-Melange.