Difference between revisions of "User:Ilnar.salimzyan/GSoC2014/Application"

From Apertium
Jump to navigation Jump to search
 
(31 intermediate revisions by the same user not shown)
Line 1: Line 1:
You can find my proposal for GSoC 2014 [http://www.google-melange.com/gsoc/proposal/public/google/gsoc2014/selimcan/5649050225344512 here].
Remember that this is only a preview :)


[[Category:GSoC_2014_Student_proposals|Ilnar.salimzyan]]
== GSoC application: Apertium-kaz-tat: machine translation between Kazakh and Tatar ==
'''Name:''' Ilnar Salimzyanov

'''E-mail adress:''' ilnar.salimzyan@gmail.com

''Other information that may be useful to contact you:''

'''IRC:''' selimcan '''Sourceforge account:''' selimcan '''Cellphone:''' +79625617985 '''Timezone:''' UTC+04.00

=Why is it you are interested in machine translation?=

=Why is it that you are interested in the Apertium project?=

=Which of the published tasks are you interested in? What do you plan to do?=
'''Task:'''
''Adopting a language pair''

'''Title:'''
''Apertium-kaz-tat — machine translation between Kazakh and Tatar''

==Why should Google and Apertium sponsor it?==

==How and whom it will benefit in society?==

=Work plan=

For Kazakh-Tatar language pair I will not have to start from absolute scratch. Transducers for both languages perform quite well, having 76% and 71% coverage each <ref> Consult ‘Kazmorph’ and ‘Tatmorph’ pages on the wiki</ref>. Having that, I thought that the crucial thing to benefit from these separate transducers most with less work is to write bidix file, translating words from Kazakh lexc file into Tatar.

; Bilingual dictionary
All words in kazakh.lexc <ref>/branches/apertium-kaz/</ref> were commented with English glosses (thanx who had done this!). Using a simple sed one-liner, I prepared bidix entries with Kazakh words as the left side, putting english glosses again into comments. In few hour’s work, I translated ~500 nouns (not proper nouns) and most of the adjectives into Tatar <ref>/branches/apertium-kaz-tat/words</ref>. For Kazakh words which look very similar to Tatar ones and have the same meaning as these Tatar equivalents, this can be done very quickly. For other I consulted Kazakh-Russian dictionaries too, but again, translating all remaining words from kazakh.lexc will take no more than few days of focused work.

; Parallel Corpora
Some sentences are available at Tatoeba project<ref>www.tatoeba.org</ref>. As a source for parallel corpora Bible or Quran translations can serve<ref>See e.g. tanzil.net; kkitap.net; http://www.ibt.org.ru/english/bible/ttr.htm and kuran.kz</ref>. There are also both Kazakh and Tatar localization teams for several FOSS desktop environments <ref>?links to lxde, xfce, gnome and kde localization teams?</ref>, but the localizations are far from being complete.

; Frequency lists
According to Francis Tyers, stems in Kazakh transducer were taken from a frequency list (obtained from Kazakh RLFE corpus), which is certainly good. As for Tatar, corpus.tatfolk.ru (a project aiming to create a web-crawled corpus of Tatar, similar in functionality with Wortschatz project), after sharing with them preprocessed pages collected by me earlier and concatenating them with what corpus.tatfolk.ru had, provided me a freqeuncy list of Tatar wordforms <ref>See branches/apertium-tat/words</ref>.

=Work To do=
==Before the coding period:==

==The coding period:==

==Non-GSoC activities==

==List your skills and give evidence of your qualifications==

I am the first year master’s student at the Kazan Federal University, studying Applied Linguistics <ref>A not-so-clear term, which caused many debates. What we study is a mix of computational linguistics, lexicography and several other courses.</ref>

I got to know about Apertium first time in 2009, while writing a small paper at the university on comparison of available machine translation systems. Apertium fascinated me then being open source, showing rapid growth and being a good potential starting point for Tatar and other Turkic languages (yes, I have thought about them too). I played around with lttoolbox dictionary for Tatar (bad idea, I know, but I didn’t know about "X/S/HFST"s then and there weren’t any other Turkic languages involved). I even managed to model nouns morphotactics using it!

Back in 2009 I translated part of the Official Documentation into Russian <ref> See /apertium/trunk/apertium-documentation/apertium-2.0/ru/apertium_docu.odt</ref> (till chapter 3.2.3; besides someone willing to finish it the translation needs a good editor). Also in 2009 I translated Apertium New language pair Howto into Russian.

I was one of the participants of the Šupaškar Apertium Workshop, held in January this year, where Francis Tyers, Hector Alos-i-Font,
Jonathan Washington and Trond Trosterud were instructors.

I was very fortunate to see Jonathan and Francis work on Tatar-Bashkir pair as an example pair for the Šupaškar Workshop and move it to nursery. It is very useful to have a transducer for my native language (and a language closest to it) to learn the semantics and structure of lexc and twol files (which I wasn’t really familiar with, since using HFST with Apertium is relatively new thing and it is not mentioned in the Official Documentation), along with the reading the famous FSMBook.

I have been involved in work on Tatar-Bashkir pair as, let’s say, “language-consultant” and “tester”<ref>See accepted, but not-yet-published paper here: https://www.softconf.com/lrec2012/TurkicLanguage2012/cgi-bin/scmd.cgi?scmd=getFinal&passcode=18X-P9A6A3D6H8&_lDoc=Paper</ref>. With another fellow from Ufa we have been translating top-5000 wordlist of Russian National Corpus into Tatar and Bashkir. This translations were added then to the translator files. Also, I have been analyzing some errors in the translations finding out, where Apertium-tt-ba performed not so well, describing it on the wiki <ref>The Morphology of Tatar Language</ref> and commiting from time to time to svn.

==References==
<references/>

[[Category:GSoC 2012 Student Proposals]]

Latest revision as of 13:17, 14 May 2014

You can find my proposal for GSoC 2014 here.