User:Ilnar.salimzyan/GSoC2014/Application

From Apertium
Jump to navigation Jump to search

Remember that this is only a preview :)

GSoC application: Apertium-kaz-tat: machine translation between Kazakh and Tatar

Name: Ilnar Salimzyanov

E-mail adress: ilnar.salimzyan@gmail.com

Other information that may be useful to contact you:

IRC: selimcan Sourceforge account: selimcan Cellphone: +79625617985 Timezone: UTC+04.00

Why is it you are interested in machine translation?

Why is it that you are interested in the Apertium project?

Which of the published tasks are you interested in? What do you plan to do?

Task: Adopting a language pair

Title: Apertium-kaz-tat — machine translation between Kazakh and Tatar

Why should Google and Apertium sponsor it?

How and whom it will benefit in society?

Work plan

Work To do

Before the coding period:

The coding period:

Non-GSoC activities

List your skills and give evidence of your qualifications

I am the first year master’s student at the Kazan Federal University, studying Applied Linguistics [1]

I got to know about Apertium first time in 2009, while writing a small paper at the university on comparison of available machine translation systems. Apertium fascinated me then being open source, showing rapid growth and being a good potential starting point for Tatar and other Turkic languages (yes, I have thought about them too). I played around with lttoolbox dictionary for Tatar (bad idea, I know, but I didn’t know about "X/S/HFST"s then and there weren’t any other Turkic languages involved). I even managed to model nouns morphotactics using it!

Back in 2009 I translated part of the Official Documentation into Russian [2] (till chapter 3.2.3; besides someone willing to finish it the translation needs a good editor). Also in 2009 I translated Apertium New language pair Howto into Russian.

I was one of the participants of the Šupaškar Apertium Workshop, held in January this year, where Francis Tyers, Hector Alos-i-Font, Jonathan Washington and Trond Trosterud were instructors.

I was very fortunate to see Jonathan and Francis work on Tatar-Bashkir pair as an example pair for the Šupaškar Workshop and move it to nursery. It is very useful to have a transducer for my native language (and a language closest to it) to learn the semantics and structure of lexc and twol files (which I wasn’t really familiar with, since using HFST with Apertium is relatively new thing and it is not mentioned in the Official Documentation), along with the reading the famous FSMBook.

I have been involved in work on Tatar-Bashkir pair as, let’s say, “language-consultant” and “tester”[3]. With another fellow from Ufa we have been translating top-5000 wordlist of Russian National Corpus into Tatar and Bashkir. This translations were added then to the translator files. Also, I have been analyzing some errors in the translations finding out, where Apertium-tt-ba performed not so well, describing it on the wiki [4] and commiting from time to time to svn.

References

  1. A not-so-clear term, which caused many debates. What we study is a mix of computational linguistics, lexicography and several other courses.
  2. See /apertium/trunk/apertium-documentation/apertium-2.0/ru/apertium_docu.odt
  3. See accepted, but not-yet-published paper here: https://www.softconf.com/lrec2012/TurkicLanguage2012/cgi-bin/scmd.cgi?scmd=getFinal&passcode=18X-P9A6A3D6H8&_lDoc=Paper
  4. The Morphology of Tatar Language