User:Ilnar.salimzyan/GSoC2014/Application

From Apertium
Jump to navigation Jump to search

Remember that this is only a preview :)

GSoC application: Apertium-kaz-tat: machine translation between Kazakh and Tatar

Name: Ilnar Salimzyanov

E-mail adress: ilnar.salimzyan@gmail.com

Other information that may be useful to contact you:

IRC: selimcan Sourceforge account: selimcan Cellphone: +79625617985 Timezone: UTC+04.00

Why is it you are interested in machine translation?

Why is it that you are interested in the Apertium project?

Which of the published tasks are you interested in? What do you plan to do?

Task: Adopting a language pair

Title: Apertium-kaz-tat — machine translation between Kazakh and Tatar

Why should Google and Apertium sponsor it?

How and whom it will benefit in society?

Work plan

For Kazakh-Tatar language pair I will not have to start from absolute scratch. Transducers for both languages perform quite well, having 76% and 71% coverage each [1]. Having that, I thought that the crucial thing to benefit from these separate transducers most with less work is to write bidix file, translating words from Kazakh lexc file into Tatar.

Bilingual dictionary

All words in kazakh.lexc [2] were commented with English glosses (thanx who had done this!). Using a simple sed one-liner, I prepared bidix entries with Kazakh words as the left side, putting english glosses again into comments. In few hour’s work, I translated ~500 nouns (not proper nouns) and most of the adjectives into Tatar [3]. For Kazakh words which look very similar to Tatar ones and have the same meaning as these Tatar equivalents, this can be done very quickly. For other I consulted Kazakh-Russian dictionaries too, but again, translating all remaining words from kazakh.lexc will take no more than few days of focused work.

Parallel Corpora

Some sentences are available at Tatoeba project[4]. As a source for parallel corpora Bible or Quran translations can serve[5]. There are also both Kazakh and Tatar localization teams for several FOSS desktop environments [6], but the localizations are far from being complete.

Frequency lists

According to Francis Tyers, stems in Kazakh transducer were taken from a frequency list (obtained from Kazakh RLFE corpus), which is certainly good. As for Tatar, corpus.tatfolk.ru (a project aiming to create a web-crawled corpus of Tatar, similar in functionality with Wortschatz project), after sharing with them preprocessed pages collected by me earlier and concatenating them with what corpus.tatfolk.ru had, provided me a freqeuncy list of Tatar wordforms [7].

Work To do

Before the coding period:

The coding period:

Non-GSoC activities

List your skills and give evidence of your qualifications

I am the first year master’s student at the Kazan Federal University, studying Applied Linguistics [8]

I got to know about Apertium first time in 2009, while writing a small paper at the university on comparison of available machine translation systems. Apertium fascinated me then being open source, showing rapid growth and being a good potential starting point for Tatar and other Turkic languages (yes, I have thought about them too). I played around with lttoolbox dictionary for Tatar (bad idea, I know, but I didn’t know about "X/S/HFST"s then and there weren’t any other Turkic languages involved). I even managed to model nouns morphotactics using it!

Back in 2009 I translated part of the Official Documentation into Russian [9] (till chapter 3.2.3; besides someone willing to finish it the translation needs a good editor). Also in 2009 I translated Apertium New language pair Howto into Russian.

I was one of the participants of the Šupaškar Apertium Workshop, held in January this year, where Francis Tyers, Hector Alos-i-Font, Jonathan Washington and Trond Trosterud were instructors.

I was very fortunate to see Jonathan and Francis work on Tatar-Bashkir pair as an example pair for the Šupaškar Workshop and move it to nursery. It is very useful to have a transducer for my native language (and a language closest to it) to learn the semantics and structure of lexc and twol files (which I wasn’t really familiar with, since using HFST with Apertium is relatively new thing and it is not mentioned in the Official Documentation), along with the reading the famous FSMBook.

I have been involved in work on Tatar-Bashkir pair as, let’s say, “language-consultant” and “tester”[10]. With another fellow from Ufa we have been translating top-5000 wordlist of Russian National Corpus into Tatar and Bashkir. This translations were added then to the translator files. Also, I have been analyzing some errors in the translations finding out, where Apertium-tt-ba performed not so well, describing it on the wiki [11] and commiting from time to time to svn.

References

  1. Consult ‘Kazmorph’ and ‘Tatmorph’ pages on the wiki
  2. /branches/apertium-kaz/
  3. /branches/apertium-kaz-tat/words
  4. www.tatoeba.org
  5. See e.g. tanzil.net; kkitap.net; http://www.ibt.org.ru/english/bible/ttr.htm and kuran.kz
  6. ?links to lxde, xfce, gnome and kde localization teams?
  7. See branches/apertium-tat/words
  8. A not-so-clear term, which caused many debates. What we study is a mix of computational linguistics, lexicography and several other courses.
  9. See /apertium/trunk/apertium-documentation/apertium-2.0/ru/apertium_docu.odt
  10. See accepted, but not-yet-published paper here: https://www.softconf.com/lrec2012/TurkicLanguage2012/cgi-bin/scmd.cgi?scmd=getFinal&passcode=18X-P9A6A3D6H8&_lDoc=Paper
  11. The Morphology of Tatar Language