User:Ilnar.salimzyan/GSoC2014/Application

From Apertium
Jump to navigation Jump to search

Remember that this is only a preview :)

GSoC application: Apertium-kaz-tat: machine translation between Kazakh and Tatar

Name: Ilnar Salimzyanov

E-mail adress: ilnar.salimzyan@gmail.com

Other information that may be useful to contact you:

IRC: selimcan Sourceforge account: selimcan Cellphone: +79625617985 Timezone: UTC+04.00

Why is it you are interested in machine translation?

Why is it that you are interested in the Apertium project?

Which of the published tasks are you interested in? What do you plan to do?

Task: Adopting a language pair

Title: Apertium-kaz-tat — machine translation between Kazakh and Tatar

Why should Google and Apertium sponsor it?

How and whom it will benefit in society?

Work plan

For Kazakh-Tatar language pair I will not have to start from absolute scratch. Transducers for both languages perform quite well, having 76% and 71% coverage each [1]. Having that, I thought that the crucial thing to benefit from these separate transducers most with less work is to write bidix file, translating words from Kazakh lexc file into Tatar.

Bilingual dictionary

All words in kazakh.lexc [2] were commented with English glosses (thanx who had done this!). Using a simple sed one-liner, I prepared bidix entries with Kazakh words as the left side, putting english glosses again into comments. In few hour’s work, I translated ~500 nouns (not proper nouns) and most of the adjectives into Tatar [3]. For Kazakh words which look very similar to Tatar ones and have the same meaning as these Tatar equivalents, this can be done very quickly. For other I consulted Kazakh-Russian dictionaries too, but again, translating all remaining words from kazakh.lexc will take no more than few days of focused work.

Parallel Corpora

As a source for parallel corpora Bible and Quran translations can serve[4]. There are also both Kazakh and Tatar localization teams for several FOSS desktop environments (LXDE, XFCE, Gnome and KDE), but the localizations are far from being complete. Some sentences are available at Tatoeba project[5].

Frequency lists

According to Francis Tyers, stems in Kazakh transducer were taken from a frequency list (obtained from Kazakh RLFE corpus), which is certainly good. As for Tatar, corpus.tatfolk.ru (a project aiming to create a web-crawled corpus of Tatar, similar in functionality with Wortschatz project), after sharing with them preprocessed pages collected by me earlier and concatenating them with what corpus.tatfolk.ru had, provided me a freqeuncy list of Tatar wordforms [6].

Before the coding period 21 May
  • Work on improving of Tatar lexc and twol files as part of Apertium Tatar-Bashkir. The goal is to increase the coverage up to 80% (the description of work on apertium-tt-ba see below);
    • Read Documentation where necessery; finish reading FSMBook;
  • Create bilingual dictionary, by:
    • translating stems from Kazakh transducer into Tatar;
    • translating Tatar stems from tt.lexc found in apertium-tt-ba into Kazakh (as they are most frequent Tatar words).
  1. Make transducers really compatible. I already started doing this [see apertium-tat and apertium-kaz in branches], following

tag-choosing conventions described in [1] and general structure of continuation classes and lexicons implemented in [branches tur kir] (as most complete transducers for a Turkic language). This task also includes remedy of known shortcomings of the transducer for Tatar mentioned below.

  1. Expand transducer for Tatar with stems from the bilingual dictionary, if it doesn’t recognize them.
  1. Work on constraint grammars and transfer rules. Since Turkic languages usually share the POS-ambiguities, by using the same tags

and having the same logic of morphotactics (using “syntactic” categories like ‘subst’, ‘attr’ etc), I guess that translators would perform quite well even without much of Constraint grammar rules. So invest more time to transfer rules.

The coding period

Week plan

List your skills and give evidence of your qualifications

I am the first year master’s student at the Kazan Federal University, studying Applied Linguistics [7]

I got to know about Apertium first time in 2009, while writing a small paper at the university on comparison of available machine translation systems. Apertium fascinated me then being open source, showing rapid growth and being a good potential starting point for Tatar and other Turkic languages (yes, I have thought about them too). I played around with lttoolbox dictionary for Tatar (bad idea, I know, but I didn’t know about "X/S/HFST"s then and there weren’t any other Turkic languages involved). I even managed to model nouns morphotactics using it!

Back in 2009 I translated part of the Official Documentation into Russian [8] (till chapter 3.2.3; besides someone willing to finish it the translation needs a good editor). Also in 2009 I translated Apertium New language pair Howto into Russian.

I was one of the participants of the Šupaškar Apertium Workshop, held in January this year, where Francis Tyers, Hector Alos-i-Font, Jonathan Washington and Trond Trosterud were instructors.

I was very fortunate to see Jonathan and Francis work on Tatar-Bashkir pair as an example pair for the Šupaškar Workshop and move it to nursery. It is very useful to have a transducer for my native language (and a language closest to it) to learn the semantics and structure of lexc and twol files (which I wasn’t really familiar with, since using HFST with Apertium is relatively new thing and it is not mentioned in the Official Documentation), along with the reading the famous FSMBook.

I have been involved in work on Tatar-Bashkir pair as, let’s say, “language-consultant” and “tester”[9]. With another fellow from Ufa we have been translating top-5000 wordlist of Russian National Corpus into Tatar and Bashkir. This translations were added then to the translator files. Also, I have been analyzing some errors in the translations finding out, where Apertium-tt-ba performed not so well, describing it on the wiki [10] and commiting from time to time to svn.

Non-GSoC activities

References

  1. Consult ‘Kazmorph’ and ‘Tatmorph’ pages on the wiki
  2. See /branches/apertium-kaz/
  3. See /branches/apertium-kaz-tat/words
  4. See e.g. tanzil.net; kkitap.net; http://www.ibt.org.ru/english/bible/ttr.htm and kuran.kz
  5. www.tatoeba.org
  6. See branches/apertium-tat/words
  7. A not-so-clear term, which caused many debates. What we study is a mix of computational linguistics, lexicography and several other courses.
  8. See /apertium/trunk/apertium-documentation/apertium-2.0/ru/apertium_docu.odt
  9. See accepted, but not-yet-published paper here: https://www.softconf.com/lrec2012/TurkicLanguage2012/cgi-bin/scmd.cgi?scmd=getFinal&passcode=18X-P9A6A3D6H8&_lDoc=Paper
  10. http://wiki.apertium.org/wiki/Morphology_of_Tatar_language