Difference between revisions of "User:Ilnar.salimzyan/GSoC2014/Application"
Line 40: | Line 40: | ||
===Before 21 May=== |
===Before 21 May=== |
||
* Improve Tatar lexc and twol files as part of Apertium Tatar-Bashkir. The goal is to increase the coverage up to 80% (the description of work on apertium-tt-ba see below); |
* Improve Tatar lexc and twol files as part of Apertium Tatar-Bashkir. The goal is to increase the coverage up to 80% (the description of work on apertium-tt-ba see below); |
||
** Read Documentation where necessery; finish reading FSMBook |
** Read Documentation where necessery; finish reading FSMBook. |
||
* Create bilingual dictionary, by: |
* Create bilingual dictionary, by: |
||
** translating stems from Kazakh transducer into Tatar; |
** translating stems from Kazakh transducer into Tatar; |
||
** translating Tatar stems |
** translating Tatar stems which are in apertium-tt-ba/tt.lexc but not yet in the bilingual dictinary into Kazakh (as they are most frequent Tatar words). |
||
* Collect sentences difficult for translation (put them on the wiki under something like Kazakh-Tatar/Pending tests). |
* Collect sentences difficult for translation (put them on the wiki under something like Kazakh-Tatar/Pending tests<ref>Yes, I read talk about regression tests! Here it is only about difficult cases). |
||
'''Deliverable #I:''' A bidix file containing all the stems from branches/apertium-kaz/kaz.lexc and nursery/apertium-tt-ba/tt.lexc |
|||
'''Deliverable #II:''' A morphological transducer for Tatar with 80% coverage (as part of apertium-tt-ba) |
|||
===The coding period. General description=== |
===The coding period. General description=== |
||
* Make transducers really compatible. I already started doing this [see apertium-tat and apertium-kaz in branches], following tag-choosing conventions described in <ref>http://wiki.apertium.org/wiki/Turkic_languages</ref> and general structure of continuation classes and lexicons implemented in [branches tur kir] (as most complete transducers for a Turkic language). |
* Make transducers really compatible. I already started doing this [see apertium-tat and apertium-kaz in branches], following tag-choosing conventions described in [Turkic Languages]<ref>http://wiki.apertium.org/wiki/Turkic_languages</ref> and general structure of continuation classes and lexicons implemented in [branches tur kir] (as most complete transducers for a Turkic language). |
||
* Add Kazakh stems from the bilingual dictionary, where they appeared after translating Tatar words as described above, to the lexc-file for Kazakh. Expand transducer for Tatar with stems from the bilingual dictionary, if it doesn’t "recognize" them. In other words, check that both lexc-files are up-to-date with the bilingual dictionary (which contains maximum stems |
* Add Kazakh stems from the bilingual dictionary, where they appeared after translating Tatar words as described above, to the lexc-file for Kazakh. Expand transducer for Tatar with stems from the bilingual dictionary, if it doesn’t "recognize" them. In other words, check that both lexc-files are up-to-date with the bilingual dictionary (which contains maximum stems in our case). |
||
* Evaluate transducers: write some basic transfer rules (the ones from the tt-ba pair can be reused — since this languages are *very* close, there aren't many transfer rules actually. What we need is just rules which will take a lexical unit and output a lexical unit with the same tags) and run testvoc. |
* Evaluate transducers: write some basic transfer rules (the ones from the tt-ba pair can be reused — since this languages are *very* close, there aren't many transfer rules actually. What we need is just rules which will take a lexical unit and output a lexical unit with the same tags) and run testvoc. |
||
Line 61: | Line 64: | ||
==Week plan== |
==Week plan== |
||
* Weeks 1-3: Make Kazakh transducer consistent with Tatar one. Tweak twol-file for Kazakh if necessery |
|||
* Week 1: |
|||
* Week 4: Check whether lexc-files contain all the stems from the bidix. Run testvoc |
|||
* Week 2: |
|||
* Week 3: |
|||
* '''Deliverable #1''': Testvoc-clean, 80% monolingual coverage morphological transducers |
|||
* Week 4: |
|||
* Week 5: Write CG rules for cases, where Kazakh and Tatar surface forms don't share ambiguity |
|||
* '''Deliverable #2''': Minimalistic Constraint Grammars |
|||
* Week 6-8: Continue on disambiguation rules. Write transfer rules |
|||
⚫ | |||
⚫ | |||
* Week 5: |
|||
* Week 6: |
|||
* Week 7: |
|||
* Week 8: |
|||
* Week 9-11: Testing. Write lexical selection rules. Improve transfer rules |
|||
* '''Deliverable #2''' |
|||
* '''Deliverable #4''': Lexical selection rules. Update versions of everything delivered before |
|||
* Week |
* Week 12: Evaluation |
||
'''Release''' |
|||
* Week 10: |
|||
* Week 11: |
|||
* Week 12: |
|||
==List your skills and give evidence of your qualifications== |
==List your skills and give evidence of your qualifications== |
Revision as of 07:29, 6 April 2012
Remember that this is only a preview :)
Contents
- 1 GSoC application: Apertium-kaz-tat: machine translation between Kazakh and Tatar
- 2 Why is it you are interested in machine translation?
- 3 Why is it that you are interested in the Apertium project?
- 4 Which of the published tasks are you interested in? What do you plan to do?
- 5 Why should Google and Apertium sponsor it?
- 6 How and whom it will benefit in society?
- 7 Work plan
- 8 Week plan
- 9 List your skills and give evidence of your qualifications
- 10 Non-GSoC activities
- 11 References
GSoC application: Apertium-kaz-tat: machine translation between Kazakh and Tatar
Name: Ilnar Salimzyanov
E-mail adress: ilnar.salimzyan@gmail.com
Other information that may be useful to contact you:
IRC: selimcan Sourceforge account: selimcan Cellphone: +79625617985 Timezone: UTC+04.00
Why is it you are interested in machine translation?
Why is it that you are interested in the Apertium project?
Which of the published tasks are you interested in? What do you plan to do?
Task: Adopting a language pair
Title: Apertium-kaz-tat — machine translation between Kazakh and Tatar
Why should Google and Apertium sponsor it?
How and whom it will benefit in society?
Work plan
For Kazakh-Tatar language pair I will not have to start from absolute scratch. Transducers for both languages perform quite well, having 76% and 71% coverage each [1]. Having that, I thought that the crucial thing to benefit from these separate transducers most with less work is to write bidix file, translating words from Kazakh lexc file into Tatar.
- Bilingual dictionary
All words in kazakh.lexc [2] were commented with English glosses (thanx who had done this!). Using a simple sed one-liner, I prepared bidix entries with Kazakh words as the left side, putting english glosses again into comments. In few hour’s work, I translated ~500 nouns (not proper nouns) and most of the adjectives into Tatar [3]. For Kazakh words which look very similar to Tatar ones and have the same meaning as these Tatar equivalents, this can be done very quickly. For others I consulted Kazakh-Russian dictionaries too, but again, translating all remaining words from kazakh.lexc will take no more than few days of focused work.
- Parallel Corpora
As a source for parallel corpora Bible and Quran translations can serve[4]. There are also both Kazakh and Tatar localization teams for several FOSS desktop environments (LXDE, XFCE, Gnome and KDE), but the localizations are far from being complete. Some sentences are available at Tatoeba project[5].
- Monolingual corpora and frequency lists
For Kazakh there is a corpus made of materials of Radio Free Europe/Radio Liberty, and the part of the stems in the Kazakh lexc are the most frequent words taken from it. This is certainly good news. As for Tatar, corpus.tatfolk.ru (a project aiming to create a web-crawled corpus of Tatar, similar in functionality with Wortschatz project), after sharing with them preprocessed pages collected by me earlier and concatenating them with what corpus.tatfolk.ru had, provided me a freqeuncy list of Tatar wordforms [6].
Before 21 May
- Improve Tatar lexc and twol files as part of Apertium Tatar-Bashkir. The goal is to increase the coverage up to 80% (the description of work on apertium-tt-ba see below);
- Read Documentation where necessery; finish reading FSMBook.
- Create bilingual dictionary, by:
- translating stems from Kazakh transducer into Tatar;
- translating Tatar stems which are in apertium-tt-ba/tt.lexc but not yet in the bilingual dictinary into Kazakh (as they are most frequent Tatar words).
- Collect sentences difficult for translation (put them on the wiki under something like Kazakh-Tatar/Pending testsCite error: Closing
</ref>
missing for<ref>
tag and general structure of continuation classes and lexicons implemented in [branches tur kir] (as most complete transducers for a Turkic language).
- Add Kazakh stems from the bilingual dictionary, where they appeared after translating Tatar words as described above, to the lexc-file for Kazakh. Expand transducer for Tatar with stems from the bilingual dictionary, if it doesn’t "recognize" them. In other words, check that both lexc-files are up-to-date with the bilingual dictionary (which contains maximum stems in our case).
- Evaluate transducers: write some basic transfer rules (the ones from the tt-ba pair can be reused — since this languages are *very* close, there aren't many transfer rules actually. What we need is just rules which will take a lexical unit and output a lexical unit with the same tags) and run testvoc.
- If Kazakh transducer doesn't make it up to 80% of monolingual coverage after that, expand it with the not-recognized-words taken from the RFE/RL corpus. Add this words to the bilingual dictionary and corresponding translations to Tatar lexc-file.
- Work on constraint grammars and transfer rules. Since Turkic languages usually share the POS-ambiguities, by using the same tags and having the same logic of morphotactics (e.g. using “syntactic” categories like ‘subst’, ‘attr’ etc), I guess that translators would perform quite well even without much of Constraint grammar rules. So invest more time to transfer rules.
- Write rules so that sentences from "Pending tests" collected earlier are translated correctly.
Week plan
- Weeks 1-3: Make Kazakh transducer consistent with Tatar one. Tweak twol-file for Kazakh if necessery
- Week 4: Check whether lexc-files contain all the stems from the bidix. Run testvoc
- Deliverable #1: Testvoc-clean, 80% monolingual coverage morphological transducers
- Week 5: Write CG rules for cases, where Kazakh and Tatar surface forms don't share ambiguity
- Deliverable #2: Minimalistic Constraint Grammars
- Week 6-8: Continue on disambiguation rules. Write transfer rules
- Deliverable #3: Transfer rules
- Week 9-11: Testing. Write lexical selection rules. Improve transfer rules
- Deliverable #4: Lexical selection rules. Update versions of everything delivered before
- Week 12: Evaluation
Release
List your skills and give evidence of your qualifications
I am the first year master’s student at the Kazan Federal University, studying Applied Linguistics [7]
I got to know about Apertium first time in 2009, while writing a small paper at the university on comparison of available machine translation systems. Apertium fascinated me then being open source, showing rapid growth and being a good potential starting point for Tatar and other Turkic languages (yes, I have thought about them too). I played around with lttoolbox dictionary for Tatar (bad idea, I know, but I didn’t know about "X/S/HFST"s then and there weren’t any other Turkic languages involved). I even managed to model nouns morphotactics using it!
Back in 2009 I translated part of the Official Documentation into Russian [8] (till chapter 3.2.3; besides someone willing to finish it the translation needs a good editor). Also in 2009 I translated Apertium New language pair Howto into Russian.
I was one of the participants of the Šupaškar Apertium Workshop, held in January this year, where Francis Tyers, Hector Alos-i-Font, Jonathan Washington and Trond Trosterud were instructors.
I was very fortunate to see Jonathan and Francis work on Tatar-Bashkir pair as an example pair for the Šupaškar Workshop and move it to nursery. It is very useful to have a transducer for my native language (and a language closest to it) to learn the semantics and structure of lexc and twol files (which I wasn’t really familiar with, since using HFST with Apertium is relatively new thing and it is not mentioned in the Official Documentation), along with the reading the famous FSMBook.
I have been involved in work on Tatar-Bashkir pair as, let’s say, “language-consultant” and “tester”[9]. With another fellow from Ufa we have been translating top-5000 wordlist of Russian National Corpus into Tatar and Bashkir. This translations were added then to the translator files. Also, I have been analyzing some errors in the translations finding out, where Apertium-tt-ba performed not so well, describing it on the wiki [10] and commiting from time to time to svn.
Non-GSoC activities
I have an exam in the first week of June, after that I have no other commitments.
References
- ↑ Consult ‘Kazmorph’ and ‘Tatmorph’ pages on the wiki
- ↑ See /branches/apertium-kaz/
- ↑ See /branches/apertium-kaz-tat/words
- ↑ See e.g. tanzil.net; kkitap.net; http://www.ibt.org.ru/english/bible/ttr.htm and kuran.kz
- ↑ www.tatoeba.org
- ↑ See branches/apertium-tat/words
- ↑ A not-so-clear term, which caused many debates. What we study is a mix of computational linguistics, lexicography and several other courses.
- ↑ See /apertium/trunk/apertium-documentation/apertium-2.0/ru/apertium_docu.odt
- ↑ See accepted, but not-yet-published paper here: https://www.softconf.com/lrec2012/TurkicLanguage2012/cgi-bin/scmd.cgi?scmd=getFinal&passcode=18X-P9A6A3D6H8&_lDoc=Paper
- ↑ http://wiki.apertium.org/wiki/Morphology_of_Tatar_language