Difference between revisions of "Narimann/GSOC 2019 proposal: Kazakh-Turkish and Turkish-Kazakh"
(12 intermediate revisions by 3 users not shown) | |||
Line 75: | Line 75: | ||
===Title=== |
===Title=== |
||
− | Turkish |
+ | Turkish and Tatar to Kazakh MT |
===Why Google and Apertium should sponsor it? How and who it will benefit in society?=== |
===Why Google and Apertium should sponsor it? How and who it will benefit in society?=== |
||
Line 81: | Line 81: | ||
There are a lot of people who speak these languages: Kazakh(around 11 million), Turkish(75 million), Tatar(5 million). |
There are a lot of people who speak these languages: Kazakh(around 11 million), Turkish(75 million), Tatar(5 million). |
||
− | Turkish-Kazakh & Tatar-Kazakh pairs work stably, but only in one direction. So making them work in both direction will make these pairs more valuable and will lead to further development of Turkic languages in machine translation. In addition, it will help people to communicate with each other or at least translate the texts needed. |
+ | Turkish-to-Kazakh & Tatar-to-Kazakh pairs work stably, but only in one direction. So making them work in both direction will make these pairs more valuable and will lead to further development of Turkic languages in machine translation. In addition, it will help people to communicate with each other or at least translate the texts needed. |
===Work Plan=== |
===Work Plan=== |
||
+ | '''Turkish-Kazakh & Tatar-Kazakh''' |
||
− | ====Post Application Period==== |
||
− | |||
− | Reading Wiki and Documentation |
||
+ | * Add at least 1000 bidix stems each week |
||
− | ====Community Bonding Period==== |
||
+ | * Write new transfer rules |
||
− | |||
+ | * Write constraint-based lexical selection |
||
− | Discuss details and get acquainted with all aspects of these pairs as far as possible, |
||
+ | {|class="wikitable" |
||
− | Identification of morphological and syntax differences |
||
+ | ! style="width: 10%" | Week |
||
+ | ! style="width: 15%" | Dates |
||
+ | ! style="width: 45%" | Goals |
||
+ | ! style="width: 5%" | Bidix |
||
+ | ! style="width: 15%" | WER / PER |
||
+ | ! style="width: 10%" | Coverage |
||
+ | |- |
||
+ | ! Post-application period |
||
+ | | style="text-align:center" | 9 April - 27 May |
||
+ | | |
||
+ | * Identification of morphological and syntax differences |
||
+ | * Begin fixing broken bidix entries |
||
+ | | style="text-align:center" | kaz-tur: ~8000, kaz-tat: ~10000 |
||
+ | | style="text-align:center" | |
||
+ | | style="text-align:center" | |
||
+ | |- |
||
+ | ! 1 |
||
+ | | style="text-align:center" | 27 May - 3 June |
||
+ | | |
||
+ | * Expand bilingual dictionary(nouns) |
||
+ | * Write transfer rules (Turkish > Kazakh) |
||
+ | * Write lexical selection rules (Turkish > Kazakh) |
||
+ | | style="text-align:center" | |
||
+ | | style="text-align:center" | |
||
+ | | style="text-align:center" | |
||
+ | |- |
||
+ | ! 2 |
||
+ | | style="text-align:center" | 4 June - 11 June |
||
+ | | |
||
+ | * Expand bilingual dictionary(verbs) |
||
+ | * Write transfer rules (Turkish > Kazakh) |
||
+ | * Write lexical selection rules (Turkish > Kazakh) |
||
+ | | style="text-align:center" | |
||
+ | | style="text-align:center" | |
||
+ | | style="text-align:center" | |
||
+ | |- |
||
+ | ! 3 |
||
+ | | style="text-align:center" | 12 June - 19 June |
||
+ | | |
||
+ | * Expand bilingual dictionary(adj) |
||
+ | * Write transfer rules (Turkish > Kazakh) |
||
+ | * Write lexical selection rules (Turkish > Kazakh) |
||
+ | | style="text-align:center" | |
||
+ | | style="text-align:center" | |
||
+ | | style="text-align:center" | |
||
+ | |- |
||
+ | ! 4 |
||
+ | | style="text-align:center" | 20 June - 27 June |
||
+ | | |
||
+ | * Expand bilingual dictionary |
||
+ | * Write transfer rules (Turkish > Kazakh) |
||
+ | * Write lexical selection rules (Turkish > Kazakh) |
||
+ | * Documentation |
||
+ | '''First evaluation''' |
||
+ | | style="text-align:center" | |
||
+ | | style="text-align:center" | |
||
+ | | style="text-align:center" | |
||
+ | |- |
||
+ | ! 5 |
||
+ | | style="text-align:center" | 28 June - 4 July |
||
+ | | |
||
+ | * Expand bilingual dictionary(nouns) |
||
+ | * Write transfer rules (Tatar > Kazakh) |
||
+ | * Write lexical selection rules (Tatar > Kazakh) |
||
+ | | style="text-align:center" | |
||
+ | | style="text-align:center" | |
||
+ | | style="text-align:center" | |
||
+ | |- |
||
+ | ! 6 |
||
+ | | style="text-align:center" | 4 July - 11 July |
||
+ | | |
||
+ | * Expand bilingual dictionary(verbs) |
||
+ | * Write transfer rules (Tatar > Kazakh) |
||
+ | * Write lexical selection rules (Tatar > Kazakh) |
||
+ | | style="text-align:center" | |
||
+ | | style="text-align:center" | |
||
+ | | style="text-align:center" | |
||
+ | |- |
||
+ | ! 7 |
||
+ | | style="text-align:center" | 12 July - 19 July |
||
+ | | |
||
+ | * Expand bilingual dictionary(adj) |
||
+ | * Write transfer rules (Tatar > Kazakh) |
||
+ | * Write lexical selection rules (Tatar > Kazakh) |
||
+ | | style="text-align:center" | |
||
+ | | style="text-align:center" | |
||
+ | | style="text-align:center" | |
||
+ | |- |
||
+ | ! 8 |
||
+ | | style="text-align:center" | 20 July - 27 July |
||
+ | | |
||
+ | * Expand bilingual dictionary |
||
+ | * Write transfer rules (Tatar > Kazakh) |
||
+ | * Write lexical selection rules (Tatar > Kazakh) |
||
+ | * Documentation |
||
+ | '''Second evaluation''' |
||
+ | | style="text-align:center" | |
||
+ | | style="text-align:center" | |
||
+ | | style="text-align:center" | |
||
+ | |- |
||
+ | ! 9 |
||
+ | | style="text-align:center" | 28 July - 3 August |
||
+ | | |
||
+ | * Testing, Evaluation, Correction, Documentation |
||
+ | | style="text-align:center" | |
||
+ | | style="text-align:center" | |
||
+ | | style="text-align:center" | |
||
+ | |- |
||
+ | ! 10 |
||
+ | | style="text-align:center" | 4 August - 11 August |
||
+ | | |
||
+ | * Testing, Evaluation, Correction, Documentation |
||
+ | ! style="text-align:center" | |
||
+ | ! style="text-align:center" | |
||
+ | ! style="text-align:center" | |
||
+ | |- |
||
+ | ! 11 |
||
+ | | style="text-align:center" | 12 August - 19 August |
||
+ | | |
||
+ | * Testing, Evaluation, Correction, Documentation |
||
+ | ! style="text-align:center" | |
||
+ | ! style="text-align:center" | |
||
+ | ! style="text-align:center" | |
||
+ | |- |
||
+ | ! 12 |
||
+ | | style="text-align:center" | 20 August - 27 August |
||
+ | | |
||
+ | * Testing, Evaluation, Correction, Documentation |
||
+ | '''Final evaluation''' |
||
+ | ! style="text-align:center" | |
||
+ | ! style="text-align:center" | |
||
+ | ! style="text-align:center" | |
||
+ | |} |
||
+ | == Coding Challenge == |
||
− | ====Week 1==== |
||
+ | https://github.com/nariman9119/apertium-kaz-tur-1 |
||
+ | There I added some transfer rules for verbs, based on tur-kir pair. The coding challenge is not complete as I couldn't spend much time on it, due to other projects in my university. |
||
− | Check the Turkish monodix |
||
+ | Sorry for being late. |
||
− | Check the Kazakh monodix |
||
− | |||
− | Expand the Kazakh-Turkish bidix |
||
− | |||
− | ====Week 2==== |
||
− | |||
− | Expand the Kazakh-Turkish bidix |
||
− | |||
− | Supplement and test the Constraint Grammar rules from Turkish to Kazakh |
||
− | |||
− | Design and preliminary testing of transfer rules from Turkish to Kazakh |
||
− | |||
− | ====Week 3==== |
||
− | |||
− | Expand the Kazakh-Turkish bidix |
||
− | |||
− | Supplement and test the Constraint Grammar rules from Turkish to Kazakh |
||
− | |||
− | Design and preliminary testing of transfer rules from Turkish to Kazakh |
||
− | |||
− | ====Week 4==== |
||
− | |||
− | Expand the Kazakh-Turkish bidix |
||
− | |||
− | Test and debugging of constraint grammar and transfer rules. |
||
− | |||
− | ====Deliverable 1==== |
||
− | |||
− | Complete set of constraint grammar rules for the tur-kaz direction |
||
− | |||
− | Complete set of transfer rules for the tur-kaz direction |
||
− | |||
− | ====Week 5==== |
||
− | |||
− | Expand the Kazakh-Tatar bidix |
||
− | |||
− | Start extending the Tatar monodix |
||
− | |||
− | Start extending the Kazakh monodix |
||
− | |||
− | ====Week 6==== |
||
− | |||
− | Expand the Kazakh-Tatar bidix |
||
− | |||
− | Supplement and test the Constraint Grammar rules from Tatar to Kazakh |
||
− | |||
− | Design and preliminary testing of transfer rules from Tatar to Kazakh. |
||
− | |||
− | ====Week 7==== |
||
− | |||
− | Expand the Kazakh-Tatar bidix |
||
− | |||
− | Supplement and test the Constraint Grammar rules from Tatar to Kazakh |
||
− | |||
− | Design and preliminary testing of transfer rules from Tatar to Kazakh. |
||
− | |||
− | ====Week 8==== |
||
− | |||
− | Expand the Kazakh-Tatar bidix |
||
− | |||
− | Test and debugging of constraint grammar and transfer rules. |
||
− | |||
− | ====Deliverable 2==== |
||
− | |||
− | Complete set of constraint grammar rules for the tat-kaz direction |
||
− | |||
− | Complete set of transfer rules for the tat-kaz direction |
||
− | |||
− | ====Week 9==== |
||
− | |||
− | Testing |
||
− | |||
− | ====Week 10==== |
||
− | |||
− | Testing |
||
− | |||
− | ====Week 11==== |
||
− | |||
− | Testing |
||
− | |||
− | ====Week 12==== |
||
− | |||
− | Testing |
||
− | |||
− | ====Final deliverable==== |
||
− | |||
− | Final evaluation of pairs |
||
− | |||
− | == Coding Challenge == |
||
− | To be updated |
||
== List any non-Summer-of-Code plans you have for the Summer == |
== List any non-Summer-of-Code plans you have for the Summer == |
||
Line 195: | Line 238: | ||
I am planning to visit my parents in Kazakhstan in the period between 1-10 August. In this period I will be able to work at least 30 hours a week. During this time I will change timezone from GMT+3 to GMT+6. |
I am planning to visit my parents in Kazakhstan in the period between 1-10 August. In this period I will be able to work at least 30 hours a week. During this time I will change timezone from GMT+3 to GMT+6. |
||
+ | |||
+ | [[Category:GSoC 2019 student proposals]] |
Latest revision as of 13:10, 14 April 2019
Contents
Contact Information[edit]
Name: Daniyar Nariman
Location: Kazan, Tatarstan
E-mail: n.daniyar@innopolis.ru, nariman9119@gmail.com
IRC: nariman
Github: https://github.com/nariman9119
Gitlab: https://gitlab.com/users/nariman9119
Telegram: nariman9119
Timezone: GMT+3, GMT+6
Skills[edit]
I am a third-year undergraduate student at Innopolis University(Tatarstan, Russia).
Major: Computer Science
Track: Data Science
Programming Skills: Python, Java, C, C++, XML
Languages
Kazakh - upper-intermediate
Russian - fluent
English - upper-intermediate(IELTS 6.0 - 2015)
Turkish - intermediate(5 years of studying in Kazakh-Turkish school)
NLP Related Projects
Word Sense Disambiguation for WordNet corpora
LSTM for Text classification
Russian-Tatar text classification
Tweet analysis on different preprocessing approaches
Keyboard layout and associated misspellings analysis
Dynamic Language Interpreter implementation
My current field of study is more related to Natural Language Processing. For the last 3 months, I worked on a company by developing an AutoLabeller system, which can process the text and collect only the information needed.
Since I will graduate from University next year, I am planning to take the topic related to machine translation for my diploma, and this internship will help me a lot for deeper and more detailed study of how RBMT works.
Why is it that you are interested in Apertium?[edit]
I am studying Computer Science at my university, Data Science track. I am very interested in machine translation and other stuff related to the NLP.
I am interested in Apertium because it pays attention not only to common languages which have a lot of speakers around the world but also to these minority of languages which are not so popular and sometimes do not even have enough data to build a valuable translator.
Nowadays statistical machine translation(SMT) is very popular around the globe comparing with rule-based machine translation(RBMT). But the problem is that SMT requires a lot of data in the form of parallel languages corpora, since they very addicted to data, and many languages cannot afford it. While RBMT does not require so much data but requires a lot of effort to put in. From this point, we can conclude that Apertium is a good approach for machine translation purposes of small languages. Another point is that with a good and full implementation of a specific pair Apertium can reach accuracy comparable to big giants in this field such as Google or Yandex.
Which of the published tasks are you interested in? What do you plan to do?[edit]
Turkish to Kazakh and Tatar to Kazakh MT.
As we discussed in the Apertium-stuff mailing list, I will make these two pairs work stably in that direction, given my linguistic knowledge. Choosing from the given tasks, this task is more related to 1.2 and 1.4.
Proposal[edit]
Title[edit]
Turkish and Tatar to Kazakh MT
Why Google and Apertium should sponsor it? How and who it will benefit in society?[edit]
There are a lot of people who speak these languages: Kazakh(around 11 million), Turkish(75 million), Tatar(5 million).
Turkish-to-Kazakh & Tatar-to-Kazakh pairs work stably, but only in one direction. So making them work in both direction will make these pairs more valuable and will lead to further development of Turkic languages in machine translation. In addition, it will help people to communicate with each other or at least translate the texts needed.
Work Plan[edit]
Turkish-Kazakh & Tatar-Kazakh
- Add at least 1000 bidix stems each week
- Write new transfer rules
- Write constraint-based lexical selection
Week | Dates | Goals | Bidix | WER / PER | Coverage |
---|---|---|---|---|---|
Post-application period | 9 April - 27 May |
|
kaz-tur: ~8000, kaz-tat: ~10000 | ||
1 | 27 May - 3 June |
|
|||
2 | 4 June - 11 June |
|
|||
3 | 12 June - 19 June |
|
|||
4 | 20 June - 27 June |
First evaluation |
|||
5 | 28 June - 4 July |
|
|||
6 | 4 July - 11 July |
|
|||
7 | 12 July - 19 July |
|
|||
8 | 20 July - 27 July |
Second evaluation |
|||
9 | 28 July - 3 August |
|
|||
10 | 4 August - 11 August |
|
|||
11 | 12 August - 19 August |
|
|||
12 | 20 August - 27 August |
Final evaluation |
Coding Challenge[edit]
https://github.com/nariman9119/apertium-kaz-tur-1
There I added some transfer rules for verbs, based on tur-kir pair. The coding challenge is not complete as I couldn't spend much time on it, due to other projects in my university.
Sorry for being late.
List any non-Summer-of-Code plans you have for the Summer[edit]
I consider GSoC as a full-time job and I will not have other commitments during this time. Also, I am planning to start work on the project during the community bonding period to get acquainted with all aspects of these pairs as far as possible(20 hours a week).
I am planning to visit my parents in Kazakhstan in the period between 1-10 August. In this period I will be able to work at least 30 hours a week. During this time I will change timezone from GMT+3 to GMT+6.