Develop a prototype MT system for Kazakh - Uzbek language pair
Name: Barno Kutlimuratova (@Kamush)
Location: Galicia, Spain
University: Universidade da Coruña
Degree/Field of Study: MSc in Advanced English Studies and its Applications
Short Description of the proposal
Having seen the benefits of the open-source Rule-Based Machine Translation platform - Apertium as an alternative to other free/commercial online translator systems, especially for many low-resource language pairs, I decided to contribute to the platform by extending the list of language pairs my native language - Uzbek has so far.
Being a master student in philology, and having some experience in the creation of language resources, I would like to propose to implement new language pair: Kazakh - Uzbek for Apertium, as these two languages are both low-resource Turkic languages that are official languages of two respective Central Asian countries with so many economical and cultural relationships. But this language pair still lacks an open-source machine translation system.
My proposal is to fill this gap as much as possible during this GSoC2021 program.
Since Uzbek and Kazakh languages from the same language family, they are closely related in terms of grammar, word order, and similarity in vocabulary, so I will try to make a bidirectional translation, with a more focus on Kazakh -> Uzbek side, as Uzbek is my native language and I possess very basic knowledge in Kazakh.
Why is it that you are interested in Apertium?
Having specialized in creating NLP resources as my field of research, I wanted to contribute to my native language as well rather than only English. Apertium is a free and open-source platform for both RBMT as well as the Monolingual language package, I am interested in adding more resources there to support my native language.
Which of the published tasks are you interested in? What do you plan to do?
Title: Apertium translation pair for Kazakh - Uzbek
Besides what the proposal title says, I also can offer flexibility around working on language data, be it monolingual or in pairs where Uzbek is a target language (since it is my native one).
Major points of my proposal are as following:
• Spending a little time on Uzbek lexicon to achieve high-accuracy morphological analyser; • Initializing Kazakh-Uzbek pair (kaz-uzb); • Adding dictionary words to the Kazakh-Uzbek pair, increasing the coverage to above 80%; • Increasing WER on the Kazakh-Uzbek pair (goal: below 30%); • Implementing apertium-separable to the kaz-uzb pair; • Writing Lexical selection rules for better translation accuracy; • Creatng testvoc for testing; • Introducing apertium-recursive;
This part is still beaing created...
|Community Bonding Period
May 17-June 5
|Make Uzbek better||
|Expand bilingual dictionary||
|More on .dix and .lrx||
June 27-July 3
|More on .dix and .lrx||
|Test translator and expand more||
|Test the kaz-uzb translator||
|Focus on transfer rules||
|Focus on testvoc||
Skills and qualifications
• Academic skills: Currently I am a first year master student in Advanced English studies in Spain. • Language skills: Uzbek (native); English (advanced); Russian, Kazakh, Kyrgyz (basic). • Programming skills: I do have a basic understanding of XML and other Markup languages in general, I can work with bash scripts and I also can easily get help from my close people when there is a need for actual coding.
Declaration of Honour
I do declare that I can spend a required amount of hours working with Apertium during Community bonding and an actual working period during summer. I also inform that in case of immediate changes in personal life that might affect the working hours, I will immediately inform mentors and get their permission, with a condition to fulfill the requirements even if the official date is finished.