User:Kamush/GSoC2021Proposal

Develop a prototype MT system for Kazakh - Uzbek language pair

Contact Information[edit]

Name: Barno Kutlimuratova (@Kamush)

Nationality: Uzbekistan

Location: Galicia, Spain

University: Universidade da Coruña

Email: kutlimuratovab0712@gmail.com

Degree/Field of Study: MSc in Advanced English Studies and its Applications

IRC: Kamush

Timezone: GTM+2

Github: kamush901

Short Description of the proposal[edit]

Having seen the benefits of the open-source Rule-Based Machine Translation platform - Apertium as an alternative to other free/commercial online translator systems, especially for many low-resource language pairs, I decided to contribute to the platform by extending the list of language pairs my native language - Uzbek has so far.

Being a master student in philology, and having some experience in the creation of language resources, I would like to propose to implement new language pair: Kazakh - Uzbek for Apertium, as these two languages are both low-resource Turkic languages that are official languages of two respective Central Asian countries with so many economical and cultural relationships. But this language pair still lacks an open-source machine translation system.

My proposal is to fill this gap as much as possible during this GSoC2021 program.

Since Uzbek and Kazakh languages from the same language family, they are closely related in terms of grammar, word order, and similarity in vocabulary, so I will try to make a bidirectional translation, with a more focus on Kazakh -> Uzbek side, as Uzbek is my native language and I possess very basic knowledge in Kazakh.

Why is it that you are interested in Apertium?[edit]

Having specialized in creating NLP resources as my field of research, I wanted to contribute to my native language as well rather than only English. Apertium is a free and open-source platform for both RBMT as well as the Monolingual language package, I am interested in adding more resources there to support my native language.

Which of the published tasks are you interested in? What do you plan to do?[edit]

Title: Apertium translation pair for Kazakh - Uzbek

Besides what the proposal title says, I also can offer flexibility around working on language data, be it monolingual or in pairs where Uzbek is a target language (since it is my native one).

Major goals[edit]

Major points of my proposal are as following:

   • Spending a little time on Uzbek lexicon to achieve high-accuracy morphological analyser;
   • Initializing Kazakh-Uzbek pair (kaz-uzb);
   • Adding dictionary words to the Kazakh-Uzbek pair, increasing the coverage to above 80%;
   • Increasing WER on the Kazakh-Uzbek pair (goal: below 30%);
   • Implementing apertium-separable to the kaz-uzb pair;
   • Writing Lexical selection rules for better translation accuracy;
   • Creatng testvoc for testing;
   • Introducing apertium-recursive;

Workplan[edit]

This part is still beaing created...

Time Period	Goal	Details
Community Bonding Period May 17-June 5	Installing Apertium Initialize kaz-uzb pair Collect data in both languages	Installing Apertium and necessary tools; Send the first PR that can translate a small sample text; Extract Uzbek and Kazakh wiki corpus; Collect Uzbek and Kazakh web(non-wiki) corpus; Collect Kazakh-Uzbek dictionary and parallel corpora;
Week 1 June 6-12	Make Uzbek better	Go through all Uzbek stems in uzb.lexc; Clean(deduplicate) and correct uzb stems; Improve Uzbek lexicon;
Week 2 June 13-19	Expand bilingual dictionary	Start adding bilingual dictionary elements;
Week 3 June 20-26	More on .dix and .lrx	Expand bilingual dictionary; Lexical selection rules;
Week 4 June 27-July 3	Focus on transfer rules	Expand bilingual dictionary; Lexical selection rules;
Week 5 July 4-10	Test translator and expand more	Test the kaz-uzb translator; Expand the Uzbek lexicon with missing words; Expand bilingual dictionary; Expand lexical selection rules;
Week 6 July 11-17	Focus more on transfer rules	Work more on transfer rules; More bilingual dictionary; More lexical section rules;
Week 7 July 18-24	Test the kaz-uzb translator	Test the kaz-uzb translator; Extend the Uzbek lexicon with missing words; Extend the Kazakh lexicon with missing words; Extend bilingual dictionary; Add more lexical selection rules;
Week 8 July 25-31	Focus on transfer rules	Add words, rules; Work on transfer rules; Start the testvoc;
Week 9 August 1-7	Focus on testvoc	Add words, rules; Transfer rules kaz-uzb; Testvoc kaz-uzb
Week 10 August 8-14	Finalize work	Test the kaz-uzb translator; Check the transfer rules; Check the testvoc; Write the final report;

Skills and qualifications[edit]

   • Academic skills: Currently I am a first year master student in Advanced English studies in Spain.
   • Language skills: Uzbek (native); English (advanced); Russian, Kazakh, Kyrgyz (basic).
   • Programming skills: I do have a basic understanding of XML and other Markup languages in general, I can work with bash scripts and I also can easily get help from my close people when there is a need for actual coding.

Declaration of Honour[edit]

I do declare that I can spend a required amount of hours working with Apertium during Community bonding and an actual working period during summer. I also inform that in case of immediate changes in personal life that might affect the working hours, I will immediately inform mentors and get their permission, with a condition to fulfill the requirements even if the official date is finished.

User:Kamush/GSoC2021Proposal

Contents

Contact Information[edit]

Short Description of the proposal[edit]

Why is it that you are interested in Apertium?[edit]

Which of the published tasks are you interested in? What do you plan to do?[edit]

Major goals[edit]

Workplan[edit]

Skills and qualifications[edit]

Declaration of Honour[edit]

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools