User:Kamush/GSoC2021Proposal

From Apertium
Revision as of 16:28, 15 April 2021 by Kamush (talk | contribs) (Workplan about to finish)
Jump to navigation Jump to search

Develop a prototype MT system for Kazakh - Uzbek language pair

Contact Information

Name: Barno Kutlimuratova (@Kamush)

Nationality: Uzbekistan

Location: Galicia, Spain

University: Universidade da Coruña

Email: kutlimuratovab0712@gmail.com

Degree/Field of Study: MSc in Advanced English Studies and its Applications

IRC: Kamush

Timezone: GTM+2

Github: kamush901


Short Description of the proposal

Having seen the benefits of the open-source Rule-Based Machine Translation platform - Apertium as an alternative to other free/commercial online translator systems, especially for many low-resource language pairs, I decided to contribute to the platform by extending the list of language pairs my native language - Uzbek has so far.

Being a master student in philology, and having some experience in the creation of language resources, I would like to propose to implement new language pair: Kazakh - Uzbek for Apertium, as these two languages are both low-resource Turkic languages that are official languages of two respective Central Asian countries with so many economical and cultural relationships. But this language pair still lacks an open-source machine translation system.

My proposal is to fill this gap as much as possible during this GSoC2021 program.

Since Uzbek and Kazakh languages from the same language family, they are closely related in terms of grammar, word order, and similarity in vocabulary, so I will try to make a bidirectional translation, with a more focus on Kazakh -> Uzbek side, as Uzbek is my native language and I possess very basic knowledge in Kazakh.


Why is it that you are interested in Apertium?

Having specialized in creating NLP resources as my field of research, I wanted to contribute to my native language as well rather than only English. Apertium is a free and open-source platform for both RBMT as well as the Monolingual language package, I am interested in adding more resources there to support my native language.


Which of the published tasks are you interested in? What do you plan to do?

Title: Apertium translation pair for Kazakh - Uzbek

Besides what the proposal title says, I also can offer flexibility around working on language data, be it monolingual or in pairs where Uzbek is a target language (since it is my native one).

Major goals

Major points of my proposal are as following:

   • Spending a little time on Uzbek lexicon to achieve high-accuracy morphological analyser;
   • Initializing Kazakh-Uzbek pair (kaz-uzb);
   • Adding dictionary words to the Kazakh-Uzbek pair, increasing the coverage to above 80%;
   • Increasing WER on the Kazakh-Uzbek pair (goal: below 30%);
   • Implementing apertium-separable to the kaz-uzb pair;
   • Writing Lexical selection rules for better translation accuracy;
   • Creatng testvoc for testing;
   • Introducing apertium-recursive;


Workplan

This part is still beaing created...

Time Period Goal Details
Community Bonding Period

May 17-June 5

  • Installing Apertium
  • Initialize kaz-uzb pair
  • Collect data in both languages
  • Installing Apertium and necessary tools;
  • Send the first PR that can translate a small sample text;
  • Extract Uzbek and Kazakh wiki corpus;
  • Collect Uzbek and Kazakh web(non-wiki) corpus;
  • Collect Kazakh-Uzbek dictionary and parallel corpora;
Week 1

June 6-12

Make Uzbek better
  • Go through all Uzbek stems in uzb.lexc;
  • Clean(deduplicate) and correct uzb stems;
  • Improve Uzbek lexicon;
Week 2

June 13-19

Expand bilingual dictionary
  • Start adding bilingual dictionary elements;
Week 3

June 20-26

More on .dix and .lrx
  • Expand bilingual dictionary;
  • Lexical selection rules;
Week 4

June 27-July 3

More on .dix and .lrx
  • Expand bilingual dictionary;
  • Lexical selection rules;
Week 5

July 4-10

Test translator and expand more
  • Test the kaz-uzb translator;
  • Expand the Uzbek lexicon with missing words;
  • Expand bilingual dictionary;
  • Expand lexical selection rules;
Week 6

July 11-17

Focus on transfer rules
  • Start working on transfer rules;
  • More bilingual dictionary;
  • More lexical section rules;
Week 7

July 18-24

Test the kaz-uzb translator
  • Test the kaz-uzb translator;
  • Extend the Uzbek lexicon with missing words;
  • Extend the Kazakh lexicon with missing words;
  • Extend bilingual dictionary;
  • Add more lexical selection rules;
Week 8

July 25-31

Focus on transfer rules
  • Add words, rules;
  • Work on transfer rules;
  • Start the testvoc;
Week 9

August 1-7

Focus on testvoc
  • Add words, rules;
  • Transfer rules kaz-uzb;
  • Testvoc kaz-uzb
Week 10

August 8-14

Finalize work
  • Test the kaz-uzb translator;
  • Check the transfer rules;
  • Check the testvoc;
  • Write the final report;

Skills and qualifications

Academic skills: Currently I am a first year master student in Advanced English studies in Spain.
   • Language skills: Uzbek (native); English (advanced); Russian, Kazakh, Kyrgyz (basic).
   • Programming skills: I do have a basic understanding of XML and other Markup languages in general, I can work with bash scripts and I also can easily get help from my close people when there is a need for actual coding.

Declaration of Honour

I do declare that I can spend a required amount of hours working with Apertium during Community bonding and an actual working period during summer. I also inform that in case of immediate changes in personal life that might affect the working hours, I will immediately inform mentors and get their permission, with a condition to fulfill the requirements even if the official date is finished.