User:Elmurod1202/GSoC2020 Proposal

From Apertium
Jump to navigation Jump to search

GSoC 2020: State-of-the-art Morphological Analyser for Uzbek language and improved language pairs: uz-kk, uz-ky, uz-tr,..

Progress can be seen here

The Final Report can be seen here


Contact Information[edit]

Name: Elmurod Kuriyozov

Nationality: Uzbekistan

Location: A Coruna, Spain

University: Universidade da Coruña

Email: elmurod1202@gmail.com

IRC: elmurod1202

Timezone: GTM+2

Github: elmurod1202


Why is it you are interested in machine translation?[edit]

Starting from my master's degree, I had an interest in improving the translation quality of my native language(Uzbek) to other languages when my supervisor had a project to create NLP tools for the Uzbek language that I was partially involved. Now I am doing my Ph.D. in Computational Linguistics. So Machine translation is part of my Ph.D. career.

Why is it that you are interested in the Apertium project?[edit]

  • Apertium is free and open-source;
  • Apertium focuses on machine translation basically for low-resource languages which completely fits what I am currently working on;
  • Apertium has a wide range of community where I can easily find people that can help and support.

Which of the published tasks are you interested in? What do you plan to do?[edit]

Contributing to the language resources and enhancing language pairs’ translation quality.

Title[edit]

State-of-the-art Morphological Analyser for Uzbek language and improved language pairs: uz-kk, uz-ky, uz-tr...


Major goals[edit]

Having enough knowledge in Natural Language Processing(NLP), I have decided to conduct my research on creating NLP resources for low-resource Turkic languages with a special focus on my native language – Uzbek. Since Uzbek language has more than 30 million native speakers, yet there is almost no reliable NLP resource for it, or only commercially available. My proposal in the shortest way possible is following:

   • Creating a high-accuracy morphological analyser for Uzbek by contributing to the currently existing one;
   • Increasing WER on the tur-uzb pair (goal: below 20%);
   • Increasing naïve coverage of the tur-uzb pair (goal of up to 90%)
   • Cleaning testvoc, introducing apertium-recursive.

Reasons why Google and Apertium should sponsor it[edit]

Uzbek language has more than 30 million native speakers and is an official language of Uzbekistan. Apart from that it is spoken in other neighbouring countries in Central Asia, some parts of Russian Federation and a minority in China. Even though it has such many speakers and is a crucial aspect to have language resources, Uzbek language is considered a heavily under-resourced language. So my aim is to create free and open-source NLP resources for Uzbek language. Apertium project is so handy for my case, because it already has enough resources I can contribute and bring closer to the community in need. My main goal includes lifting the Apertium project to the Uzbekistan’s official recommendation when it comes to the translation of documents in Uzbek to other related languages.

Work plan[edit]

Community bonding period (May 4 - June 1):[edit]

  • Getting closer with Apertium tools and community;
  • Finding out the current state of Uzbek language;
  • Finding out the availability of Uzbek resources available;
  • Learning more about the HFST;
  • Doing coding challenge;
  • Finding out initial WER and naïve coverage of tur-uzb pair.

Work Period (June 1 - August 31):[edit]

Week 1:

  • Introducing apertium-separable to the tur-uzb pair

Week 2,3:

  • Adding more stems to bilingual dictionary;
  • Transfer rules refactoring;
  • Increasing WER coverage;

Week 4:

  • Running tests
  • Updating documentation
  • Preparing for the first evaluation

Deliverable 1: Increased WER of tur-uzb pair (goal down to 20%)

Week 5,6,7:

  • More work on apertium-separable
  • Extending bilingual dictionary
  • Increasing naïve coverage

Week 8:

  • Running tests
  • Updating documentation
  • Preparing for the second evaluation

Deliverable 2: Increased naïve coverage of the tur-uzb pair (goal up to 90%)

Week 9,10,11:

  • Extending bilingual dictionary, adding more multiwords
  • Work more on transfer rules
  • Cleaning testvoc

Week 12:

  • Running final tests, fixing issues
  • Entire documentation revising and final check-ups
  • Making the project ready for final evaluation

Deliverable 3: Achieving clean translation output

List your skills and give evidence of your qualifications[edit]

Educational qualifications:

   • Graduated BSc in Applied Mathematics and Informatics, UrSU, Uzbekistan;
   • Graduated Master in Applied Mathematics and Information Technologies, SamSU, Uzbekistan;
   • Started studying Ph.D. in Computational Linguistics at the University of a Coruna, A Coruna, Spain. 

I have been carrying out my PhD research since 2018 in the topic: “Creating NLP Resources for low-resource Turkic languages, with a specific focus on Uzbek”. So far, my published papers include:

   • “Deep Learning and Machine Learning methods for Sentiment Analysis in the Uzbek Language”(LTC2019, Best Student work award) 
   • “Cross-Lingual Word Embeddings for Turkic Languages”(Accepted, LREC2020)
   • “Unsupervised and semi-supervised morphological segmentation analysis for Uzbek language”(Under process).

I am native in Uzbek language and have a basic understanding evel of Kazakh, Kyrgyz, Karakalpak and Uyghur. I speak fluently in English and have a good command in Russian languages. I have been studying NLP field for more than a year and I can show a good knowledge in machine translation.

Coding Challenge[edit]

As a bachelor student in years between 2010 and 2014 I actively participated in ACM ICPC – International Collegiate Programming Contest and two times won the quarter final and was able to attend in pre-finals. As a master student I earned Web-programming and acquired Java, PHP, Javascript, and MySQL skills. Created some websites. As a PhD researcher, I am doing my research basically in Python for computations.

List any non-Summer-of-Code plans you have for the Summer[edit]

I can devote my full time, meaning that at least 30 hours per week I can work with this project since it is the highest priority for me to work with Apertium during summer. This is going to be the part of my thesis work. I am not planning to take any summer classes, no any trip planned and I am currently unemployed. There will be only one thing: I will have to travel from Spain back to Uzbekistan in Summer, but it won’t take more than 3-4 days to flay and settle. My return to home won’t affect the productivitiy since I have my own room at Urgench State University, Uzbekistan that I can go every day and continue.