User:Elmurod1202/GSoC2020 Proposal

From Apertium
Jump to navigation Jump to search

GSoC 2020: State-of-the-art Morphological Analyser for Uzbek language and improved language pairs: uz-kk, uz-ky, uz-tr,..

Contact Information

Name: Elmurod Kuriyozov

Nationality: Uzbekistan

Location: A Coruna, Spain

University: Universidade da Coruña

Email: elmurod1202@gmail.com

IRC: elmurod1202

Timezone: GTM+2

Github: elmurod1202


Why is it you are interested in machine translation?

Starting from my master's degree, I had an interest in improving the translation quality of my native language(Uzbek) to other languages when my supervisor had a project to create NLP tools for the Uzbek language that I was partially involved. Now I am doing my Ph.D. in Computational Linguistics. So Machine translation is part of my Ph.D. career.

Why is it that you are interested in the Apertium project?

  • Apertium is free and open-source;
  • Apertium focuses on machine translation basically for low-resource languages which completely fits what I am currently working on;
  • Apertium has a wide range of community where I can easily find people that can help and support.

Which of the published tasks are you interested in? What do you plan to do?

Contributing to the language resources and enhancing language pairs’ translation quality.

Title

State-of-the-art Morphological Analyser for Uzbek language and improved language pairs: uz-kk, uz-ky, uz-tr...


Major goals

Having enough knowledge in Natural Language Processing(NLP), I have decided to conduct my research on creating NLP resources for low-resource Turkic languages with a special focus on my native language – Uzbek. Since Uzbek language has more than 30 million native speakers, yet there is almost no reliable NLP resource for it, or only commercially available. My proposal in the shortest way possible is following:

   • Creating a high-accuracy morphological analyser for Uzbek by contributing to the currently existing one;
   • Increasing WER on the tur-uzb pair (goal: below 20%);
   • Increasing naïve coverage of the tur-uzb pair (goal of up to 90%)
   • Cleaning testvoc, introducing apertium-recursive.

Reasons why Google and Apertium should sponsor it

Uzbek language has more than 30 million native speakers and is an official language of Uzbekistan. Apart from that it is spoken in other neighbouring countries in Central Asia, some parts of Russian Federation and a minority in China. Even though it has such many speakers and is a crucial aspect to have language resources, Uzbek language is considered a heavily under-resourced language. So my aim is to create free and open-source NLP resources for Uzbek language. Apertium project is so handy for my case, because it already has enough resources I can contribute and bring closer to the community in need. My main goal includes lifting the Apertium project to the Uzbekistan’s official recommendation when it comes to the translation of documents in Uzbek to other related languages.

Work plan

Community bonding period (May 4 - June 1):

  • Getting closer with Apertium tools and community
  • Finding out the current state of Uzbek language
  • Finding out the availability of Uzbek resources available
  • Learning more about the HFST
  • Doing coding challenge
  • Begin interacting with Apertium's core system

Work Period (June 1 - August 31):

  • This part will be updated soon.


List your skills and give evidence of your qualifications

Educational qualifications:

   • Graduated BSc in Applied Mathematics and Informatics, UrSU, Uzbekistan;
   • Graduated Master in Applied Mathematics and Information Technologies, SamSU, Uzbekistan;
   • Started studying Ph.D. in Computational Linguistics at the University of a Coruna, A Coruna, Spain. 

I have been carrying out my PhD research since 2018 in the topic: “Creating NLP Resources for low-resource Turkic languages, with a specific focus on Uzbek”. So far, my published papers include:

   • “Deep Learning and Machine Learning methods for Sentiment Analysis in the Uzbek Language”(LTC2019, Best Student work award) 
   • “Cross-Lingual Word Embeddings for Turkic Languages”(Accepted, LREC2020)
   • “Unsupervised and semi-supervised morphological segmentation analysis for Uzbek language”(Under process).

I am native in Uzbek language and have a basic understanding evel of Kazakh, Kyrgyz, Karakalpak and Uyghur. I speak fluently in English and have a good command in Russian languages. I have been studying NLP field for more than a year and I can show a good knowledge in machine translation.

Coding Challenge

As a bachelor student in years between 2010 and 2014 I actively participated in ACM ICPC – International Collegiate Programming Contest and two times won the quarter final and was able to attend in pre-finals. As a master student I earned Web-programming and acquired Java, PHP, Javascript, and MySQL skills. Created some websites. As a PhD researcher, I am doing my research basically in Python for computations.

List any non-Summer-of-Code plans you have for the Summer

I can devote my full time, meaning that at least 30 hours per week I can work with this project since it is the highest priority for me to work with Apertium during summer. This is going to be the part of my thesis work. I am not planning to take any summer classes, no any trip planned and I am currently unemployed. There will be only one thing: I will have to travel from Spain back to Uzbekistan in Summer, but it won’t take more than 3-4 days to flay and settle. My return to home won’t affect the productivitiy since I have my own room at Urgench State University, Uzbekistan that I can go every day and continue.