Difference between revisions of "User:Elmurod1202/GSoC2020 Proposal"
Elmurod1202 (talk | contribs) m (Elmurod1202 moved page User:Elmurod1202 to User:Elmurod1202/GSoC2020 Proposal) |
Elmurod1202 (talk | contribs) (Added link to my final report) |
||
(3 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
'''GSoC 2020: State-of-the-art Morphological Analyser for Uzbek language and improved language pairs: uz-kk, uz-ky, uz-tr,..''' |
'''GSoC 2020: State-of-the-art Morphological Analyser for Uzbek language and improved language pairs: uz-kk, uz-ky, uz-tr,..''' |
||
Progress can be seen [https://wiki.apertium.org/wiki/User:Elmurod1202/GSoC2020Progress here] |
|||
The Final Report can be seen [https://wiki.apertium.org/wiki/User:Elmurod1202/GSoC2020_Final_Report here] |
|||
== Contact Information == |
== Contact Information == |
||
Line 38: | Line 43: | ||
My proposal in the shortest way possible is following: |
My proposal in the shortest way possible is following: |
||
• Creating a high-accuracy morphological analyser for Uzbek by contributing to the currently existing one; |
• Creating a high-accuracy morphological analyser for Uzbek by contributing to the currently existing one; |
||
• Increasing WER on the tur-uzb pair (goal: below 20%); |
|||
• Since having basic understanding level in other related Turkic languages(Karakalpak, Kyrgyz and Kazakh), Contributing to the machine translation in language pairs: Uzbek-Karakalpak, Uzbek-Kazakh, Uzbek-Kyrgyz as well as Uzbek-Uyghur. |
|||
• Increasing naïve coverage of the tur-uzb pair (goal of up to 90%) |
|||
• Cleaning testvoc, introducing apertium-recursive. |
|||
== Reasons why Google and Apertium should sponsor it == |
== Reasons why Google and Apertium should sponsor it == |
||
Line 46: | Line 53: | ||
=== Community bonding period (May 4 - June 1): === |
=== Community bonding period (May 4 - June 1): === |
||
*Getting closer with Apertium tools and community |
*Getting closer with Apertium tools and community; |
||
*Finding out the current state of Uzbek language |
*Finding out the current state of Uzbek language; |
||
*Finding out the availability of Uzbek resources available |
*Finding out the availability of Uzbek resources available; |
||
*Learning more about the HFST |
*Learning more about the HFST; |
||
*Doing coding challenge |
*Doing coding challenge; |
||
*Finding out initial WER and naïve coverage of tur-uzb pair. |
|||
*Begin interacting with Apertium's core system |
|||
=== Work Period (June 1 - August 31): === |
=== Work Period (June 1 - August 31): === |
||
Week 1: |
|||
*This part will be updated soon. |
|||
*Introducing apertium-separable to the tur-uzb pair |
|||
Week 2,3: |
|||
*Adding more stems to bilingual dictionary; |
|||
*Transfer rules refactoring; |
|||
*Increasing WER coverage; |
|||
Week 4: |
|||
*Running tests |
|||
*Updating documentation |
|||
*Preparing for the first evaluation |
|||
'''Deliverable 1:''' Increased WER of tur-uzb pair (goal down to 20%) |
|||
Week 5,6,7: |
|||
*More work on apertium-separable |
|||
*Extending bilingual dictionary |
|||
*Increasing naïve coverage |
|||
Week 8: |
|||
*Running tests |
|||
*Updating documentation |
|||
*Preparing for the second evaluation |
|||
'''Deliverable 2:''' Increased naïve coverage of the tur-uzb pair (goal up to 90%) |
|||
Week 9,10,11: |
|||
*Extending bilingual dictionary, adding more multiwords |
|||
*Work more on transfer rules |
|||
*Cleaning testvoc |
|||
Week 12: |
|||
*Running final tests, fixing issues |
|||
*Entire documentation revising and final check-ups |
|||
*Making the project ready for final evaluation |
|||
'''Deliverable 3:''' Achieving clean translation output |
|||
== List your skills and give evidence of your qualifications == |
== List your skills and give evidence of your qualifications == |
Latest revision as of 22:42, 29 August 2020
GSoC 2020: State-of-the-art Morphological Analyser for Uzbek language and improved language pairs: uz-kk, uz-ky, uz-tr,..
Progress can be seen here
The Final Report can be seen here
Contents
- 1 Contact Information
- 2 Why is it you are interested in machine translation?
- 3 Why is it that you are interested in the Apertium project?
- 4 Which of the published tasks are you interested in? What do you plan to do?
- 5 Title
- 6 Major goals
- 7 Reasons why Google and Apertium should sponsor it
- 8 Work plan
- 9 List your skills and give evidence of your qualifications
- 10 Coding Challenge
- 11 List any non-Summer-of-Code plans you have for the Summer
Contact Information[edit]
Name: Elmurod Kuriyozov
Nationality: Uzbekistan
Location: A Coruna, Spain
University: Universidade da Coruña
Email: elmurod1202@gmail.com
IRC: elmurod1202
Timezone: GTM+2
Github: elmurod1202
Why is it you are interested in machine translation?[edit]
Starting from my master's degree, I had an interest in improving the translation quality of my native language(Uzbek) to other languages when my supervisor had a project to create NLP tools for the Uzbek language that I was partially involved. Now I am doing my Ph.D. in Computational Linguistics. So Machine translation is part of my Ph.D. career.
Why is it that you are interested in the Apertium project?[edit]
- Apertium is free and open-source;
- Apertium focuses on machine translation basically for low-resource languages which completely fits what I am currently working on;
- Apertium has a wide range of community where I can easily find people that can help and support.
Which of the published tasks are you interested in? What do you plan to do?[edit]
Contributing to the language resources and enhancing language pairs’ translation quality.
Title[edit]
State-of-the-art Morphological Analyser for Uzbek language and improved language pairs: uz-kk, uz-ky, uz-tr...
Major goals[edit]
Having enough knowledge in Natural Language Processing(NLP), I have decided to conduct my research on creating NLP resources for low-resource Turkic languages with a special focus on my native language – Uzbek. Since Uzbek language has more than 30 million native speakers, yet there is almost no reliable NLP resource for it, or only commercially available. My proposal in the shortest way possible is following:
• Creating a high-accuracy morphological analyser for Uzbek by contributing to the currently existing one; • Increasing WER on the tur-uzb pair (goal: below 20%); • Increasing naïve coverage of the tur-uzb pair (goal of up to 90%) • Cleaning testvoc, introducing apertium-recursive.
Reasons why Google and Apertium should sponsor it[edit]
Uzbek language has more than 30 million native speakers and is an official language of Uzbekistan. Apart from that it is spoken in other neighbouring countries in Central Asia, some parts of Russian Federation and a minority in China. Even though it has such many speakers and is a crucial aspect to have language resources, Uzbek language is considered a heavily under-resourced language. So my aim is to create free and open-source NLP resources for Uzbek language. Apertium project is so handy for my case, because it already has enough resources I can contribute and bring closer to the community in need. My main goal includes lifting the Apertium project to the Uzbekistan’s official recommendation when it comes to the translation of documents in Uzbek to other related languages.
Work plan[edit]
Community bonding period (May 4 - June 1):[edit]
- Getting closer with Apertium tools and community;
- Finding out the current state of Uzbek language;
- Finding out the availability of Uzbek resources available;
- Learning more about the HFST;
- Doing coding challenge;
- Finding out initial WER and naïve coverage of tur-uzb pair.
Work Period (June 1 - August 31):[edit]
Week 1:
- Introducing apertium-separable to the tur-uzb pair
Week 2,3:
- Adding more stems to bilingual dictionary;
- Transfer rules refactoring;
- Increasing WER coverage;
Week 4:
- Running tests
- Updating documentation
- Preparing for the first evaluation
Deliverable 1: Increased WER of tur-uzb pair (goal down to 20%)
Week 5,6,7:
- More work on apertium-separable
- Extending bilingual dictionary
- Increasing naïve coverage
Week 8:
- Running tests
- Updating documentation
- Preparing for the second evaluation
Deliverable 2: Increased naïve coverage of the tur-uzb pair (goal up to 90%)
Week 9,10,11:
- Extending bilingual dictionary, adding more multiwords
- Work more on transfer rules
- Cleaning testvoc
Week 12:
- Running final tests, fixing issues
- Entire documentation revising and final check-ups
- Making the project ready for final evaluation
Deliverable 3: Achieving clean translation output
List your skills and give evidence of your qualifications[edit]
Educational qualifications:
• Graduated BSc in Applied Mathematics and Informatics, UrSU, Uzbekistan; • Graduated Master in Applied Mathematics and Information Technologies, SamSU, Uzbekistan; • Started studying Ph.D. in Computational Linguistics at the University of a Coruna, A Coruna, Spain.
I have been carrying out my PhD research since 2018 in the topic: “Creating NLP Resources for low-resource Turkic languages, with a specific focus on Uzbek”. So far, my published papers include:
• “Deep Learning and Machine Learning methods for Sentiment Analysis in the Uzbek Language”(LTC2019, Best Student work award) • “Cross-Lingual Word Embeddings for Turkic Languages”(Accepted, LREC2020) • “Unsupervised and semi-supervised morphological segmentation analysis for Uzbek language”(Under process).
I am native in Uzbek language and have a basic understanding evel of Kazakh, Kyrgyz, Karakalpak and Uyghur. I speak fluently in English and have a good command in Russian languages. I have been studying NLP field for more than a year and I can show a good knowledge in machine translation.
Coding Challenge[edit]
As a bachelor student in years between 2010 and 2014 I actively participated in ACM ICPC – International Collegiate Programming Contest and two times won the quarter final and was able to attend in pre-finals. As a master student I earned Web-programming and acquired Java, PHP, Javascript, and MySQL skills. Created some websites. As a PhD researcher, I am doing my research basically in Python for computations.
List any non-Summer-of-Code plans you have for the Summer[edit]
I can devote my full time, meaning that at least 30 hours per week I can work with this project since it is the highest priority for me to work with Apertium during summer. This is going to be the part of my thesis work. I am not planning to take any summer classes, no any trip planned and I am currently unemployed. There will be only one thing: I will have to travel from Spain back to Uzbekistan in Summer, but it won’t take more than 3-4 days to flay and settle. My return to home won’t affect the productivitiy since I have my own room at Urgench State University, Uzbekistan that I can go every day and continue.