User:Ogabek

From Apertium
Jump to navigation Jump to search

GSOC 2019 : Extend weighted transfer rules[1]

Personal Details

Contact Information

Name : Ogabek Yusupov
Location : Tashkent, Uzbekistan
Phone number : +998941155873
Email : ogabekyusupov@gmail.com
IRC : ogabek
Github : ogabek96
Timezone : GMT + 5

Education

4th year Bachelor student of Software Engineering Faculty in Tashkent university of information technologies named after Muhammad Al-Khwarizmi.


Technical skills

Programming languages: C++, Java, Javascript, PHP
Databases: MySQL,PostgreSQL
Frameworks: Express.js
Operating systems: Linux, Windows


Related projects

Open-source Uzbek-Korean language dictionary


Related work experience

Volunteered on Google translator: Translated sentences from English into Uzbek.
Participated in LIONBRIDGE Language Research: Record my voice reading sentences written in Uzbek language and sent audio files.


Languages

Uzbek(native), English, Russian


Why is it you are interested in machine translation?

I have always fascinated by machine translation and I am an active user of it. Machine translation nowadays demanded more than ever because people are travelling more than before and it takes down language barriers. Although the quality of translation improved significantly in recent years we cannot fully rely on it because of errors in translations. As a computer science student I think it is my responsibility to make it better.


Why is it that you are interested in the Apertium project?

The first attribute of Apertium platform that draw my attention is that it is open-source. Nowadays most existing platforms are not free and users cannot use them freely on their projects. Since I am a supporter of open-source I found this project is interesting.Another thing that I like in this project that there are many members who are actively contributing to Turkic languages. Since I am a native speaker of Uzbek I want to improve the translation of my native language too. My contribution to this project will be improving Turkish<->Uzbek language pair because it has not been updated for four years.


Which of the published tasks are you interested in? What do you plan to do?

Title

Bring a released Turkish<->Uzbek language pair up to state-of-the-art quality. Also I am ready to fix technical errors because I have some experience in software development. Reasons why Google and Apertium should sponsor it. Although Uzbek and Turkish are in the same language groups there are no appropriate translation platforms on the internet. Also, although Uzbek language has 33 million native speakers it is not popular on the internet. The information found on the internet is very limited. I believe that my contribution to this platform will raise popularity of Uzbek language.


A description of how and who it will benefit in society

Firstly, It will benefit app developers since Apertium is open-source anyone can use it one their projects. Secondly, the relation between Uzbekistan and Turkey is improving. There are many visitors from Turkey to Uzbekistan for business or for tourism. Releasing Turkish<->Uzbek language pair will take down language barriers between these nations.


Working plan

Doing coding challenge(until May 1)

Installing Apertium
Creating a wiki page on Apertium
Forking an existing language pair and setting Apertium to add data to an existing language pair.
Preliminary evaluation. Translate the story and try to imrove translation as much as possible
Try to learn as much as possible about Apertium platform.

Community Bonding Period (May 6 - May 27)

Get closer with Apertium community
Investigate more about machine translation
Reading Apertium documentation, and exploring .dix, lexc and other formats of apertium-uzb and understand how they work
Collecting resources in Turkish and Uzbek

Work Period (May 27 - August 26)

Week 1:

Editing apertium-uzb.uzb.lexc and correcting existing translation errors.
Write test scripts
Add transfer rules for nouns, pronouns.
Start working for pronouns, adverbs, and adjectives
Add appropriate rules/stems.
Achieve a WER < 20% for 1 basic text

Week 2:

Add transfer rules for adjectives, adverbs
Take another 500-word story.
Target: WER <50% Post-edit translated texts. Analyze and look for common rules and add rules

Week 3:

Finish with lexical selection rules and chunking.
Start working on disambiguation and its solutions
Refactoring and documentation.

Week 4:

Run corpus testing to analyze the improvement.
Improve morphological analyzer

Deliverable #1

Week 5:

Find good parallel corpora and add words in decreasing frequency in apertium-uzb.
Coverage ~45%
Parallelly start working of tur-uzb bilingual dictionary

Week 6:

Work on a ~ 700-word story
Calculate WER, PER, and document
Target WER <=40%
Even up nouns, pronouns
Even up for verbs, adjectives, adverbs

Week 7:

Testvoc clean for all classes
Working on transfer grammar rules (t1x) using the common rules generated from post-edit analysis
WER <=30%
Bidix-coverage ~45%

Week 8:

Continue working on tur-uzb pair:
Add transfer rules for nouns, pronouns
Add transfer rules for verbs, adjectives, adverbs.
Start working on CG and disambiguation

Deliverable #2

Week 9:

Continue working on disambiguation and its solutions.
Add required transfer/lexical selection rules to improve WER, PER.
Begin with chunking and t3x

Week 10:

Get another ~700 token story for tur-uzb and improve WER.
Target WER <=25%
Regression testing for tur-uzb pair
Evaluate test results, make the required changes, run tests again
User acceptance testing, trying evaluation.

Week 11:

Regression testing for two pairs
Achieve WER < 10% on all previous advanced texts and 3 new advanced texts

Week 12:

Discuss with the mentor about some final changes that must be made.
Detailed analysis on what further improvement could be made for the pairs
Evaluation of results and documentation.

Final evaluation

List any non-Summer-of-Code plans you have for the Summer.

I don’t have non-GSoC plans for the summer I have university exams on July which lasts two weeks during this period I will spend 20 hours a week on this project. Other times I can dedicate 40 hours a week.