User:Ogabek

From Apertium
Jump to navigation Jump to search

GSOC 2019 : Extend weighted transfer rules[1]

Personal Details

Contact Information

Name : Ogabek Yusupov
Location : Tashkent, Uzbekistan
Phone number : +998941155873
Email : ogabekyusupov@gmail.com
IRC : ogabek
Github : ogabek96
Timezone : GMT + 5

Education

4th year Bachelor student of Software Engineering Faculty in Tashkent university of information technologies named after Muhammad Al-Khwarizmi.


Technical skills

Programming languages: C++, Java, Javascript, PHP
Databases: MySQL,PostgreSQL
Frameworks: Express.js
Operating systems: Linux, Windows


Related projects

Open-source Uzbek-Korean language dictionary


Related work experience

Volunteered on Google translator: Translated sentences from English into Uzbek.
Participated in LIONBRIDGE Language Research: Record my voice reading sentences written in Uzbek language and sent audio files.


Languages

Uzbek(native), English, Russian


Why is it you are interested in machine translation?

I have always fascinated by machine translation and I am an active user of it. Machine translation nowadays demanded more than ever because people are travelling more than before and it takes down language barriers. Although the quality of translation improved significantly in recent years we cannot fully rely on it because of errors in translations. As a computer science student I think it is my responsibility to make it better.


Why is it that you are interested in the Apertium project?

The first attribute of Apertium platform that draw my attention is that it is open-source. Nowadays most existing platforms are not free and users cannot use them freely on their projects. Since I am a supporter of open-source I found this project is interesting.Another thing that I like in this project that there are many members who are actively contributing to Turkic languages. Since I am a native speaker of Uzbek I want to improve the translation of my native language too. My contribution to this project will be improving Turkish<->Uzbek language pair because it has not been updated for four years.


Which of the published tasks are you interested in? What do you plan to do?

Title

Bring a released Turkish<->Uzbek language pair up to state-of-the-art quality. Also I am ready to fix technical errors because I have some experience in software development. Reasons why Google and Apertium should sponsor it. Although Uzbek and Turkish are in the same language groups there are no appropriate translation platforms on the internet. Also, although Uzbek language has 33 million native speakers it is not popular on the internet. The information found on the internet is very limited. I believe that my contribution to this platform will raise popularity of Uzbek language.


A description of how and who it will benefit in society

Firstly, It will benefit app developers since Apertium is open-source anyone can use it one their projects. Secondly, the relation between Uzbekistan and Turkey is improving. There are many visitors from Turkey to Uzbekistan for business or for tourism. Releasing Turkish<->Uzbek language pair will take down language barriers between these nations.


Working plan

Doing coding challenge(until May 1)

Installing Apertium
Creating a wiki page on Apertium
Forking an existing language pair and setting Apertium to add data to an existing language pair.
Preliminary evaluation. Translate the story and try to imrove translation as much as possible
Try to learn as much as possible about Apertium platform.

Community Bonding Period (May 6 - May 27)

Get closer with Apertium community
Investigate more about machine translation
Reading Apertium documentation, and exploring .dix, lexc and other formats of apertium-uzb and understand how they work
Collecting resources in Turkish and Uzbek

Work Period (May 27 - August 26)

Week 1:

Editing apertium-uzb.uzb.lexc and correcting existing translation errors.
Add transfer rules for nouns, pronouns. Start working for pronouns, adverbs, and adjectives Add appropriate rules/stems.

Week 2:

Add transfer rules for adjectives, adverbs Take another 500-word story.

Week 3:

Finish with lexical selection rules and chunking.
Start working on disambiguation and its solutions
Refactoring and documentation.

Week 4:

Run corpus testing to analyze the improvement.
Improve morphological analyzer

Week 5:

Find good parallel corpora and add words in decreasing frequency in apertium-uzb. Coverage ~45% Parallelly start working of tur-uzb bilingual dictionary

Week 6:

Work on a ~ 700-word story
Calculate WER, PER, and document
Even up nouns, pronouns
Even up for verbs, adjectives, adverbs

Week 7:

Testvoc clean for all classes
Working on transfer grammar rules (t1x) using the common rules generated from post-edit analysis

Week 8:

Continue working on tur-uzb pair: Add transfer rules for nouns, pronouns Add transfer rules for verbs, adjectives, adverbs. Start working on CG and disambiguation

Week 9:

Continue working on disambiguation and its solutions.
Add required transfer/lexical selection rules to improve WER, PER.
Begin with chunking and t3x

Week 10:

Week 11:

Week 12:

List any non-Summer-of-Code plans you have for the Summer.

I don’t have non-GSoC plans for the summer I have university exams on July which lasts two weeks during this period I will spend 20 hours a week on this project. Other times I can dedicate 40 hours a week.