User:Qareken

From Apertium
Jump to navigation Jump to search

GSoC 2019 : Adopt an unreleased language pair [1]

Contact information[edit]

Name: Kalabaev Sharapat

Location: Tashkent, Uzbekistan

E-mail address: kalabaevshj@gmail.com

Tel number: +998911341226

IRC: qareken

SourceForge: qareken

Github: sharapat

Timezone: GTM +5

Why is it you are interested in machine translation?[edit]

In today’s world, many of the remote languages are under the threat of extinction due to shortage of proper information about them. However, availability of modern technologies can have significant impact on preserving them from extinction. This can be achieved through wide availability of machine translation platforms which insures broad usage of languages.

Why is it that you are interested in the Apertium project?[edit]

The primary reason is that I desire Apertium project as a standout project amongst the best open source extends on machine interpretation sphere. I intend to develop Karakalpak-Uzbek translation system on Apertium platform. Karakalpak language serves as a bridge of communication between karakalpak people and uzbek government.

Which of the published tasks are you interested in? What do you plan to do?[edit]

Title[edit]

Adopting an unreleased language pair of uzb-kaa languages.

In this project I am going to create a new language pair uzb-kaa. I have made the project (google play link, github link): rus->kaa and kaa->eng dictionary, therefore I have an access to the biggest Karakalpak language dictionary which I am going to use it here. So I believe that I can easily make a transducer for Karakalpak language. In addition, I have analyzed the existing repository for language pair of uzb-kaa languages (https://github.com/apertium/apertium-uzb-kaa/pull/2) and have found some linguistic errors which deviate true meaning of words. I am competent to fix these mistakes as I am a native speaker of karakalpak language.

Reasons why Google and Apertium should sponsor it[edit]

Although these languages are quite related, there is no single translator or dictionary is created till the present days. Additionally, this project would open new ways for Karakalpak language to associate with different languages as well, since now its inclusion level is very low.

A description of how and who it will benefit in society[edit]

It would an extraordinary assistance to holders of Karakalpak language, moreover, according to UNESCO, karakalpak language is regarded as vulnerable. The explanation behind this phenomenon is that the scarcity and unexplored status of Karakalpak language, thus, great efforts should be directed to this language. Major stakeholders of the project are native karakalpak people as there will be vast opportunity to explore world knowledge conveniently.

Work plan[edit]

Community bonding period (May 6 - 27):

  • Getting closer with Apertium tools and community
  • Finding the language resources for Karakalpak and Uzbek
  • Begin editing Uzbek - Karakalpak dictionary

Work Period (May 27 - August 19):

Week 1:

  • Begin creating Karakalpak monodix using Uzbek monodix to its size.
  • Check kaa monodix and fix existing translation errors
  • Add nouns and verbs to kaa monodix

Week 2:

  • Add adjectives, pronouns, adverbs, conjunctions and prepositions to kaa monodix

Week 3:

  • Check the transducer
  • Add transfer rules for adjectives, adverbs

Week 4:

  • Run tests
  • Discuss shortcomings of the performed work with the and fix it

Deliverable #1: updated kaa monodix

Week 5:

  • Adding verbs to uzb-kaa bidix and adding necessary uzb-kaa transfer rules

Week 6:

  • Adding pronouns, adverbs and others to uzb-kaa bidix and adding necessary uzb-kaa transfer rules

Week 7:

  • Adding determinants and more adjectives to uzb-kaa bidix
  • Test on a ~500 word story (achieve WER < 20%)
  • add rules for concordance between verbs and pronouns

Week 8:

  • Work on transfer rules in .t2x and .t3x files
  • Test uzb-kaa bidix
  • Discuss shortcomings of the performed work with the mentor and fix it

Deliverable #2: updated kaa monodix, uzb-kaa bidix and uzb-kaa transfer rules

Week 9:

  • Check kaa, uzb monodix

Week 10:

  • Test on ~1000 word story and achieve WER < 10% on it.

Deliverable #3: finished kaa monodix, updated uzb monodix, uzb-kaa bidix and uzb-kaa transfer rules

Week 11:

  • Try to achieve WER < 10% on the big stories
  • Discuss about performed work with the mentor

Week 12:

  • evaluation of results and documentation

Project completion:

  • Tidying up, releasing
  • Final evaluation

List your skills and give evidence of your qualifications[edit]

I am on the 4th year of Bachelor’s degree in Programm Engineering faculty at the Tashkent University of Information Technology named after Al-Khwarizmi. My native language is Karakalpak [Kaa] and I know Uzbek [Uzb] language on a good level too, mainly due to their similarity and I live and study in Tashkent, Uzbekistan. Programming skills: C, C++, Java, Kotlin, Python, git and xml.

List any non-Summer-of-Code plans you have for the Summer[edit]

I have no non-GSoC plans for the summer and can contribute from 30 to 40 hours a week. However, my school finishes in the middle of June. Therefore, if it is fine I would like to work ~ 20 hours in the first month and in the 2nd and 3rd months I will work ~ 40-50 hours per week in order to compensate.