User:Aidana/Proposal

From Apertium
Jump to navigation Jump to search

Contact information


Name:Aidana Karibayeva
E-mail address: a.s.karibayeva@gmail.com
Nick on the IRC #apertium channel: Aidana
Username at sourceforge: aidana1

Why is it you are interested in machine translation?

These days the translating text automatically by using machine translation is very important, because it helps people from whole world to understand and get information in foreign language very quickly and easy. Building machine translation systems is very interesting process, which needs to have and use your knowledge in programming, linguistics and languages, and in the result, by combining all these stuff, you could create machine translation system, which could be useful for many people.

Why is it that they are interested in the Apertium project?

I have worked on Apertium platform since 2014, when I started to build Kazakh-English machine translation system. I chose this platform, because it is free and open-source, so I was able to use previously developed resources, such like Kazakh transducer and English-Kazakh bilingual dictionary.

Which of the published tasks are you interested in? What do you plan to do?

I interested in task “Adopt an unreleased language pair”, as a language pair I choose is Kazakh-English, in Kazakh-English translation direction. I am planning to expand the dictionary, create more transfer and lexical rules.

Title

Adopt an unreleased Kazakh-English language pair

Reasons why Google and Apertium should sponsor it

I want to improve the current state of Kazakh-English pair on Apertium platform, so it could produce translation with high quality, bigger vocabulary and more transfer rules. Resources could be used for another language pairs, where the Kazakh is source language.

A description of how and who it will benefit in society

As I said above, resources of developed Kazakh-English machine translation system could be used for other language systems, for instance, transfer rules are already used in Kazakh-Russian MT systems. Also, English speakers, who are interested in Kazakh culture could use it to understand Kazakh texts, songs and news.

List your skills and give evidence of your qualifications

I currently study on 2nd year master degree in Information Systems at Al-Farabi Kazakh National University in Kazakhstan. My programming skills include: C, C++, C#, shell scripting language. I speak fluently Kazakh, Russian and English languages.

List any non-Summer-of-Code plans you have for the Summer

I have no non-GSoC plans for the summer and I can spend about 40-50 hours for a week

Work plan

Major goals

I plan to work on transfer rules and work on vocabulary, to be more precise, I plan to reach:

  • Create testvoc
  • Clean testvoc
  • ~25000 stems in bidix(~600 stems per week, or ~100 per day) and ~32000 stems in lexc (~100 stems per week, or ~20 per day) stems in lexc
  • Additional rules:transfer, lexical, constraint grammar(~10 per week,or ~2 per day)

Deatiled Workplan

April -1 June

  1. Create files for each POS with morph.analysis to testvoc
  2. run first testvoc with these files
  3. write ≥3 lexical selection rules
  4. write ≥3 transfer rules
  5. write ≥1 disambiguation rules

1 June- 23 June

  1. total 21250 stems in dix and 31500 stems in lexc
  2. adding unknown words from Kazakh to English
  3. adding transfer rules

24 June- 1 July

  1. total 21750 stems in dix and 31600 stems in lexc
  2. adding unknown words with *
  3. adding transfer rules
  4. adding lexical selection rules

2 July -10 July

  1. total 22250 stems in dix and 31700 stems in lexc
  2. clean testvoc for #
  3. adding transfer rules
  4. correcting existing transfer rules

11 July- 17 July

  1. total 22750 stems in dix and 31800 stems in lexc
  2. correcting tags and clean testvoc for #
  3. adding transfer rules
  4. adding lexical selection rules

18 July – 24 July

  1. total 23250 stems in dix and 31900 stems in lexc
  2. correcting tags and clean testvoc for #
  3. adding transfer rules
  4. adding lexical selection rules

25 July- 31 July

  1. total 23750 stems in dix and 32000 stems in lexc
  2. correcting tags and clean testvoc for #
  3. adding transfer rules
  4. adding lexical selection rules

1 August – 7 August

  1. total 24250 stems in dix and 32100 stems in lexc
  2. correcting tags and clean testvoc for #
  3. adding transfer rules
  4. adding lexical selection rules

8 August – 14 August

  1. total 24750 stems in dix and 32200 stems in lexc
  2. correcting tags and clean testvoc for #
  3. adding transfer rules
  4. adding lexical selection rules

15 – 24 August

  1. total 25250 stems in dix and 32300 stems in lexc
  2. cleanup code and upload final version
  3. Finished language pair