User:Aidana/Proposal

From Apertium
Jump to navigation Jump to search

Contact information[edit]


Name:Aidana Karibayeva
E-mail address: a.s.karibayeva@gmail.com
Nick on the IRC #apertium channel: Aidana
Username at sourceforge: aidana1

Why is it you are interested in machine translation?[edit]

These days the translating text automatically by using machine translation is very important, because it helps people from whole world to understand and get information in foreign language very quickly and easy. Building machine translation systems is very interesting process, which needs to have and use your knowledge in programming, linguistics and languages, and in the result, by combining all these stuff, you could create machine translation system, which could be useful for many people.

Why is it that you are interested in the Apertium project?[edit]

I have worked on Apertium platform since 2014, when I started to build Kazakh-English machine translation system. I chose this platform, because it is free and open-source, so I was able to use previously developed resources, such like Kazakh transducer and English-Kazakh bilingual dictionary.

Which of the published tasks are you interested in? What do you plan to do?[edit]

I interested in task “Adopt an unreleased language pair”, as a language pair I choose is Kazakh-English, in Kazakh-English translation direction. I am planning to expand the dictionary, create more transfer and lexical rules.

Title[edit]

Adopt an unreleased Kazakh-English language pair

Reasons why Google and Apertium should sponsor it[edit]

I want to improve the current state of Kazakh-English pair on Apertium platform, so it could produce translation with high quality, bigger vocabulary and more transfer rules. Resources could be used for another language pairs, where the Kazakh is source language.

A description of how and who it will benefit in society[edit]

As I said above, resources of developed Kazakh-English machine translation system could be used for other language systems, for instance, transfer rules are already used in Kazakh-Russian MT systems. Also, English speakers, who are interested in Kazakh culture could use it to understand Kazakh texts, songs and news.

List your skills and give evidence of your qualifications[edit]

I currently study on 2nd year master degree in Information Systems at Al-Farabi Kazakh National University in Kazakhstan. My programming skills include: C, C++, C#, shell scripting language. I speak fluently Kazakh, Russian and English languages.

List any non-Summer-of-Code plans you have for the Summer[edit]

I have no non-GSoC plans for the summer and I can spend about 40-50 hours for a week

Initial data:Coding challenge[edit]

Evaluation of kaz-to-eng translation using:

   src set "apertium" (1 docs, 43 segs)
   ref set "apertium" (1 refs)
   tst set "apertium" (1 systems)

NIST score = 2.2109 BLEU score = 0.0386 for system "apertium"

Individual N-gram scoring

       1-gram   2-gram   3-gram   4-gram   5-gram   6-gram   7-gram   8-gram   9-gram
       ------   ------   ------   ------   ------   ------   ------   ------   ------
NIST:  2.0950   0.1140   0.0019   0.0000   0.0000   0.0000   0.0000   0.0000   0.0000  "apertium"
BLEU:  0.3333   0.0607   0.0174   0.0063   0.0023   0.0013   0.0007   0.0004   0.0002  "apertium"

Cumulative N-gram scoring

       1-gram   2-gram   3-gram   4-gram   5-gram   6-gram   7-gram   8-gram   9-gram
       ------   ------   ------   ------   ------   ------   ------   ------   ------
NIST:  2.0950   2.2090   2.2109   2.2109   2.2109   2.2109   2.2109   2.2109   2.2109  "apertium"
BLEU:  0.3333   0.1423   0.0706   0.0386   0.0220   0.0137   0.0090   0.0061   0.0043  "apertium"

MT evaluation scorer ended on 2016 Mar 22 at 00:33:31

Statistics about input files


  • Number of words in reference: 520
  • Number of words in test: 439
  • Number of unknown words (marked with a star) in test: 7
  • Percentage of unknown words: 1.59 %

Results when removing unknown-word marks (stars)


  • Edit distance: 507
  • Word error rate (WER): 97.50 %
  • Number of position-independent correct words: 97
  • Position-independent word error rate (PER): 81.35 %

MT evaluation ended on 2016 Mar 24 at 01:05:00[edit]

Coding challenge finished[edit]

Results:

  • removed all #
  • added unknown words
  • added t1x and tx2 rules
  • added rlx rule in apertium-kaz

BLEU:

 Evaluation of kaz-to-eng translation using:
   src set "apertium" (1 docs, 43 segs)
   ref set "apertium" (1 refs)
   tst set "apertium" (1 systems)

NIST score = 4.2546 BLEU score = 0.2206 for system "zzz"


Individual N-gram scoring

       1-gram   2-gram   3-gram   4-gram   5-gram   6-gram   7-gram   8-gram   9-gram
       ------   ------   ------   ------   ------   ------   ------   ------   ------
NIST:  3.6697   0.4962   0.0723   0.0163   0.0000   0.0000   0.0000   0.0000   0.0000  "zzz"
BLEU:  0.6069   0.2476   0.1671   0.1287   0.1065   0.0800   0.0571   0.0345   0.0143  "zzz"

Cumulative N-gram scoring

       1-gram   2-gram   3-gram   4-gram   5-gram   6-gram   7-gram   8-gram   9-gram
       ------   ------   ------   ------   ------   ------   ------   ------   ------
NIST:  3.6697   4.1659   4.2382   4.2546   4.2546   4.2546   4.2546   4.2546   4.2546  "zzz"
BLEU:  0.5615   0.3587   0.2709   0.2206   0.1878   0.1608   0.1372   0.1143   0.0899  "zzz"

WER: Statistics about input files

  • Number of words in reference: 480
  • Number of words in test: 388
  • Number of unknown words (marked with a star) in test: 0
  • Percentage of unknown words: 0.00 %

Results when removing unknown-word marks (stars)


  • Edit distance: 366
  • Word error rate (WER): 76.25 %
  • Number of position-independent correct words: 176
  • Position-independent word error rate (PER): 63.33 %

Work plan[edit]

Detecting problems[edit]

While doing coding challenge, I detect some problems in most of rules:

  1. Wrong prepositions: need more disambiguation rules
  2. Extra prepostions: after running macros some variables are not cleaned and not correctly assigned. Partly solved after coding challenge
  3. Generating invalid forms: with # and extra tags "<#pres". I am planning to detect more of them by using different corpuses. These errors are appeared because of wrong t1x rules.
  4. Some interchunk rules do not do agreement between subject and verb right.

For first time, I am planning to use corpus, which consist of 5625 sentences, it is Kazakh-English bilingual corpus.

Major goals[edit]

I plan to work on transfer rules and work on vocabulary, to be more precise, I plan to reach:

  • Create testvoc
  • Clean testvoc
  • ~25000 stems in bidix(~600 stems per week, or ~100 per day) and ~32000 stems in lexc (~100 stems per week, or ~20 per day) stems in lexc
  • Additional rules:transfer, lexical, constraint grammar(~10 per week,or ~2 per day)

Deatiled Workplan[edit]

April -1 June

  1. Create files for each POS with morph.analysis to testvoc
  2. run first testvoc with these files
  3. write ≥3 lexical selection rules
  4. write ≥3 transfer rules
  5. write ≥1 disambiguation rules

1 June- 23 June

  1. total 21250 stems in dix and 31500 stems in lexc
  2. adding unknown words from Kazakh to English
  3. adding transfer rules

24 June- 1 July

  1. total 21750 stems in dix and 31600 stems in lexc
  2. adding unknown words with *
  3. adding transfer rules
  4. adding lexical selection rules

2 July -10 July

  1. total 22250 stems in dix and 31700 stems in lexc
  2. clean testvoc for #
  3. adding transfer rules
  4. correcting existing transfer rules

11 July- 17 July

  1. total 22750 stems in dix and 31800 stems in lexc
  2. correcting tags and clean testvoc for #
  3. adding transfer rules
  4. adding lexical selection rules

18 July – 24 July

  1. total 23250 stems in dix and 31900 stems in lexc
  2. correcting tags and clean testvoc for #
  3. adding transfer rules
  4. adding lexical selection rules

25 July- 31 July

  1. total 23750 stems in dix and 32000 stems in lexc
  2. correcting tags and clean testvoc for #
  3. adding transfer rules
  4. adding lexical selection rules

1 August – 7 August

  1. total 24250 stems in dix and 32100 stems in lexc
  2. correcting tags and clean testvoc for #
  3. adding transfer rules
  4. adding lexical selection rules

8 August – 14 August

  1. total 24750 stems in dix and 32200 stems in lexc
  2. correcting tags and clean testvoc for #
  3. adding transfer rules
  4. adding lexical selection rules

15 – 24 August

  1. total 25250 stems in dix and 32300 stems in lexc
  2. cleanup code and upload final version
  3. Finished language pair