User:Aidana/Proposal

1 Contact information
2 Why is it you are interested in machine translation?
3 Why is it that you are interested in the Apertium project?
4 Which of the published tasks are you interested in? What do you plan to do?
5 Initial data:Coding challenge
6 MT evaluation ended on 2016 Mar 24 at 01:05:00
7 Coding challenge finished
8 Work plan
- 8.1 Detecting problems
- 8.2 Major goals
  - 8.2.1 Deatiled Workplan

Contact information[edit]

Name:Aidana Karibayeva
E-mail address: a.s.karibayeva@gmail.com
Nick on the IRC #apertium channel: Aidana
Username at sourceforge: aidana1

Why is it you are interested in machine translation?[edit]

These days the translating text automatically by using machine translation is very important, because it helps people from whole world to understand and get information in foreign language very quickly and easy. Building machine translation systems is very interesting process, which needs to have and use your knowledge in programming, linguistics and languages, and in the result, by combining all these stuff, you could create machine translation system, which could be useful for many people.

Why is it that you are interested in the Apertium project?[edit]

I have worked on Apertium platform since 2014, when I started to build Kazakh-English machine translation system. I chose this platform, because it is free and open-source, so I was able to use previously developed resources, such like Kazakh transducer and English-Kazakh bilingual dictionary.

Which of the published tasks are you interested in? What do you plan to do?[edit]

I interested in task “Adopt an unreleased language pair”, as a language pair I choose is Kazakh-English, in Kazakh-English translation direction. I am planning to expand the dictionary, create more transfer and lexical rules.

Title[edit]

Adopt an unreleased Kazakh-English language pair

Reasons why Google and Apertium should sponsor it[edit]

I want to improve the current state of Kazakh-English pair on Apertium platform, so it could produce translation with high quality, bigger vocabulary and more transfer rules. Resources could be used for another language pairs, where the Kazakh is source language.

A description of how and who it will benefit in society[edit]

As I said above, resources of developed Kazakh-English machine translation system could be used for other language systems, for instance, transfer rules are already used in Kazakh-Russian MT systems. Also, English speakers, who are interested in Kazakh culture could use it to understand Kazakh texts, songs and news.

List your skills and give evidence of your qualifications[edit]

I currently study on 2nd year master degree in Information Systems at Al-Farabi Kazakh National University in Kazakhstan. My programming skills include: C, C++, C#, shell scripting language. I speak fluently Kazakh, Russian and English languages.

List any non-Summer-of-Code plans you have for the Summer[edit]

I have no non-GSoC plans for the summer and I can spend about 40-50 hours for a week

Initial data:Coding challenge[edit]

Evaluation of kaz-to-eng translation using:

   src set "apertium" (1 docs, 43 segs)
   ref set "apertium" (1 refs)
   tst set "apertium" (1 systems)

NIST score = 2.2109 BLEU score = 0.0386 for system "apertium"

Individual N-gram scoring

       1-gram   2-gram   3-gram   4-gram   5-gram   6-gram   7-gram   8-gram   9-gram
       ------   ------   ------   ------   ------   ------   ------   ------   ------
NIST:  2.0950   0.1140   0.0019   0.0000   0.0000   0.0000   0.0000   0.0000   0.0000  "apertium"

BLEU:  0.3333   0.0607   0.0174   0.0063   0.0023   0.0013   0.0007   0.0004   0.0002  "apertium"

Cumulative N-gram scoring

       1-gram   2-gram   3-gram   4-gram   5-gram   6-gram   7-gram   8-gram   9-gram
       ------   ------   ------   ------   ------   ------   ------   ------   ------
NIST:  2.0950   2.2090   2.2109   2.2109   2.2109   2.2109   2.2109   2.2109   2.2109  "apertium"

BLEU:  0.3333   0.1423   0.0706   0.0386   0.0220   0.0137   0.0090   0.0061   0.0043  "apertium"

MT evaluation scorer ended on 2016 Mar 22 at 00:33:31

Statistics about input files

Number of words in reference: 520
Number of words in test: 439
Number of unknown words (marked with a star) in test: 7
Percentage of unknown words: 1.59 %

Results when removing unknown-word marks (stars)

Edit distance: 507
Word error rate (WER): 97.50 %
Number of position-independent correct words: 97
Position-independent word error rate (PER): 81.35 %

MT evaluation ended on 2016 Mar 24 at 01:05:00[edit]

Coding challenge finished[edit]

Results:

removed all #
added unknown words
added t1x and tx2 rules
added rlx rule in apertium-kaz

BLEU:

 Evaluation of kaz-to-eng translation using:
   src set "apertium" (1 docs, 43 segs)
   ref set "apertium" (1 refs)
   tst set "apertium" (1 systems)

NIST score = 4.2546 BLEU score = 0.2206 for system "zzz"

Individual N-gram scoring

       1-gram   2-gram   3-gram   4-gram   5-gram   6-gram   7-gram   8-gram   9-gram
       ------   ------   ------   ------   ------   ------   ------   ------   ------
NIST:  3.6697   0.4962   0.0723   0.0163   0.0000   0.0000   0.0000   0.0000   0.0000  "zzz"

BLEU:  0.6069   0.2476   0.1671   0.1287   0.1065   0.0800   0.0571   0.0345   0.0143  "zzz"

Cumulative N-gram scoring

       1-gram   2-gram   3-gram   4-gram   5-gram   6-gram   7-gram   8-gram   9-gram
       ------   ------   ------   ------   ------   ------   ------   ------   ------
NIST:  3.6697   4.1659   4.2382   4.2546   4.2546   4.2546   4.2546   4.2546   4.2546  "zzz"

BLEU:  0.5615   0.3587   0.2709   0.2206   0.1878   0.1608   0.1372   0.1143   0.0899  "zzz"

WER: Statistics about input files

Number of words in reference: 480
Number of words in test: 388
Number of unknown words (marked with a star) in test: 0
Percentage of unknown words: 0.00 %

Results when removing unknown-word marks (stars)

Edit distance: 366
Word error rate (WER): 76.25 %
Number of position-independent correct words: 176
Position-independent word error rate (PER): 63.33 %

Work plan[edit]

Detecting problems[edit]

While doing coding challenge, I detect some problems in most of rules:

Wrong prepositions: need more disambiguation rules
Extra prepostions: after running macros some variables are not cleaned and not correctly assigned. Partly solved after coding challenge
Generating invalid forms: with # and extra tags "<#pres". I am planning to detect more of them by using different corpuses. These errors are appeared because of wrong t1x rules.
Some interchunk rules do not do agreement between subject and verb right.

For first time, I am planning to use corpus, which consist of 5625 sentences, it is Kazakh-English bilingual corpus.

Major goals[edit]

I plan to work on transfer rules and work on vocabulary, to be more precise, I plan to reach:

Create testvoc
Clean testvoc
~25000 stems in bidix(~600 stems per week, or ~100 per day) and ~32000 stems in lexc (~100 stems per week, or ~20 per day) stems in lexc
Additional rules:transfer, lexical, constraint grammar(~10 per week,or ~2 per day)

Deatiled Workplan[edit]

April -1 June

Create files for each POS with morph.analysis to testvoc
run first testvoc with these files
write ≥3 lexical selection rules
write ≥3 transfer rules
write ≥1 disambiguation rules

1 June- 23 June

total 21250 stems in dix and 31500 stems in lexc
adding unknown words from Kazakh to English
adding transfer rules

24 June- 1 July

total 21750 stems in dix and 31600 stems in lexc
adding unknown words with *
adding transfer rules
adding lexical selection rules

2 July -10 July

total 22250 stems in dix and 31700 stems in lexc
clean testvoc for #
adding transfer rules
correcting existing transfer rules

11 July- 17 July

total 22750 stems in dix and 31800 stems in lexc
correcting tags and clean testvoc for #
adding transfer rules
adding lexical selection rules

18 July – 24 July

total 23250 stems in dix and 31900 stems in lexc
correcting tags and clean testvoc for #
adding transfer rules
adding lexical selection rules

25 July- 31 July

total 23750 stems in dix and 32000 stems in lexc
correcting tags and clean testvoc for #
adding transfer rules
adding lexical selection rules

1 August – 7 August

total 24250 stems in dix and 32100 stems in lexc
correcting tags and clean testvoc for #
adding transfer rules
adding lexical selection rules

8 August – 14 August

total 24750 stems in dix and 32200 stems in lexc
correcting tags and clean testvoc for #
adding transfer rules
adding lexical selection rules

15 – 24 August

total 25250 stems in dix and 32300 stems in lexc
cleanup code and upload final version
Finished language pair

User:Aidana/Proposal

Contents

Contact information[edit]

Why is it you are interested in machine translation?[edit]

Why is it that you are interested in the Apertium project?[edit]

Which of the published tasks are you interested in? What do you plan to do?[edit]

Title[edit]

Reasons why Google and Apertium should sponsor it[edit]

A description of how and who it will benefit in society[edit]

List your skills and give evidence of your qualifications[edit]

List any non-Summer-of-Code plans you have for the Summer[edit]

Initial data:Coding challenge[edit]

MT evaluation ended on 2016 Mar 24 at 01:05:00[edit]

Coding challenge finished[edit]

Work plan[edit]

Detecting problems[edit]

Major goals[edit]

Deatiled Workplan[edit]

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools