Difference between revisions of "User:Aidana/Proposal"
Line 82: | Line 82: | ||
NIST score = 4.2546 BLEU score = 0.2206 for system "zzz" |
NIST score = 4.2546 BLEU score = 0.2206 for system "zzz" |
||
------------------------------------------------------------------------ |
|||
Individual N-gram scoring |
Individual N-gram scoring |
||
Line 91: | Line 91: | ||
BLEU: 0.6069 0.2476 0.1671 0.1287 0.1065 0.0800 0.0571 0.0345 0.0143 "zzz" |
BLEU: 0.6069 0.2476 0.1671 0.1287 0.1065 0.0800 0.0571 0.0345 0.0143 "zzz" |
||
------------------------------------------------------------------------ |
|||
Cumulative N-gram scoring |
Cumulative N-gram scoring |
||
1-gram 2-gram 3-gram 4-gram 5-gram 6-gram 7-gram 8-gram 9-gram |
1-gram 2-gram 3-gram 4-gram 5-gram 6-gram 7-gram 8-gram 9-gram |
Revision as of 19:16, 24 March 2016
Contents
- 1 Contact information
- 2 Why is it you are interested in machine translation?
- 3 Why is it that you are interested in the Apertium project?
- 4 Which of the published tasks are you interested in? What do you plan to do?
- 5 Initial data:Coding challenge
- 6 MT evaluation ended on 2016 Mar 24 at 01:05:00
- 7 Coding challenge finished
- 8 Work plan
Contact information
Name:Aidana Karibayeva
E-mail address: a.s.karibayeva@gmail.com
Nick on the IRC #apertium channel: Aidana
Username at sourceforge: aidana1
Why is it you are interested in machine translation?
These days the translating text automatically by using machine translation is very important, because it helps people from whole world to understand and get information in foreign language very quickly and easy. Building machine translation systems is very interesting process, which needs to have and use your knowledge in programming, linguistics and languages, and in the result, by combining all these stuff, you could create machine translation system, which could be useful for many people.
Why is it that you are interested in the Apertium project?
I have worked on Apertium platform since 2014, when I started to build Kazakh-English machine translation system. I chose this platform, because it is free and open-source, so I was able to use previously developed resources, such like Kazakh transducer and English-Kazakh bilingual dictionary.
Which of the published tasks are you interested in? What do you plan to do?
I interested in task “Adopt an unreleased language pair”, as a language pair I choose is Kazakh-English, in Kazakh-English translation direction. I am planning to expand the dictionary, create more transfer and lexical rules.
Title
Adopt an unreleased Kazakh-English language pair
Reasons why Google and Apertium should sponsor it
I want to improve the current state of Kazakh-English pair on Apertium platform, so it could produce translation with high quality, bigger vocabulary and more transfer rules. Resources could be used for another language pairs, where the Kazakh is source language.
A description of how and who it will benefit in society
As I said above, resources of developed Kazakh-English machine translation system could be used for other language systems, for instance, transfer rules are already used in Kazakh-Russian MT systems. Also, English speakers, who are interested in Kazakh culture could use it to understand Kazakh texts, songs and news.
List your skills and give evidence of your qualifications
I currently study on 2nd year master degree in Information Systems at Al-Farabi Kazakh National University in Kazakhstan. My programming skills include: C, C++, C#, shell scripting language. I speak fluently Kazakh, Russian and English languages.
List any non-Summer-of-Code plans you have for the Summer
I have no non-GSoC plans for the summer and I can spend about 40-50 hours for a week
Initial data:Coding challenge
Evaluation of kaz-to-eng translation using:
src set "apertium" (1 docs, 43 segs) ref set "apertium" (1 refs) tst set "apertium" (1 systems)
NIST score = 2.2109 BLEU score = 0.0386 for system "apertium"
Individual N-gram scoring
1-gram 2-gram 3-gram 4-gram 5-gram 6-gram 7-gram 8-gram 9-gram ------ ------ ------ ------ ------ ------ ------ ------ ------ NIST: 2.0950 0.1140 0.0019 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 "apertium"
BLEU: 0.3333 0.0607 0.0174 0.0063 0.0023 0.0013 0.0007 0.0004 0.0002 "apertium"
Cumulative N-gram scoring
1-gram 2-gram 3-gram 4-gram 5-gram 6-gram 7-gram 8-gram 9-gram ------ ------ ------ ------ ------ ------ ------ ------ ------ NIST: 2.0950 2.2090 2.2109 2.2109 2.2109 2.2109 2.2109 2.2109 2.2109 "apertium"
BLEU: 0.3333 0.1423 0.0706 0.0386 0.0220 0.0137 0.0090 0.0061 0.0043 "apertium"
MT evaluation scorer ended on 2016 Mar 22 at 00:33:31
Statistics about input files
- Number of words in reference: 520
- Number of words in test: 439
- Number of unknown words (marked with a star) in test: 7
- Percentage of unknown words: 1.59 %
Results when removing unknown-word marks (stars)
- Edit distance: 507
- Word error rate (WER): 97.50 %
- Number of position-independent correct words: 97
- Position-independent word error rate (PER): 81.35 %
MT evaluation ended on 2016 Mar 24 at 01:05:00
Coding challenge finished
Results:
- removed all #
- added unknown words
- added t1x and tx2 rules
- added rlx rule in apertium-kaz
BLEU:
Evaluation of kaz-to-eng translation using: src set "apertium" (1 docs, 43 segs) ref set "apertium" (1 refs) tst set "apertium" (1 systems)
NIST score = 4.2546 BLEU score = 0.2206 for system "zzz"
------------------------------------------------------------------------
Individual N-gram scoring
1-gram 2-gram 3-gram 4-gram 5-gram 6-gram 7-gram 8-gram 9-gram ------ ------ ------ ------ ------ ------ ------ ------ ------ NIST: 3.6697 0.4962 0.0723 0.0163 0.0000 0.0000 0.0000 0.0000 0.0000 "zzz"
BLEU: 0.6069 0.2476 0.1671 0.1287 0.1065 0.0800 0.0571 0.0345 0.0143 "zzz"
------------------------------------------------------------------------
Cumulative N-gram scoring
1-gram 2-gram 3-gram 4-gram 5-gram 6-gram 7-gram 8-gram 9-gram ------ ------ ------ ------ ------ ------ ------ ------ ------ NIST: 3.6697 4.1659 4.2382 4.2546 4.2546 4.2546 4.2546 4.2546 4.2546 "zzz"
BLEU: 0.5615 0.3587 0.2709 0.2206 0.1878 0.1608 0.1372 0.1143 0.0899 "zzz"
WER: Statistics about input files
- Number of words in reference: 480
- Number of words in test: 388
- Number of unknown words (marked with a star) in test: 0
- Percentage of unknown words: 0.00 %
Results when removing unknown-word marks (stars)
- Edit distance: 366
- Word error rate (WER): 76.25 %
- Number of position-independent correct words: 176
- Position-independent word error rate (PER): 63.33 %
Work plan
Major goals
I plan to work on transfer rules and work on vocabulary, to be more precise, I plan to reach:
- Create testvoc
- Clean testvoc
- ~25000 stems in bidix(~600 stems per week, or ~100 per day) and ~32000 stems in lexc (~100 stems per week, or ~20 per day) stems in lexc
- Additional rules:transfer, lexical, constraint grammar(~10 per week,or ~2 per day)
Deatiled Workplan
April -1 June
- Create files for each POS with morph.analysis to testvoc
- run first testvoc with these files
- write ≥3 lexical selection rules
- write ≥3 transfer rules
- write ≥1 disambiguation rules
1 June- 23 June
- total 21250 stems in dix and 31500 stems in lexc
- adding unknown words from Kazakh to English
- adding transfer rules
24 June- 1 July
- total 21750 stems in dix and 31600 stems in lexc
- adding unknown words with *
- adding transfer rules
- adding lexical selection rules
2 July -10 July
- total 22250 stems in dix and 31700 stems in lexc
- clean testvoc for #
- adding transfer rules
- correcting existing transfer rules
11 July- 17 July
- total 22750 stems in dix and 31800 stems in lexc
- correcting tags and clean testvoc for #
- adding transfer rules
- adding lexical selection rules
18 July – 24 July
- total 23250 stems in dix and 31900 stems in lexc
- correcting tags and clean testvoc for #
- adding transfer rules
- adding lexical selection rules
25 July- 31 July
- total 23750 stems in dix and 32000 stems in lexc
- correcting tags and clean testvoc for #
- adding transfer rules
- adding lexical selection rules
1 August – 7 August
- total 24250 stems in dix and 32100 stems in lexc
- correcting tags and clean testvoc for #
- adding transfer rules
- adding lexical selection rules
8 August – 14 August
- total 24750 stems in dix and 32200 stems in lexc
- correcting tags and clean testvoc for #
- adding transfer rules
- adding lexical selection rules
15 – 24 August
- total 25250 stems in dix and 32300 stems in lexc
- cleanup code and upload final version
- Finished language pair