Difference between revisions of "User:Aida/Application"
(23 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
=Contact information= |
==Contact information== |
||
Name:Aida Sundetova |
<br>'''Name''':Aida Sundetova |
||
E-mail address: sun27aida@gmail.com |
<br>'''E-mail address''': sun27aida@gmail.com |
||
<br>'''Nick on the IRC #apertium channel''': Aida |
|||
<br>'''Username at [http://www.sourceforge.net sourceforge]''': aida27 |
|||
=Why is it you are interested in machine translation?= |
==Why is it you are interested in machine translation?== |
||
I have started to learn machine translation in 2012 when I joined to a project, which included developing machine translation from English to Kazakh. Before it I was really interested in artificial intelligence and automation of processes. I continued developing English-Kazakh machine translation on Apertium and tried to know more about Apertium free/open-source machine translation platform. Knowledge of the languages, programming and my target to do translation better helped me to learn |
I have started to learn machine translation in 2012 when I joined to a project as a part of my work in Intelligence Information Systems laboratory, which is lead by professor Tukeyev U., in Research Institute of Mechanics and Mathematics, which included developing machine translation from English to Kazakh. Before it I was really interested in artificial intelligence and automation of processes. I continued developing English-Kazakh machine translation on Apertium and tried to know more about Apertium free/open-source machine translation platform. Knowledge of the languages, programming and my target to do translation better helped me to learn XML and improve my working skills. |
||
=Why is it that they are interested in the Apertium project?= |
==Why is it that they are interested in the Apertium project?== |
||
At the first, Apertium is free/open-source machine translation platform, which means that developers from other countries like me can join and start to do translations for new language pair. Apertium uses Unix “pipelines” which are very useful for fast diagnosis and debugging, so I can use additional modules between existing modules, like using HFST(Helsinki finite-state transducer) for morphological analysis and generation for Kazakh language. |
At the first, Apertium is free/open-source machine translation platform, which means that developers from other countries like me can join and start to do translations for new language pair. Apertium uses Unix “pipelines” which are very useful for fast diagnosis and debugging, so I can use additional modules between existing modules, like using HFST(Helsinki finite-state transducer) for morphological analysis and generation for Kazakh language. |
||
=Which of the published tasks are you interested in? What do you plan to do?= |
==Which of the published tasks are you interested in? What do you plan to do?== |
||
I plan to improve |
I plan to improve English-Kazakh language pair to reach a good translation quality. I have already developed this pair, but it doesn't have corpora and enough vocabulary to show the adequate translation. My target is to make vocabulary coverage bigger than now by using corpora from news and wikipedia and come close to working translator. |
||
==Title== |
===Title=== |
||
Adopting unreleased English-Kazakh language pair |
'''Adopting unreleased English-Kazakh language pair''' |
||
== |
===Reasons why Google and Apertium should sponsor it=== |
||
English to Kazakh machine translation |
English to Kazakh machine translation is very important, because Kazakh is Turkic language, so transfer, constraint grammar rules, which I write for this pair can be useful for developing another English – Turkic Language pairs. |
||
== |
===How and who it will benefit in society=== |
||
Kazakh speakers can use English-Kazakh translation for understanding English texts: news, papers, etc. Also, as part of this work is Kazakh-English language pair will be improved by increasing a bilingual dictionary, so English speakers would use this translator to understand texts in Kazakh. |
Kazakh speakers, about 16 millions, can use English-Kazakh translation for understanding English texts: news, papers, etc. Also, as part of this work is Kazakh-English language pair will be improved by increasing a bilingual dictionary, so English speakers would use this translator to understand texts in Kazakh. |
||
==List your skills and give evidence of your qualifications== |
|||
⚫ | |||
==List |
===List your skills and give evidence of your qualifications=== |
||
⚫ | I have developed a English to Kazakh machine translation on Apertium since 2012 and Kazakh to English since 2013. I have great experience in writing and correcting transfer rules, lexical selection and constraint grammar rules, also adding vocabulary in monolingual and bilingual dictionaries. I am a 4-year bachelor student in Kazakh National University(Kazakhstan,Almaty), my major is Information Systems and I will graduate in summer 2014. I know programming languages: C, C++, C#, HTML, XML, and I have basic knowledge of PHP. In addition, I can work with databases and know SQL. My mother language is Kazakh and I also speak English and Russian. |
||
⚫ | |||
⚫ | |||
===List any non-Summer-of-Code plans you have for the Summer=== |
|||
⚫ | |||
⚫ | |||
⚫ | |||
⚫ | |||
⚫ | |||
⚫ | |||
* Good WER |
* Good WER |
||
* Clean testvoc |
* Clean testvoc |
||
* |
* ~10000 stems in bidix (~600 stems per week, or ~100 per day) |
||
* Additional rules:transfer, lexical, constraint grammar(~10 per week,or ~2 per day) |
* Additional rules:transfer, lexical, constraint grammar(~10 per week,or ~2 per day) |
||
== Schedule == |
=== Schedule === |
||
=== Timeline === |
==== Timeline ==== |
||
See [http://www.google-melange.com/gsoc/events/google/gsoc2014 GSoC 2014 Timeline] for complete timeline. Important coding dates follow: |
See [http://www.google-melange.com/gsoc/events/google/gsoc2014 GSoC 2014 Timeline] for complete timeline. Important coding dates follow: |
||
* April 22nd: begin working on project |
* April 22nd: begin working on project |
||
Line 36: | Line 42: | ||
* August 22th: final evaluation |
* August 22th: final evaluation |
||
=== Workplan === |
==== Workplan ==== |
||
{|class="wikitable" |
{|class="wikitable" |
||
! week |
! week |
||
Line 47: | Line 53: | ||
!colspan="2" style="text-align: right"|post-application period<br />23 March - 17 April |
!colspan="2" style="text-align: right"|post-application period<br />23 March - 17 April |
||
| |
| |
||
# finish coding challenge with WER ~ |
# finish coding challenge with WER ~55% |
||
# total 3660 stems in dix |
# total 3660 stems in dix |
||
Line 61: | Line 67: | ||
# write ≥4 lexical selection rules |
# write ≥4 lexical selection rules |
||
# write ≥3 transfer rules |
# write ≥3 transfer rules |
||
# write ≥4 |
# write ≥4 disambiguation rules |
||
note: should be in IRC every day |
note: should be in IRC every day |
||
Line 74: | Line 80: | ||
!style="text-align: right"| 1 - 22 June |
!style="text-align: right"| 1 - 22 June |
||
| |
| |
||
# total |
# total 4500 stems in dix |
||
# adding unknown words {{tag|postadv}} {{tag|ij}} {{tag|adv}} |
# adding unknown words {{tag|postadv}} {{tag|ij}} {{tag|adv}} |
||
# adding transfer rules |
# adding transfer rules |
||
# 500-word evaluation, WER ~ |
# 500-word evaluation, WER ~52% |
||
Line 88: | Line 94: | ||
!style="text-align: right"| 23 - 29 June |
!style="text-align: right"| 23 - 29 June |
||
| |
| |
||
# total |
# total 5100 stems in dix |
||
# adding unknown words {{tag|num}} {{tag|post}} {{tag|prn}} {{tag|det}} |
# adding unknown words {{tag|num}} {{tag|post}} {{tag|prn}} {{tag|det}} |
||
# adding transfer rules |
# adding transfer rules |
||
Line 99: | Line 105: | ||
!style="text-align: right"| 30 - 6 July |
!style="text-align: right"| 30 - 6 July |
||
| |
| |
||
# total |
# total 5700 stems in dix |
||
# adding unknown words {{tag|cnjcoo}} {{tag|cnjadv}} {{tag|cnjsub}} |
# adding unknown words {{tag|cnjcoo}} {{tag|cnjadv}} {{tag|cnjsub}} |
||
# adding transfer rules |
# adding transfer rules |
||
Line 125: | Line 131: | ||
!style="text-align: right"| 14 - 20 July |
!style="text-align: right"| 14 - 20 July |
||
| |
| |
||
# total |
# total 6900 stems in dix |
||
# clean testvoc for # |
# clean testvoc for # |
||
# adding transfer rules |
# adding transfer rules |
||
Line 137: | Line 143: | ||
!style="text-align: right"| 21 - 27 July |
!style="text-align: right"| 21 - 27 July |
||
| |
| |
||
# total |
# total 7300 stems in dix |
||
# correcting tags and clean testvoc for # |
# correcting tags and clean testvoc for # |
||
# adding transfer rules |
# adding transfer rules |
||
Line 147: | Line 153: | ||
!style="text-align: right"| 28 - 3 August |
!style="text-align: right"| 28 - 3 August |
||
| |
| |
||
# total |
# total 7900 stems in dix |
||
# adding transfer rules |
# adding transfer rules |
||
Line 157: | Line 163: | ||
!style="text-align: right"| 4 - 10 August |
!style="text-align: right"| 4 - 10 August |
||
| |
| |
||
# total |
# total 8500 stems in dix |
||
# correcting tags and clean testvoc for # |
# correcting tags and clean testvoc for # |
||
# adding transfer rules |
# adding transfer rules |
||
Line 168: | Line 174: | ||
!style="text-align: right"| 11 - 17 August |
!style="text-align: right"| 11 - 17 August |
||
| |
| |
||
# total |
# total 9100 stems in dix |
||
# adding transfer rules |
# adding transfer rules |
||
Line 177: | Line 183: | ||
!style="text-align: right"| 18 - 22 August |
!style="text-align: right"| 18 - 22 August |
||
| |
| |
||
# total |
# total 9700 stems in dix |
||
# finish with WER ~ |
# finish with WER ~50% |
||
# adding transfer rules |
# adding transfer rules |
||
Latest revision as of 05:24, 13 May 2014
Contents
Contact information[edit]
Name:Aida Sundetova
E-mail address: sun27aida@gmail.com
Nick on the IRC #apertium channel: Aida
Username at sourceforge: aida27
Why is it you are interested in machine translation?[edit]
I have started to learn machine translation in 2012 when I joined to a project as a part of my work in Intelligence Information Systems laboratory, which is lead by professor Tukeyev U., in Research Institute of Mechanics and Mathematics, which included developing machine translation from English to Kazakh. Before it I was really interested in artificial intelligence and automation of processes. I continued developing English-Kazakh machine translation on Apertium and tried to know more about Apertium free/open-source machine translation platform. Knowledge of the languages, programming and my target to do translation better helped me to learn XML and improve my working skills.
Why is it that they are interested in the Apertium project?[edit]
At the first, Apertium is free/open-source machine translation platform, which means that developers from other countries like me can join and start to do translations for new language pair. Apertium uses Unix “pipelines” which are very useful for fast diagnosis and debugging, so I can use additional modules between existing modules, like using HFST(Helsinki finite-state transducer) for morphological analysis and generation for Kazakh language.
Which of the published tasks are you interested in? What do you plan to do?[edit]
I plan to improve English-Kazakh language pair to reach a good translation quality. I have already developed this pair, but it doesn't have corpora and enough vocabulary to show the adequate translation. My target is to make vocabulary coverage bigger than now by using corpora from news and wikipedia and come close to working translator.
Title[edit]
Adopting unreleased English-Kazakh language pair
Reasons why Google and Apertium should sponsor it[edit]
English to Kazakh machine translation is very important, because Kazakh is Turkic language, so transfer, constraint grammar rules, which I write for this pair can be useful for developing another English – Turkic Language pairs.
How and who it will benefit in society[edit]
Kazakh speakers, about 16 millions, can use English-Kazakh translation for understanding English texts: news, papers, etc. Also, as part of this work is Kazakh-English language pair will be improved by increasing a bilingual dictionary, so English speakers would use this translator to understand texts in Kazakh.
List your skills and give evidence of your qualifications[edit]
I have developed a English to Kazakh machine translation on Apertium since 2012 and Kazakh to English since 2013. I have great experience in writing and correcting transfer rules, lexical selection and constraint grammar rules, also adding vocabulary in monolingual and bilingual dictionaries. I am a 4-year bachelor student in Kazakh National University(Kazakhstan,Almaty), my major is Information Systems and I will graduate in summer 2014. I know programming languages: C, C++, C#, HTML, XML, and I have basic knowledge of PHP. In addition, I can work with databases and know SQL. My mother language is Kazakh and I also speak English and Russian.
List any non-Summer-of-Code plans you have for the Summer[edit]
Before 20 of June I have final exams, but I will have about 3 hours a day to work on project. After graduating from the university I will be free and I can spend 30 hours a week on Apertium.
My plan[edit]
Major goals[edit]
I plan to work more on vocabulary, and add transfer rules if needed, to be more precise, I plan to reach:
- Good WER
- Clean testvoc
- ~10000 stems in bidix (~600 stems per week, or ~100 per day)
- Additional rules:transfer, lexical, constraint grammar(~10 per week,or ~2 per day)
Schedule[edit]
Timeline[edit]
See GSoC 2014 Timeline for complete timeline. Important coding dates follow:
- April 22nd: begin working on project
- June 27th - August 17th: midterm evaluations
- August 18th: 'pencils down' date
- August 22th: final evaluation
Workplan[edit]
week | dates | goals
|
notes |
---|---|---|---|
post-application period 23 March - 17 April |
|
| |
17 April - 1 June |
note: should be in IRC every day |
| |
1 | 1 - 22 June |
|
|
2 | 23 - 29 June |
|
|
3 | 30 - 6 July |
|
|
4 | 7 - 13 July |
|
|
5 | 14 - 20 July |
|
|
6 | 21 - 27 July |
|
|
7 | 28 - 3 August |
|
|
8 | 4 - 10 August |
|
|
9 | 11 - 17 August |
|
|
10 | 18 - 22 August |
|
|