Difference between revisions of "User:Aida/Application"

From Apertium
Jump to navigation Jump to search
 
(40 intermediate revisions by 2 users not shown)
Line 1: Line 1:
==Contact information==
*Name: Sundetova Aida
<br>'''Name''':Aida Sundetova
<br>'''E-mail address''': sun27aida@gmail.com
<br>'''Nick on the IRC #apertium channel''': Aida
<br>'''Username at [http://www.sourceforge.net sourceforge]''': aida27


==Why is it you are interested in machine translation?==
*E-mail address: sun27aida@gmail.com
I have started to learn machine translation in 2012 when I joined to a project as a part of my work in Intelligence Information Systems laboratory, which is lead by professor Tukeyev U., in Research Institute of Mechanics and Mathematics, which included developing machine translation from English to Kazakh. Before it I was really interested in artificial intelligence and automation of processes. I continued developing English-Kazakh machine translation on Apertium and tried to know more about Apertium free/open-source machine translation platform. Knowledge of the languages, programming and my target to do translation better helped me to learn XML and improve my working skills.


==Why is it that they are interested in the Apertium project?==
*Other information that may be useful to contact you: nick on the #apertium channel: Aida
At the first, Apertium is free/open-source machine translation platform, which means that developers from other countries like me can join and start to do translations for new language pair. Apertium uses Unix “pipelines” which are very useful for fast diagnosis and debugging, so I can use additional modules between existing modules, like using HFST(Helsinki finite-state transducer) for morphological analysis and generation for Kazakh language.
==Which of the published tasks are you interested in? What do you plan to do?==
I plan to improve English-Kazakh language pair to reach a good translation quality. I have already developed this pair, but it doesn't have corpora and enough vocabulary to show the adequate translation. My target is to make vocabulary coverage bigger than now by using corpora from news and wikipedia and come close to working translator.
===Title===
'''Adopting unreleased English-Kazakh language pair'''


===Reasons why Google and Apertium should sponsor it===
*Why is it you are interested in machine translation?
English to Kazakh machine translation is very important, because Kazakh is Turkic language, so transfer, constraint grammar rules, which I write for this pair can be useful for developing another English – Turkic Language pairs.
**I have started to learn machine translation in 2012 when I joined to a project, which included developing machine translation from English to Kazakh. Before it I was really interested in artificial intelligence and automation of processes. I continued developing english-kazakh machine translation on Apertium and tried to know more about Apertium free/open-source machine translation platform. Knowledge of the languages, programming and my target to do translation better helped me to learn new programming language as XML and improve my working skills.
===How and who it will benefit in society===
*Why is it that they are interested in the Apertium project?
Kazakh speakers, about 16 millions, can use English-Kazakh translation for understanding English texts: news, papers, etc. Also, as part of this work is Kazakh-English language pair will be improved by increasing a bilingual dictionary, so English speakers would use this translator to understand texts in Kazakh.
**At the first, Apertium is free/open-source machine translation platform, which means that developers from other countries like me can join and start to do translations for new language pair. Apertium uses Unix “pipelines” which are very useful for fast diagnosis and debugging and I can use additional modules between existing modules, like using HFST(Helsinki finite-state transducer) for morphological analysis and generation for Kazakh language.
*Which of the published tasks are you interested in? What do you plan to do?
**I plan to improve “Apertium English-kazakh” to reach a good translation quality. I already develop this pair, but it doesn't have corpora and enough vocabulary to show adequate translation. My target is to make vocabulary coverage bigger than now by using corpora from news and wikipedia and come close to working translator.
*Include a proposal, including
** a title, --what title??
** reasons why Google and Apertium should sponsor it,
* a description of how and who it will benefit in society,
**English to Kazakh machine translation are very important, because Kazakh is Turkic language, so transfer rules, which I write for this pair can be useful for another English – Turkic Languages pairs.
* and a detailed work plan (including, if possible, a brief schedule with milestones and deliverables). Include time needed to think, to program, to document and to disseminate.
**I plan to work more on vocabulary, and add transfer rules is it needed.
*List your skills and give evidence of your qualifications. Tell us what is your current field of study,
major, etc. Convince us that you can do the work. In particular we would like to know whether you
have programmed before in open-source projects.
**I have developed a English to Kazakh machine translation on Apertium since 2012 and Kazakh to English since 2013. I have great experience in writing and correcting transfer rules, lexical selection and constraint grammar rules, also adding vocabulary in monolingual and bilingual dictionaries. I study on 4th grade of bachelor's degree in Information Systems and I will graduate in the summer 2014. I know programming languages: C, C++, C#, HTML, XML, and I have basic knowledge of PHP. In addition, I can work with databases and know SQL. My mother language is Kazakh and I also speak English and Russian.


===List your skills and give evidence of your qualifications===
*List any non-Summer-of-Code plans you have for the Summer, especially employment, if you are applying for internships, and class-taking. Be specific about schedules and time commitments. we would like to be sure you have at least 30 free hours a week to develop for our project.
I have developed a English to Kazakh machine translation on Apertium since 2012 and Kazakh to English since 2013. I have great experience in writing and correcting transfer rules, lexical selection and constraint grammar rules, also adding vocabulary in monolingual and bilingual dictionaries. I am a 4-year bachelor student in Kazakh National University(Kazakhstan,Almaty), my major is Information Systems and I will graduate in summer 2014. I know programming languages: C, C++, C#, HTML, XML, and I have basic knowledge of PHP. In addition, I can work with databases and know SQL. My mother language is Kazakh and I also speak English and Russian.
**Before 20 of June I have final exams, but I will have about 3 hours a day to work on project. After graduating from the university I will be free and can spend 30 hours a week on Apertium.


===List any non-Summer-of-Code plans you have for the Summer===
== Major goals ==
Before 20 of June I have final exams, but I will have about 3 hours a day to work on project. After graduating from the university I will be free and I can spend 30 hours a week on Apertium.

==My plan==
===Major goals===
I plan to work more on vocabulary, and add transfer rules if needed, to be more precise, I plan to reach:
* Good WER
* Good WER
* Clean testvoc
* Clean testvoc
* 8'000 stems in bidix (~700 stems per week, or ~100 per day)
* ~10000 stems in bidix (~600 stems per week, or ~100 per day)
* Additional rules:transfer, lexical, constraint grammar(~10 per week,or ~2 per day)
* Additional rules:transfer, lexical, constraint grammar(~10 per week,or ~2 per day)


== Schedule ==
=== Schedule ===
=== Timeline ===
==== Timeline ====
See [http://www.google-melange.com/gsoc/events/google/gsoc2014 GSoC 2014 Timeline] for complete timeline. Important coding dates follow:
See [http://www.google-melange.com/gsoc/events/google/gsoc2014 GSoC 2014 Timeline] for complete timeline. Important coding dates follow:
* April 22nd: begin working on project
* April 22nd: begin working on project
Line 40: Line 42:
* August 22th: final evaluation
* August 22th: final evaluation


=== Workplan ===
==== Workplan ====
{|class="wikitable"
{|class="wikitable"
! week
! week
! dates
! dates
!style="width: 25%"| goals
!style="width: 25%"| goals

! eval

!style="width: 25%"| accomplishments
!style="width: 35%"| notes
!style="width: 35%"| notes
|-
|-
!colspan="2" style="text-align: right"|post-application period<br />23 March - 17 April
!colspan="2" style="text-align: right"|post-application period<br />23 March - 17 April
|
|
# finish coding challenge with WER ~30%
# finish coding challenge with WER ~55%
# total 300 stems in dix
# total 3660 stems in dix

| {{Workeval5|4}}
|
# ---
# ---
|
|
* Demonstrated ability to add stems to dix and lexc.
* Demonstrated ability to add stems to dix and lexc.
* A couple easy lexical selection rules are still not written.
* A couple easy lexical selection rules are still not written.
* Needs for transfer rules in eng-kaz.t2x
* Needs for transfer rules in eng-kaz.t1x
|-
|-
!colspan="2" style="text-align: right"|community bonding period<br />17 April - 1 June
!colspan="2" style="text-align: right"| 17 April - 1 June
|
|
# run first testvoc
# run first testvoc
# run coverage scripts
# get first frequency lists
# get first frequency lists
# write ≥4 lexical selection rules
# write ≥4 lexical selection rules
# write ≥3 transfer rules
# write ≥3 transfer rules
# write ≥4 disambig rules
# write ≥4 disambiguation rules
note: should be in IRC every day
note: should be in IRC every day

| {{Workeval5|3}}
|
# —
# --
# --
# --
# --
# —
|
|
* demonstrated ability to work with lexical selection rules
* demonstrated ability to work with lexical selection rules
* Needs for lexical rules in eng-kaz.lrx
* demonstrated ability to work with transfer rules
* demonstrated ability to work with constraint grammar rules
* got only some experience with coverage scripts
* got only some experience with coverage scripts


Line 90: Line 80:
!style="text-align: right"| 1 - 22 June
!style="text-align: right"| 1 - 22 June
|
|
# total 1500 stems in dix
# total 4500 stems in dix
# clean testvoc for {{tag|postadv}} {{tag|ij}}
# adding unknown words {{tag|postadv}} {{tag|ij}} {{tag|adv}}
# adding transfer rules
# adding transfer rules
# 500-word evaluation, WER ~30%
# 500-word evaluation, WER ~52%



| {{Workeval5|0}}
|
|
|
* demonstrated ability to work with constraint grammar rules and disambiguating
* Needs for CG rules in eng-kaz.rlx


|-
|-
Line 103: Line 94:
!style="text-align: right"| 23 - 29 June
!style="text-align: right"| 23 - 29 June
|
|
# total 2200 stems in dix
# total 5100 stems in dix
# clean testvoc for {{tag|num}} {{tag|post}}
# adding unknown words {{tag|num}} {{tag|post}} {{tag|prn}} {{tag|det}}
# adding transfer rules
# adding transfer rules


| {{Workeval5|0}}
|
|
* demonstrated ability to work with transfer rules and chunking
# -
* Needs for transfer rules in eng-kaz.t1x
|

|-
|-
! 3
! 3
!style="text-align: right"| 30 - 6 July
!style="text-align: right"| 30 - 6 July
|
|
# total 2900 stems in dix
# total 5700 stems in dix
# clean testvoc for {{tag|cnjcoo}} {{tag|cnjadv}} {{tag|cnjsub}}
# adding unknown words {{tag|cnjcoo}} {{tag|cnjadv}} {{tag|cnjsub}}
# adding transfer rules
# adding transfer rules



| {{Workeval5|2}}
|
# --
# --
|
|

* demonstrated ability to work with transfer rules and chunking
* Needs for transfer rules in eng-kaz.t1x


|-
|-
Line 130: Line 119:
!style="text-align: right"| 7 - 13 July
!style="text-align: right"| 7 - 13 July
|
|
# total 3600 stems in dix
# total 6300 stems in dix
# clean testvoc for {{tag|adv}}
# adding unknown words {{tag|adj}} {{tag|n}} {{tag|np}}
# adding transfer rules
# adding transfer rules
|


* demonstrated ability to work with transfer rules
|{{Workeval5|2}}
* Needs for transfer rules in eng-kaz.t2x
|rowspan="2"|
# --
# --

|rowspan="2"|


|-
|-
Line 145: Line 131:
!style="text-align: right"| 14 - 20 July
!style="text-align: right"| 14 - 20 July
|
|
# total 4200 stems in dix
# total 6900 stems in dix
# clean testvoc for {{tag|prn}} {{tag|det}}
# clean testvoc for #
# adding transfer rules
# adding transfer rules

|{{Workeval5|3}}
|

* demonstrated ability to work with transfer rules
* Needs for transfer rules in eng-kaz.t2x
|-
|-
! 6
! 6
!style="text-align: right"| 21 - 27 July
!style="text-align: right"| 21 - 27 July
|
|
# total 4900 stems in dix
# total 7300 stems in dix
# clean testvoc for {{tag|adj}} {{tag|adj}}{{tag|advl}}
# correcting tags and clean testvoc for #
# adding transfer rules
# adding transfer rules
|{{Workeval5|3}}
|rowspan="3"|
# --

|rowspan="2"|


|
* Needs for "cleaning" tags rules in eng-kaz.t4x
|-
|-
! 7
! 7
!style="text-align: right"| 28 - 3 August
!style="text-align: right"| 28 - 3 August
|
|
# total 5600 stems in dix
# total 7900 stems in dix
# adding transfer rules
# adding transfer rules
|{{Workeval5|2}}
|-
!colspan="2" style="text-align: right"| 4 - 11 August
|
# total 6300 stems in dix
# 500-word evaluation, WER ~30%


|{{Workeval5|2}}
|
|
* Needs for transfer rules in eng-kaz.t1x

* Needs for transfer rules in eng-kaz.t4x
|-
|-
! 8
! 8
!style="text-align: right"| 4 - 10 August
!style="text-align: right"| 4 - 10 August
|
|
# total 7200 stems in dix
# total 8500 stems in dix
# correcting tags and clean testvoc for #
# clean testvoc for {{tag|n}} {{tag|num}}{{tag|subst}} {{tag|np}} {{tag|adj}}{{tag|subst}}
# adding transfer rules
# trimmed coverage 75%

|{{Workeval5|2}}
|rowspan="3"|
# stems in dix: 6493
# trimmed coverage: 79.6%,74.1%
|
|
* Needs for transfer rules in eng-kaz.t1x
* Needs for transfer rules in eng-kaz.t2x
|-
|-
! 9
! 9
!style="text-align: right"| 11 - 17 August
!style="text-align: right"| 11 - 17 August
|
|
# total 8000 stems in dix
# total 9100 stems in dix
# adding transfer rules
# trimmed coverage 78%

|{{Workeval5|2}}
|
|
* Needs for transfer rules in eng-kaz.t2x
|-
|-
! 10
! 10
!style="text-align: right"| 18 - 24 August
!style="text-align: right"| 18 - 22 August
|
|
# total 8800 stems in dix
# total 9700 stems in dix
# finish with WER ~50%
# trimmed coverage 81%
# adding transfer rules
|{{Workeval5|3}}

|
|
*Finishing documentation

|-
|-

! 11
!style="text-align: right"| 25 - 31 August
|
# total 9600 stems in dix
# clean testvoc for {{tag|v}}
# trimmed coverage 83%
|{{Workeval5|3}}
|
# stems in dix: 6730
# trimmed coverage: 82.5%,78.4%
# azattyq_24455849 WER: 6.62%
|
|-
! 12
!style="text-align: right"| 1 - 7 September
|
# total 10400 stems in dix
# trimmed coverage 85%
|{{Workeval5|3}}
|
# stems in dix: 7007
# trimmed coverage: 84.2%,79.8%
|
* Good [[Turkic_lexicon#Kyrgyz|adjective typology]]
* Decent progress on coverage
* Not around much later in the week
* Still no testvoc...
—[[User:Firespeaker|Firespeaker]] 07:29, 10 September 2013 (UTC)
|-
! 13
!style="text-align: right"| 8 - 15 September
|
# total 11200 stems in dix
# trimmed coverage 87%
|{{Workeval5|1}}
|
# stems in dix: 7454
# trimmed coverage: 85.2%,80.4%
|
* Decent increase in coverage
* Still no testvoc
* Still ~600 unsorted ADJ
* Not around much
—[[User:Firespeaker|Firespeaker]] 20:06, 22 September 2013 (UTC)
|-
!colspan="2" style="text-align: right"| pencils-down week<br />final evaluation<br />16 - 23 September
|
# total 12000 stems in dix
# 500-word evaluation, WER ~10%
# clean testvoc for all categories
# trimmed coverage 88%
# release 0.1.0 and move to trunk
|
|
# stems in dix: 7546
# trimmed coverage: 85.8%,81.6%
|
* Good coverage
* "Good" WER results
** But lots of # and * errors :(
* No work on testvoc
* Some ADJ sorted; still >500 unsorted
* only 2 sets of LRX rules since early in GSoC
* only 1 transfer rule since early in GSoC
|-
!colspan="2" style="text-align: right"| Final evaluation
|
|
|
|
* Has improved coverage a certain amount
* Has not done anything else
* Mentors have had to nag to get him to work
* Has not been around enough
* Among the lowest-performing students
* Has not improved since midterm
* Last-ditch efforts not at all impressive
|}
|}



== Tips and Tricks ==

=== Adding stems quickly ===
[[Category:GSoC 2014 Student proposals|Aida]]
* Add top stems from frequency lists of unknown forms
* Use spectie's dix-entries-to-be-checked script

Latest revision as of 05:24, 13 May 2014

Contact information[edit]


Name:Aida Sundetova
E-mail address: sun27aida@gmail.com
Nick on the IRC #apertium channel: Aida
Username at sourceforge: aida27

Why is it you are interested in machine translation?[edit]

I have started to learn machine translation in 2012 when I joined to a project as a part of my work in Intelligence Information Systems laboratory, which is lead by professor Tukeyev U., in Research Institute of Mechanics and Mathematics, which included developing machine translation from English to Kazakh. Before it I was really interested in artificial intelligence and automation of processes. I continued developing English-Kazakh machine translation on Apertium and tried to know more about Apertium free/open-source machine translation platform. Knowledge of the languages, programming and my target to do translation better helped me to learn XML and improve my working skills.

Why is it that they are interested in the Apertium project?[edit]

At the first, Apertium is free/open-source machine translation platform, which means that developers from other countries like me can join and start to do translations for new language pair. Apertium uses Unix “pipelines” which are very useful for fast diagnosis and debugging, so I can use additional modules between existing modules, like using HFST(Helsinki finite-state transducer) for morphological analysis and generation for Kazakh language.

Which of the published tasks are you interested in? What do you plan to do?[edit]

I plan to improve English-Kazakh language pair to reach a good translation quality. I have already developed this pair, but it doesn't have corpora and enough vocabulary to show the adequate translation. My target is to make vocabulary coverage bigger than now by using corpora from news and wikipedia and come close to working translator.

Title[edit]

Adopting unreleased English-Kazakh language pair

Reasons why Google and Apertium should sponsor it[edit]

English to Kazakh machine translation is very important, because Kazakh is Turkic language, so transfer, constraint grammar rules, which I write for this pair can be useful for developing another English – Turkic Language pairs.

How and who it will benefit in society[edit]

Kazakh speakers, about 16 millions, can use English-Kazakh translation for understanding English texts: news, papers, etc. Also, as part of this work is Kazakh-English language pair will be improved by increasing a bilingual dictionary, so English speakers would use this translator to understand texts in Kazakh.

List your skills and give evidence of your qualifications[edit]

I have developed a English to Kazakh machine translation on Apertium since 2012 and Kazakh to English since 2013. I have great experience in writing and correcting transfer rules, lexical selection and constraint grammar rules, also adding vocabulary in monolingual and bilingual dictionaries. I am a 4-year bachelor student in Kazakh National University(Kazakhstan,Almaty), my major is Information Systems and I will graduate in summer 2014. I know programming languages: C, C++, C#, HTML, XML, and I have basic knowledge of PHP. In addition, I can work with databases and know SQL. My mother language is Kazakh and I also speak English and Russian.

List any non-Summer-of-Code plans you have for the Summer[edit]

Before 20 of June I have final exams, but I will have about 3 hours a day to work on project. After graduating from the university I will be free and I can spend 30 hours a week on Apertium.

My plan[edit]

Major goals[edit]

I plan to work more on vocabulary, and add transfer rules if needed, to be more precise, I plan to reach:

  • Good WER
  • Clean testvoc
  • ~10000 stems in bidix (~600 stems per week, or ~100 per day)
  • Additional rules:transfer, lexical, constraint grammar(~10 per week,or ~2 per day)

Schedule[edit]

Timeline[edit]

See GSoC 2014 Timeline for complete timeline. Important coding dates follow:

  • April 22nd: begin working on project
  • June 27th - August 17th: midterm evaluations
  • August 18th: 'pencils down' date
  • August 22th: final evaluation

Workplan[edit]

week dates goals


notes
post-application period
23 March - 17 April
  1. finish coding challenge with WER ~55%
  2. total 3660 stems in dix
  • Demonstrated ability to add stems to dix and lexc.
  • A couple easy lexical selection rules are still not written.
  • Needs for transfer rules in eng-kaz.t1x
17 April - 1 June
  1. run first testvoc
  2. get first frequency lists
  3. write ≥4 lexical selection rules
  4. write ≥3 transfer rules
  5. write ≥4 disambiguation rules

note: should be in IRC every day

  • demonstrated ability to work with lexical selection rules
  • Needs for lexical rules in eng-kaz.lrx
  • got only some experience with coverage scripts


1 1 - 22 June
  1. total 4500 stems in dix
  2. adding unknown words <postadv> <ij> <adv>
  3. adding transfer rules
  4. 500-word evaluation, WER ~52%


  • demonstrated ability to work with constraint grammar rules and disambiguating
  • Needs for CG rules in eng-kaz.rlx
2 23 - 29 June
  1. total 5100 stems in dix
  2. adding unknown words <num> <post> <prn> <det>
  3. adding transfer rules
  • demonstrated ability to work with transfer rules and chunking
  • Needs for transfer rules in eng-kaz.t1x
3 30 - 6 July
  1. total 5700 stems in dix
  2. adding unknown words <cnjcoo> <cnjadv> <cnjsub>
  3. adding transfer rules


  • demonstrated ability to work with transfer rules and chunking
  • Needs for transfer rules in eng-kaz.t1x
4 7 - 13 July
  1. total 6300 stems in dix
  2. adding unknown words <adj> <n> <np>
  3. adding transfer rules
  • demonstrated ability to work with transfer rules
  • Needs for transfer rules in eng-kaz.t2x
5 14 - 20 July
  1. total 6900 stems in dix
  2. clean testvoc for #
  3. adding transfer rules
  • demonstrated ability to work with transfer rules
  • Needs for transfer rules in eng-kaz.t2x
6 21 - 27 July
  1. total 7300 stems in dix
  2. correcting tags and clean testvoc for #
  3. adding transfer rules
  • Needs for "cleaning" tags rules in eng-kaz.t4x
7 28 - 3 August
  1. total 7900 stems in dix
  2. adding transfer rules
  • Needs for transfer rules in eng-kaz.t1x
  • Needs for transfer rules in eng-kaz.t4x
8 4 - 10 August
  1. total 8500 stems in dix
  2. correcting tags and clean testvoc for #
  3. adding transfer rules
  • Needs for transfer rules in eng-kaz.t1x
  • Needs for transfer rules in eng-kaz.t2x
9 11 - 17 August
  1. total 9100 stems in dix
  2. adding transfer rules
  • Needs for transfer rules in eng-kaz.t2x
10 18 - 22 August
  1. total 9700 stems in dix
  2. finish with WER ~50%
  3. adding transfer rules
  • Finishing documentation