Difference between revisions of "User:Dshgna/GSoC 2014 Proposal"
(5 intermediate revisions by 2 users not shown) | |||
Line 9: | Line 9: | ||
== Why is it you are interested in machine translation? == |
== Why is it you are interested in machine translation? == |
||
I am an Information Technology |
I am an Information Technology undergraduate with a love for linguistics. Machine translation is the perfect combined application for both these interests! Additionally, I live in a multilingual country given me first hand experience of the political, socio-economic and educational divide caused due to the language barrier. This makes me appreciate the need of MT and the change it would make to make the world a better place. |
||
== Why is it that you are interested in the Apertium project? == |
== Why is it that you are interested in the Apertium project? == |
||
Line 32: | Line 30: | ||
Currently Apertium has no language pair for Sinhala-Tamil. Both of these are low resource languages with a lack of open source MT systems. The only related language pair in Apertium is Sinhala-English in the |
Currently Apertium has no language pair for Sinhala-Tamil. Both of these are low resource languages with a lack of open source MT systems. The only related language pair in Apertium is Sinhala-English in the incubator. (a quick literature review showed that Sinhala-Tamil translation has been only attempted using SMT which yielded low results due to the lack of language resources). |
||
⚫ | |||
⚫ | |||
== A description of how and who it will benefit in society == |
== A description of how and who it will benefit in society == |
||
Line 53: | Line 49: | ||
Week 1: |
Week 1: |
||
Improve Sinhala monodix to a goal of |
Improve Sinhala monodix to a goal of 3000 total words. |
||
Week 2: |
Week 2: |
||
Improve Tamil monodix by creating pardefs and adding nouns, adverbs, adjectives and numerals to a goal of a total of |
Improve Tamil monodix by creating pardefs and adding nouns, adverbs, adjectives and numerals to a goal of a total of 3000 total words. |
||
Week 3: |
Week 3: |
||
Improve the Tamil monodix by adding verbs to a |
Improve the Tamil monodix by adding verbs and adjectives to a goal of 4000 total words. |
||
Week 4: |
Week 4: |
||
Line 80: | Line 76: | ||
Week 8: |
Week 8: |
||
Improve Sinhala monodix to a total of |
Improve Sinhala monodix to a total of 8000 words |
||
Week 9: |
Week 9: |
||
Improve Tamil monodix to a total of |
Improve Tamil monodix to a total of 8000 words |
||
Week 10: |
Week 10: |
||
Line 91: | Line 87: | ||
Week 11: |
Week 11: |
||
testing |
testing using testvoce |
||
Week 12: |
Week 12: |
||
testing |
testing using testvoce |
||
releasing |
releasing |
||
Line 108: | Line 104: | ||
Current progress is as follows |
Current progress is as follows |
||
1. Created a frequency based ordering of 100 words from the manually translated Sinhala text) |
1. Created a frequency based ordering of approx. 100 words from the manually translated Sinhala text) |
||
2. Created pardefs and added approximately 40 of these words to the Sinhala monodix |
2. Created pardefs and added approximately 40 of these words to the Sinhala monodix |
||
3. Created pardefs and added approximately |
3. Created pardefs and added approximately 40 of the corresponding Tamil words to the monodix |
||
4. Created skeleton bidix and transfer rules files. |
4. Created skeleton bidix and transfer rules files. |
Latest revision as of 06:13, 18 March 2014
Name: Dulshani Gunawardhana
E-mail address: dulshani[dot]gunawardhana89@gmail[dot]com
IRC: dshgna
Contents
- 1 Why is it you are interested in machine translation?
- 2 Why is it that you are interested in the Apertium project?
- 3 Which of the published tasks are you interested in? What do you plan to do?
- 4 Why Google and Apertium should sponsor it?
- 5 A description of how and who it will benefit in society
- 6 Work plan
- 7 List your skills and give evidence of your qualifications
- 8 List any non-Summer-of-Code plans you have for the Summer
Why is it you are interested in machine translation?[edit]
I am an Information Technology undergraduate with a love for linguistics. Machine translation is the perfect combined application for both these interests! Additionally, I live in a multilingual country given me first hand experience of the political, socio-economic and educational divide caused due to the language barrier. This makes me appreciate the need of MT and the change it would make to make the world a better place.
Why is it that you are interested in the Apertium project?[edit]
The concept of freedom of software, specially when applied to a domain as complex as MT, is extremely appealing to me. The emphasis of Apertium on less-resourced languages is one of my interested points as it opens the door to many MT projects that would never see the light of day due to lack of funding and interest.
Which of the published tasks are you interested in? What do you plan to do?[edit]
Adopt an unreleased language pair: Sinhala-Tamil (si-ta)
This task will include implementing bi-directional translation for the Sinhala-Tamil language pair based on the Apertium platform. This will involve developing the skelton monodix and bidix dictionaries I've already created and implementing transfer rules for Sinhala and Tamil.
Why Google and Apertium should sponsor it?[edit]
Currently Apertium has no language pair for Sinhala-Tamil. Both of these are low resource languages with a lack of open source MT systems. The only related language pair in Apertium is Sinhala-English in the incubator. (a quick literature review showed that Sinhala-Tamil translation has been only attempted using SMT which yielded low results due to the lack of language resources).
Sponsoring my work on this language pair will enable me to develop resources for two less resourced languages which in turn will enable others to use them for future work.
A description of how and who it will benefit in society[edit]
The biggest benefit would be that it would help to overcome the language barrier between the Sinhala and Tamil people of Sri Lanka (an issue that was one of the primary causes of a long and bloody civil war). In addition it would create valuable, open source resources that could be used in many future projects such as language learning.
Work plan[edit]
Community bonding period Get acquainted with Apertium mentors and fellow students Study of Sinhala and Tamil grammatical structures and other Indic/Dravidian language implementations Learn about testing mechanisms
Work Period
Week 1: Improve Sinhala monodix to a goal of 3000 total words.
Week 2: Improve Tamil monodix by creating pardefs and adding nouns, adverbs, adjectives and numerals to a goal of a total of 3000 total words.
Week 3: Improve the Tamil monodix by adding verbs and adjectives to a goal of 4000 total words.
Week 4: Improve bidix for si-ta on existing words
Deliverable #1: improved si and ta monodix and si-ta bidix
Week 5: Implement transfer rules for Sinhala->Tamil
Week 6: Implement transfer rules for Tamil->Sinhala
Week 7 (Midterm): Improve transfer rules for Tamil->Sinhala
Deliverable #2: Basic transfer rules for Sinhala <->Tamil
Week 8: Improve Sinhala monodix to a total of 8000 words
Week 9: Improve Tamil monodix to a total of 8000 words
Week 10: Improve transfer rules for Sinhala <-> Tamil to reflect newly added words
Deliverable #3: updated si-ta monodix, bidix and transfer rules
Week 11: testing using testvoce
Week 12: testing using testvoce releasing
Week 13: Project completion Final evaluation
List your skills and give evidence of your qualifications[edit]
Programming Challenge
Current progress is as follows
1. Created a frequency based ordering of approx. 100 words from the manually translated Sinhala text)
2. Created pardefs and added approximately 40 of these words to the Sinhala monodix
3. Created pardefs and added approximately 40 of the corresponding Tamil words to the monodix
4. Created skeleton bidix and transfer rules files.
I am a fourth year undergraduate studying Information Technology at the University of Moratuwa, Sri Lanka.
Linguistic Skills : Sinhala(native), Tamil(good), English(fluent), Hindi(fair)(This was very beneficial as I studied the existing Hindi-English language pair when creating skeleton si-ta)
Related Course Modules : Automata Theory(2013-top 1%), Theory of Computability and Complexity(current)
Programming Skills : Python, Java, C++, C MatLab. I am willing to learn Perl if required.
Other : GIT, SVN
List any non-Summer-of-Code plans you have for the Summer[edit]
I have no other non-GSoC commitments and can put in the required amount of time per week. I have exams from May 26th-June 13th and wish to start work during the community boding period itself to minimize any time deficiencies during exams.