User:Dshgna/GSoC 2014 Proposal

From Apertium
Jump to navigation Jump to search

Name: Dulshani Gunawardhana

E-mail address: dulshani[dot]gunawardhana89@gmail[dot]com

IRC: dshgna


Why is it you are interested in machine translation?[edit]

I am an Information Technology undergraduate with a love for linguistics. Machine translation is the perfect combined application for both these interests! Additionally, I live in a multilingual country given me first hand experience of the political, socio-economic and educational divide caused due to the language barrier. This makes me appreciate the need of MT and the change it would make to make the world a better place.

Why is it that you are interested in the Apertium project?[edit]

The concept of freedom of software, specially when applied to a domain as complex as MT, is extremely appealing to me. The emphasis of Apertium on less-resourced languages is one of my interested points as it opens the door to many MT projects that would never see the light of day due to lack of funding and interest.


Which of the published tasks are you interested in? What do you plan to do?[edit]

Adopt an unreleased language pair: Sinhala-Tamil (si-ta)

This task will include implementing bi-directional translation for the Sinhala-Tamil language pair based on the Apertium platform. This will involve developing the skelton monodix and bidix dictionaries I've already created and implementing transfer rules for Sinhala and Tamil.


Why Google and Apertium should sponsor it?[edit]

Currently Apertium has no language pair for Sinhala-Tamil. Both of these are low resource languages with a lack of open source MT systems. The only related language pair in Apertium is Sinhala-English in the incubator. (a quick literature review showed that Sinhala-Tamil translation has been only attempted using SMT which yielded low results due to the lack of language resources).

Sponsoring my work on this language pair will enable me to develop resources for two less resourced languages which in turn will enable others to use them for future work.

A description of how and who it will benefit in society[edit]

The biggest benefit would be that it would help to overcome the language barrier between the Sinhala and Tamil people of Sri Lanka (an issue that was one of the primary causes of a long and bloody civil war). In addition it would create valuable, open source resources that could be used in many future projects such as language learning.


Work plan[edit]

Community bonding period Get acquainted with Apertium mentors and fellow students Study of Sinhala and Tamil grammatical structures and other Indic/Dravidian language implementations Learn about testing mechanisms

Work Period

Week 1: Improve Sinhala monodix to a goal of 3000 total words.

Week 2: Improve Tamil monodix by creating pardefs and adding nouns, adverbs, adjectives and numerals to a goal of a total of 3000 total words.

Week 3: Improve the Tamil monodix by adding verbs and adjectives to a goal of 4000 total words.

Week 4: Improve bidix for si-ta on existing words

Deliverable #1: improved si and ta monodix and si-ta bidix

Week 5: Implement transfer rules for Sinhala->Tamil

Week 6: Implement transfer rules for Tamil->Sinhala

Week 7 (Midterm): Improve transfer rules for Tamil->Sinhala

Deliverable #2: Basic transfer rules for Sinhala <->Tamil

Week 8: Improve Sinhala monodix to a total of 8000 words

Week 9: Improve Tamil monodix to a total of 8000 words

Week 10: Improve transfer rules for Sinhala <-> Tamil to reflect newly added words

Deliverable #3: updated si-ta monodix, bidix and transfer rules

Week 11: testing using testvoce

Week 12: testing using testvoce releasing

Week 13: Project completion Final evaluation

List your skills and give evidence of your qualifications[edit]

Programming Challenge

Current progress is as follows

1. Created a frequency based ordering of approx. 100 words from the manually translated Sinhala text)

2. Created pardefs and added approximately 40 of these words to the Sinhala monodix

3. Created pardefs and added approximately 40 of the corresponding Tamil words to the monodix

4. Created skeleton bidix and transfer rules files.


I am a fourth year undergraduate studying Information Technology at the University of Moratuwa, Sri Lanka.

Linguistic Skills : Sinhala(native), Tamil(good), English(fluent), Hindi(fair)(This was very beneficial as I studied the existing Hindi-English language pair when creating skeleton si-ta)

Related Course Modules : Automata Theory(2013-top 1%), Theory of Computability and Complexity(current)

Programming Skills : Python, Java, C++, C MatLab. I am willing to learn Perl if required.

Other : GIT, SVN

List any non-Summer-of-Code plans you have for the Summer[edit]

I have no other non-GSoC commitments and can put in the required amount of time per week. I have exams from May 26th-June 13th and wish to start work during the community boding period itself to minimize any time deficiencies during exams.