User:Surajkawade/GSOC proposal: Marathi and English

From Apertium
Jump to navigation Jump to search

Name

Suraj Kawade

Contact information

IRC nick : develover

E-mail : suraj.kawade@gmail.com / suraj.kawade@hotmail.com

Phone no : +918983005859 / +919404943130

skype username : yesiamsuraj

Why are you interested in machine translation?

As I am interested in linguistics and I love programming, machine translation is magnet for me! World is culturally diverse and languages are barrier cum ways to these cultures. I have read that (on Wikipedia) "There are between 6000 and 7000 languages currently spoken, and that between 50-90% of those will have become extinct by the year 2100". I was shocked but I don't want to feel helpless. Though humans speak in different tongues, they express the same thing! Then why shouldn't I gather my curiosity to know more how these languages are related and how they differ in something? And to help society not to blank out the gift their ancestors gave them? Everything is going digital and fast and so is the field of NLP, and MT is helping a large part in it and I want to be a (small though) part of it.

Why are you interested in the Apertium project?

The best things in the world are free (as in 'freedom')! Open Source is free and Apertium is Open Source. So by the law of trasitivity Apertium is best thing. If I say I do not want languages dying in front of my eyes, I should help avoiding it and thus I found Apertium. I think Apertium is community of knowledgeable, inspiring people who are really enthusiastic on a common cause and most importantly, they love what they do and the other way around.(I figured this out while talking to them in the IRC channel.) And most importantly to "do" something for preserving a language, with Apertium, you really need less resources at the beginning, which is really helpful, less hectic and hence encouraging. Apertium uses rule-based translation methods and not the dictionary based, which makes it work with the meanings of words and not just the words, hence more close to humans.

Why Google and Apertium should sponsor it?

On knowing there is nothing done of release quality in Apertium regarding Marathi, I decided will work on it. Marathi is written in Devanagari script and Apertium is yet to release pair containing a Devanagari script language(most of them are in incubator). Doing extensive work and bringing Marathi-English pair to release quality will also encourage adaptation of those Devanagari languages in incubator.

How and who it will benefit in society?

Marathi is 19th most spoken language in the world and an official language of state of Maharashtra. Though Marathi has rich literature and glorious history, there is no reliable and quality translation solutions available as of now. Even Google Translate do not provide Marathi translation services. It is observed that Marathi speaking students struggle more in learning English as compared to that of other Indian students, who, at some extent, have digital tools available to them. Due to the tablet and smartphone explosion and easy availability of Internet in India, people are using lot social-networking sites, they have stated blogging and are reading news online. In such days, not having good Marathi-English translation tool feels inconvenient. Creating a tool using machine translation with Apertium will not only server the need but also benefit large community.

Which of the published tasks are you interested in? What do you plan to do?

As there is no Marathi-English pair in Apertium so far, I am starting to work on it from scratch. There is Marathi-Hindi bilingual dictionary in incubator but I don't know how much it is completed. I will try if it helps me in my project. My interest and enthusiasm says that I am going to try to bring Marathi-English pair to release quality.

Work plan

Coding challenge

Installation

  • I installed new Ubuntu machine in VirtualBox for Aprtium installation.
  • firespeaker helped me to install the system on my machine.
  • After installing Apertium and lttolbox, I decided to install en-es language pair.
  • Initially I got lots of errors and problems regarding permissions but with the help of firespeaker I succeeded to install the system.

Getting Started

  • Then I got introduced to spectie in IRC channel. He gave me links to documentation on "How to start a new language pair". He also gave me links to study the basic structural and functional elements of Apertium system and it's working.
  • spectie gave me a document(a story) in English to translate it to Marathi. I completed it and sent it to him.
  • spectie created a basic Marathi-English system for me and since I was familiar with the symbols and terminologies(through documentation reading), I understood it quickly. He also added some words in monolingual and bilingual dictionaries and gave me a list of words to add by myself. Initially I felt I was doing it too slow but after adding more words I got the mechanism and I am comfortable in it now.
  • spectie told me how to checkout, make changes and commit the changes. I did it successfully.

https://apertium.svn.sourceforge.net/svnroot/apertium/incubator/apertium-mar-eng/

Community Bonding Period

  • I first joined the #apertium IRC channel, where I got introduced to fellow members of the community.
  • Then I joined the mailing list and created an user account on Apertium wiki.
  • I got proper directions about stuff I should read, how to install the Apertium system on my machine form community members(especially firespeaker) in the IRC.
  • I got a real push and encouragement from spectie who helped me to create a basic mar-eng pair and helped me through the coding challenge.
  • I have been learning a lot from IRC channel and the documentation available.
  • I have become so positive about Apertium that even if my project gets rejected I will work for my Marathi-English pair. Because I am working 'for' my people 'with' good people!
  • I am planning to learn as much as possible before the actual coding commences to make me 'bleed' less in the 'war' and win it.

Week Plan

Week Task
1 Getting to know the Mentor and discuss the whole plan with him. Start the implementation of Marathi monodix according to the word frequency list. Add closed class words.
2 Continue to work on monodix and adding more words. Add open class words.
3 Continue to work on monodix and adding more open class words. Start Working on a Marathi-English bilingual dictionary.
4 Continue adding more words monolingual and bilingual dictionaries.
Deliverable #1 A monolingual dictionary containing at least 4000 words(4000*3mins/word=12000/60min=200/6hrsperday=33days).
5 Continue and wrap-up adding more words to monolingual dictionary. Continue adding words to bilingual dictionary.
6 Continue adding more words to the bilingual dictionary.
7 Implementation of disambiguation rules for Marathi.
8 Implementation of transfer rules for Marathi->English.
Deliverable #2 Completion of morphological dictionaries with 8000 words in monodix and about 10000 words in bidix
9 Complete the disambiguation, transfer rules implementation and design of constraint grammar.
10 testvoc
11 testvoc
12 Wrap-up testvoc, cleaning up, result evaluation and completion of documentation.
Deliverable #3 Completion of the project.


  • I am doing the documentation work along with the project.
  • At the end of the project I will discuss about future advancement of the Marathi-English language and finding potential contributors.

List your skills and give evidence of your qualifications

Currently I am a final year student of Computer Engineering at University of Pune. Marathi is my mother-tongue and I can also speak and write good English.

Regarding my experience with open source, I have been working on l10n of Firefox in Hindi and starting contributing for Marathi too. I have also attended their workshops and seminars. Also I have been a member of Pune Linux User Group for past 4 years. I am quite fascinated by concept of Rasberry Pi and have given seminar on it in my college.

I am comfortable with XML and HTML.

I am autodidact regarding Python, and have intermediate knowledge of it and have planned to continue leaning after the university exams.

I know C/C++, but have not done any big projects in them.

I have studied subjects Theory of Computation and Compilers.

Regarding MT, I am a newbie and sees GSoC project with Apertium as great opportunity to learn everything possible.

My non-Summer-of-Code plans for the Summer

I will be a bit busy for university exams until they finish on 10th of June, after which I have no other commitments. I will be available for about 40-45 hours a week for entire project period. On Saturdays and Sundays I can add more hours to work.