User:Oldtrafford.kedar

From Apertium
Revision as of 12:10, 9 April 2010 by Oldtrafford.kedar (talk | contribs) (Created page with 'Name: KEDAR KULKARNI E-mail address: oldtrafford.kedar@gmail.com Other information that may be useful for contact: PH-NO: +919160011165 Why is it you are intere…')
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Name:

KEDAR KULKARNI


E-mail address:


oldtrafford.kedar@gmail.com


Other information that may be useful for contact:


PH-NO: +919160011165



Why is it you are interested in machine translation?

Machine Translation is a tool through which you can access texts in other languages. I think the study of languages is very fascinating. As a matter of fact, my mother tongue is Marathi. But I was brought up in an environment where I was more exposed to Hindi than Marathi. Hence I mostly communicated in Hindi. However I belong to a family which is dominated by Marathi culture. Moreover I am interested in reading Marathi literature. So MT is a tool which can help me in understanding Marathi literature. Also I have been keenly listening the advances in MT for a few years which created an interest in me towards developing MT.



Why is it that you are interested in the Apertium project?


Apertium acts as a machine translation platform. Basically, it provides you an engine and toolbox that allow you to build your own MT systems. Also it is open source and open content. Since I am interested in building a MT system, I was looking for available resources. A couple of resources which interested me were Anusaaraka and Apertium. Anusaaraka only gives language access but doesn't give translation. Also it is not very user friendly as its use requires proper training. On the other hand Apertium is very user friendly and it can be used straight out of the box. So here is a opportunity to test the usability of Apertium on closely related languages such as Marathi-Hindi. Apertium is small and efficient. So closely related Indic languages should work well on Apertium. Since there has been no work done in Indic Languages on the Apertium platform ( except for Urdu - Hindi), I thought this is an opportunity to show the usefulness of Apertium for Indic Languages.



Which of the published tasks are you interested in?


Apertium: Machine Translation between Marathi to Hindi



What do you plan to do?

STATE – OF – ART: 1.Marathi morph analyzer with around 80% coverage on web text in Anusaaraka format. 2.Hindi morph analyzer with around 90% coverage on web text in Anusaaraka format. 3.Marathi-Hindi bilingual dictionary with around 15K headwords . 4.Working system of Marathi-Hindi Anusaaraka producing core Anusaaraka output.

COMMUNITY BONDING PERIOD: • Learning Apertium Framework in general. • Use of Apertium viewer. • An overview of what is available in Anusaaraka.

WEEK1:

• Developing programs for converting Anusaaraka morph analyzers to Apertium format. • Building a Apertium morphological dictionary for highly frequent 5000 words of Marathi and Hindi. • Converting WX resources to Unicode data.

WEEK 2 & WEEK 3: • Checking the completeness of paradigms in Unicode format and providing missing paradigms if any. • Testing morphological analyzers on various sample from Wikipedia to ensure that coverage is at least 80%. • Adding enough entries from high frequent words so as to get 80% coverage for Marathi.

WEEK 4: • Marathi-Hindi Transfer rules. Since Marathi and Hindi are very similar, Maximum work will be in t1x, little work in t2x and almost no work in t3x

DELIVERABLE AT THE END OF 4th WEEK:- Marathi and Hindi Morphological analyzer with standardized tagsets.

WEEK 5: • Developing a program to convert the Marathi-Hindi bilingual Anusaaraka dictionary to Apertium format.

WEEK 6 & WEEK 7: • Ensuring that the words in Marathi-Hindi dictionary of morph's analyzers have been covered.If not add them.

WEEK 8: • Testing the bilingual dictionary on random Wiki pages to ensure to seek 80% coverage.

DELIVERABLE AT THE END OF 8TH WEEK:- Bilingual Dictionary with 80% coverage in Marathi ---> Hindi.

WEEK 9 & WEEK 10: • Training a POS tagger for both Marathi and Hindi • Developing mapping from ILMT tags to Apertium tags and exploring the possibility of using POS data of ILMT for training POS taggers of Marathi and Hindi.

WEEK 11: • Testing and improving the quality and coverage of the translation.

WEEK 12: • Testing the complete Machine Translation system on Wikipedia texts and evaluating.




Applicants should also include a two- to eight-page proposal , including a title, reasons why Google and Apertium should sponsor it, a description of how and who it will benefit, and a detailed work plan including, if possible, a schedule with milestones and deliverable. Include time needed to think, to program, to document and to disseminate.


There are no Indic Languages in the open source except for Anusaaraka. However Anusaaraka doesn't have MT component of it. Also Apertium currently does not have language pairs in Indic Languages. Hence this group would make a nice group for Apertium systems and also expand its horizon. Also it would act as a building block for other language pairs in the Indic Language group.

             Google is an organization which works for the benefit of society. Google doesn't have a Hindi-Marathi language pair. So Google may find this project interesting and acquire this work to its translator toolkit. This project is also beneficial to the society as explained in the later parts. 

Marathi is the 4th most spoken language in India. Mahabhasya by Patanjali is only available in Marathi but not in Hindi. So it not accessible to the Hindi population in the country. Mahabhasya is only an example. There are many such cases in Marathi Literature. So MT would help us in such cases. The Hindi-Marathi MT can serve as a case study for building Telugu-Hindi,Kannada-Hindi and Punjabi-Hindi systems because Anusaaraka systems are already available for these languages under GPL license.



In the proposal, list your skills and give evidence of your qualifications. Tell us what is current field of study, major, etc.


I am currently on my first year in the Integrated Masters Program in Economics at the University Of Hyderabad,Hyderabad.I am good in Shell scripting and Perl Programming. I think I am a good manager and leader.So I can build a team which will work on the project.




Convince us that you can do the work. In particular we would like to know whether you have programmed before in open-source projects.


I am a 18-year old student. I have just finished my schooling. I am very much fascinated by the open-source tools available on the web. With my very little knowledge of shell and Perl programming I could convert the Marathi-Hindi bilingual dictionary from one format ( Anusaaraka format) to the Apertium format very easily in a couple of days. With this experience I am confident enough that during this summer I can contribute substantially by developing the Marathi-Hindi Apertium using the resources from Marathi-Hindi Anusaaraka both of which are available under GPL.




Please list any non-Summer-of-Code plans you have for the Summer, especially employment and class-taking. Be specific about schedules and time commitments. we would like to be sure you have at least 30 free hours a week to develop for our project.

No. No other plans if my application is selected.




REFERENCES:


Anusaaraka http://ltrc.iiit.ac.in/showfile.php?filename=downloads/anu/index.htm.

Speech at "First Workshop on Free Rule Based MT", at Alacante, Spain, 2nd Nov 2009 on Anusaaraka: An Accessor cum Machine Translator by Amba Kulkarni

Bharati, Akshar, Amba P Kulkarni, Dipti Misra Sharma Anusaaraka: A better approach to Machine Translation { A case study for English-Hindi/Telugu} Presented at Language Technology Tools: Implementation of Telugu; A 3 day National conference, 8-10 October, 2003, University of Hyderabad, Hyderabad

Kulkarni, Amba P. Design and Architecture of anusAraka: An Approach to Machine Translation Satyam Technical Review vol 3, Oct 2003