User:Oldtrafford.kedar

From Apertium
Jump to navigation Jump to search

Name[edit]

KEDAR KULKARNI

Email Address[edit]

oldtrafford.kedar@gmail.com

Contact Information[edit]

PH-NO: +919160011165

Why is it you are interested in machine translation?[edit]

Machine Translation is a tool through which you can access texts in other languages. I think the study of languages is very fascinating. As a matter of fact, my mother tongue is Marathi. But I was brought up in an environment where I was more exposed to Hindi than Marathi. Hence I mostly communicated in Hindi. However I belong to a family which is dominated by Marathi culture. Moreover I am interested in reading Marathi literature. So MT is a tool which can help me in understanding Marathi literature. Also I have been keenly listening the advances in MT for a few years which created an interest in me towards developing MT.

Why is it that you are interested in the Apertium project?[edit]

Apertium acts as a machine translation platform. Basically, it provides you an engine and toolbox that allow you to build your own MT systems. Also it is open source and open content. Since I am interested in building a MT system, I was looking for available resources. A couple of resources which interested me were Anusaaraka and Apertium. Anusaaraka only gives language access but doesn't give translation. Also it is not very user friendly as its use requires proper training. On the other hand Apertium is very user friendly and it can be used straight out of the box. So here is a opportunity to test the usability of Apertium on closely related languages such as Marathi-Hindi. Apertium is small and efficient. So closely related Indic languages should work well on Apertium. Since there has been no work done in Indic Languages on the Apertium platform ( except for Urdu - Hindi), I thought this is an opportunity to show the usefulness of Apertium for Indic Languages.

Which of the published tasks are you interested in?[edit]

Apertium: Machine Translation between Marathi to Hindi

What do you plan to do?[edit]

STATE – OF – ART: 1.Marathi morph analyzer with around 80% coverage on web text in Anusaaraka format. 2.Hindi morph analyzer with around 90% coverage on web text in Anusaaraka format. 3.Marathi-Hindi bilingual dictionary with around 15K headwords . 4.Working system of Marathi-Hindi Anusaaraka producing core Anusaaraka output.

COMMUNITY BONDING PERIOD: • Learning Apertium Framework in general. • Use of Apertium viewer. • An overview of what is available in Anusaaraka.

WEEK1:

• Developing programs for converting Anusaaraka morph analyzers to Apertium format. • Building a Apertium morphological dictionary for highly frequent 5000 words of Marathi and Hindi. • Converting WX resources to Unicode data. WEEK 2 & WEEK 3: • Checking the completeness of paradigms in Unicode format and providing missing paradigms if any. • Testing morphological analyzers on various sample from Wikipedia to ensure that coverage is at least 80%. • Adding enough entries from high frequent words so as to get 80% coverage for Marathi.

WEEK 4: • Marathi-Hindi Transfer rules. Since Marathi and Hindi are very similar, Maximum work will be in t1x, little work in t2x and almost no work in t3x

DELIVERABLE AT THE END OF 4th WEEK:- Marathi and Hindi Morphological analyzer with standardized tagsets.

WEEK 5: • Developing a program to convert the Marathi-Hindi bilingual Anusaaraka dictionary to Apertium format.

WEEK 6 & WEEK 7: • Ensuring that the words in Marathi-Hindi dictionary of morph's analyzers have been covered.If not add them.

WEEK 8: • Testing the bilingual dictionary on random Wiki pages to ensure to seek 80% coverage.

DELIVERABLE AT THE END OF 8TH WEEK:- Bilingual Dictionary with 80% coverage in Marathi ---> Hindi.

WEEK 9 & WEEK 10: • Training a POS tagger for both Marathi and Hindi • Developing mapping from ILMT tags to Apertium tags and exploring the possibility of using POS data of ILMT for training POS taggers of Marathi and Hindi.

WEEK 11: • Testing and improving the quality and coverage of the translation.

WEEK 12: • Testing the complete Machine Translation system on Wikipedia texts and evaluating.

Why should Google and Apertium Sponsor it?[edit]

There are no Indic Languages in the open source except for Anusaaraka. However Anusaaraka doesn't have MT component of it. Also Apertium currently does not have language pairs in Indic Languages. Hence this group would make a nice group for Apertium systems and also expand its horizon. Also it would act as a building block for other language pairs in the Indic Language group.Google is an organization which works for the benefit of society. Google doesn't have a Hindi-Marathi language pair. So Google may find this project interesting and acquire this work to its translator toolkit. This project is also beneficial to the society as explained in the later parts.

How and who will it benefit in Society?[edit]

Marathi is the 4th most spoken language in India. Mahabhasya by Patanjali is only available in Marathi but not in Hindi. So it not accessible to the Hindi population in the country. Mahabhasya is only an example. There are many such cases in Marathi Literature. So MT would help us in such cases. The Hindi-Marathi MT can serve as a case study for building Telugu-Hindi,Kannada-Hindi and Punjabi-Hindi systems because Anusaaraka systems are already available for these languages under GPL license.

List your skills and give evidence of your qualifications.[edit]

I am currently on my first year in the Integrated Masters Program in Economics at the University Of Hyderabad,Hyderabad.I am good in Shell scripting and Perl Programming. I think I am a good manager and leader.So I can build a team which will work on the project.

List any non-Summer-of-Code plans you have for the Summer[edit]

No. No other plans if my application is selected.

References:[edit]

Anusaaraka http://ltrc.iiit.ac.in/showfile.php?filename=downloads/anu/index.htm.

Speech at "First Workshop on Free Rule Based MT", at Alacante, Spain, 2nd Nov 2009 on Anusaaraka: An Accessor cum Machine Translator by Amba Kulkarni

Bharati, Akshar, Amba P Kulkarni, Dipti Misra Sharma Anusaaraka: A better approach to Machine Translation { A case study for English-Hindi/Telugu} Presented at Language Technology Tools: Implementation of Telugu; A 3 day National conference, 8-10 October, 2003, University of Hyderabad, Hyderabad

Kulkarni, Amba P. Design and Architecture of anusAraka: An Approach to Machine Translation Satyam Technical Review vol 3, Oct 2003