Difference between revisions of "User:Oldtrafford.kedar"

From Apertium
Jump to navigation Jump to search
(Created page with 'Name: KEDAR KULKARNI E-mail address: oldtrafford.kedar@gmail.com Other information that may be useful for contact: PH-NO: +919160011165 Why is it you are intere…')
 
Line 1: Line 1:
Name:
+
'''Name:'''
   
 
KEDAR KULKARNI
 
KEDAR KULKARNI
Line 5: Line 5:
   
   
E-mail address:
+
'''E-mail address:'''
   
   
Line 13: Line 13:
   
   
Other information that may be useful for contact:
+
'''Other information that may be useful for contact:'''
   
   
Line 23: Line 23:
   
   
Why is it you are interested in machine translation?
+
'''Why is it you are interested in machine translation?'''
   
 
Machine Translation is a tool through which you can access texts in other languages. I think the study of languages is very fascinating. As a matter of fact, my mother tongue is Marathi. But I was brought up in an environment where I was more exposed to Hindi than Marathi. Hence I mostly communicated in Hindi. However I belong to a family which is dominated by Marathi culture. Moreover I am interested in reading Marathi literature. So MT is a tool which can help me in understanding Marathi literature. Also I have been keenly listening the advances in MT for a few years which created an interest in me towards developing MT.
 
Machine Translation is a tool through which you can access texts in other languages. I think the study of languages is very fascinating. As a matter of fact, my mother tongue is Marathi. But I was brought up in an environment where I was more exposed to Hindi than Marathi. Hence I mostly communicated in Hindi. However I belong to a family which is dominated by Marathi culture. Moreover I am interested in reading Marathi literature. So MT is a tool which can help me in understanding Marathi literature. Also I have been keenly listening the advances in MT for a few years which created an interest in me towards developing MT.
Line 31: Line 31:
   
   
Why is it that you are interested in the Apertium project?
+
'''Why is it that you are interested in the Apertium project?'''
   
   
Line 42: Line 42:
   
   
Which of the published tasks are you interested in?
+
'''Which of the published tasks are you interested in?'''
   
   
Line 52: Line 52:
   
   
What do you plan to do?
+
'''What do you plan to do?'''
   
STATE – OF – ART:
+
''STATE – OF – ART:''
 
1.Marathi morph analyzer with around 80% coverage on web text in Anusaaraka format.
 
1.Marathi morph analyzer with around 80% coverage on web text in Anusaaraka format.
 
2.Hindi morph analyzer with around 90% coverage on web text in Anusaaraka format.
 
2.Hindi morph analyzer with around 90% coverage on web text in Anusaaraka format.
Line 60: Line 60:
 
4.Working system of Marathi-Hindi Anusaaraka producing core Anusaaraka output.
 
4.Working system of Marathi-Hindi Anusaaraka producing core Anusaaraka output.
   
COMMUNITY BONDING PERIOD:
+
''COMMUNITY BONDING PERIOD:''
 
• Learning Apertium Framework in general.
 
• Learning Apertium Framework in general.
 
• Use of Apertium viewer.
 
• Use of Apertium viewer.
 
• An overview of what is available in Anusaaraka.
 
• An overview of what is available in Anusaaraka.
   
WEEK1:
+
'''WEEK1''':
   
 
• Developing programs for converting Anusaaraka morph analyzers to Apertium
 
• Developing programs for converting Anusaaraka morph analyzers to Apertium
Line 72: Line 72:
 
of Marathi and Hindi.
 
of Marathi and Hindi.
 
• Converting WX resources to Unicode data.
 
• Converting WX resources to Unicode data.
  +
'''
 
WEEK 2 & WEEK 3:
+
WEEK 2 & WEEK 3:'''
 
• Checking the completeness of paradigms in Unicode format and providing
 
• Checking the completeness of paradigms in Unicode format and providing
 
missing paradigms if any.
 
missing paradigms if any.
Line 81: Line 81:
 
Marathi.
 
Marathi.
   
WEEK 4:
+
'''WEEK 4:'''
 
• Marathi-Hindi Transfer rules. Since Marathi and Hindi are very similar,
 
• Marathi-Hindi Transfer rules. Since Marathi and Hindi are very similar,
 
Maximum work will be in t1x, little work in t2x and almost no work in t3x
 
Maximum work will be in t1x, little work in t2x and almost no work in t3x
   
DELIVERABLE AT THE END OF 4th WEEK:- Marathi and Hindi Morphological
+
'''DELIVERABLE AT THE END OF 4th WEEK''':- Marathi and Hindi Morphological
 
analyzer with standardized tagsets.
 
analyzer with standardized tagsets.
   
WEEK 5:
+
'''WEEK 5:'''
 
• Developing a program to convert the Marathi-Hindi bilingual Anusaaraka
 
• Developing a program to convert the Marathi-Hindi bilingual Anusaaraka
 
dictionary to Apertium format.
 
dictionary to Apertium format.
   
WEEK 6 & WEEK 7:
+
'''WEEK 6 & WEEK 7''':
 
• Ensuring that the words in Marathi-Hindi dictionary of morph's analyzers have
 
• Ensuring that the words in Marathi-Hindi dictionary of morph's analyzers have
 
been covered.If not add them.
 
been covered.If not add them.
   
WEEK 8:
+
'''WEEK 8''':
 
• Testing the bilingual dictionary on random Wiki pages to ensure to seek 80%
 
• Testing the bilingual dictionary on random Wiki pages to ensure to seek 80%
 
coverage.
 
coverage.
   
DELIVERABLE AT THE END OF 8TH WEEK:- Bilingual Dictionary with 80%
+
'''DELIVERABLE AT THE END OF 8TH WEEK''':- Bilingual Dictionary with 80%
 
coverage in Marathi ---> Hindi.
 
coverage in Marathi ---> Hindi.
   
WEEK 9 & WEEK 10:
+
'''WEEK 9 & WEEK 10''':
 
• Training a POS tagger for both Marathi and Hindi
 
• Training a POS tagger for both Marathi and Hindi
 
• Developing mapping from ILMT tags to Apertium tags and exploring the
 
• Developing mapping from ILMT tags to Apertium tags and exploring the
Line 109: Line 109:
 
Hindi.
 
Hindi.
   
WEEK 11:
+
'''WEEK 11:'''
 
• Testing and improving the quality and coverage of the translation.
 
• Testing and improving the quality and coverage of the translation.
   
WEEK 12:
+
'''WEEK 12''':
 
• Testing the complete Machine Translation system on Wikipedia texts and
 
• Testing the complete Machine Translation system on Wikipedia texts and
 
evaluating.
 
evaluating.
Line 122: Line 122:
   
   
  +
'''Why should Google and Apertium Sponsor it?'''
Applicants should also include a two- to eight-page proposal , including a title, reasons why Google and Apertium should sponsor it, a description of how and who it will benefit, and a detailed work plan including, if possible, a schedule with milestones and deliverable. Include time needed to think, to program, to document and to disseminate.
 
 
   
   
Line 130: Line 129:
 
Google is an organization which works for the benefit of society. Google doesn't have a Hindi-Marathi language pair. So Google may find this project interesting and acquire this work to its translator toolkit. This project is also beneficial to the society as explained in the later parts.
 
Google is an organization which works for the benefit of society. Google doesn't have a Hindi-Marathi language pair. So Google may find this project interesting and acquire this work to its translator toolkit. This project is also beneficial to the society as explained in the later parts.
   
Marathi is the 4th most spoken language in India. Mahabhasya by Patanjali is only available in Marathi but not in Hindi. So it not accessible to the Hindi population in the country. Mahabhasya is only an example. There are many such cases in Marathi Literature. So MT would help us in such cases. The Hindi-Marathi MT can serve as a case study for building Telugu-Hindi,Kannada-Hindi and Punjabi-Hindi systems because Anusaaraka systems are already available for these languages under GPL license.
 
   
  +
'''How and who will it benefit in Society?'''
  +
 
Marathi is the 4th most spoken language in India. Mahabhasya by Patanjali is only available in Marathi but not in Hindi. So it not accessible to the Hindi population in the country. Mahabhasya is only an example. There are many such cases in Marathi Literature. So MT would help us in such cases. The Hindi-Marathi MT can serve as a case study for building Telugu-Hindi,Kannada-Hindi and Punjabi-Hindi systems because Anusaaraka systems are already available for these languages under GPL license.
   
   
   
   
In the proposal, list your skills and give evidence of your qualifications. Tell us what is current field of study, major, etc.
+
'''List your skills and give evidence of your qualifications.'''
   
   

Revision as of 12:17, 9 April 2010

Name:

KEDAR KULKARNI


E-mail address:


oldtrafford.kedar@gmail.com


Other information that may be useful for contact:


PH-NO: +919160011165



Why is it you are interested in machine translation?

Machine Translation is a tool through which you can access texts in other languages. I think the study of languages is very fascinating. As a matter of fact, my mother tongue is Marathi. But I was brought up in an environment where I was more exposed to Hindi than Marathi. Hence I mostly communicated in Hindi. However I belong to a family which is dominated by Marathi culture. Moreover I am interested in reading Marathi literature. So MT is a tool which can help me in understanding Marathi literature. Also I have been keenly listening the advances in MT for a few years which created an interest in me towards developing MT.



Why is it that you are interested in the Apertium project?


Apertium acts as a machine translation platform. Basically, it provides you an engine and toolbox that allow you to build your own MT systems. Also it is open source and open content. Since I am interested in building a MT system, I was looking for available resources. A couple of resources which interested me were Anusaaraka and Apertium. Anusaaraka only gives language access but doesn't give translation. Also it is not very user friendly as its use requires proper training. On the other hand Apertium is very user friendly and it can be used straight out of the box. So here is a opportunity to test the usability of Apertium on closely related languages such as Marathi-Hindi. Apertium is small and efficient. So closely related Indic languages should work well on Apertium. Since there has been no work done in Indic Languages on the Apertium platform ( except for Urdu - Hindi), I thought this is an opportunity to show the usefulness of Apertium for Indic Languages.



Which of the published tasks are you interested in?


Apertium: Machine Translation between Marathi to Hindi



What do you plan to do?

STATE – OF – ART: 1.Marathi morph analyzer with around 80% coverage on web text in Anusaaraka format. 2.Hindi morph analyzer with around 90% coverage on web text in Anusaaraka format. 3.Marathi-Hindi bilingual dictionary with around 15K headwords . 4.Working system of Marathi-Hindi Anusaaraka producing core Anusaaraka output.

COMMUNITY BONDING PERIOD: • Learning Apertium Framework in general. • Use of Apertium viewer. • An overview of what is available in Anusaaraka.

WEEK1:

• Developing programs for converting Anusaaraka morph analyzers to Apertium format. • Building a Apertium morphological dictionary for highly frequent 5000 words of Marathi and Hindi. • Converting WX resources to Unicode data. WEEK 2 & WEEK 3: • Checking the completeness of paradigms in Unicode format and providing missing paradigms if any. • Testing morphological analyzers on various sample from Wikipedia to ensure that coverage is at least 80%. • Adding enough entries from high frequent words so as to get 80% coverage for Marathi.

WEEK 4: • Marathi-Hindi Transfer rules. Since Marathi and Hindi are very similar, Maximum work will be in t1x, little work in t2x and almost no work in t3x

DELIVERABLE AT THE END OF 4th WEEK:- Marathi and Hindi Morphological analyzer with standardized tagsets.

WEEK 5: • Developing a program to convert the Marathi-Hindi bilingual Anusaaraka dictionary to Apertium format.

WEEK 6 & WEEK 7: • Ensuring that the words in Marathi-Hindi dictionary of morph's analyzers have been covered.If not add them.

WEEK 8: • Testing the bilingual dictionary on random Wiki pages to ensure to seek 80% coverage.

DELIVERABLE AT THE END OF 8TH WEEK:- Bilingual Dictionary with 80% coverage in Marathi ---> Hindi.

WEEK 9 & WEEK 10: • Training a POS tagger for both Marathi and Hindi • Developing mapping from ILMT tags to Apertium tags and exploring the possibility of using POS data of ILMT for training POS taggers of Marathi and Hindi.

WEEK 11: • Testing and improving the quality and coverage of the translation.

WEEK 12: • Testing the complete Machine Translation system on Wikipedia texts and evaluating.




Why should Google and Apertium Sponsor it?


There are no Indic Languages in the open source except for Anusaaraka. However Anusaaraka doesn't have MT component of it. Also Apertium currently does not have language pairs in Indic Languages. Hence this group would make a nice group for Apertium systems and also expand its horizon. Also it would act as a building block for other language pairs in the Indic Language group.

             Google is an organization which works for the benefit of society. Google doesn't have a Hindi-Marathi language pair. So Google may find this project interesting and acquire this work to its translator toolkit. This project is also beneficial to the society as explained in the later parts. 


How and who will it benefit in Society?

Marathi is the 4th most spoken language in India. Mahabhasya by Patanjali is only available in Marathi but not in Hindi. So it not accessible to the Hindi population in the country. Mahabhasya is only an example. There are many such cases in Marathi Literature. So MT would help us in such cases. The Hindi-Marathi MT can serve as a case study for building Telugu-Hindi,Kannada-Hindi and Punjabi-Hindi systems because Anusaaraka systems are already available for these languages under GPL license.



List your skills and give evidence of your qualifications.


I am currently on my first year in the Integrated Masters Program in Economics at the University Of Hyderabad,Hyderabad.I am good in Shell scripting and Perl Programming. I think I am a good manager and leader.So I can build a team which will work on the project.




Convince us that you can do the work. In particular we would like to know whether you have programmed before in open-source projects.


I am a 18-year old student. I have just finished my schooling. I am very much fascinated by the open-source tools available on the web. With my very little knowledge of shell and Perl programming I could convert the Marathi-Hindi bilingual dictionary from one format ( Anusaaraka format) to the Apertium format very easily in a couple of days. With this experience I am confident enough that during this summer I can contribute substantially by developing the Marathi-Hindi Apertium using the resources from Marathi-Hindi Anusaaraka both of which are available under GPL.




Please list any non-Summer-of-Code plans you have for the Summer, especially employment and class-taking. Be specific about schedules and time commitments. we would like to be sure you have at least 30 free hours a week to develop for our project.

No. No other plans if my application is selected.




REFERENCES:


Anusaaraka http://ltrc.iiit.ac.in/showfile.php?filename=downloads/anu/index.htm.

Speech at "First Workshop on Free Rule Based MT", at Alacante, Spain, 2nd Nov 2009 on Anusaaraka: An Accessor cum Machine Translator by Amba Kulkarni

Bharati, Akshar, Amba P Kulkarni, Dipti Misra Sharma Anusaaraka: A better approach to Machine Translation { A case study for English-Hindi/Telugu} Presented at Language Technology Tools: Implementation of Telugu; A 3 day National conference, 8-10 October, 2003, University of Hyderabad, Hyderabad

Kulkarni, Amba P. Design and Architecture of anusAraka: An Approach to Machine Translation Satyam Technical Review vol 3, Oct 2003