Ankush/Application

From Apertium
Revision as of 17:57, 3 April 2010 by Ankush (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Name[edit]

Ankush Gupta

E-mail address[edit]

ankushgupta31089@gmail.com
ankushgupta@students.iiit.ac.in

Address[edit]

Room No - 240
International Institute of Information Technology (IIIT) Hyderabad
Gachibowli, Hyderabad - 500 032
Andhra Pradesh
India

Interest in Machine Translation[edit]

The whole idea of translating a text from one natural language to another exites me. I have been working on this field for about a year. MT can bring wonders , specially for Indian Languages because we still do not have enough data available in languages like Hindi, Urdu, Marathi, Bengali, etc. as compared to English and other foreign languages. I am specially interested in MT systems where the source language is English and the target languages are Indian Languages. It is impossible to translate the enormous amount of information available from one language to another manually. MT provides us with translations that are less accurate than human translations but in many cases, the output provided is sufficient to understand, hence serving the purpose.


Interest in Apertium Project[edit]

I have an interest in Natural Language Processing and I can't get a better option than Apertium to work in this area. Machine Translation is a challenging task , as is evident from the History of MT and I am excited to take the current state of art to a step ahead. Open Source always excites me and I think this is the best opportunity one can get. I believe open source is the only way possible to improve the MT systems because we need quality linguistic data and excellent knowledge of language apart from a good knowledge of the field.


Task Of Interest[edit]

Improving the existing English to Hindi wordlist and transfer rules available in Apertium.


Reasons for Google and Apertium to sponsor it[edit]

Apertium is a shallow-transfer machine translation system. Currently , at http://www.apertium.org/ no Indian Language is supported. Hindi is in the incubator, i would like to see the users using translation in Indian Language facility also. As, population of India is huge and there is an urgent need to gain the advantage of enumerous data available in English , this task becomes very important. Google has been a supported of open source projects. Through google summer code , more source code is created and released for the use and benefit of all.


Benefit to society[edit]

This project will be very useful to a large percentage of population (who understands Hindi but not English). They can very easily use a free and easily avaialble software like Apertium for converting vast amount of text available in English to Hindi.


Work plan[edit]

Goal[edit]

Improve the existing English to Hindi wordlist and transfer rules available in Apertium.


Understanding of the Problem[edit]

First thing which needs to be done is understanding the scope of improvement of the current wordlist and transfer rules. I looked at the Hindi resources available in the incubator (http://wiki.apertium.org/wiki/Incubator). In the POS tagged English to Hindi wordlist, there are only 25,424 words, while I saw the English Wordnet, which contains about 150,000 words. So, a lot of words need to be added for improving the translation. I also looked at the transfer rules but understanding them will take some time.

Actually the opposite is the case, a lot of words need to be removed for the translation to be improved. If you don't understand this, you should probably come and talk to us on IRC. - Francis Tyers 13:40, 3 April 2010 (UTC)


Week 1 : Week 2 : Understanding the deficiency in the currently available dictionaries and the existing transfer rules.
Week 3 : Expanding the currently available English monolingual dictionary (if there is scope).
Week 4 : Expanding the currently available Hindi monolingual dictionary.
Deliverable 1 : Modified English and Hindi monolingual dictionary.

Week 5 : Expanding the currently available bilingual dictionary.
Week 6 : Expanding the currently available bilingual dictionary.
Week 7 : Adding multi-words to English monolingual dictionary, Hindi monolingual dictionary, bilingual dictionary.
Week 8 : Correcting the current transfer rules.
Deliverable 2 : Final POS tagged English to Hindi wordlist and final monolingual dictionaries.

Week 9 : Correcting the current tranfer rules.
Week 10 : Adding new and efficient transfer rules.
Week 11 : Detecting Errors and correcting them.
Week 12: Detecting Errors and correcting them.
Project Completed

Skill set and qualifications: I am a B.Tech III Year Computer Science and Engineering student of International Institute of Information Technology (IIIT) Hyderabad. It is one of the esteemed institutions of our country. I am among the top 10 students of my batch and my CGPA is 9.0 (out of 10). I have done courses like "Introduction to Natural Language Processing", "Natural Language Processing Applications", "Artificial Intelligence", "Algorithms","Data Structures", etc. I have good programming skills and have done several projects in C,C++,Python,Java. I have done projects in MT Evaluation area and Paragraph Alignment also , so I have a good knowledge about the field. My mother tongue is Hindi , so I am fluent in both the languages : English and Hindi. I have a fair knowledge of XML as I have done courses like "IT Workshop". I am in Dual Degree and will be doing my M.S. in Machine Translation field from Language Technology Research Centre (LTRC) , IIIT Hyderabad , which is one of the best language processing labs.

I have no other engagements this summer and hence look forward to devote myself entirely to the desired problem. I promise to be punctual and consistent with my work. I will not let you down.