User:Nikant/GsocApplication

Name

Nikant Vohra

Contact information

E-mail: nikantv@iitrpr.ac.in , nikantvohra@gmail.com
Skype: nikant.vohra

Why are you interested in machine translation?

I have been interested in machine translation since I took a course on Natural Language Processing as part of my coursework in the university.I was so much fascinated by this field that I myself started reading about it. I studied the online video Lectures of the course by Stanford university and also took a course on neurolinguistics.I loved the idea of a computer understanding and interpreting the grammar and other aspects of a language just like a human. I then started experimenting with machine learning and completed a project on Optical Character Recognition of Devanagari Script (Hindi). I think in this era of advancement of technology machine translation will play a very important role in bringing this world together.Machine Translation systems will reduce the need of human translators and make communication between the people of different nationalities much easier.Also the existing machine translation systems for less popular languages are not good enough. So there is a lot of scope of improvement and innovation in this field.

Why are you interested in the Apertium project?

Due to my interest in Machine Translation I started looking for open source projects in the same field.I basically wanted to develop a system for Hindi to English translation as the existing systems for the same do not behave properly. I came to know about this project from a batch mate who had completed a GSOC project with Apertium last year. I started going through the wiki documentation of Apertium and found it out to be very easy to understand and follow.I experimented with the dictionaries and transfer rules and was surprised to see how we can achieve desirable results with only minor additions.I loved the helpful community of Apertium who was willing to answer any of my questions. Apertium is doing a wonderful job by keeping some minor languages alive and I would love to contibute to this project.

Why Google and Apertium should sponsor it?

Hindi is one of the largest spoken languages in the world covering 4.46% of the world population (Source: wikipedia). The existing systems available for the machine translation of this pair do not work very well. The Hindi-English language pair still lies in the incubator stage in the Apertium directory. It needs to be improved a lot to be able to make it available for release. If this pair is made available for release it can significantly increase the number of people using Apertium. Also it would be helpful for a large number of people who know one of these languages to communicate and exchange ideas.

How and who it will benefit in society?

English is spoken by quite a lot of people in India.Major educational institutes of India have a medium of teaching in English.A majority of population can only write and read in Hindi.So language acts as a barrier for people here towards the study of technology.Study of technology is very important to solve the ever growing social and economic problems in India.If a robust system for English-Hindi conversion is introduced it can be beneficial for thousands of people in India. It can also be helpful for people of other countries as a number of Hindi texts are very popular all over the world.If an online system to convert these texts to Hindi is introduced it could help a large number of people to understand and make use of these texts.

Which of the published tasks are you interested in? What do you plan to do?

The project I'd like to work on is Hindi-English language pair machine translation for Apertium.

Some work has already been done for this language pair.I would like to make this language pair available for release by the end of the coding period. It lies currently in the incubator stage with the Hindi dictionary in the WX format. Some conversion of this dictionary to unicode has already been done but a major part is still left.There is still no support for verbs and pronouns in the dictionary.The bilingual dictionary has a support of about 19000 words but still a lot of words need to be added to make this work.The apertium tagger set needs to be trained for Hindi corpora. The transfer rules also needs to be improved a lot and support for multi words needs to be added.I tested the existing Hindi analyzer on a Hindi corpus obtained from online resources. The coverage for the morphological analyzer came out to be around 71% while for bilingual translator it was around 57%.I also analyzed the corpus to get a list of high frequency unknown Hindi words as a part of the coding challenge. These are the results I obtained for the top 20 high frequency unknown words in my corpus:

Word in Hindi	Frequency in Corpus	Part of Speech	Translation in English
इस	14776	pronoun	this
किया	11398	verb	did
गया	11174	verb	went
तथा	10381	conjunction	and
जो	8671	pronoun	who
थी	7886	verb	was
थे	7882	verb	were
जाता	6586	verb	goes
तक	6118	noun	till
जा	5669	verb	go
मैं	5391	pronoun	I
किसी	4938	pronoun	someone
कहा	4424	verb	said
गई	4375	verb	went
उस	4188	pronoun	his
एवं	3974	conjunction	and
द्वारा	3850	preposition	by
जाती	3654	verb	went
हम	3350	pronoun	we
दी	2773	verb	give

Wikipedia gives a huge list of resources available for Hindi Scripts. I intend to use Shabdanjali- an offline Hindi to English conversion dictionary for the bilingual dictionary. This dictionary is available in unicode format appropriate for the project. Various online morphological taggers are available for Hindi which can be used for the creation of monodix.The language pair for Hindi-Urdu which lies in the nursery stage can also be used for development of monodix.There is a online tagged Hindi corpus made available by the LTRC group at IIITH that can be used for training the apertium tagger for Hindi text .

English and Hindi languages have a lot in common.Most of the building blocks for grammar are the same in both languages.But there are lots of differences as well in terms of inflections of various parts of speech. I intend to make full use of these similarities so as to write appropriate transfer rules for these languages.

I've already got quite familiar with Apertium framework. I completed the coding challenge of Hindi-English conversion set by Apertium community. I added support for a lot of verbs,pronouns and adjectives to both the mono and bilingual dictionaries.I also went through a lot of documentation of Apertium about morphological disambiguation , writing the transfer rules and testing the dictionaries. I think I will be able to grasp the remaining concepts required for this project before the coding period starts.

Work plan

Coding challenge

set up work environment (installation and configuration)
check what has already been done (study monodices from Hindi-English and Hindi-Urdu language pairs)
prepare a list of resources that can be used for the project
calculate the coverage of the Hindi analyser and Hindi--English bilingual dictionary
generate a frequency list of unknown words

Community Bonding Period

study Hindi and English language rules thoroughly
get monolingual and multilingual aligned corpora for further analysis
prepare a list of words sorted by frequency of occurrence for Hindi dictionary (to acquire at least 90% coverage)
learn to use dictionaries and tools (dixtools package, apertium-viewer, poliqarp) in practise

Week1

write test scripts (make use of the existing language-pair regression and corpus tests)
add the missing close-class words (pronouns, prepositions, conjunctions, determiners, numbers, modal verbs and the like) to the dictionaries

Week2

work on Hindi monodix; add open-class words (nouns, verbs, adjectives, adverbs) according to the frequency list

Week3

work on Hindi-English bilingual dictionary; add open-class words (nouns, verbs, adjectives, adverbs) according to the frequency list

Week4

add the rest of the words

Deliverable1: Dictionaries covering most of the words for both languages

Week5

gather translational data with the use of parallel corpora
add basic transfer rules for the purpose of testing, verify the tag definition files

Week6

work further on bilingual dictionary
work further on the transfer rules and morphological disambiguation

Week7

prepare a list of word sequences that frequently appear together for Hindi (use Apriori algorithm to find frequent sets)
add multiwords with translations to the dictionaries

Week8

bring the dictionaries to a consistent state (successful vocabulary tests)

Deliverable2: Bilingual dictionary completed and some Morphological disambiguation done

Week9

obtain hand-tagged training corpora
study the word order rules for Hindi and English (identify restrictions)
work on tag definition files
carry out supervised tagger training for the languages

Week10

extract segments of the parallel corpora that are translated (more or less) literally
work on transfer rules

Week11

carry out thorough regression tests
check dictionaries manually to spot possible errors

Week12

clean up, evaluation of results and documetation

Project completed

During the whole work the quality of translations will be controlled by means of regression and vocabulary tests. The work will be consulted on every stage and the progress will be reported on dedicated Wiki page.

List your skills and give evidence of your qualifications

I'm currently a fourth year student of Bachelors in Electrical Engineering at Indian Institute of Technology Ropar.I have always been interested in the field of Computer Science and learn to code myself by working on different projects.I am quite comfortable with C,java,C++,android programming and python. I also have completed courses in algorithms and data structures, operating systems ,Computer Networks,Natural Language processing and Web development. I have worked on quite a few projects as a part of my coursework.

I did a research internship last summer at Aston University ,UK as part of an exchange program. I worked on an Open Source project TinyOS there. I basically built a fall detection system for elderly people using wireless sensor networks. I was able to improve the accuracy of previously known algorithms for fall detection by about 7 percent.The coding for sensor networks was mainly done in C and the web interface was built using Java.

I did a project on machine learning for the Optical Character Recognition of Hindi Text. The project made use of neural networks which were mainly coded using Matlab. I developed an android app for fast file sharing using Wifi Direct as a part of my Computer networks course. As I worked as a part of a team in these projects I became familiar with tools like git and svn.

I am quite fascinated by the field of Natural Language Processing.I really enjoy learning about new languages. I am quite good at Hindi and English and know a little bit of Punjabi and Haryanavi.I think I have got the required skills that can help me to complete this project successfully.

My non-Summer-of-Code plans for the Summer

I have no other plans for the Summer than GSoC program.I will complete my Bachelors degree by the end of May. After that I have no commitments of work right now. I can contribute around 40 hours per week for this program.