Difference between revisions of "User:Nikant/GsocApplication"

From Apertium
Jump to navigation Jump to search
Line 31: Line 31:


Some work has already been done for this language pair.I would like to make this language pair available for release by the end of the coding period. It lies currently in the incubator stage with the Hindi dictionary in the WX format. Some conversion of this dictionary to unicode has already been done but a major part is still left.There is still no support for verbs and pronouns in the dictionary.The bilingual dictionary has a support of about 19000 words but still a lot of words need to be added to make this work.The apertium tagger set needs to be trained for Hindi corpora. The transfer rules also needs to be improved a lot and support for multi words needs to be added.I tested the existing Hindi analyzer on a Hindi corpus obtained from online resources. The coverage for the morphological analyzer came out to be around 71% while for bilingual translator it was around 57%.I also analyzed the corpus to get a list of high frequency unknown Hindi words as a part of the coding challenge.
Some work has already been done for this language pair.I would like to make this language pair available for release by the end of the coding period. It lies currently in the incubator stage with the Hindi dictionary in the WX format. Some conversion of this dictionary to unicode has already been done but a major part is still left.There is still no support for verbs and pronouns in the dictionary.The bilingual dictionary has a support of about 19000 words but still a lot of words need to be added to make this work.The apertium tagger set needs to be trained for Hindi corpora. The transfer rules also needs to be improved a lot and support for multi words needs to be added.I tested the existing Hindi analyzer on a Hindi corpus obtained from online resources. The coverage for the morphological analyzer came out to be around 71% while for bilingual translator it was around 57%.I also analyzed the corpus to get a list of high frequency unknown Hindi words as a part of the coding challenge.
This are the results I obtained for the high frequency unknown words in my corpus:


{| class="wikitable" border="1"
|-
! Word in Hindi
! Frequency in Corpus
! Part of Speech
! Translation in English
|-
| इस
| 14776
| pronoun
| this
|-
| किया
| 11398
| verb
| did
|-
| गया
| 11174
| verb
| went
|-
| तथा
| 10381
| conjunction
| and
|-
| जो
| 8671
| pronoun
| who
|-
| थी
| 7886
| verb
| was
|-
| थे
| 7882
| verb
| were
|-
| जाता
| 6586
| verb
| goes
|-
| तक
| 6118
| noun
| till
|-
| जा
| 5669
| verb
| go
|-
| मैं
| 5391
| pronoun
| I
|-
| किसी
| 4938
| pronoun
| someone
|-
| कहा
| 4424
| verb
| said
|-
| गई
| 4375
| verb
| went
|-
| उस
| 4188
| pronoun
| his
|-
| एवं
| 3974
| conjunction
| and
|-
| द्वारा
| 3850
| preposition
| by
|-
| जाती
| 3654
| verb
| went
|-
| हम
| 3350
| pronoun
| we
|-
| दी
| 2773
| verb
| give
|}


Wikipedia gives a huge list of resources available for Hindi Scripts. I intend to use Shabdanjali-an offline Hindi to English conversion dictionary for the bilingual dictionary. This dictionary is available in unicode format appropriate for the project. Various online morphological taggers are available for Hindi which can be used for the creation of monodix.The language pair for Hindi-Urdu which lies in the nursery stage can also be used for development of monodix.There is a online hindi corpus made available by the LTRC group at IIITH that can be used for training the apertium tagger for Hindi text .
Wikipedia gives a huge list of resources available for Hindi Scripts. I intend to use Shabdanjali-an offline Hindi to English conversion dictionary for the bilingual dictionary. This dictionary is available in unicode format appropriate for the project. Various online morphological taggers are available for Hindi which can be used for the creation of monodix.The language pair for Hindi-Urdu which lies in the nursery stage can also be used for development of monodix.There is a online hindi corpus made available by the LTRC group at IIITH that can be used for training the apertium tagger for Hindi text .

Revision as of 12:23, 29 April 2013

Name

Nikant Vohra

Contact information

E-mail: nikantv@iitrpr.ac.in
Skype: nikant.vohra


Why are you interested in machine translation?

I have been interested in machine translation since I took a course on Natural Language Processing as part of my coursework in the university.I was so much fascinated by this field that I myself started reading about it. I studied the online video Lectures of the course by Stanford university and also took a course on neurolinguistics.I loved the idea of a computer understanding and interpreting the grammar and other aspects of a language just like a human. I then started experimenting with machine learning and completed a project on Optical Character Recognition of Devanagari Script (Hindi). I think in this era of advancement of technology machine translation will play a very important role in bringing this world together.Machine Translation systems will reduce the need of human translators and make communication between the people of different nationalities much easier.Also the existing machine translation systems for less popular languages are not good enough. So there is a lot of scope of improvement and innovation in this field.

Why are you interested in the Apertium project?

Due to my interest in Machine Translation I started looking for open source projects in the same field.I basically wanted to develop a system for Hindi to English translation as the existing systems for the same do not behave properly. I came to know about this project from a batch mate who had completed a GSOC project with Apertium last year. I started going through the wiki documentation of Apertium and found it out to be very easy to understand and follow.I experimented with the dictionaries and transfer rules and was surprised to see how we can achieve desirable results with only minor additions.I loved the helpful community of Apertium who was willing to answer any of my questions. Apertium is doing a wonderful job by keeping some minor languages alive and I would love to contibute to this project.

Why Google and Apertium should sponsor it?

Hindi is one of the largest spoken languages in the world covering 4.46% of the world population (Source: wikipedia). The existing systems available for the machine translation of this pair do not work very well. The Hindi-English language pair still lies in the incubator stage in the Apertium directory. It needs to be improved a lot to be able to make it available for release. If this pair is made available for release it can significantly increase the number of people using Apertium. Also it would be helpful for a large number of people who know one of these languages to communicate and exchange ideas.


How and who it will benefit in society?

English is spoken by quite a lot of people in India.Major educational institutes of India have a medium of teaching in English.A majority of population can only write and read in Hindi.So language acts as a barrier for people here towards the study of technology.Study of technology is very important to solve the ever growing social and economic problems in India.If a robust system for English-Hindi conversion is introduced it can be beneficial for thousands of people in India. It can also be helpful for people of other countries as a number of Hindi texts are very popular all over the world.If an online system to convert these texts to Hindi is introduced it could help a large number of people to understand and make use of these texts.

Which of the published tasks are you interested in? What do you plan to do?

The project I'd like to work on is Hindi-English language pair machine translation for Apertium.

Some work has already been done for this language pair.I would like to make this language pair available for release by the end of the coding period. It lies currently in the incubator stage with the Hindi dictionary in the WX format. Some conversion of this dictionary to unicode has already been done but a major part is still left.There is still no support for verbs and pronouns in the dictionary.The bilingual dictionary has a support of about 19000 words but still a lot of words need to be added to make this work.The apertium tagger set needs to be trained for Hindi corpora. The transfer rules also needs to be improved a lot and support for multi words needs to be added.I tested the existing Hindi analyzer on a Hindi corpus obtained from online resources. The coverage for the morphological analyzer came out to be around 71% while for bilingual translator it was around 57%.I also analyzed the corpus to get a list of high frequency unknown Hindi words as a part of the coding challenge. This are the results I obtained for the high frequency unknown words in my corpus:


Word in Hindi Frequency in Corpus Part of Speech Translation in English
इस 14776 pronoun this
किया 11398 verb did
गया 11174 verb went
तथा 10381 conjunction and
जो 8671 pronoun who
थी 7886 verb was
थे 7882 verb were
जाता 6586 verb goes
तक 6118 noun till
जा 5669 verb go
मैं 5391 pronoun I
किसी 4938 pronoun someone
कहा 4424 verb said
गई 4375 verb went
उस 4188 pronoun his
एवं 3974 conjunction and
द्वारा 3850 preposition by
जाती 3654 verb went
हम 3350 pronoun we
दी 2773 verb give

Wikipedia gives a huge list of resources available for Hindi Scripts. I intend to use Shabdanjali-an offline Hindi to English conversion dictionary for the bilingual dictionary. This dictionary is available in unicode format appropriate for the project. Various online morphological taggers are available for Hindi which can be used for the creation of monodix.The language pair for Hindi-Urdu which lies in the nursery stage can also be used for development of monodix.There is a online hindi corpus made available by the LTRC group at IIITH that can be used for training the apertium tagger for Hindi text .

English and Hindi languages have a lot in common.Most of the building blocks for grammar are the same in both languages.But there are lots of differences as well in terms of inflections of various parts of speech. I intend to make full use of these similarities so as to write appropriate transfer rules for these languages.

I've already got quite familiar with Apertium framework. I completed the coding challenge of Hindi-English conversion set by Apertium community. I added support for a lot of verbs,pronouns and adjectives to both the mono and bilingual dictionaries.I also went through a lot of documentation of Apertium about morphological disambiguation , writing the transfer rules and testing the dictionaries. I think I will be able to grasp the remaining concepts required for this project before the coding period starts.

Work plan

Coding challenge

  • set up work environment (installation and configuration)
  • check what has already been done (study monodices from Hindi-English and Hindi-Urdu language pairs)
  • prepare a list of resources that can be used for the project
  • calculate the coverage of the Hindi analyser and Hindi--English bilingual dictionary
  • generate a frequency list of unknown words

Community Bonding Period

  • study Hindi and English language rules thoroughly
  • get monolingual and multilingual aligned corpora for further analysis
  • prepare a list of words sorted by frequency of occurrence for Hindi dictionary (to acquire at least 90% coverage)
  • learn to use dictionaries and tools (dixtools package, apertium-viewer, poliqarp) in practise

Week1

  • write test scripts (make use of the existing language-pair regression and corpus tests)
  • add the missing close-class words (pronouns, prepositions, conjunctions, determiners, numbers, modal verbs and the like) to the dictionaries

Week2

  • work on Hindi monodix; add open-class words (nouns, verbs, adjectives, adverbs) according to the frequency list

Week3

  • work on Hindi-English bilingual dictionary; add open-class words (nouns, verbs, adjectives, adverbs) according to the frequency list

Week4

  • add the rest of the words

Deliverable1: Dictionaries covering most of the words for both languages

Week5

  • gather translational data with the use of parallel corpora
  • add basic transfer rules for the purpose of testing, verify the tag definition files


Week6

  • work further on bilingual dictionary
  • work further on the transfer rules and morphological disambiguation

Week7

  • prepare a list of word sequences that frequently appear together for Hindi (use Apriori algorithm to find frequent sets)
  • add multiwords with translations to the dictionaries

Week8

  • bring the dictionaries to a consistent state (successful vocabulary tests)

Deliverable2: Bilingual dictionary completed and some Morphological disambiguation done

Week9

  • obtain hand-tagged training corpora
  • study the word order rules for Hindi and English (identify restrictions)
  • work on tag definition files
  • carry out supervised tagger training for the languages

Week10

  • extract segments of the parallel corpora that are translated (more or less) literally
  • work on transfer rules

Week11

  • carry out thorough regression tests
  • check dictionaries manually to spot possible errors

Week12

  • clean up, evaluation of results and documetation

Project completed

During the whole work the quality of translations will be controlled by means of regression and vocabulary tests. The work will be consulted on every stage and the progress will be reported on dedicated Wiki page.

List your skills and give evidence of your qualifications

I'm currently a fourth year student of Bachelors in Electrical Engineering at Indian Institute of Technology Ropar.I have always been interested in the field of Computer Science and learn to code myself by working on different projects.I am quite comfortable with C,java,C++,android programming and python. I also have completed courses in algorithms and data structures, operating systems ,Computer Networks,Natural Language processing and Web development. I have worked on quite a few projects as a part of my coursework.

I did a research internship last summer at Aston University ,UK as part of an exchange program. I worked on an Open Source project TinyOS there. I basically built a fall detection system for elderly people using wireless sensor networks. I was able to improve the accuracy of previously known algorithms for fall detection by about 7 percent.The coding for sensor networks was mainly done in C and the web interface was built using Java.

I did a project on machine learning for the Optical Character Recognition of Hindi Text. The project made use of neural networks which were mainly coded using Matlab. I developed an android app for fast file sharing using Wifi Direct as a part of my Computer networks course. As I worked as a part of a team in these projects I became familiar with tools like git and svn.

I am quite fascinated by the field of Natural Language Processing.I really enjoy learning about new languages. I am quite good at Hindi and English and know a little bit of Punjabi and Haryanavi.I think I have got the required skills that can help me to complete this project successfully.

My non-Summer-of-Code plans for the Summer

I have no other plans for the Summer than GSoC program.I will complete my Bachelors degree by the end of May. After that I have no commitments of work right now. I can contribute around 40 hours per week for this program.