Difference between revisions of "User:Nikant/GsocApplication"

Revision as of 15:26, 23 April 2013

Name

Nikant Vohra

Contact information

E-mail: nikantv@iitrpr.ac.in
Skype: nikant.vohra

Why are you interested in machine translation?

I have been interested in machine translation since I took a course on Natural Language Processing as part of my coursework in the university.I was so much fascinated by this field that I myself started reading about it. I studied the online video Lectures of the course by Stanford university and also took a course on neurolinguistics.I loved the idea of a computer understanding and interpreting the grammar and other aspects of a language just like a human. I then started experimenting with machine learning and completed a project on Optical Character Recognition of Devanagari Script (Hindi). I think in this era of advancement of technology machine translation will play a very important role in bringing this world together.Machine Translation systems will reduce the need of human translators and make communication between the people of different nationalities much easier.Also the existing machine translation systems for less popular languages are not good enough. So there is a lot of scope of improvement and innovation in this field.

Why are you interested in the Apertium project?

Due to my interest in Machine Translation I started looking for open source projects in the same field.I basically wanted to develop a system for Hindi to English translation as the existing systems for the same do not behave properly. I came to know about this project from a batch mate who had completed a GSOC project with Apertium last year. I started going through the wiki documentation of Apertium and found it out to be very easy to understand and follow.I experimented with the dictionaries and transfer rules and was surprised to see how we can achieve desirable results with only minor improvements.I also loved the helpful community of apertium who was willing to answer any of my questions. I think Apertium is doing a wonderful job by keeping some minor languages alive .So I would love to contribute to this project.

Why Google and Apertium should sponsor it?

Hindi is one of the largest spoken languages in the world covering 4.46% of the world population (Source: wikipedia). The existing systems available for the machine translation of this pair do not work very well. The Hindi-English language pair still lies in the incubator stage in the Apertium directory. It needs to be improved a lot to be able to make it available for release. If this pair is made available for release it can significantly increase the number of people using Apertium. Also it would be helpful for a large number of people who know one of these languages to communicate and exchange ideas.

How and who it will benefit in society?

English is spoken by quite a lot of people in India.Major educational institutes of India have a medium of teaching in English.A majority of population can only write and read in Hindi.So language acts as a barrier for people here towards the study of technology.Study of technology is very important to solve the ever growing social and economic problems in India.If a robust for English-Hindi conversion is introduced it can be beneficial for thousands of people in India. It can also be helpful for people of other countries as a number of Hindi texts are very popular all over the world.If an online system to convert these texts to Hindi is introduced it could help a large number of people to understand and make use of these texts.

Which of the published tasks are you interested in? What do you plan to do?

The project I'd like to work on is Hindi-English language pair machine translation for Apertium.

Some work has already been done for this language pair.I would like to make this language pair available for release by the end of the coding period. It lies currently in the incubator stage with the Hindi dictionary in the WX format. Some conversion of this dictionary to unicode has already been done but a major part is still left.There is still no support for verbs and pronouns in the dictionary.The bilingual dictionary has a support of about 19000 words but still a lot of words need to be added to make this work.The apertiumtagger set needs to be trained for Hindi corpora. The transfer rules also needs to be improved a lot and support for multi words needs to be added.Also some constraint grammar rules need to be written to make the conversion look good.

Wikipedia gives a huge list of resources available for Hindi Scripts. I intend to use Shabdanjali-an offline Hindi to English conversion dictionary for the bilingual dictionary. This dictionary is available in unicode format appropriate for the project. Various online morphological taggers are available for Hindi which can be used for the creation of monodix.The language pair for Hindi-Urdu which lies in the nursery stage can also be used for development of monodix.There is a online hindi corpus made available by the LTRC group at IIITH that can be used for training the apertium tagger for Hindi text .

English and Hindi languages have a lot in common.Most of the building blocks for grammar are the same in both languages.But there are lots of differences as well in terms of inflections of various parts of speech. I intend to make full use of these similarities so as to write appropriate transfer rules for these languages.

I've already got quite familiar with Apertium framework. I completed the coding challenge of Hindi-English conversion set by Apertium community. I added support for a lot of verbs,pronouns and adjectives to both the mono and bilingual dictionaries.I also went through a lot of documentation of Apertium about morphological disambiguation , writing the transfer rules and writing constraint grammars. I think I will be able to grasp the remaining concepts required for this project before the coding period starts.

Work plan

Community Bonding Period

set up work environment (installation and configuration)
study Hindi and English language rules thoroughly
check what has already been done (study monodices from Hindi-English and Hindi-Urdu language pairs)
get monolingual and multilingual aligned corpora for further analysis
prepare a list of words sorted by frequency of occurrence for Hindi dictionary (to acquire at least 80% coverage)
learn to use dictionaries and tools (dixtools package, apertium-viewer, poliqarp) in practise

Week1

write test scripts (make use of the existing language-pair regression and corpus tests)
add the missing close-class words (pronouns, prepositions, conjunctions, determiners, numbers, modal verbs and the like) to the dictionaries

Week2

work on Hindi monodix; add open-class words (nouns, verbs, adjectives, adverbs) according to the frequency list

Week3

work on Hindi-English bilingual dictionary; add open-class words (nouns, verbs, adjectives, adverbs) according to the frequency list

Week4

add the rest of the words

Deliverable1: Desirable coverage acquired for both languages

Week5

gather translational data with the use of parallel corpora
add basic transfer rules for the purpose of testing, verify the tag definition files
work on bilingual dictionary

Week6

work further on bilingual dictionary
update the Polish-Czech page of the "False Friends of the Slavist" wikibook

Week7

prepare a list of word sequences that frequently appear together for both Polish and Czech (use Apriori algorithm to find frequent sets)
add multiwords with translations to the dictionaries

Week8

bring the dictionaries to a consistent state (successful vocabulary tests)

Deliverable2: Bilingual dictionary completed

Week9

obtain hand-tagged training corpora
study the word order rules of Czech and Polish (identify restrictions)
work on tag definition files
carry out supervised tagger training (with retraining on untagged text corpora) for both languages

Week10

extract segments of the parallel corpora that are translated (more or less) literally
work on transfer rules

Week11

carry out thorough regression tests
check dictionaries manually to spot possible errors

Week12

clean up, evaluation of results

Project completed

During the whole work the quality of translations will be controlled by means of regression and vocabulary tests. The work will be consulted on every stage and the progress will be reported on dedicated Wiki page.

List your skills and give evidence of your qualifications

I'm currently a fourth year student of Bachelors in Electrical Engineering at Indian Institute of Technology Ropar.I have always been interested in the field of Computer Science and learn to code myself by working on different projects.I am quite comfortable with C,java,C++,android programming and python. I also have completed courses in algorithms and data structures, operating systems ,Computer Networks,Natural Language processing and Web development. I have worked on quite a few projects as a part of my coursework.

I did a research internship last summer at Aston University ,UK as part of an exchange program. I worked on an Open Source project TinyOS there. I basically built a fall detection system for elderly people using wireless sensor networks. I was able to improve the accuracy of previously known algorithms for fall detection by about 7 percent.The coding for sensor networks was mainly done in C and the web interface was built using Java.

I also did a project on machine learning for the Optical Character Recognition of Hindi Text. The project made use of neural networks which were mainly coded using Matlab. I developed an android app for fast file sharing using Wifi Direct as a part of my Computer networks course. As a worked as a part of a team in these projects I became familiar with tools like git and svn.

I am quite fascinated by the field of Natural Language Processing.I really enjoy learning about new languages. I am quite good at Hindi and English and know a little bit of Punjabi and Haryanavi.I think I have got the required skills that can help me to complete this project successfully.

My non-Summer-of-Code plans for the Summer

have no other plans for the Summer than GSoC program. I intended to apply for a job, but if my application is accepted I'll postpone it until the project is completed. The GSoC program begins before my academic year will have ended, therefore I would like to work on the project a bit longer than it is specified - perhaps till the end of August, or even longer. During May and June I will have to combine my studies with developing the project and then I can fully focus on it when my summer break starts in July. I'm sure there won't be any problems with me studying and working on the GSoC project simultaneously as I've already managed to work during 3 semesters of my studies.

@@ Line 107: / Line 107: @@
 == List your skills and give evidence of your qualifications ==
-I'm currently first year student of Master in Computer Science at Gdansk University of Technology, Poland. I have Individual Studies Program and received scholarship for high academic achievements. During my previous studies I did a lot of programming mainly using c/c++, java and C#. I also have completed courses in algorithms and data structures, logic, operating systems (shell scripting, regular expressions), data mining, automata theory and formal languages. I have learned how the compiler works and how to generate simple lexical, syntactic and semantic analyzers for pascal and ada languages using flex, bison and yacc. I also completed a course in artificial intelligence where I learned about the hidden Markov model and neural networks.
+I'm currently a fourth year student of Bachelors in Electrical Engineering at Indian Institute of Technology Ropar.I have always been interested in the field of Computer Science and learn to code myself by working on different projects.I am quite comfortable with C,java,C++,android programming and python. I also have completed courses in algorithms and data structures, operating systems ,Computer Networks,Natural Language processing and Web development. I have worked on quite a few projects as a part of my coursework.
+I did a research internship last summer at Aston University ,UK as part of an exchange program. I worked on an Open Source project TinyOS there. I basically built a fall detection system for elderly people using wireless sensor networks. I was able to improve the accuracy of previously known algorithms for fall detection by about 7 percent.The coding for sensor networks was mainly done in C and the web interface was built using Java.
-So far I haven't participated in open-source project, but I've been involved in several research projects at my University concerning motion detection and tracking and hand gesture recognition. At present I'm working on a speech interface for a smart medical services system that will enable the user to communicate using 3D avatar.
+I also did a project on machine learning for the Optical Character Recognition of Hindi Text. The project made use of neural networks which were mainly coded using Matlab. I developed an android app for fast file sharing using Wifi Direct as a part of my Computer networks course. As a worked as a part of a team in these projects I became familiar with tools like git and svn.
-I have been working as an intern in [http://www.speednet.pl/home_en.htm Speednet] company for 1,5 year. During that time I was part of a team that developed Electronic Health Card System. I was responsible for the mobile part of the system written in .NET Compact Framework. I became familiar with software localisation and used MT to automate translation between Polish and English. Apart from that I learned how to use TortoiseSvn and MantisBT.
+I am quite fascinated by the field of Natural Language Processing.I really enjoy learning about new languages. I am quite good at Hindi and English and know a little bit of Punjabi and Haryanavi.I think I have got the required skills that can help me to complete this project successfully.
-In my projects I use PostgreSQL and Microsoft SQL Server DBMSes. Recently I also started a course in Oracle. I know .NET technology (windows forms, windows forms ce, wpf, wcf, silverlight) and the basics of JEE (servlets, jsp/jsf, facelets, JPA, JAAS, JMS). I'm also familiar distributted and parallel programming concepts.
-I really enjoy leaning languages and I consider myself good at it. I know Polish (mother tongue) English (Cambridge CAE Certificate), German (pre-intermediate level) and some basic Croatian. Whenever I go abroad I always remember to take a language guide with me. Although I never took Czech lessons, because of it's similarity to Polish language, I can understand it quite well. I strongly believe I can manage to successfully realize a translator for this language pair.
 == My non-Summer-of-Code plans for the Summer ==