Difference between revisions of "User:Nikant/GsocApplication"
Line 46: | Line 46: | ||
* study Hindi and English language rules thoroughly |
* study Hindi and English language rules thoroughly |
||
* check what has already been done (study monodices from Hindi-English and Hindi-Urdu language pairs) |
* check what has already been done (study monodices from Hindi-English and Hindi-Urdu language pairs) |
||
* get monolingual and multilingual aligned corpora for further analysis |
* get monolingual and multilingual aligned corpora for further analysis |
||
* prepare a list of words sorted by frequency of occurrence for |
* prepare a list of words sorted by frequency of occurrence for Hindi dictionary (to acquire at least 80% coverage) |
||
* learn to use dictionaries and tools (dixtools package, apertium-viewer, poliqarp) in practise |
* learn to use dictionaries and tools (dixtools package, apertium-viewer, poliqarp) in practise |
||
=== Week1 === |
=== Week1 === |
||
* write test scripts (make use of the existing language-pair regression and corpus tests) |
* write test scripts (make use of the existing language-pair regression and corpus tests) |
||
* add the missing close-class words (pronouns, prepositions, conjunctions, determiners, numbers, modal verbs and the like) to the dictionaries |
* add the missing close-class words (pronouns, prepositions, conjunctions, determiners, numbers, modal verbs and the like) to the dictionaries |
||
=== Week2 === |
=== Week2 === |
||
* work on |
* work on Hindi monodix; add open-class words (nouns, verbs, adjectives, adverbs) according to the frequency list |
||
=== Week3 === |
=== Week3 === |
||
* work on |
* work on Hindi-English bilingual dictionary; add open-class words (nouns, verbs, adjectives, adverbs) according to the frequency list |
||
=== Week4 === |
=== Week4 === |
Revision as of 14:42, 23 April 2013
Contents
- 1 Name
- 2 Contact information
- 3 Why are you interested in machine translation?
- 4 Why are you interested in the Apertium project?
- 5 Why Google and Apertium should sponsor it?
- 6 How and who it will benefit in society?
- 7 Which of the published tasks are you interested in? What do you plan to do?
- 8 Work plan
- 9 List your skills and give evidence of your qualifications
- 10 My non-Summer-of-Code plans for the Summer
Name
Nikant Vohra
Contact information
E-mail: nikantv@iitrpr.ac.in
Skype: nikant.vohra
Why are you interested in machine translation?
I have been interested in machine translation since I took a course on Natural Language Processing as part of my coursework in the university.I was so much fascinated by this field that I myself started reading about it. I studied the online video Lectures of the course by Stanford university and also took a course on neurolinguistics.I loved the idea of a computer understanding and interpreting the grammar and other aspects of a language just like a human. I then started experimenting with machine learning and completed a project on Optical Character Recognition of Devanagari Script (Hindi). I think in this era of advancement of technology machine translation will play a very important role in bringing this world together.Machine Translation systems will reduce the need of human translators and make communication between the people of different nationalities much easier.Also the existing machine translation systems for less popular languages are not good enough. So there is a lot of scope of improvement and innovation in this field.
Why are you interested in the Apertium project?
Due to my interest in Machine Translation I started looking for open source projects in the same field.I basically wanted to develop a system for Hindi to English translation as the existing systems for the same do not behave properly. I came to know about this project from a batch mate who had completed a GSOC project with Apertium last year. I started going through the wiki documentation of Apertium and found it out to be very easy to understand and follow.I experimented with the dictionaries and transfer rules and was surprised to see how we can achieve desirable results with only minor improvements.I also loved the helpful community of apertium who was willing to answer any of my questions. I think Apertium is doing a wonderful job by keeping some minor languages alive .So I would love to contribute to this project.
Why Google and Apertium should sponsor it?
Hindi is one of the largest spoken languages in the world covering 4.46% of the world population (Source: wikipedia). The existing systems available for the machine translation of this pair do not work very well. The Hindi-English language pair still lies in the incubator stage in the Apertium directory. It needs to be improved a lot to be able to make it available for release. If this pair is made available for release it can significantly increase the number of people using Apertium. Also it would be helpful for a large number of people who know one of these languages to communicate and exchange ideas.
How and who it will benefit in society?
English is spoken by quite a lot of people in India.Major educational institutes of India have a medium of teaching in English.A majority of population can only write and read in Hindi.So language acts as a barrier for people here towards the study of technology.Study of technology is very important to solve the ever growing social and economic problems in India.If a robust for English-Hindi conversion is introduced it can be beneficial for thousands of people in India. It can also be helpful for people of other countries as a number of Hindi texts are very popular all over the world.If an online system to convert these texts to Hindi is introduced it could help a large number of people to understand and make use of these texts.
Which of the published tasks are you interested in? What do you plan to do?
The project I'd like to work on is Hindi-English language pair machine translation for Apertium.
Some work has already been done for this language pair.I would like to make this language pair available for release by the end of the coding period. It lies currently in the incubator stage with the Hindi dictionary in the WX format. Some conversion of this dictionary to unicode has already been done but a major part is still left.There is still no support for verbs and pronouns in the dictionary.The bilingual dictionary has a support of about 19000 words but still a lot of words need to be added to make this work.The apertiumtagger set needs to be trained for Hindi corpora. The transfer rules also needs to be improved a lot and support for multi words needs to be added.Also some constraint grammar rules need to be written to make the conversion look good.
Wikipedia gives a huge list of resources available for Hindi Scripts. I intend to use Shabdanjali-an offline Hindi to English conversion dictionary for the bilingual dictionary. This dictionary is available in unicode format appropriate for the project. Various online morphological taggers are available for Hindi which can be used for the creation of monodix.The language pair for Hindi-Urdu which lies in the nursery stage can also be used for development of monodix.There is a online hindi corpus made available by the LTRC group at IIITH that can be used for training the apertium tagger for Hindi text .
English and Hindi languages have a lot in common.Most of the building blocks for grammar are the same in both languages.But there are lots of differences as well in terms of inflections of various parts of speech. I intend to make full use of these similarities so as to write appropriate transfer rules for these languages.
I've already got quite familiar with Apertium framework. I completed the coding challenge of Hindi-English conversion set by Apertium community. I added support for a lot of verbs,pronouns and adjectives to both the mono and bilingual dictionaries.I also went through a lot of documentation of Apertium about morphological disambiguation , writing the transfer rules and writing constraint grammars. I think I will be able to grasp the remaining concepts required for this project before the coding period starts.
Work plan
Community Bonding Period
- set up work environment (installation and configuration)
- study Hindi and English language rules thoroughly
- check what has already been done (study monodices from Hindi-English and Hindi-Urdu language pairs)
- get monolingual and multilingual aligned corpora for further analysis
- prepare a list of words sorted by frequency of occurrence for Hindi dictionary (to acquire at least 80% coverage)
- learn to use dictionaries and tools (dixtools package, apertium-viewer, poliqarp) in practise
Week1
- write test scripts (make use of the existing language-pair regression and corpus tests)
- add the missing close-class words (pronouns, prepositions, conjunctions, determiners, numbers, modal verbs and the like) to the dictionaries
Week2
- work on Hindi monodix; add open-class words (nouns, verbs, adjectives, adverbs) according to the frequency list
Week3
- work on Hindi-English bilingual dictionary; add open-class words (nouns, verbs, adjectives, adverbs) according to the frequency list
Week4
- add the rest of the words
Deliverable1: Desirable coverage acquired for both languages
Week5
- gather translational data with the use of parallel corpora
- add basic transfer rules for the purpose of testing, verify the tag definition files
- work on bilingual dictionary
Week6
- work further on bilingual dictionary
- update the Polish-Czech page of the "False Friends of the Slavist" wikibook
Week7
- prepare a list of word sequences that frequently appear together for both Polish and Czech (use Apriori algorithm to find frequent sets)
- add multiwords with translations to the dictionaries
Week8
- bring the dictionaries to a consistent state (successful vocabulary tests)
Deliverable2: Bilingual dictionary completed
Week9
- obtain hand-tagged training corpora
- study the word order rules of Czech and Polish (identify restrictions)
- work on tag definition files
- carry out supervised tagger training (with retraining on untagged text corpora) for both languages
Week10
- extract segments of the parallel corpora that are translated (more or less) literally
- work on transfer rules
Week11
- carry out thorough regression tests
- check dictionaries manually to spot possible errors
Week12
- clean up, evaluation of results
Project completed
During the whole work the quality of translations will be controlled by means of regression and vocabulary tests. The work will be consulted on every stage and the progress will be reported on dedicated Wiki page.
List your skills and give evidence of your qualifications
I'm currently first year student of Master in Computer Science at Gdansk University of Technology, Poland. I have Individual Studies Program and received scholarship for high academic achievements. During my previous studies I did a lot of programming mainly using c/c++, java and C#. I also have completed courses in algorithms and data structures, logic, operating systems (shell scripting, regular expressions), data mining, automata theory and formal languages. I have learned how the compiler works and how to generate simple lexical, syntactic and semantic analyzers for pascal and ada languages using flex, bison and yacc. I also completed a course in artificial intelligence where I learned about the hidden Markov model and neural networks.
So far I haven't participated in open-source project, but I've been involved in several research projects at my University concerning motion detection and tracking and hand gesture recognition. At present I'm working on a speech interface for a smart medical services system that will enable the user to communicate using 3D avatar.
I have been working as an intern in Speednet company for 1,5 year. During that time I was part of a team that developed Electronic Health Card System. I was responsible for the mobile part of the system written in .NET Compact Framework. I became familiar with software localisation and used MT to automate translation between Polish and English. Apart from that I learned how to use TortoiseSvn and MantisBT.
In my projects I use PostgreSQL and Microsoft SQL Server DBMSes. Recently I also started a course in Oracle. I know .NET technology (windows forms, windows forms ce, wpf, wcf, silverlight) and the basics of JEE (servlets, jsp/jsf, facelets, JPA, JAAS, JMS). I'm also familiar distributted and parallel programming concepts.
I really enjoy leaning languages and I consider myself good at it. I know Polish (mother tongue) English (Cambridge CAE Certificate), German (pre-intermediate level) and some basic Croatian. Whenever I go abroad I always remember to take a language guide with me. Although I never took Czech lessons, because of it's similarity to Polish language, I can understand it quite well. I strongly believe I can manage to successfully realize a translator for this language pair.
My non-Summer-of-Code plans for the Summer
have no other plans for the Summer than GSoC program. I intended to apply for a job, but if my application is accepted I'll postpone it until the project is completed. The GSoC program begins before my academic year will have ended, therefore I would like to work on the project a bit longer than it is specified - perhaps till the end of August, or even longer. During May and June I will have to combine my studies with developing the project and then I can fully focus on it when my summer break starts in July. I'm sure there won't be any problems with me studying and working on the GSoC project simultaneously as I've already managed to work during 3 semesters of my studies.