User:Sakshi.iiita/Application

From Apertium
Jump to navigation Jump to search

Name
E-mail address
Other information that may be useful to contact you
Why is it you are interested in machine translation?
Why is it that they are interested in the Apertium project?
Which of the published tasks are you interested in?
Reasons why Google and Apertium should sponsor it
A description of how and who it will benefit in society.
A detailed work plan
List your skills and give evidence of your qualifications.


Name: Sakshi Rastogi
E-mail address: sakshi.iiita@gmail.com
Address: Room No. 413 Girls Hostel 2, Indian Institute Of Information
Technology Allahabad, U.P. ,India.


Interest in Machine Translation: My interest in Machine translation is quite recent. Frankly speaking I first read about it in December when I thought about doing a Natural Language Processing (NLP) project this summer in Google Summer of Code. I have had a keen interest in NLP as I think this is one area which if exploited could bring in miraculous software for solving language related problems.
Machine Translation (MT) grabbed my attention as it promises to bring about a radical change in the field of translation which presently requires a huge human support for its existence. An effort to automate it at such a large scale will surely bear fruits. It’s quite tough as “language” as a whole is ambiguous and bridging two ambiguous things is in itself a hard job. But a hybrid combination of statistics and rules can do wonders; this is evident in the fact that such platforms like Apertium exist.
The urge to contribute to such an effort has brought me close to Machine Translation.

Interest in Apertium Project: Apertium is perhaps one of those few in GSoC that cater to my interest in Natural Language Processing. Machine Translation is a promising field and Apertium is entirely based on it. This also gives an opportunity to know about various languages and even trying to contribute to it if one knows a language particularly well.
I am a strong supporter of Open Source and hence look up to such organizations.

Task Of interest: Detect hidden unknown words by using the probabilities of the HMM-based part-of-speech tagger in Apertium.

Reasons for Google and Apertium to sponsor it: Apertium is an open source shallow-transfer machine translation (MT) system. In addition to the translation engine, it also provides tools for manipulating linguistic data, and translators designed to run using the engine. The dictionaries of different language may not be able to provide all the lexical forms for a particular surface form. There is a need to fill such gaps and make the dictionaries more robust so that the translation is more accurate.
Google’s contribution to Open Source is commendable. Through Google Summer of Code it has supported the coding arena of Open Source with more and more people willing to contribute. By providing Apertium the required assistance it is building a thrust in Machine Translation which is a promising field of research and development.

Benefit to society: The users shall be benefited as this feature will warn regarding the possible missing entries in the dictionary. The warning regarding the missing lexical form for the existing surface forms in the dictionaries will alert the language pair maintainers and they will add possible inflections for the missing lexical forms in the dictionary and hence the dictionary will be enriched. This will result in more accurate translations reduce considerably the chances of errors in translation.
Therefore even the contributors (language pair maintainers) shall be benefited from it.

Work plan:
Goal: Detect hidden unknown words by using the probabilities of the HMM-based part-of-speech tagger in Apertium.


Understanding of problem: The problem requires adding to the present code of parts of speech tagger, a functionality to list the missing lexical forms of the existing dictionary words in the languages.
This can be accomplished if the existing code apart from determining the part of speech of each word in the given sentence (that maximizes the probability of the observed sequence of part of speech), also checks whether such a lexical form of the particular surface form exists in the dictionary for that language. If it does not exist, then the added feature of part of speech tagger should warn the language pair maintainer about the same. In the new version of the part of speech tagger the options for the part of speech for a particular word will not be limited to the options available in the dictionary.


Actual flow of Implementation: My observation of the code have made me realize that it will take some time to understand it as the documentation is not done completely in English. Next I will get myself in rhythm with the Hidden Markov Model implementation of the whole thing.
As we plan to specify missing entries in the dictionary so we will have to assume that each surface form can have open class tags before disambiguation. This in turn will lead to adding more ambiguity classes for each surface form in the given sentence. This will also make the calculation of emission Matrix computationally expensive.Calculation of Emission probabilities will follow the same rule as done before but now just the size of it will increase.Not all the ambiguity classes shall be expanded as there are many words which can appear in only one form (like articles), so for them there will be no ambiguity classes.Expanding Ambiguity classes shall affect methods which deal in calculation of emission probabilities and transition probabilities for them shall also have to be decided. So those will be accordingly updated.
The decision regarding which lexical form needs to be added in the dictionary and the corresponding surface form shall depend on the output of the (maximized probability) part of speech sequence of the given sentence. Comparing each surface form and the corresponding part of speech tag a check can be aimed at the dictionary entry of that surface form in order to know whether actually such a lexical form exists against that surface form in the dictionary for that language. If it doesn’t exist then the word will be added to the list of missing word that will be the output of the functionality. This list can be used by the Language pair maintainer to review and add the words if there be a need.



Time Line:
Community Bonding Period
Week 1: April 27 - May 2
Getting myself acquainted with the way of working as Apertium contributor and interacting with the mentors.
Week 2: May 3 - May 9
Reading suggested research papers and discussion for a complete understanding of the problem.
Week 3: May 10 - May 16
Reading the code documentation.
Week 4: May 17 - May 23
Reading and understanding of the existing code.

Coding Period
Week 5: May 24 - May 30
Expansion of ambiguity classes, with simultaneous documentation and testing of the code.
Week 6: May 31 - June 6
Expansion of ambiguity classes, with simultaneous documentation and testing of the code.
Week 7: June 7 - June 13
Calculation of the Emission matrix.
Week 8: June 14 - June 20
Implementing the whole code with expanded ambiguity classes.

Deliverable: The new code for part of speech tagging.

Week 9: June 21 - June 27
Learning done through a large corpus of sentences for various languages.
Week 10: June 28 - July 24
Generating the list of possible missing words.
Week 11: July 5 - July 11
Checking the output of the program with various language sentences.
Week 12: July 12 - July 18 (Mid-term Evaluation)

Deliverable: The code that gives as output suggested list of new lexical forms.

Will apply suggested improvements.
Week 13: July 19 - July 25 Will apply suggested improvements.
Week 14: July 26 - August 1
Testing and debugging.
Week 15: August 2 - August 8
Testing and debugging.
Week 16: August 9 - August 16
Testing and debugging.
Final Evaluation: August 9 - August 16



Skill set and qualifications: I am a 3rd year B.Tech student graduating from Indian Institute of Information Technology with major in Information Technology. It is one of the esteemed institutions of our country. I have always been an above average student as far as my academics are concerned. Also I have a keen interest in programming which I started doing at school level as a part of the curriculum. I do coding both in JAVA and C++(with STL). Also I have knowledge of Data Structures (taken as a course in first year).
I started using Linux an year ago and work mainly in Ubuntu. Before GSoC I never thought of contributing to Open Source software though to have a better understanding of Linux, I did work on building Linux from Scratch.
Our university has a separate laboratory for Natural Language Processing and I have joined it this year to boost my budding passion of this subject. As part of the university project I am currently working on extracting concepts from documents using the semantic clustering of words by Fuzzy C Mean Algorithm.
I have studied Artificial Intelligence and in the current semester have opted for an elective on soft computing. Both of the mentioned subjects have helped me develop a sufficient knowledge of Hidden Markov Model specifically. In our practical classes I have even had the opportunity to code it and use it as a tool for predictions. HMM and Genetic Algorithm have always been my favorite topic of research.
In addition to above I have a working knowledge of SQL,HTML and XML. I have done courses on Compiler Design, Programming Practices and Database management. I have no other engagements in this summer and hence look forward to devote myself entirely to the desired problem. I promise to be punctual and consistent with my work.