Apertium has moved from SourceForge to GitHub.
If you have any questions, please come and talk to us on #apertium on irc.freenode.net or contact the GitHub migration team.

User:Khannatanmai/GSoC2020Proposal DistributedRepresentations

From Apertium
Jump to navigation Jump to search

Personal Details[edit]

Name: Tanmai Khanna

E-mail address: khanna.tanmai@gmail.com , tanmai.khanna@research.iiit.ac.in

IRC: khannatanmai

GitHub: khannatanmai

LinkedIn: khannatanmai

Current Designation: Undergraduate Researcher in the LTRC Lab, IIIT Hyderabad (4th year student) and a Teaching Assistant for Linguistics courses

Time Zone: GMT+5:30

About Me[edit]

Open Source Softwares I use: I have used Apertium in the past, Ubuntu, Firefox, VLC.

Professional Interests: I’m an undergraduate researcher in Computational Linguistics and I have a particular interest in Linguistics and NLP tools, specifically Machine Translation and its components.

Hobbies: I love Parliamentary Debating, Singing, and Reading.

What I want to get out of GSoC

I’ve enjoyed using Apertium in various personal and academic projects and it’s amazing to me that I get an opportunity to work with them.

Computational Linguistics is my passion, and I would love to work with similarly passionate people at Apertium, to develop tools that people actually benefit from. This would be an invaluable experience that classes just can't match.

I am applying for GSoC, as the stipend would allow me to dedicate my full attention to the project during the 3 months.

Why is it that I am interested in Apertium and Machine Translation?[edit]

Apertium is an Open Source Rule-based MT system. I'm a researcher in the IIIT-H LTRC lab, currently working on Machine Translation and it interests me because it’s a complex problem which tries to achieve something most people believe is only achievable by humans. Translating data to other languages, and especially low- resource languages gives the speakers of those languages access to valuable data and can help in several domains, such as education, news, judiciary, etc. Machine Translation is often called NLP-Complete by my professors, i.e. it uses most of the tools NLP has to offer and hence if one learns to create good tools for MT, they learn most of Natural Language Processing.

Each part of Apertium's mission statement, especially the fact that they focus on Low Resource Languages, excites me to be working with them. While recent trends lean towards Neural Networks and Deep Learning, they fall short when it comes to resource-poor languages.

A tool which is rule-based and open source really helps the community with language pairs that are resource- poor and gives them free translations for their needs and that is why I want to work on improving on it.

I've worked with Apertium for GSoC 2019 and have continued to update and maintain the Anaphora Resolution module that I developed. I have also contributed to a paper written about the recent advances in Apertium. I have enjoyed every bit of the process and since I plan to be a long time contributor with Apertium, I'm applying for this project, that xyzzy and would help the users of this tool and help me develop a deep knowledge about the Apertium pipeline, which will help for all the future projects I do in Apertium as well.

Project Proposal[edit]

Which of the published tasks am I interested in? What do I plan to do?[edit]

Proposed Improvements (TODO)[edit]

Propagating the surface form[edit]

Modification of words in transfer based on TL tags[edit]

Compounds and multiwords[edit]

Idea Description[edit]

Working Example[edit]

Work Plan (TODO)[edit]

Community Bonding Period (May 6 - May 27)

Week 1-4 (May 27 - )

Deliverable #1:[edit]

Evaluation 1: June 24-28

Week 5-8 (June 28)

Deliverable #2:[edit]

Evaluation 2: July 22-26

Week 9-12 (July 26)

Final Evaluations: August 19-26

Project Completed[edit]

NOTE: The third phase of the project has extra time to deal with unforeseen issues and ideas

A description of how and who it will benefit in society[edit]


I’m from India and for a lot of our languages, we don’t have the data to create reliable Neural MT systems. Similarly, for all resource poor languages, Apertium provides an easy and reliable MT system for their needs. That’s how Apertium benefits society already.

Reasons why Google and Apertium should sponsor it[edit]

I've been a regular contributor with Apertium for more than a year now, and this project is one which aims to modify almost every part of the pipeline for the better. The funding that I receive will help me to focus my time and resources on this project so that it can be adequately completed in three months.

By funding this project, Google will help improve an important Open Source tool and promote Open Source Development. In a world of Proprietary softwares, this is an invaluable resource for society and supports innovation that everyone can benefit from.

Skills and Qualifications[edit]

I'm currently a fourth year student and an Undergraduate Researcher at IIIT Hyderabad where I'm studying Computational Linguistics. It is a dual degree where we study Computer Science, Linguistics, NLP and more. I'm working on Machine Translation in the LTRC lab in IIIT Hyderabad and I'm part of the MT group in our university.

I've been interested in linguistics from the very beginning and due to the rigorous programming courses, I'm also adept at several programming languages like Python, C++, XML, Bash Scripting, etc. I'm skilled in writing Finite State Transducers, Algorithms. Data Structures, and Machine Learning Algorithms as well.

I also have a lot of experience studying data which I feel is essential in solving any problem.

I've worked with Apertium as part of GSoC 2019 and built the Anaphora resolution module, and hence I'm familiar with the codebase and the community which will help me to dive right in the project and make a significant contribution right from the start. I have worked in several other projects, such as a tool that predicts commas and sentence boundaries in ASR output using pitch, building a Translation Memory, Detecting Homographic Puns, POS Taggers, Grammar and Spell Checkers, Named Entity Recognisers, Building Chatbots, etc. all of which required a working understanding of Natural Language Processing. Most of these projects were done offline in my research lab and aren't available on GitHub because of the privacy settings but can be provided if needed.

I am fluent in English, Hindi and have basic knowledge of Spanish.

The details of my skills and work experience can be found here: CV

Non-Summer-Of-Code Plans[edit]

I have no plans apart from GSoC in the summer and can devote 30-40 hours a week for this project.