User:Khannatanmai/GSoC2020Proposal Trimming

From Apertium
Jump to navigation Jump to search

Personal Details

Name: Tanmai Khanna

E-mail address: khanna.tanmai@gmail.com , tanmai.khanna@research.iiit.ac.in

IRC: khannatanmai

GitHub: khannatanmai

LinkedIn: khannatanmai

Current Designation: Undergraduate Researcher in the LTRC Lab, IIIT Hyderabad (4th year student) and a Teaching Assistant for Linguistics courses

Time Zone: GMT+5:30

About Me

Open Source Softwares I use: I have used Apertium in the past, Ubuntu, Firefox, VLC.

Professional Interests: I’m an undergraduate researcher in Computational Linguistics and I have a particular interest in Linguistics and NLP tools, specifically Machine Translation and its components.

Hobbies: I love Parliamentary Debating, Singing, and Reading.

What I want to get out of GSoC

I’ve enjoyed using Apertium in various personal and academic projects and it’s amazing to me that I get an opportunity to work with them.

Computational Linguistics is my passion, and I would love to work with similarly passionate people at Apertium, to develop tools that people actually benefit from. This would be an invaluable experience that classes just can't match.

I am applying for GSoC, as the stipend would allow me to dedicate my full attention to the project during the 3 months.

Why is it that I am interested in Apertium and Machine Translation?

Apertium is an Open Source Rule-based MT system. I'm a researcher in the IIIT-H LTRC lab, currently working on Machine Translation and it interests me because it’s a complex problem which tries to achieve something most people believe is only achievable by humans. Translating data to other languages, and especially low- resource languages gives the speakers of those languages access to valuable data and can help in several domains, such as education, news, judiciary, etc. Machine Translation is often called NLP-Complete by my professors, i.e. it uses most of the tools NLP has to offer and hence if one learns to create good tools for MT, they learn most of Natural Language Processing.

Each part of Apertium's mission statement, especially the fact that they focus on Low Resource Languages, excites me to be working with them. While recent trends lean towards Neural Networks and Deep Learning, they fall short when it comes to resource-poor languages.

A tool which is rule-based and open source really helps the community with language pairs that are resource- poor and gives them free translations for their needs and that is why I want to work on improving on it.

I've worked with Apertium for GSoC 2019 and have continued to update and maintain the Anaphora Resolution module that I developed. I have also contributed to a paper written about the recent advances in Apertium. I have enjoyed every bit of the process and since I plan to be a long time contributor with Apertium, I'm applying for this project, that eliminates dictionary trimming and would help the users of this tool and help me develop a deep knowledge about the Apertium pipeline, which will help for all the future projects I do in Apertium as well.

Project Proposal

Which of the published tasks am I interested in? What do I plan to do?

The task I'm interested in is Eliminating Dictionary Trimming(Ideas_for_Google_Summer_of_Code/Eliminate_trimming). Dictionary trimming is the process of removing those words and their analyses from monolingual language models (FSTs compiled from monodixes) which don't have an entry in the bidix, to avoid a lot of untranslated lemmas (with an @ if debugging) in the output, which lead to issues with comprehension and post-editing the output. The section #WhyWeTrim explains the rationale behind trimming further.

However, by trimming the dictionary, you throw away valuable analyses of words in the source language, which, if preserved, can be used as context for lexical selection and analysis of the input. Also, several transfer rules don't match as the word is shown as unknown. There are several other project ideas that would become viable if we don't trim away the analyses of monodix words that aren't in the bidix, such as a morph-guessing module that can output a source language lemma with target language morphology. Eliminating trimming would also help as if we learn morphological data for a language from an external source, if we continue to trim, it is unusable until all those words are in the bidix. As a general principle, it is better to try and not discard useful information, to pass it in the pipeline and then work on it based on the task.

As part of this project, I plan to eliminate monodix trimming and propose a solution such that we don't lose the benefits of trimming.

Proposed Improvements

Idea Description

Working Example

Work Plan

Community Bonding Period (May 6 - May 27)

Week 1 (May 27)

Week 2 (June 3)

Week 3 (June 10)

Week 4 (June 17)

Deliverable #1:

Evaluation 1: June 24-28

Week 5 (June 28)

Week 6 (July 4)

Week 7 (July 10)

Week 8 (July 16)

Deliverable #2:

Evaluation 2: July 22-26

Week 9 (July 26)

Week 10 (August 1)

Week 11 (August 7)

Week 12 (August 13)

Final Evaluations: August 19-26

Project Completed

NOTE: Week 11 and Week 12 have extra time to deal with unforeseen issues and ideas


A description of how and who it will benefit in society

It will definitely benefit most users of Apertium and hopefully will attract more people to the tool. I’m from India and for a lot of our languages, we don’t have the data to create reliable Neural MT systems. Similarly, for all resource poor languages, Apertium provides an easy and reliable MT system for their needs. That’s how Apertium benefits society already.

Reasons why Google and Apertium should sponsor it

With this project I aim to help the users of Apertium, I wish to become a regular contributor to Apertium and become equipped to do a lot more Open Source Development in the future for other organisations as well.

By funding this project, Google will help improve an important Open Source tool and promote Open Source Development. In a world of Proprietary softwares, this is an invaluable resource for society and supports innovation that everyone can benefit from.

Skills and Qualifications

I'm currently a third year student at IIIT Hyderabad where I'm studying Computational Linguistics. It is a dual degree where we study Computer Science, Linguistics, NLP and more. I'm working on Machine Translation in the LTRC lab in IIIT Hyderabad and I'm part of the MT group in our university.

I've been interested in linguistics from the very beginning and due to the rigorous programming courses, I'm also adept at several programming languages like Python, C++, XML, Bash Scripting, etc. I'm skilled in writing Algorithms. Data Structures, and Machine Learning Algorithms as well.

I enjoy reading research papers and I have analysed several research papers in preparation for this proposal, all of which have been cited. I also have a lot of experience studying data which I feel is essential in solving any problem.

Due to the focused nature of our course, I have worked in several projects, such as building Translation Memory, Detecting Homographic Puns, POS Taggers, Grammar and Spell Checkers, Named Entity Recognisers, Building Chatbots, etc. all of which required a working understanding of Natural Language Processing. Most of these projects were done offline in my research lab and aren't available on GitHub because of the privacy settings but can be provided if needed.

I am fluent in English, Hindi and have basic knowledge of Spanish.

The details of my skills and work experience can be found here: CV

Coding Challenge

Non-Summer-Of-Code Plans