User:Khannatanmai

Google Summer of Code 2019: Proposal [Second Draft]

Anaphora Resolution

Personal Details

Name: Tanmai Khanna

E-mail address: khanna.tanmai@gmail.com , tanmai.khanna@research.iiit.ac.in

IRC: khannatanmai

GitHub: khannatanmai

LinkedIn: khannatanmai

Time Zone: GMT+5:30

About Me

Open Source Softwares I use: I have used Apertium in the past, Ubuntu, Firefox, VLC.

Professional Interests: I’m currently studying NLP and I have a particular interest in Linguistics and NLP tools, specifically Machine Translation and its components.

Hobbies: I love Parliamentary Debating, Singing, and Reading.

What I want to get out of GSoC

I’ve studied about apertium and it’s amazing to me that I get an opportunity to work with them. NLP is what I want to do in life and working with a team to develop tools that actual people use will be invaluable experience that classes simply cannot match. Of course, the stipend is a big plus!

Why is it that I am interested in Apertium and Machine Translation?

Apertium is an Open Source Rule-based MT system. I have been part of the Machine Translation lab in my college and it interests me because it’s a complex problem and is often called NLP-Complete by my professors, i.e. it uses most of the tools NLP has to offer and hence if one learns to do good MT they learn most of Natural Language Processing.

Each part of Apertium's mission statement, especially the fact that they focus on Low Resource Languages, interests me and excites me to be working with them. While Neural Networks and Deep Learning is the fad these days, it only works for resource rich languages.

A tool which is rule based and open source really helps the community with language pairs that are resource poor and gives them free translations for their needs. I'm interested in working with Apertium and GSoC so that I can contribute to helping the community through the project.

Project Proposal

Which of the published tasks am I interested in? What do I plan to do?

Anaphora Resolution - Apertium currently uses default male. I don’t plan to make a perfect anaphora resolution in 3 months, but I can make one which uses complex features to pick an antecedent. I am confident that it will increase the fluency and intelligibility of output significantly.

The Anaphora Resolution tool will be language agnostic and will improve the output for most language pairs in Apertium. Pronouns are present in a lot of languages and they change based on their antecedent's gender, number, person, etc. This tool will figure out the antecedent and will choose the correct pronoun accordingly.

Sentences like “The group agreed to release his mission statement”, "The girl ate his apple" are grammatically incoherent and an incorrect pronoun will more often than not confuse people in complex sentences.

What puts people off Machine Translation is the lack of fluency, and this tool will definitely generate more fluent sentences leading to more trust in this tool.

Proposed Modifications

I will be working with Spanish-English and Catalan-English pairs while developing the tool. However, the features will be made largely language agnostic and I will also evaluate how well it works for other language pairs which need Anaphora Resolution.

Ultimately, the system should be able to do Anaphora Resolution for the following:

Pronouns

Spanish Sentence: La chica es aquí, está vistiendo un vestido rojo

Apertium Translation: The girl is here, is dressing a red dress

After Anaphora [Proposed Translation]: The girl is here, she is dressing/wearing a red dress

Possessive Pronouns

Spanish Sentence: La chica comió su manzana

Apertium Translation: The girl ate his apple

After Anaphora [Proposed Translation]: The girl ate her apple

Zero Pronouns

Spanish Sentence: canta bueno

Apertium Translation: It sings well

After Anaphora [Proposed Translation]: He/She/It sings well (Based on context)

Work Plan

Would be good to line these tasks up with weeks in the program (including the community bonding period). —Firespeaker (talk) 18:48, 16 March 2019 (CET)

Understand the system, Get familiar with the files that I need to modify
Formalise the problem, limit the scope of anaphora resolution (To Anaphora needed for MT)
Automatic Annotation of anaphora for evaluation
Flowchart of proposed system and Pseudocode
Implement a scoring system for antecedent indicators [work for Spanish-English and Catalan-English for now]
Decide on a definite context window
Implement basic anaphora outside the pipeline (python)
Implement basic transfer rules to see if final system will work
A basic prototype of final system ready
Port the basic prototype to C++ (All further coding to be done in C++)
TEST the system
Document the outline
Implement system to work out all possible antecedents
Add ability to give antecedents a score
TEST basic sentences with single antecedents, Test the pipeline

Deliverable #1: Anaphora Resolution for single antecedents, with transfer rules [The full pipeline]

Implement Antecedent Indicators:
Implement Boosting Indicators
Implement Impeding Indicators
Implement tie breaking systems
Implement fallback for anaphora (in case of too many antecedents or not past certainty threshold)
Code to remember antecedents for a certain window
TEST Scoring System
Implement transfer rules to deal with new additions in Spanish-English pair
Implement transfer rules to deal with new additions in Catalan-English pair
Evaluate current system and produce precision and recall

Deliverable #2: Anaphora Resolution with antecedent scores, fallback mechanism, for Spa-Eng and Cat-Eng

[OPTIONAL: If current system not producing good enough results]
Implement Expectation-Maximization Algorithm using monolingual corpus
Implement choosing anaphora with max probability: If Scoring System has a tie vs. As an independent system
Compare and Evaluate the effectiveness of the above two possibilities
Test EM Algorithm and implemented system

[NOT OPTIONAL]
Document Antecedent Indicators, Scoring System, Fallback for Cat-Eng & Spa-Eng
Insert into Apertium pipeline
Implement code to accept input in chunks and process it
Output with anaphora attached
EXTENSIVELY TEST final system
Try out other language pairs
Evaluate and find out which features are language agnostic
Decide on list of features for agnostic anaphora and for language specific anaphora
TEST on multiple pairs and give Evaluation Scores
TEST for backwards compatibility and ensure it
Project Completed

Additional Information

Agreement: Different for different languages?
Agreement rules in Arabic, however, are different. For instance, a set of non- human items (animals, plants, objects) is referred to by a singular feminine pronoun.
Since Arabic is an agglutinative language, the pronouns may appear as suffixes of verbs, nouns (e.g., in the case of possessive pronouns) and preposi- tions.

Antecedent Indicators:

Boosting Indicators[Scoring different for different languages]
First NPs
Indicating Verbs
Lexical Reiteration
Section Heading Preference
Collocation Pattern Preference
Immediate reference (if it is pronoun then its reference)
Sequential Instructions

Impeding Indicators
Indefiniteness
Prepositional NPs

A description of how and who it will benefit in society

It will definitely benefit most users of Apertium and hopefully will attract more people to the tool. I’m from India and for a lot of our languages we don’t have the data to create reliable Neural MT systems. Similarly, for all resource poor languages, Apertium provides an easy and reliable MT system for their needs. That’s how Apertium benefits society already.

However, what discourages people from using Machine Translation is unintelligible outputs and too much post editing which makes it very time consuming and costly for them. While Apertium aims to make minimal errors, as of now it selects a default male pronoun and that leads to several unintelligible outputs. Fixing that and making the system more fluent and intelligible overall will encourage people to use Machine Translation and will reduce costs of time and money.

Reasons why Google and Apertium should sponsor it

I feel this project has a wide scope as it affects almost all language pairs and helps almost everyone using Apertium. A decent Anaphora Resolution will give the output an important boost in its fluency and intelligibility, not just for one language, but all of them.

It’s a project which has promising future prospects as well - apart from the fact that language specific features can be added to improve it, we’ll be doing anaphora resolution even for languages which don’t need it to pick the correct pronoun. Doing this will enable Apertium to do Gisting Translation in the future, for which Anaphora Resolution is essential.

With this project I aim to help the users of Apertium, I wish to become a regular contributor to Apertium and become equipped to do a lot more Open Source Development in the future for other organisations as well.

By funding this project, Google will help improve an important Open Source tool and promote Open Source Development. In a world of Proprietary softwares, this is an invaluable resource for society and supports innovation that everyone can benefit from.

Skills and Qualifications

I'm currently a third year student at IIIT Hyderabad where I'm studying Computational Linguistics. It is a dual degree where we study Computer Science, Linguistics, NLP and more. I've been interested in linguistics from the very beginning and due to the rigorous programming courses, I'm also adept at several programming languages like Python, C++, XML, Bash Scripting, etc. I'm skilled in writing Algorithms. Data Structures, and Machine Learning Algorithms as well.

Due to the focused nature of our course, I have worked in several projects, such as building Translation Memory, Detecting Homographic Puns, POS Taggers, Grammar and Spell Checkers, Named Entity Recognisers, Building Chatbots, etc. all of which required a working understanding of Natural Language Processing.

I am fluent in English, Hindi and have basic knowledge of Spanish.

The details of my skills and work experience can be found here: CV

Coding Challenge

I successfully completed the coding challenge and proposed an alternate method to process the input as a chunk which resulted in a speedup of more than 2x.

The repo can be found at: https://github.com/khannatanmai/apertium

Files in Repo:

Code to do basic anaphora resolution (last seen noun), input taken as a stream (byte by byte)
Code to do basic anaphora resolution (last seen noun), input taken as a stream (as a chunk)
Speed-Up Report

Non-Summer-Of-Code Plans

I will have my college vacations during GSoC so will have no other commitments in that period and will be dedicated to GSoC full time, i.e. 40 hours a week.

I am planning to go on a short trip to London from 18 May to 25 May but I will have internet there and will be working a little less than normal but will catch up.

In aligning your tasks and goals with the GSoC timeline, be sure to take this into account. —Firespeaker (talk) 18:49, 16 March 2019 (CET)

User:Khannatanmai

Contents

Personal Details

About Me

Why is it that I am interested in Apertium and Machine Translation?

Project Proposal

Which of the published tasks am I interested in? What do I plan to do?

Proposed Modifications

Work Plan

Additional Information

A description of how and who it will benefit in society

Reasons why Google and Apertium should sponsor it

Skills and Qualifications

Coding Challenge

Non-Summer-Of-Code Plans

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools