Difference between revisions of "User:Khannatanmai"

From Apertium
Jump to navigation Jump to search
 
(134 intermediate revisions by 2 users not shown)
Line 1: Line 1:
'''Google Summer of Code 2019: Proposal [Second Draft]'''

'''Anaphora Resolution'''

== Personal Details ==
== Personal Details ==

'''Name:''' Tanmai Khanna
'''Name:''' Tanmai Khanna


Line 14: Line 9:


'''LinkedIn:''' [http://linkedin.com/in/khannatanmai khannatanmai]
'''LinkedIn:''' [http://linkedin.com/in/khannatanmai khannatanmai]

'''Current Designation:''' Graduate Researcher in the LTRC Lab, IIIT Hyderabad (5th year student) and a Teaching Assistant for Linguistics courses


'''Time Zone:''' GMT+5:30
'''Time Zone:''' GMT+5:30
Line 19: Line 16:
== About Me ==
== About Me ==


'''Professional Interests:''' I’m currently doing research in Computational Linguistics and I have a particular interest in Linguistics and NLP tools, specifically Machine Translation and its components.
'''Open Source Softwares I use:''' I have used Apertium in the past, Ubuntu, Firefox, VLC.

'''Professional Interests:''' I’m currently studying NLP and I have a particular interest in Linguistics and NLP tools, specifically Machine Translation and its components.

'''Hobbies:''' I love Parliamentary Debating, Singing, and Reading.

'''What I want to get out of GSoC'''

I’ve enjoyed using Apertium in various personal and academic projects and it’s amazing to me that I get an opportunity to work with them.

NLP is my passion, and I would love to work with the similarly passionate people at Apertium, to develop tools that people actually benefit from. This would be an invaluable experience that classes just can't match.

I am applying for GSoC, as the stipend would allow me to dedicate my full attention to the project during the 3 months.

=== Why is it that I am interested in Apertium and Machine Translation? ===

Apertium is an Open Source Rule-based MT system. I'm a researcher in the IIIT-H LTRC lab, currently working on Machine Translation and it interests me because it’s a complex problem and is often called NLP-Complete by my professors, i.e. it uses most of the tools NLP has to offer and hence if one learns to do good MT they learn most of Natural Language Processing.

Each part of Apertium's mission statement, especially the fact that they focus on Low Resource Languages, interests me and excites me to be working with them. While Neural Networks and Deep Learning is the fad these days, it only works for resource rich languages.

A tool which is rule based and open source really helps the community with language pairs that are resource poor and gives them free translations for their needs. I'm interested in working with Apertium and GSoC so that I can contribute to helping the community through the project.

== Project Proposal ==
=== Which of the published tasks am I interested in? What do I plan to do? ===

Anaphora Resolution - Apertium currently uses default male. I wish to build a perfect Anaphora Resolution system, but that is obviously not possible in 3 months so I propose we limit the scope of this project to use complex features to pick an antecedent as a way to increase the fluency and intelligibility of output significantly.

The Anaphora Resolution tool will be language agnostic and will improve the output for most language pairs in Apertium. Pronouns are present in a lot of languages and they change based on their antecedent's gender, number, person, etc. The aim is to figure out the antecedent and choose the correct pronoun accordingly.

Sentences like “The group agreed to release his mission statement”, "The girl ate his apple" are grammatically incoherent and an incorrect pronoun will more often than not confuse people in complex sentences.

What puts people off Machine Translation is the lack of fluency, and this tool will definitely generate more fluent sentences leading to more trust in this tool.

'''NOTE: Anaphora Resolution is one part of resolving long distance dependencies. The method of resolving this will not be limited to anaphora and can be used for general coreference, agreement and other long distance dependencies which need to identify the antecedent.'''

This project has promising future prospects:

* Adding Language Specific metrics for better identification
* Extending system to general coreference needed for MT
* Shallow parsing for low resource languages

I want to be a contributor to Apertium even after GSoC and would love to continue this project forward after this summer.

=== Proposed Modifications ===

I will be working with Spanish-English and Catalan-English pairs while developing the tool. However, the features will be made largely language agnostic and I will also evaluate how well it works for other language pairs which need Anaphora Resolution.

Ultimately, the system should be able to do Anaphora Resolution for the following:

* '''Possessive Pronouns'''

Spanish Sentence: La chica comió su manzana

Apertium Translation: The girl ate his apple

After Anaphora [Proposed Translation]: The girl ate '''her''' apple

* '''Zero Pronouns'''

'''Eg. 1.'''
Spanish Sentence: canta bueno

Apertium Translation: It sings well

After Anaphora [Proposed Translation]: He/She/It sings well (Based on context)

'''Eg. 2.'''
Spanish Sentence: La chica esta aquí, lleva un vestido rojo

Apertium Translation: The girl is here, spends a red dress

After Anaphora [Proposed Translation]: The girl is here, '''she''' spends/wears a red dress

* '''Reflexive Pronouns'''

Spanish Sentence: se mató

Apertium Translation: It killed

After Anaphora [Proposed Translation]: He/She killed himself/herself

* '''Long Distance Agreement'''

English Sentence: The table is here. It is red.

Apertium Translation: La mesa es aquí. Es rojo.

Proposed Translation: La mesa es aquí. Es '''roja.'''

----

=== Idea Description ===

Since Apertium deals with low resource language pairs, this module will use very basic linguistic information - POS tags, gender, number information as opposed to parse trees, which require more data to be accurate.

The module will assign '''salience scores''' to all antecedents in the context window and the highest scored antecedent will be selected.
These scores will be assigned based on several linguistic indicators

==== Antecedent Indicators ====

'''Boosting Indicators'''

* Heads of NP
* First NPs
* Lexical Reiteration: Lexically reiterated items are likely candidates for antecedents.
* Section Heading Preference
* Collocation Pattern Preference: This preference is given to candidates which have an identical collocation pattern with a pronoun.
* Immediate reference (if it is pronoun then its reference)
* Sequential Instructions

'''Impeding Indicators'''

* Indefiniteness
* Prepositional NPs: NPs which are part of a PP are penalised.

Reference : [https://link.springer.com/content/pdf/10.1023%2FA%3A1011184828072.pdf Multilingual Anaphora Resolution, Ruslan Mitkov]

==== Salience Features to find Coreference ====

* Unique in Discourse
If there is a single possible antecedent i in the read-in portion of the entire discourse, then pick i as the antecedent: 8 correct, and 0 incorrect.

* Reflexive
Pick nearest possible antecedent in read-in portion of current sentence if the anaphor is a reflexive pronoun: 16 correct, and 1 incorrect.

* Unique in Current + Prior
If there is a single possible antecedent i in the prior sentence and the read-in portion of the current sentence, then pick i as the antecedent: 114 correct, and 2 incorrect.

* Possessive Pronoun
If the anaphor is a possessive pronoun and there is a single exact string match i of the possessive in the prior sentence, then pick i as the antecedent: 4 correct, and 1 incorrect.

*Unique Subject/ Subject Pronoun
If the subject of the prior sentence contains a single possible antecedent i, and the anaphor is the subject of the current sentence, then pick i as the antecedent: 11 correct, and 0 incorrect.

Reference : [https://aclweb.org/anthology/W97-1306 High Precision Coreference with Limited Knowledge and Linguistic Reference, Breck Baldwin]

== Work Plan ==
:: Would be good to line these tasks up with weeks in the program (including the community bonding period). —[[User:Firespeaker|Firespeaker]] ([[User talk:Firespeaker|talk]]) 18:48, 16 March 2019 (CET)


'''Community Bonding Period''' (May 6 - May 27)
* Understand the Apertium pipeline fully
* Modify and Understand individual files
* Get familiar with the files that I need to modify
* Formalise the problem, limit the scope of anaphora resolution (To Anaphora needed for MT)
* Flowchart of proposed system and Pseudocode
* Study the EuroParl corpus and see which anaphors the method will be able to resolve on paper

'''Week 1''' (May 27)
* Automatic Annotation of anaphora for evaluation (EuroParl Corpus)
* Implement a preliminary scoring system for antecedent indicators [work for Spanish-English and Catalan-English for now]
* Decide on a definite context window

'''Week 2''' (June 3)
* Implement basic anaphora outside the pipeline (python)
* Implement transfer rules for normal pronouns and possessive pronouns
* Implement transfer rules for verbs (for zero pronouns)
* A basic prototype of final system ready

'''Week 3''' (June 10)
* Write the code in C++ (All further coding to be done in C++)
* TEST the system extensively
* Document the outline
* Implement system to work out all possible antecedents

'''Week 4''' (June 17)
* Add ability to give antecedents a score
* TEST basic sentences with single antecedents, Test the pipeline
* Test and Evaluate for Normal, Possessive, Zero Pronouns in Spa-Eng pair
* Test and Evaluate for Normal, Possessive, Zero Pronouns in Cat-Eng pair

=== Deliverable #1: Anaphora Resolution for single antecedents, with transfer rules [The full pipeline] ===

'''Evaluation 1: June 24-28'''

'''Week 5''' (June 28)
* Implement Antecedent Indicators - Boosting Indicators:
* Code to Identify Boosting Indicators

'''Week 6''' (July 4)
* Code to Identify Impeding Indicators
* Code to Identify Agreement Antecedent
* Implement transfer rules for agreement in adjectives for Cat-Eng & Spa-Eng
* Code to remember antecedents for a certain window

'''Week 7''' (July 10)
* Implement remaining transfer rules for anaphora in pronouns Cat-Eng & Spa-Eng
* Give scores to the antecedent indicators
* Code Salience Indicators & Implement tie breaking systems
* Modify scoring system based on performance in the pairs

'''Week 8''' (July 16)
* Implement fallback for anaphora (in case of too many antecedents or not past certainty threshold)
* TEST Scoring System
* TEST and Evaluate current system and produce precision and recall
* TEST and Evaluate Agreement for Adjectives
* Document Antecedent Indicators, Scoring System, Fallback for Cat-Eng & Spa-Eng

=== Deliverable #2: Anaphora Resolution with antecedent scores, fallback mechanism, for Spa-Eng and Cat-Eng ===

'''Evaluation 2: July 22-26'''

'''Week 9''' [OPTIONAL: If current system not producing good enough results]
* Implement Expectation-Maximization Algorithm using monolingual corpus
* Implement choosing anaphora with max probability: If Scoring System has a tie vs. As an independent system
* Compare and Evaluate the effectiveness of the above two possibilities
* Test EM Algorithm and implemented system

'''Week 9''' [NOT OPTIONAL] (July 26)

* Insert into Apertium pipeline
* Implement code to accept input in chunks and process it
* Output with anaphora attached

'''Week 10''' (August 1)
* EXTENSIVELY TEST final system
* Try out other language pairs
* Evaluate and find out which features are language agnostic
* Decide on list of features for agnostic anaphora and for language specific anaphora

'''Week 11''' (August 7)
* TEST on multiple pairs and give Evaluation Scores
* TEST for backwards compatibility and ensure it

'''Week 12''' (August 13)
* Wrap up on the final module
* Complete the overall documentation with observations and future prospects

'''Final Evaluations: August 19-26'''

=== Project Completed ===
'''NOTE''': Week 11 and Week 12 have extra time to deal with unforeseen issues and ideas
----

== A description of how and who it will benefit in society ==

It will definitely benefit most users of Apertium and hopefully will attract more people to the tool. I’m from India and for a lot of our languages we don’t have the data to create reliable Neural MT systems. Similarly, for all resource poor languages, Apertium provides an easy and reliable MT system for their needs. That’s how Apertium benefits society already.

However, what discourages people from using Machine Translation is unintelligible outputs and too much post editing which makes it very time consuming and costly for them. While Apertium aims to make minimal errors, as of now it selects a default male pronoun and that leads to several unintelligible outputs. Fixing that and making the system more fluent and intelligible overall will encourage people to use Machine Translation and will reduce costs of time and money.

== Reasons why Google and Apertium should sponsor it ==

I feel this project has a wide scope as it can affect almost all language pairs and helps almost everyone using Apertium. A decent Anaphora Resolution will give the output an important boost in its fluency and intelligibility, not just for one language, but all of them.

It’s a project which has promising future prospects as well - apart from the fact that language specific features can be added to improve it, we’ll be doing anaphora resolution even for languages which don’t need it to pick the correct pronoun. Doing this will enable Apertium to do Gisting Translation in the future, for which Anaphora Resolution is essential.

This method also paves the way for general coreference resolution for low-resource languages.

With this project I aim to help the users of Apertium, I wish to become a regular contributor to Apertium and become equipped to do a lot more Open Source Development in the future for other organisations as well.


'''Hobbies:''' I enjoy playing the bass, singing, reading, and used to be super into Parliamentary Debating in uni.
By funding this project, Google will help improve an important Open Source tool and promote Open Source Development. In a world of Proprietary softwares, this is an invaluable resource for society and supports innovation that everyone can benefit from.


== Ideas ==
== Skills and Qualifications ==
I'm currently a third year student at IIIT Hyderabad where I'm studying Computational Linguistics. It is a dual degree where we study Computer Science, Linguistics, NLP and more. I'm working on Machine Translation in the LTRC lab in IIIT Hyderabad and I'm part of the MT group in our university.


'''Constructions in low resource MT:''' [[User:Khannatanmai/Constructions]]
I've been interested in linguistics from the very beginning and due to the rigorous programming courses, I'm also adept at several programming languages like Python, C++, XML, Bash Scripting, etc. I'm skilled in writing Algorithms. Data Structures, and Machine Learning Algorithms as well.


== Google Summer of Code 2019 -- Anaphora Resolution ==
Due to the focused nature of our course, I have worked in several projects, such as building Translation Memory, Detecting Homographic Puns, POS Taggers, Grammar and Spell Checkers, Named Entity Recognisers, Building Chatbots, etc. all of which required a working understanding of Natural Language Processing.


'''Proposal ''': [[User:Khannatanmai/GSoC2019Proposal]]
I am fluent in English, Hindi and have basic knowledge of Spanish.


'''Final Report''': [[User:Khannatanmai/GSoC2019Report]]
The details of my skills and work experience can be found here: [https://drive.google.com/file/d/1ZGAWJQzmDlxJo-_TNr-0wqzIdNTVraPt/view CV]


== Google Summer of Code 2020 -- Markup handling with wordbound blanks ==
=== Coding Challenge ===
I successfully completed the coding challenge and proposed an alternate method to process the input as a chunk which resulted in a speedup of more than 2x.


'''Proposal: Modifying the apertium stream format and eliminating dictionary trimming: [[User:Khannatanmai/GSoC2020Proposal_Trimming]]'''
The repo can be found at: https://github.com/khannatanmai/apertium


Development of the stream extension: [[User:Khannatanmai/New_Apertium_stream_format]]
Files in Repo:
* Code to do basic anaphora resolution (last seen noun), input taken as a stream (byte by byte)
* Code to do basic anaphora resolution (last seen noun), input taken as a stream (as a chunk)
* Speed-Up Report


Eliminating Dictionary Trimming: [[User:Khannatanmai/Eliminating_Dictionary_Trimming]]
== Non-Summer-Of-Code Plans ==


'''Progress: [[User:Khannatanmai/GSoC2020Progress]]'''
I will have my college vacations during GSoC so will have no other commitments in that period and will be dedicated to GSoC full time, i.e. 40 hours a week.


Documentation of features related to secondary tags: [[User:Khannatanmai/Secondary_tags_features]]
I am planning to go on a short trip to London from 25 May to 30 May but I will have internet there and will be in constant communication with my mentors and will have finished my Community Bonding Period work by then.


Alternate stream modification proposal: [[User:Khannatanmai/Alternate_stream_modification]]
I'll be able to devote 20 hours in Week 1 because of the trip, but I will catch up in the remaining weeks with 40 hours/week till the end of the project.


Development of the updated stream extension: [[User:Khannatanmai/Secondary_info_apertium_stream_format]]
I have also kept the work load lighter in Week 1 for the same reason.


'''Development of wordbound blanks: [[User:Khannatanmai/Wordbound_blanks]]'''
:: In aligning your tasks and goals with the GSoC timeline, be sure to take this into account. —[[User:Firespeaker|Firespeaker]] ([[User talk:Firespeaker|talk]]) 18:49, 16 March 2019 (CET)


'''Documentation of wordbound blanks: [[Wordbound_blanks]]'''


'''Final Report: [[User:Khannatanmai/GSoC2020_Final_Report]]'''
[[Category:GSoC 2019 student proposals|Khannatanmai]]

Latest revision as of 04:27, 9 September 2020

Personal Details[edit]

Name: Tanmai Khanna

E-mail address: khanna.tanmai@gmail.com , tanmai.khanna@research.iiit.ac.in

IRC: khannatanmai

GitHub: khannatanmai

LinkedIn: khannatanmai

Current Designation: Graduate Researcher in the LTRC Lab, IIIT Hyderabad (5th year student) and a Teaching Assistant for Linguistics courses

Time Zone: GMT+5:30

About Me[edit]

Professional Interests: I’m currently doing research in Computational Linguistics and I have a particular interest in Linguistics and NLP tools, specifically Machine Translation and its components.

Hobbies: I enjoy playing the bass, singing, reading, and used to be super into Parliamentary Debating in uni.

Ideas[edit]

Constructions in low resource MT: User:Khannatanmai/Constructions

Google Summer of Code 2019 -- Anaphora Resolution[edit]

Proposal : User:Khannatanmai/GSoC2019Proposal

Final Report: User:Khannatanmai/GSoC2019Report

Google Summer of Code 2020 -- Markup handling with wordbound blanks[edit]

Proposal: Modifying the apertium stream format and eliminating dictionary trimming: User:Khannatanmai/GSoC2020Proposal_Trimming

Development of the stream extension: User:Khannatanmai/New_Apertium_stream_format

Eliminating Dictionary Trimming: User:Khannatanmai/Eliminating_Dictionary_Trimming

Progress: User:Khannatanmai/GSoC2020Progress

Documentation of features related to secondary tags: User:Khannatanmai/Secondary_tags_features

Alternate stream modification proposal: User:Khannatanmai/Alternate_stream_modification

Development of the updated stream extension: User:Khannatanmai/Secondary_info_apertium_stream_format

Development of wordbound blanks: User:Khannatanmai/Wordbound_blanks

Documentation of wordbound blanks: Wordbound_blanks

Final Report: User:Khannatanmai/GSoC2020_Final_Report