Difference between revisions of "User:Khannatanmai"

From Apertium
Jump to navigation Jump to search
 
(217 intermediate revisions by 4 users not shown)
Line 1: Line 1:
  +
== Personal Details ==
'''Google Summer of Code 2019: Proposal [First Draft]'''
 
  +
'''Name:''' Tanmai Khanna
   
  +
'''E-mail address:''' khanna.tanmai@gmail.com , tanmai.khanna@research.iiit.ac.in
'''Anaphora Resolution'''
 
   
  +
'''IRC:''' khannatanmai
== Personal Details ==
 
   
  +
'''GitHub:''' [http://github.com/khannatanmai khannatanmai]
Name: Tanmai Khanna
 
   
  +
'''LinkedIn:''' [http://linkedin.com/in/khannatanmai khannatanmai]
E-mail address: khanna.tanmai@gmail.com
 
   
  +
'''Current Designation:''' Graduate Researcher in the LTRC Lab, IIIT Hyderabad (5th year student) and a Teaching Assistant for Linguistics courses
Other information that may be useful to contact you (e.g. IRC):
 
   
  +
'''Time Zone:''' GMT+5:30
IRC: khannatanmai
 
 
GitHub: khannatanmai
 
   
 
== About Me ==
 
== About Me ==
   
  +
'''Professional Interests:''' I’m currently doing research in Computational Linguistics and I have a particular interest in Linguistics and NLP tools, specifically Machine Translation and its components.
What open source software do you use?
 
 
Have used Apertium in the past, Ubuntu, Firefox, vlc.
 
 
What are your professional interests?
 
 
I’m currently studying NLP and I have a particular interest in Linguistics
 
 
What are your hobbies?
 
 
I love singing, reading, debating.
 
 
What is your skill set?
 
 
Creating NLP tools, Thorough Linguistic Analysis, Writing clean and understandable code
 
 
What do you want to get out of GSoC?
 
 
I’ve studied about apertium and it’s amazing to me that I get an opportunity to work with them. NLP is what I want to do in life and working with a team to develop tools that actual people use will be invaluable experience that classes simply cannot match.
 
 
=== Why is it that you are interested in Apertium? / Why am I interested in Machine Translation? ===
 
 
Apertium is an Open Source Rule-based MT system. Each part of their mission statement interests me and excites me to be working with them. I have been part of the Machine Translation lab in my college and it interests me because it’s a huge problem and is often called NLP-Complete by my professors, i.e. it uses most of the tools NLP has to offer and hence if one learns to do good MT they learn most of Natural Language Processing.
 
 
While Neural Networks, Deep Learning is the fad these days, they only work for resource rich languages and that’s why I feel a project which is rule based and open source really helps the community with language pairs that our resource poor and gives them free translations for their needs.
 
 
== Project Proposal ==
 
=== Which of the published tasks are you interested in? What do you plan to do? ===
 
 
Anaphora Resolution - Currently uses default male. I don’t plan to make a perfect anaphora resolution in 3 months, but I’m confident that I can make one which works significantly better than the default male and it can increase the fluency and intelligibility of output significantly.
 
 
The Anaphora Resolution tool will be language agnostic and hence this project affects almost all language pairs in apertium and hence it affects almost everyone using the tool. Pronouns are present in a lot of languages and with gendered pronouns, singular, plural, etc., we need to find out what they refer to. Why is this important?
 
 
A sentence like “The group agreed to release his mission statement” is grammatically incoherent and an incorrect pronoun will more often than not confuse people in more complex sentences.
 
What puts people off Machine Translation is the lack of fluency, and I feel this is an important contribution to fluency and will definitely generate more fluent sentences leading to more trust in this tool.
 
   
  +
'''Hobbies:''' I enjoy playing the bass, singing, reading, and used to be super into Parliamentary Debating in uni.
=== Work Plan ===
 
   
  +
== Ideas ==
* Understand the system, Get familiar with the files that I need to modify
 
* Formalise the problem, limit the scope of anaphora resolution (To Anaphora needed for MT)
 
* May include general anaphora (Need to decide on scope [for gisting])
 
* Annotation of anaphora for evaluation
 
* Flowchart of proposed system and Pseudocode
 
* Implement a scoring system for antecedent indicators [work for Target Language:English for now]
 
* Decide on a definite context window
 
* Implement basic anaphora outside the pipeline (python)
 
* Implement basic transfer rules to see if final system will work
 
* A basic prototype of final system ready
 
* Port the basic prototype to C++ (All further coding to be done in C++)
 
* TEST the system
 
* Document the outline
 
* Implement system to work out all possible antecedents
 
* Add ability to give antecedents a score
 
* TEST basic sentences with single antecedents, Test the pipeline
 
* '''Deliverable #1: Anaphora Resolution for single antecedents, with transfer rules [The full pipeline]'''
 
* Implement Antecedent Indicators:
 
* Implement Boosting Indicators
 
* Implement Impeding Indicators
 
* Implement tie breaking systems
 
* Implement fallback for anaphora (in case of too many antecedents or not past certainty threshold)
 
* Code to remember antecedents for a certain window
 
* TEST Scoring System
 
* Implement transfer rules to deal with new additions
 
* Evaluate current system and produce precision and recall
 
* Implement Expectation-Maximization Algorithm
 
* Use Monolingual corpus to get probabilities of anaphora
 
* Implement choosing anaphora with max probability:
 
* If Scoring System has a tie
 
* As an independent system
 
* Compare the effectiveness of the above two possibilities
 
* Test EM Algorithm and implemented system
 
* Evaluate if addition of EM gives us significant benefits
 
* '''Deliverable #2: Anaphora Resolution with antecedent scores, fallback mechanism, EM algorithm'''
 
* Document Antecedent Indicators, Scoring System, EM Algorithm, Fallback
 
* Insert into Apertium pipeline
 
* Implement code to accept input in chunks and process it
 
* Output with anaphora attached
 
* EXTENSIVELY TEST final system with multiple pairs, see what needs to be changed for pairs
 
* TEST for backwards compatibility and ensure it
 
* '''Project Completed'''
 
   
  +
'''Constructions in low resource MT:''' [[User:Khannatanmai/Constructions]]
=== Additional Information ===
 
* Agreement: Different for different languages?
 
* Agreement rules in Arabic, however, are different. For instance, a set of non- human items (animals, plants, objects) is referred to by a singular feminine pronoun.
 
* Since Arabic is an agglutinative language, the pronouns may appear as suffixes of verbs, nouns (e.g., in the case of possessive pronouns) and preposi- tions.
 
   
  +
== Google Summer of Code 2019 -- Anaphora Resolution ==
* '''Antecedent Indicators:'''
 
   
  +
'''Proposal ''': [[User:Khannatanmai/GSoC2019Proposal]]
* Boosting Indicators[Scoring different for different languages]
 
* First NPs
 
* Indicating Verbs
 
* Lexical Reiteration
 
* Section Heading Preference
 
* Collocation Pattern Preference
 
* Immediate reference (if it is pronoun then its reference)
 
* Sequential Instructions
 
   
  +
'''Final Report''': [[User:Khannatanmai/GSoC2019Report]]
* Impeding Indicators
 
* Indefiniteness
 
* Prepositional NPs
 
   
  +
== Google Summer of Code 2020 -- Markup handling with wordbound blanks ==
=== Reasons why Google and Apertium should sponsor it: ===
 
   
  +
'''Proposal: Modifying the apertium stream format and eliminating dictionary trimming: [[User:Khannatanmai/GSoC2020Proposal_Trimming]]'''
I feel that this project affects almost all language pairs in apertium and hence it affects almost everyone using Apertium. A decent anaphora resolution will give the output an important boost in it’s fluency and intelligibility, not just for one language, but all of them.
 
It’s a project which has promising future prospects - apart from the fact that language specific features can be added to improve it, we’ll be doing anaphora resolution even for languages which don’t need it to pick the correct pronoun. Doing this will enable Apertium to do gisting translation, which is an important tool and anaphora resolution is an essential cog in that wheel.
 
   
  +
Development of the stream extension: [[User:Khannatanmai/New_Apertium_stream_format]]
=== A description of how and who it will benefit in society: ===
 
It will definitely benefit most users of Apertium and hopefully will attract more people to the tool. I’m from India and for a lot of our languages we don’t have the data to create reliable Neural MT systems. Similarly, for all resource poor languages, Apertium provides an easy and reliable MT system for their needs. That’s how Apertium benefits society already.
 
   
  +
Eliminating Dictionary Trimming: [[User:Khannatanmai/Eliminating_Dictionary_Trimming]]
I feel that currently what repels people from Machine Translation is unintelligible outputs and too much post editing which makes it useless for them. While Apertium aims to make minimal errors, as of now it selects a default male pronoun and that leads to several unintelligible outputs. Fixing that and making the system more fluent and intelligible overall should definitely attract people to using Machine Translation and help them to reduce costs of time and money.
 
   
  +
'''Progress: [[User:Khannatanmai/GSoC2020Progress]]'''
=== Skills and Qualifications ===
 
I'm currently a third year student at IIIT Hyderabad where I'm studying Computational Linguistics. It is a dual degree where we study Computer Science, Linguistics, NLP, etc. right from the start. I've been interested in linguistics from the start and due to the rigorous programming courses, I'm also adept at several programming languages like Python, C++, etc.
 
   
  +
Documentation of features related to secondary tags: [[User:Khannatanmai/Secondary_tags_features]]
Due to the focused nature of our course, I've worked in several projects, such as building Translation Memory, Detecting Homographic Puns, POS Taggers, Grammar and Spell Checkers, Named Entity Recognisers, Building Chatbots, etc. all of which required a working understanding of Natural Language Processing.
 
   
  +
Alternate stream modification proposal: [[User:Khannatanmai/Alternate_stream_modification]]
=== Coding Challenge ===
 
I successfully completed the coding challenge and proposed an alternate method to process the input as a chunk which resulted in a speedup of more than 2x.
 
   
  +
Development of the updated stream extension: [[User:Khannatanmai/Secondary_info_apertium_stream_format]]
The repo can be found at: https://github.com/khannatanmai/apertium
 
   
  +
'''Development of wordbound blanks: [[User:Khannatanmai/Wordbound_blanks]]'''
Files in Repo:
 
* Code to do basic anaphora resolution (last seen noun), input taken as a stream (byte by byte)
 
* Code to do basic anaphora resolution (last seen noun), input taken as a stream (as a chunk)
 
* Speed-Up Report
 
   
  +
'''Documentation of wordbound blanks: [[Wordbound_blanks]]'''
=== Non-Summer-Of-Code Plans ===
 
   
  +
'''Final Report: [[User:Khannatanmai/GSoC2020_Final_Report]]'''
I will have a 3 month vacation from May to July so will heave no other commitments in that period and will be dedicated to GSoC full time (40 hours?)
 
I am going on a short trip to London from 15 May to 22 May but I will have internet there and will be working a little less than normal but will catch up.
 

Latest revision as of 04:27, 9 September 2020

Personal Details[edit]

Name: Tanmai Khanna

E-mail address: khanna.tanmai@gmail.com , tanmai.khanna@research.iiit.ac.in

IRC: khannatanmai

GitHub: khannatanmai

LinkedIn: khannatanmai

Current Designation: Graduate Researcher in the LTRC Lab, IIIT Hyderabad (5th year student) and a Teaching Assistant for Linguistics courses

Time Zone: GMT+5:30

About Me[edit]

Professional Interests: I’m currently doing research in Computational Linguistics and I have a particular interest in Linguistics and NLP tools, specifically Machine Translation and its components.

Hobbies: I enjoy playing the bass, singing, reading, and used to be super into Parliamentary Debating in uni.

Ideas[edit]

Constructions in low resource MT: User:Khannatanmai/Constructions

Google Summer of Code 2019 -- Anaphora Resolution[edit]

Proposal : User:Khannatanmai/GSoC2019Proposal

Final Report: User:Khannatanmai/GSoC2019Report

Google Summer of Code 2020 -- Markup handling with wordbound blanks[edit]

Proposal: Modifying the apertium stream format and eliminating dictionary trimming: User:Khannatanmai/GSoC2020Proposal_Trimming

Development of the stream extension: User:Khannatanmai/New_Apertium_stream_format

Eliminating Dictionary Trimming: User:Khannatanmai/Eliminating_Dictionary_Trimming

Progress: User:Khannatanmai/GSoC2020Progress

Documentation of features related to secondary tags: User:Khannatanmai/Secondary_tags_features

Alternate stream modification proposal: User:Khannatanmai/Alternate_stream_modification

Development of the updated stream extension: User:Khannatanmai/Secondary_info_apertium_stream_format

Development of wordbound blanks: User:Khannatanmai/Wordbound_blanks

Documentation of wordbound blanks: Wordbound_blanks

Final Report: User:Khannatanmai/GSoC2020_Final_Report