Difference between revisions of "User:Khannatanmai"
Khannatanmai (talk | contribs) |
Khannatanmai (talk | contribs) |
||
(232 intermediate revisions by 4 users not shown) | |||
Line 1: | Line 1: | ||
== Personal Details == |
|||
'''Google Summer of Code 2019: Proposal [First Draft] [INCOMPLETE]''' |
|||
'''Name:''' Tanmai Khanna |
|||
'''E-mail address:''' khanna.tanmai@gmail.com , tanmai.khanna@research.iiit.ac.in |
|||
Google Summer of Code Proposal 2019 |
|||
'''IRC:''' khannatanmai |
|||
Apertium |
|||
'''GitHub:''' [http://github.com/khannatanmai khannatanmai] |
|||
=== Anaphora Resolution === |
|||
'''LinkedIn:''' [http://linkedin.com/in/khannatanmai khannatanmai] |
|||
Who am I? |
|||
'''Current Designation:''' Graduate Researcher in the LTRC Lab, IIIT Hyderabad (5th year student) and a Teaching Assistant for Linguistics courses |
|||
What open source software do you use? |
|||
'''Time Zone:''' GMT+5:30 |
|||
Have used apertium in the past, ubuntu, firefox, vlc. |
|||
== About Me == |
|||
What are your professional interests? |
|||
I’m currently |
'''Professional Interests:''' I’m currently doing research in Computational Linguistics and I have a particular interest in Linguistics and NLP tools, specifically Machine Translation and its components. |
||
'''Hobbies:''' I enjoy playing the bass, singing, reading, and used to be super into Parliamentary Debating in uni. |
|||
What are your hobbies? |
|||
== Ideas == |
|||
I love singing, reading, debating. |
|||
'''Constructions in low resource MT:''' [[User:Khannatanmai/Constructions]] |
|||
What is your skill set? |
|||
== Google Summer of Code 2019 -- Anaphora Resolution == |
|||
Creating NLP tools, Thorough Linguistic Analysis, Writing clean and understandable code |
|||
'''Proposal ''': [[User:Khannatanmai/GSoC2019Proposal]] |
|||
What do you want to get out of GSoC? |
|||
'''Final Report''': [[User:Khannatanmai/GSoC2019Report]] |
|||
I’ve studied about apertium and it’s amazing to me that I get an opportunity to work with them. NLP is what I want to do in life and working with a team to develop tools that actual people use will be invaluable experience that classes simply cannot match. |
|||
== Google Summer of Code 2020 -- Markup handling with wordbound blanks == |
|||
=== Personal Details === |
|||
'''Proposal: Modifying the apertium stream format and eliminating dictionary trimming: [[User:Khannatanmai/GSoC2020Proposal_Trimming]]''' |
|||
Name: Tanmai Khanna |
|||
Development of the stream extension: [[User:Khannatanmai/New_Apertium_stream_format]] |
|||
E-mail address: khanna.tanmai@gmail.com |
|||
Eliminating Dictionary Trimming: [[User:Khannatanmai/Eliminating_Dictionary_Trimming]] |
|||
Other information that may be useful to contact you (e.g. IRC): |
|||
'''Progress: [[User:Khannatanmai/GSoC2020Progress]]''' |
|||
IRC: khannatanmai |
|||
Documentation of features related to secondary tags: [[User:Khannatanmai/Secondary_tags_features]] |
|||
GitHub: khannatanmai |
|||
Alternate stream modification proposal: [[User:Khannatanmai/Alternate_stream_modification]] |
|||
=== Why is it that you are interested in Apertium? === |
|||
Development of the updated stream extension: [[User:Khannatanmai/Secondary_info_apertium_stream_format]] |
|||
Apertium is an Open Source Rule-based MT system. Each part of their mission statement interests me and excites me to be working with them. I have been part of the Machine Translation lab in my college and it interests me because it’s a huge problem and is often called NLP-Complete by my professors, i.e. it uses most of the tools NLP has to offer and hence if one learns to do good MT they learn most of Natural Language Processing. |
|||
While Neural Networks, Deep Learning is the fad these days, they only work for resource rich languages and that’s why I feel a project which is rule based and open source really helps the community with language pairs that our resource poor and gives them free translations for their needs. |
|||
'''Development of wordbound blanks: [[User:Khannatanmai/Wordbound_blanks]]''' |
|||
=== Which of the published tasks are you interested in? What do you plan to do? === |
|||
'''Documentation of wordbound blanks: [[Wordbound_blanks]]''' |
|||
Anaphora Resolution - Currently uses default male. I don’t plan to make a perfect anaphora resolution in 3 months, but I’m confident that I can make one which works better than the default male. |
|||
'''Final Report: [[User:Khannatanmai/GSoC2020_Final_Report]]''' |
|||
I feel that this project affects almost all language pairs in apertium and hence it affects almost everyone using the tool. Pronouns are present in a lot of languages and with gendered pronouns, singular, plural, etc., we need to find out what they refer to. Why is this important? |
|||
A sentence like “The group agreed to release his mission statement” is grammatically incoherent and an incorrect pronoun will more often than not confuse people in more complex sentences. |
|||
What puts people off Machine Translation is the lack of fluency, and I feel this is an important contribution to fluency and will definitely generate more fluent sentences leading to more trust in this tool. |
|||
=== IDEAS === |
|||
Use elimination to figure out which noun. |
|||
Like heads of chunks can be referred to. |
|||
Can we use semantics or not? |
|||
Should it be language independent? Very less or no dependence on external tools, like wordnet, framenet, etc. |
|||
For basic sentences, it will definitely help (with female subjects) |
|||
When translating from languages with gender in pronouns, retain that info. Can be used for anaphora resolution in target language. |
|||
If it knows about animacy, in the coding challenge I can give accurate result. |
|||
=== QUESTIONS === |
|||
Is this tool supposed to be language independent? For eg., Anaphora Resolution of English can use certain tools which capture semantics to perform better. If it is language independent then we can’t depend on external tools, which would need solutions which use only the information available out of biltrans. |
|||
Suddenly stopping the default male system and putting another one could give worse results. Going step by step makes more sense. For eg., sentences with just one noun and if that noun is female, the later pronoun has to be female. There we should use female anaphor. Eg. “La chica comió su manzana” translates to “The girl ate his apple”. |
|||
Even without touching the default male system, if the only candidate antecedent is female, the anaphor should be female. |
|||
Apart from this, I feel a good method might be to use elimination to figure out the best antecedent for an anaphor. Biltrans does seem to have some element of animacy. We can use that to eliminate. Also, if a chunk exists, such as “Groups of the Parliament”, the head of the chunk is “groups” and it is more likely that an anaphor refers to the head of a chunk. |
|||
=== STUFF TO DO === |
|||
Understand the system, etc. |
|||
Annotation of anaphora for evaluation |
|||
Implement basic anaphora outside the pipeline (python) |
|||
Implement basic transfer rules to see if final system will work |
|||
A basic prototype of final system ready - in C++ |
|||
Update it to work on elimination, if more than 1, default male |
|||
Implement elimination rules |
|||
Implement method to extract remaining nouns with verb and pronoun |
|||
Implement Expectation-Maximization Algorithm |
|||
Use Monolingual corpus to get probabilities of anaphora |
|||
Implement choosing anaphora with max probability |
|||
TEST |
|||
Insert into Apertium pipeline |
|||
Implement transfer rules to deal with new additions |
|||
TEST final system with multiple pairs |
|||
TEST for backwards compatibility |
|||
Intelligent fallback mechanism |
|||
Define scope of problem |
|||
Clear flowchart of progress |
|||
Agreement: Different for different languages? |
|||
Agreement rules in Arabic, however, are different. For instance, a set of non- human items (animals, plants, objects) is referred to by a singular feminine pronoun. |
|||
Since Arabic is an agglutinative language, the pronouns may appear as suffixes of verbs, nouns (e.g., in the case of possessive pronouns) and preposi- tions. |
|||
Antecedent Indicators: |
|||
Boosting Indicators[Scoring different for different languages] |
|||
First NPs |
|||
Indicating Verbs |
|||
Lexical Reiteration |
|||
Section Heading Preference |
|||
Collocation Pattern Preference |
|||
Immediate reference (if it is pronoun then its reference) |
|||
Sequential Instructions |
|||
Impeding Indicators |
|||
Indefiniteness |
|||
Prepositional NPs |
|||
* a title: |
|||
* reasons why Google and Apertium should sponsor it: |
|||
I feel that this project affects almost all language pairs in apertium and hence it affects almost everyone using Apertium. A decent anaphora resolution will give the output an important boost in it’s fluency and intelligibility, not just for one language, but all of them. |
|||
It’s a project which has promising future prospects - apart from the fact that language specific features can be added to improve it, we’ll be doing anaphora resolution even for languages which don’t need it to pick the correct pronoun. Doing this will enable Apertium to do gisting translation, which is an important tool and anaphora resolution is an essential cog in that wheel. |
|||
* a description of how and who it will benefit in society: |
|||
It will definitely benefit most users of Apertium and hopefully will attract more people to the tool. I’m from India and for a lot of our languages we don’t have the data to create reliable Neural MT systems. Similarly, for all resource poor languages, Apertium provides an easy and reliable MT system for their needs. That’s how Apertium benefits society already. |
|||
I feel that currently what repels people from Machine Translation is unintelligible outputs and too much post editing which makes it useless for them. While Apertium aims to make minimal errors, as of now it selects a default male pronoun and that leads to several unintelligible outputs. Fixing that and making the system more fluent and intelligible overall should definitely attract people to using Machine Translation and help them to reduce costs of time and money. |
|||
* and a detailed work plan (including, if possible, a schedule with milestones and deliverables). |
|||
=== Work plan === |
|||
* Week 1: |
|||
* Week 2: |
|||
* Week 3: |
|||
* Week 4: |
|||
* '''Deliverable #1''' |
|||
* Week 5: |
|||
* Week 6: |
|||
* Week 7: |
|||
* Week 8: |
|||
* '''Deliverable #2''' |
|||
* Week 9: |
|||
* Week 10: |
|||
* Week 11: |
|||
* Week 12: |
|||
* '''Project completed''' |
|||
Include time needed to think, to program, to document and to disseminate. |
|||
If you are intending to disseminate to a conference, which conference are you intending to submit to. Make sure |
|||
to factor in time taken to run any experiments/evaluations and write them up in your work plan. |
|||
List your skills and give evidence of your qualifications. Tell us what is your current field of study, |
|||
major, etc. Convince us that you can do the work. |
|||
List any non-Summer-of-Code plans you have for the Summer, especially employment, if you are applying for |
|||
internships, and class-taking. Be specific about schedules and time commitments. we would like to be sure you have |
|||
at least 30 free hours a week to develop for our project. |
|||
Reasons why Google and Apertium to sponsor it |
|||
A description of how and who it will benefit in society |
|||
Why am I interested in Machine Translation? |
|||
Coding Challenge |
Latest revision as of 04:27, 9 September 2020
Contents
Personal Details[edit]
Name: Tanmai Khanna
E-mail address: khanna.tanmai@gmail.com , tanmai.khanna@research.iiit.ac.in
IRC: khannatanmai
GitHub: khannatanmai
LinkedIn: khannatanmai
Current Designation: Graduate Researcher in the LTRC Lab, IIIT Hyderabad (5th year student) and a Teaching Assistant for Linguistics courses
Time Zone: GMT+5:30
About Me[edit]
Professional Interests: I’m currently doing research in Computational Linguistics and I have a particular interest in Linguistics and NLP tools, specifically Machine Translation and its components.
Hobbies: I enjoy playing the bass, singing, reading, and used to be super into Parliamentary Debating in uni.
Ideas[edit]
Constructions in low resource MT: User:Khannatanmai/Constructions
Google Summer of Code 2019 -- Anaphora Resolution[edit]
Proposal : User:Khannatanmai/GSoC2019Proposal
Final Report: User:Khannatanmai/GSoC2019Report
Google Summer of Code 2020 -- Markup handling with wordbound blanks[edit]
Proposal: Modifying the apertium stream format and eliminating dictionary trimming: User:Khannatanmai/GSoC2020Proposal_Trimming
Development of the stream extension: User:Khannatanmai/New_Apertium_stream_format
Eliminating Dictionary Trimming: User:Khannatanmai/Eliminating_Dictionary_Trimming
Progress: User:Khannatanmai/GSoC2020Progress
Documentation of features related to secondary tags: User:Khannatanmai/Secondary_tags_features
Alternate stream modification proposal: User:Khannatanmai/Alternate_stream_modification
Development of the updated stream extension: User:Khannatanmai/Secondary_info_apertium_stream_format
Development of wordbound blanks: User:Khannatanmai/Wordbound_blanks
Documentation of wordbound blanks: Wordbound_blanks
Final Report: User:Khannatanmai/GSoC2020_Final_Report