User:Shobhit Gautam 1503

From Apertium
Revision as of 19:45, 12 April 2021 by Shobhit Gautam 1503 (talk | contribs) (GSOC 2021 Proposal: UD and Apertium Integration)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Google Summer of Code 2021 proposal

Contact

Name: Shobhit Raj Gautam

Email: srg996@gmail.com

IRC: Shobhit1503, Shobhit_1503

Current Status: Last Semester Final Year student at IIIT - HYDERABAD, with hands on system architecture/design as well NLP and IRE. As related to MT, have created bidirectional RNN encoder-decoder with attention layer MT prototype.

GitHub: https://github.com/Shobhit-srg

Linkedin: https://www.linkedin.com/in/shobhit-gautam-1503/

Timezone: GMT +5.30

About Me:

I am Last Semester Final Year student at IIIT - HYDERABAD, with hands on system architecture/design as well NLP and IRE. As related to MT, I have created bidirectional RNN encoder-decoder with attention layer MT prototype. Also involved in framing IOT Application platform able to host several microservices with different application running on it. Thus, having good in-depth working knowledge of Deep Learning and building new frameworks.

Interests

I am interested in NLP, data mining, math as well system design. I love to do competitive challenges as well as exploring new techs as much as possbile.

Why i am interested in Apertium

Apertium is open source Rule Based MT System, as well as knowing how many different languages are around globe and people facing difficulties just coz they can not translate some language in their own is huge setback and a tehnological challenge . Apertium with very bright initiative to bridge the language barrier as well as it work on rule based unlike some deep neural network which cannot fulfill the task of preserving linguistic diversity, simply because endangered languages don’t offer sufficient data.

Being a rule based MT, the resouces can be enriched as well as way of defining certain translation and its ambiguity can resolved from developer end, this proves to be more efficient in term of less known languages. It is a dream organization for me because I get to contribute to for the cause that can benefit every person around the world to communicate anywhere without language barrier and I hope to become a part of this amazing community of developers!

I was constantly active on IRC channel as well devoted time on how Apertium, its rules for different languages especially indian languages as well methods of resolving ambiguity and lexical selection rules.

Besides, the project i choose provide valuable resource to linguistics thus helping in developing better translations!!

Proposal: UD and Apertium integration

My Idea and Deliverables

My Proposal is to:

  • Enhance UD annotatrix interface by resolving all existing issues possible as well as setting up new features for better understandibilty by new users.
  • As interface is all set up without issues, lttoolbox relabelling , i.e: Tagset conversion to Apertium API and how to resolve ambiguity in tagset embedding as well as relabelling

UD- Annotatrix provides important information regarding different treebanks thus a vital part of Aperium Project but has many issues and it is hard to use with bugs in UI, as well as their is no way to upload files and from URL.For a new user, it's very hard to easily understand how UI works, My plan to set up UI in a way that even new user without prior complete knowledge of UD ( CONLL-U) can use the services and make treebanks. The UI will also hovering labels, how to use, modified shortcuts and easy to upload as well a process to store enhanced dependencies and more if needed.

A new method of representing UD tree as mentioned in this paper. .This will automatically check for potential error and points them out.

The First Deliverable will Complete New UI with issues resolved and new features installed.

Furthermore, Apertium supports multiple languages,thus large nnumbers of tags, thus many require relabelling in regex form for better conversion,i would relabel those tags, test them with relabelled analyzers as check for ambiguity.This will be my final deliverable.

Why Google and Apertium should sponsor it?

I feel the project has a wide scope, helping across languages and making work easier for all Apertium developers. Since, how to use and understand Universal Dependencies and Treebanks serves as valuable resource in linguistic world, so it will beneficial if Apertium UI is also updated and renewed even helping new users to make,edit and use Treebanks.

It is also important as it will help in better coversion of Tagsets to Apertium for better translations in broader picture thus benefitting developers.

The sponsorship will enable me to work full-time on this problem and put my best effort.

How and who will benefit from it in society?

The apertium community is very dedicated for bridging language barrier for every language possible, even if language is under-resourced and minoritised/marginalised, also Google helps its own way via programs like GSoC. With improved UI and better tagset conversions, it will help any user to understand dependencies between words in any sentence in any language, also helping developers while writing rules of translation.

Work Plan

Assuming all time period as per GSOC timeline:

I have dedicated separate 2-3 hours or more if required to interact with mentors twice a week or more if required as better communication and clarity will help in faster execution.

Time Period Goal Details
Community Bonding Period

May 17-June 5

Finalize implementation plan for UD - Annotatrix interface.
  • Decide on issues to fix. Also finding new issues if any.
  • What New features to be added and how to be added
  • Changes in existing features.
  • How to work on Tagset Relabelling, Conversion.
Week 1

June 6-12

Fixing issues, Adding Features and Testing with Environment
  • Solving issues and Implementing new features
  • Testing new features and checking for bugs
  • If no Bugs, then committing to the repo.
Week 2

June 13-19

Fixing issues, Adding Features and Testing with Environment
  • Solving issues and Implementing new features
  • Testing new features and checking for bugs
  • If no Bugs, then committing to the repo.
Code Review Get Work Reviewed
Week 3

June 20-26

Adding Features and Testing with Environment
  • Implementing new features
  • Testing new features and checking for bugs
  • If no Bugs, then committing to the repo.
Week 4

June 27-July 3

Changes in existing features with adding functionality.
  • Improving existing features.
  • Adding new functionality to existing features as decided in the plan
  • Checking new functionality in Test Environment.
Code Review Get Work Reviewed
Week 5

July 4-10

Changes in existing features with adding functionality.
  • Improving existing features.
  • Adding new functionality to existing features as decided in the plan
  • Checking new functionality in Test Environment.
Week 6

July 11-17

Deliver Final touch to UD- Annotatrix Interface
  • Work up with any issue and feature left on list to be implemented.
  • Testing the whole interface and pushing code for integration
Midterm Evaluation Get a review of work done so far.
Week 7

July 18-24

Getting Started with Tagset relabelling
  • Writing tag relabelling files
    • Modifying Tags in Regex as well in other files.
  • Testing and troubleshooting for relabelled analyzers.
Week 8

July 25-31

Further work on Tagset Conversion and Tag relabelling
  • Writing Regex for Conversion of any Tagset to Apertium
  • Testing and troubleshooting for relabelled analyzers.
Code Review Get Work Reviewed
Week 9

August 1-7

Resolving ambuiguity in Tagset Embedding
  • Resolving Ambiguity in tagset Embedding:
    • By implementing, changing Regex.
    • By other means as per mentor Guidance.
Week 10

August 8-14

Final Deliverable
  • Extra time for left Task or complications.
  • Write the Evaluation Report
Final Evaluation Project Completed

Work Done so Far

I have cloned and run the Code as well tried my few experiments with Interface. I added small changes like hovering display, buttons etc as well different ways of Tree Representation. I also read various references about different UD such as CoNLL-U, VISL CG Format, as

I was tested Positive with Covid-19 in first week of March, thus slowing my work for almost 2 weeks, But now Back on Track, I am looking for new idea-Features in interface as well understanding how Tagset conversion Takes Places.

Skills

A general overview of my skills can be found in my CV

I am Masters Students in Computer Science at IIIT- Hyderabad with depth knowledge in field of NLP as well as system design, I have worked at Walmart Labs as SDE in 2020 thus enhancing my knowledge of React, Node, python. I have been interested in linguistics from the very beginning and due to the rigorous programming courses, I'm also adept at several programming languages like Python, C++, XML, Bash Scripting, etc. I'm skilled in writing Algorithms. Data Structures, and Machine Learning Algorithms as well.

I undertook various NLP and linguistics courses throughout my undergrad as well postgrad, all work related can be found on my github repo. I also have a lot of experience studying language grammars, graph thoery and annotations which I feel is important especially for the problem mentioned in this proposal.

I also know well functioning and working of context Grammars, lexical rules, Universal Dependencies as well as Front end Development by which i had made various services for the organisation i worked at.

Alongside Development, I love to competitive coding on codeforces, Codeforces as well performed in Google HashCode, CodeJam and kickstart events. My highest rank was 345 in Kickstart in Dec' 2020. I had also performed well at ACM ICPC regionals in 2019.

The problem I have chosen requires a strong background in Universal Dependecies, TreeBanks, NodeJs, CoNLLu as well as NLP and linguistics. I believe my experience equips me to achieve the work i have planned.

Non-Summer-of-Code plans for the Summer

I have no other plans for the summer and my academic cirriculum will be over by then so i can devote all the time required, i can easily spend 35-40+ hours per week on the project.