User:Shobhit Gautam 1503

From Apertium
Jump to navigation Jump to search

Google Summer of Code 2021 proposal


Name: Shobhit Raj Gautam


IRC: Shobhit1503, Shobhit_1503

Current Status: Last Semester Final Year student at IIIT - HYDERABAD, with hands on system architecture/design as well NLP and IRE.

As related to MT, have created bidirectional RNN encoder-decoder with attention layer MT prototype.



Timezone: GMT +5.30

About Me:

I am Last Semester Final Year student at IIIT - HYDERABAD, with hands on system architecture/design as well NLP and IRE. As related to MT, I have created bidirectional RNN encoder-decoder with attention layer MT prototype. Also involved in framing IOT Application platform able to host several microservices with different application running on it. Thus, having good in-depth working knowledge of Deep Learning and building new frameworks.


I am interested in NLP, data mining, math as well system design. I love to do competitive challenges as well as exploring new techs as much as possbile.

Why i am interested in Apertium

Apertium is open source Rule Based MT System, as well as knowing how many different languages are around globe and people facing difficulties just coz they can not translate some language in their own is huge setback and a tehnological challenge . Apertium with very bright initiative to bridge the language barrier as well as it work on rule based unlike some deep neural network which cannot fulfill the task of preserving linguistic diversity, simply because endangered languages don’t offer sufficient data.

Being a rule based MT, the resouces can be enriched as well as way of defining certain translation and its ambiguity can resolved from developer end, this proves to be more efficient in term of less known languages. It is a dream organization for me because I get to contribute to for the cause that can benefit every person around the world to communicate anywhere without language barrier and I hope to become a part of this amazing community of developers!

I was constantly active on IRC channel as well devoted time on how Apertium, its rules for different languages especially indian languages as well methods of resolving ambiguity and lexical selection rules.

Besides, the project i choose provide valuable resource to linguistics thus helping in developing better translations!!

Proposal: UD and Apertium integration

My Idea and Deliverables

My Proposal is to:

  • Enhance UD annotatrix interface by resolving all existing issues possible as well as setting up new features for better understandibilty.
  • lttoolbox relabelling , i.e: Tagset conversion to Apertium API and how to resolve ambiguity in tagset embedding as well as relabelling
  • lttoolbox integration, i.e: defining constraints on analysers for better Tagging and lemmatisation in UDPipe. ( Predicting percentage and checking accuracy of projective Tree in Treebanks).
  • Setting Apertium Embedding in UDPipe.

UD- Annotatrix provides important information regarding different treebanks thus a vital part of Aperium Project but has many issues and it is hard to use with bugs in interface, as well as their is no way to upload files and from URL. My plan to set up interface in a way that any user without prior complete knowledge of UD ( CONLL-U) can use the services and make treebanks. This will include on working on CONLL-U treebanks with Syntactic and Morphological Annotation as well as Miscellaneous fields as in [1].

A new method of representing UD tree as mentioned in this paper. .This will automatically check for potential error and points them out.

The interface will also hovering labels, how to use, modified shortcuts and easy to upload as well a process to store enhanced dependencies and more if things go as timely-planned.

The First Deliverable will UD Interface with issues resolved and new features installed.

Furthermore, Apertium supports multiple languages,thus large nnumbers of tags, thus many require relabelling in regex form for better conversion,i would relabel those tags, test them with relabelled analyzers as check for ambiguity.

Also setting Embedding in UDPipe as defined in Apertium with relabelled tags and finding Bugs in UDPipe with training UDPipe for finding Accuracy of projective Trees.

This will be my final deliverable.

Why Google and Apertium should sponsor it?

I feel the project has a wide scope, helping across languages and making work easier for all Apertium developers. Since, how to use and understand Universal Dependencies and Treebanks serves as valuable resource in linguistic world, so it will beneficial if UD Annotatrix is also updated and renewed even helping new users to make,edit and use Treebanks.

It is also important as better coversion of Tagsets to Apertium for better translations and More defined and trained UDpipe in terms of embedding and Treebanks will help in broader picture thus benefitting developers.

The sponsorship will enable me to work full-time on this problem and put my best effort.

How and who will benefit from it in society?

The apertium community is very dedicated for bridging language barrier for every language possible, even if language is under-resourced and minoritised/marginalised, also Google helps its own way via programs like GSoC. With improved UD annotatrix interface it will help any user to understand dependencies between words in any sentence in any language, also helping developers while writing rules of translation and better tagsets and Treebanks, embedding in UDPipe will definitely help Developers.

Work Plan

Assuming all time period as per GSOC timeline:

I have dedicated separate 2-3 hours or more if required to interact with mentors twice a week or more if required as better communication and clarity will help in faster execution.

Time Period Goal Details
Community Bonding Period

May 17-June 5

Finalize implementation plan for UD - Annotatrix interface.

Setting Plan for Tagset Conversion, Relabelling

Setting Plan For UDPipe and embedding in UDPipe

  • Decide on issues to fix. Also finding new issues if any.
  • What & How New features to be added in UD Annotatrix.
  • How to work on Tagset Relabelling, Conversion, Resolving ambiguity.
  • Find Bugs in Udpipe And laying plan for UDPipe implementation
Week 1

June 6-12

Fixing issues, Adding Features in Interface
  • Solving existing issues.
  • Implementing new features as per mentor Guidance
  • Testing new features and checking for bugs
  • If no Bugs, then committing to the repo.
Week 2

June 13-19

Further work in UD Annotatrix Interface
  • Implementing new features as per mentor Guidance.
  • Testing latest build and checking for bugs
Code Review Get Work Reviewed
Week 3

June 20-26

Deliver Final touch to UD- Annotatrix Interface
  • Work up with any issue and feature left on list to be implemented.
  • Testing the whole interface and pushing code for integration
Week 4

June 27-July 3

Getting Started with Tagset relabelling
  • Writing tag relabelling files
    • Modifying Tags in Regex as well in other files.
  • Testing and troubleshooting for relabelled analyzers.
Code Review Get Work Reviewed
Week 5

July 4-10

Further work on Tagset Conversion and Tag relabelling
  • Writing Regex for Conversion of any Tagset to Apertium
  • Testing and troubleshooting for relabelled analyzers.
Week 6

July 11-17

Resolving ambuiguity in Tagset Embedding
  • Resolving Ambiguity in tagset Embedding:
    • By implementing, changing Regex.
    • By other means as per mentor Guidance.
Midterm Evaluation Get a review of work done so far.
Week 7

July 18-24

Getting Started with UDPipe Implementation
  • Find bugs in Script and Resolve it.
  • Get Projective data and try to train them.
  • Trying different classifier setting and getting best hyperparameters.
Week 8

July 25-31

Further Work in UDPipe
  • Get Percentage of Projective Trees in treebanks after training.
  • Improving accuracy and experimenting with training.
Code Review Get Work Reviewed
Week 9

August 1-7

Set up Apertium Embedding in UDPipe
  • Getting Embedding from Different Sources to Apertium and then to UDPipe.
  • Resolving any ambiguity in Deploying ambiguity to UDPipe.
Week 10

August 8-14

Final Deliverable
  • Extra time for Any left Task or complications.
  • Write the Evaluation Report
Final Evaluation Project Completed

Work Done so Far

I have cloned and run the Code as well tried my few experiments with Interface. I added small changes like hovering display, buttons etc as well different ways of Tree Representation. I also read various references about different UD such as CoNLL-U, VISL CG Format. Thus have knowledge how UD works.

I have also cloned UDPipe and run the repo with same parameters given in wiki to understand its working.

I was tested Positive with Covid-19 in first week of March, thus slowing my work for almost 2 weeks, But now Back on Track, I am experimenting in UD Annotatrix interface and laying implementation on how Tagset conversion takes places as well training a language in UDPipe.


A general overview of my skills can be found in my CV

I am Masters Students in Computer Science at IIIT- Hyderabad with depth knowledge in field of NLP as well as system design, I have worked at Walmart Labs as SDE in 2020 thus enhancing my knowledge of React, Node, python. I have been interested in linguistics from the very beginning and due to the rigorous programming courses, I'm also adept at several programming languages like Python, C++, Java, XML, Bash Scripting, etc. I'm skilled in writing Algorithms. Data Structures, and Machine Learning Algorithms as well.

I undertook various NLP and linguistics courses throughout my undergrad as well postgrad, all work related can be found on my github repo. I also have a lot of experience studying language grammars, graph thoery and annotations which I feel is important especially for the problem mentioned in this proposal.

I have given many contests on Kaggle as well as AiCrowd thus having well depth knowledge of Machine Learning. I also know well functioning and working of context Grammars, lexical rules, Universal Dependencies as well as Front end Development by which i had made various services for the organisation i worked at.

Alongside Development, I love to competitive coding on codeforces, Codeforces as well performed in Google HashCode, CodeJam and Kickstart events. My highest rank was 345 in Kickstart in Dec' 2020. I had also performed well at ACM ICPC regionals in 2019.

The problem I have chosen requires a strong background in Deep Learning, Universal Dependecies, TreeBanks, NodeJs, CoNLLu as well as NLP and linguistics. I believe my experience equips me to achieve the work i have planned.

Non-Summer-of-Code plans for the Summer

I have no other plans for the summer and my academic cirriculum will be over by then so i can devote all the time required, i can easily spend 35-40+ hours per week on the project.