Difference between revisions of "User:AyushPradhan/GSoC2020Proposal"
AyushPradhan (talk | contribs) |
AyushPradhan (talk | contribs) |
||
Line 51: | Line 51: | ||
=== Detailed work plan === |
=== Detailed work plan === |
||
---- |
|||
==== Preparation time ==== |
==== # Preparation time ==== |
||
{| class="wikitable" border ="1" |
{| class="wikitable" border ="1" |
||
|- |
|- |
Revision as of 10:30, 30 March 2020
Contents
- 1 Google Summer of Code Proposal Robust Tokenisation in lttoolbox
Google Summer of Code Proposal
Robust Tokenisation in lttoolbox
Contact Information:
Name: Ayush Pradhan
E-mail: [ap.ayush.pradhn@gmail.com]
Location: Gujarat, India
Time zone: UTC+5:30
IRC: ayushPradhan
GitHub: git-ayush-pradhan
LinkedIn: View profile
Self Introduction
I am a computer science undergraduate student currently in my 2nd year, studying at Vellore Institute of Technology, India. I have 3 years of experience with C/C++, 1 year of experience each in Python and Java, I have also developed many projects on them and uploaded them on GitHub.
I have been into competitive coding for the last 2 years and have achieved 2 stars in CodeChef and 3+ stars in core languages like Python / C++ / C / Problem Solving in HackerRank.
I am a goal oriented, determined and ambitious person, and I want to contribute more towards open source with a motivation to prove myself and show my full potential as a developer in the community.
Reason why I am interested in Apertium
Apertium was one of the languages in the early 2000’s which followed a machine learning translating tool. While completing the coding challenge for the robust tokenisation task, I got to learn about the ICU library which made me interested in Unicode encoding. The task in robust tokenisation is to update the lttoolbox to be fully unicode encoded with regards to alphabetical symbols and also follow a tokenization in which I am delighted to work for.
Which of the published tasks are you interested in?
The task I am interested in is robust tokenization which follows a task to update the lttoolbox to be fully Unicode and suggest an optimal tokenization for uninterrupted flow for analyzing non-alphabetical characters in a long string which earlier got interrupted and displayed unexpected outputs.
What do you plan to do?
Apertium follows an HFST based analyser which leads to errors while dealing with non-alphabetical characters in a long string. After implementation of fully unicode encoded the error caused in left-right analysis of a long string, if it encounters a non-alphabetical character (Lm, Lo), will not cause an unexpected output or failure which it causes in the current system. Tokenization should follow a test driven algorithm for optimal results. It should follow an algorithm for checking, delimiting followed by tokenization. Test cases for different languages require modified / advanced test cases for better and optimized results.
Why should Apertium/Google sponsor it?
Apertium has started with an exciting and helpful project which requires the implementation of a fully unicode encoded system and better tokenization algorithm for better results with non-alphabetical characters. Apertium needs to follow up with this project for some higher implementation with other languages under Lm, Lo block and reach to greater audiences with their project.
Work Plan:
Brief work plan
Week | Objective |
---|---|
Week 1 - 3 | Preparation time for in depth study about lttoolbox |
Week 3 - 10 | Implementation of tokenisation algorithm to lttoolbox and fully Unicode encoded system |
Week 11 | Documentation of the result to official doocuments and sites of Apertium |
Detailed work plan
# Preparation time
Week | Expected work |
---|---|
Week 1 |
|
Week 2 |
|
Week 3 |
|
Deliverable 1
Complete study and discussion of the implemented code of lttoolbox with mentors and changes that will be made during the course of time.
Working time
Week | Expected work |
---|---|
Week 4 |
|
Week 5 and 6 |
|
Week 7 |
|
Week 8 and 9 |
|
Week 10 |
|
Week 11 |
|
Deliverable 2
Newly optimized lttoolbox with fully unicode encoded code and tokenization algorithm for non-alphabetical characters under the Lm and Lo block.
Documentation time
Week | Expected work |
---|---|
Week 11/12 |
|
Deliverable 3
Documentation of the implemented changes in the lttoolbox and tokenization to the Apertium official documents and sites.
Project Completed
The project is expected to be completed by the end of 11 week.
Other plan than GSoC
I don’t have any other commitments during the summer period as such. I can make a commitment of 40 hours per week for the project. The only situation that can be a problem to the proposed work plan can be any pre scheduled task from the college which is less likely to occur during the course of time.