Difference between revisions of "User:AyushPradhan/GSoC2020Proposal"

From Apertium
Jump to navigation Jump to search
(Created page with "= '''Google Summer of Code Proposal''' = --- == '''Contact Information:''' == '''Name:''' Ayush Pradhan '''E-mail:''' [ap.ayush.pradhn@gmail.com] '''Location:''' Gujarat, Ind...")
 
Line 1: Line 1:
 
= '''Google Summer of Code Proposal''' =
 
= '''Google Summer of Code Proposal''' =
   
---
 
 
== '''Contact Information:''' ==
 
== '''Contact Information:''' ==
 
'''Name:''' Ayush Pradhan
 
'''Name:''' Ayush Pradhan
Line 11: Line 10:
 
'''LinkedIn:''' [https://www.linkedin.com/in/ayush-pradhan-a8bb46195/ View profile]
 
'''LinkedIn:''' [https://www.linkedin.com/in/ayush-pradhan-a8bb46195/ View profile]
   
---
 
 
== '''Self Introduction''' ==
 
== '''Self Introduction''' ==
 
I am a computer science undergraduate student currently in my 2nd year, studying at Vellore Institute of Technology, India. I have 3 years of experience with C/C++, 1 year of experience each in Python and Java, I have also developed many projects on them and uploaded them on GitHub.
 
I am a computer science undergraduate student currently in my 2nd year, studying at Vellore Institute of Technology, India. I have 3 years of experience with C/C++, 1 year of experience each in Python and Java, I have also developed many projects on them and uploaded them on GitHub.
   
I have been into competitive coding for the last 2 years and have achieved 2 stars in [https://www.codechef.com/users/ayush_pradhan CodeChef] and 3+ stars in core languages like Python / C++ / C / Problem Solving in [https://www.hackerrank.com/AyushPradhan HackerRank].
+
I have been into competitive coding for the last 2 years and have achieved 2 stars in [https://www.codechef.com/users/ayush_pradhan CodeChef] and 3+ stars in core languages like Python / C++ / C / Problem Solving in [https://www.hackerrank.com/AyushPradhan HackerRank].
   
 
I am a goal oriented, determined and ambitious person, and I want to contribute more towards open source with a motivation to prove myself and show my full potential as a developer in the community.
 
I am a goal oriented, determined and ambitious person, and I want to contribute more towards open source with a motivation to prove myself and show my full potential as a developer in the community.
   
---
 
 
== '''Reason why I am interested in Apertium''' ==
 
== '''Reason why I am interested in Apertium''' ==
 
Apertium was one of the languages in the early 2000’s which followed a machine learning translating tool. While completing the coding challenge for the robust tokenisation task, I got to learn about the ICU library which made me interested in Unicode encoding. The task in robust tokenisation is to update the lttoolbox to be fully unicode encoded with regards to alphabetical symbols and also follow a tokenization in which I am delighted to work for.
 
Apertium was one of the languages in the early 2000’s which followed a machine learning translating tool. While completing the coding challenge for the robust tokenisation task, I got to learn about the ICU library which made me interested in Unicode encoding. The task in robust tokenisation is to update the lttoolbox to be fully unicode encoded with regards to alphabetical symbols and also follow a tokenization in which I am delighted to work for.
   
---
 
 
== '''Which of the published tasks are you interested in?''' ==
 
== '''Which of the published tasks are you interested in?''' ==
 
The task I am interested in is robust tokenization which follows a task to update the lttoolbox to be fully Unicode and suggest an optimal tokenization for uninterrupted flow for analyzing non-alphabetical characters in a long string which earlier got interrupted and displayed unexpected outputs.
 
The task I am interested in is robust tokenization which follows a task to update the lttoolbox to be fully Unicode and suggest an optimal tokenization for uninterrupted flow for analyzing non-alphabetical characters in a long string which earlier got interrupted and displayed unexpected outputs.
   
---
 
 
== '''What do you plan to do?''' ==
 
== '''What do you plan to do?''' ==
 
Apertium follows an HFST based analyser which leads to errors while dealing with non-alphabetical characters in a long string. After implementation of fully unicode encoded the error caused in left-right analysis of a long string, if it encounters a non-alphabetical character (Lm, Lo), will not cause an unexpected output or failure which it causes in the current system.
 
Apertium follows an HFST based analyser which leads to errors while dealing with non-alphabetical characters in a long string. After implementation of fully unicode encoded the error caused in left-right analysis of a long string, if it encounters a non-alphabetical character (Lm, Lo), will not cause an unexpected output or failure which it causes in the current system.
 
Tokenization should follow a test driven algorithm for optimal results. It should follow an algorithm for checking, delimiting followed by tokenization. Test cases for different languages require modified / advanced test cases for better and optimized results.
 
Tokenization should follow a test driven algorithm for optimal results. It should follow an algorithm for checking, delimiting followed by tokenization. Test cases for different languages require modified / advanced test cases for better and optimized results.
   
---
 
 
== '''Why should Apertium/Google sponsor it?''' ==
 
== '''Why should Apertium/Google sponsor it?''' ==
 
Apertium has started with an exciting and helpful project which requires the implementation of a fully unicode encoded system and better tokenization algorithm for better results with non-alphabetical characters. Apertium needs to follow up with this project for some higher implementation with other languages under Lm, Lo block and reach to greater audiences with their project.
 
Apertium has started with an exciting and helpful project which requires the implementation of a fully unicode encoded system and better tokenization algorithm for better results with non-alphabetical characters. Apertium needs to follow up with this project for some higher implementation with other languages under Lm, Lo block and reach to greater audiences with their project.
   
---
 
 
== '''Work Plan:''' ==
 
== '''Work Plan:''' ==
   
 
==== '''Preparation time''' ====
---
 
=== '''Preparation time''' ===
 
 
''' Week 1: '''
 
''' Week 1: '''
 
* Understanding working of tokenization.
 
* Understanding working of tokenization.
Line 52: Line 44:
 
* Continue with understanding of lttoolbox and reading the code of lttoolbox
 
* Continue with understanding of lttoolbox and reading the code of lttoolbox
   
=== '''Deliverable 1''' ===
+
==== '''Deliverable 1''' ====
 
Complete study and discussion of the implemented code of lttoolbox with mentors and changes that will be made during the course of time.
 
Complete study and discussion of the implemented code of lttoolbox with mentors and changes that will be made during the course of time.
   
---
 
 
=== '''Working time''' ===
 
=== '''Working time''' ===
 
''' Week 4: '''
 
''' Week 4: '''
 
* Discussion with the mentor about the knowledge gained in the first 3 weeks.
 
* Discussion with the mentor about the knowledge gained in the first 3 weeks.
 
* Working out customized timeline if needed, and the code that will be modified during the work period with the mentors.
 
* Working out customized timeline if needed, and the code that will be modified during the work period with the mentors.
 
   
 
''' Week 5 and 6: '''
 
''' Week 5 and 6: '''
Line 80: Line 70:
 
* Extra week for any unseen problem that might occur in the future.
 
* Extra week for any unseen problem that might occur in the future.
   
=== '''Deliverable 2''' ===
+
==== '''Deliverable 2''' ====
 
Newly optimized lttoolbox with fully unicode encoded code and tokenization algorithm for non-alphabetical characters under the Lm and Lo block.
 
Newly optimized lttoolbox with fully unicode encoded code and tokenization algorithm for non-alphabetical characters under the Lm and Lo block.
   
 
==== '''Documentation time''' ====
---
 
=== '''Documentation time''' ===
 
 
''' Week 11/12: '''
 
''' Week 11/12: '''
 
* Documentation of implemented changes in the lttoolbox and tokenization into Apertium official documents and sites.
 
* Documentation of implemented changes in the lttoolbox and tokenization into Apertium official documents and sites.
   
=== '''Deliverable 3''' ===
+
==== '''Deliverable 3''' ====
 
Documentation of the implemented changes in the lttoolbox and tokenization to the Apertium official documents and sites.
 
Documentation of the implemented changes in the lttoolbox and tokenization to the Apertium official documents and sites.
   
 
==== '''Project Completed''' ====
   
=== '''Project Completed''' ===
 
 
---
 
 
== '''Other plan than GSoC''' ==
 
== '''Other plan than GSoC''' ==
 
I don’t have any other commitments during the summer period as such. I can make a commitment of 40 hours per week for the project. The only situation that can be a problem to the proposed work plan can be any pre scheduled task from the college which is less likely to occur during the course of time.
 
I don’t have any other commitments during the summer period as such. I can make a commitment of 40 hours per week for the project. The only situation that can be a problem to the proposed work plan can be any pre scheduled task from the college which is less likely to occur during the course of time.

Revision as of 09:43, 30 March 2020

Google Summer of Code Proposal

Contact Information:

Name: Ayush Pradhan E-mail: [ap.ayush.pradhn@gmail.com] Location: Gujarat, India Time zone: UTC+5:30 IRC: ayushPradhan GitHub: git-ayush-pradhan LinkedIn: View profile

Self Introduction

I am a computer science undergraduate student currently in my 2nd year, studying at Vellore Institute of Technology, India. I have 3 years of experience with C/C++, 1 year of experience each in Python and Java, I have also developed many projects on them and uploaded them on GitHub.

I have been into competitive coding for the last 2 years and have achieved 2 stars in CodeChef and 3+ stars in core languages like Python / C++ / C / Problem Solving in HackerRank.

I am a goal oriented, determined and ambitious person, and I want to contribute more towards open source with a motivation to prove myself and show my full potential as a developer in the community.

Reason why I am interested in Apertium

Apertium was one of the languages in the early 2000’s which followed a machine learning translating tool. While completing the coding challenge for the robust tokenisation task, I got to learn about the ICU library which made me interested in Unicode encoding. The task in robust tokenisation is to update the lttoolbox to be fully unicode encoded with regards to alphabetical symbols and also follow a tokenization in which I am delighted to work for.

Which of the published tasks are you interested in?

The task I am interested in is robust tokenization which follows a task to update the lttoolbox to be fully Unicode and suggest an optimal tokenization for uninterrupted flow for analyzing non-alphabetical characters in a long string which earlier got interrupted and displayed unexpected outputs.

What do you plan to do?

Apertium follows an HFST based analyser which leads to errors while dealing with non-alphabetical characters in a long string. After implementation of fully unicode encoded the error caused in left-right analysis of a long string, if it encounters a non-alphabetical character (Lm, Lo), will not cause an unexpected output or failure which it causes in the current system. Tokenization should follow a test driven algorithm for optimal results. It should follow an algorithm for checking, delimiting followed by tokenization. Test cases for different languages require modified / advanced test cases for better and optimized results.

Why should Apertium/Google sponsor it?

Apertium has started with an exciting and helpful project which requires the implementation of a fully unicode encoded system and better tokenization algorithm for better results with non-alphabetical characters. Apertium needs to follow up with this project for some higher implementation with other languages under Lm, Lo block and reach to greater audiences with their project.

Work Plan:

Preparation time

Week 1:

  • Understanding working of tokenization.
  • Detail analysis of methods and working in ICU library

Week 2:

  • Reading Apertium documentations
  • Understanding working of lttoolbox and especially about lt-comp, lt-proc and lt-expand

Week 3:

  • Continue with understanding of lttoolbox and reading the code of lttoolbox

Deliverable 1

Complete study and discussion of the implemented code of lttoolbox with mentors and changes that will be made during the course of time.

Working time

Week 4:

  • Discussion with the mentor about the knowledge gained in the first 3 weeks.
  • Working out customized timeline if needed, and the code that will be modified during the work period with the mentors.

Week 5 and 6:

  • Implementation of a fully unicode encoded system in lttoolbox.
  • Discussion with the mentor while implementation and changes to be made in the code.

Week 7:

  • Debugging and testing of the implemented algorithm for non-alphabetical characters like Lm and Lo characters.

Week 8 and 9:

  • Optimizing the tokenization algorithm with respect to changes made to the lttoolbox.
  • Delivering a test driven algorithm in tokenization.
  • Discussion with the mentor for the customizations made.

Week 10:

  • Debugging and testing of the tokenization algorithm with different cases and languages.

Week 11:

  • Extra week for any unseen problem that might occur in the future.

Deliverable 2

Newly optimized lttoolbox with fully unicode encoded code and tokenization algorithm for non-alphabetical characters under the Lm and Lo block.

Documentation time

Week 11/12:

  • Documentation of implemented changes in the lttoolbox and tokenization into Apertium official documents and sites.

Deliverable 3

Documentation of the implemented changes in the lttoolbox and tokenization to the Apertium official documents and sites.

Project Completed

Other plan than GSoC

I don’t have any other commitments during the summer period as such. I can make a commitment of 40 hours per week for the project. The only situation that can be a problem to the proposed work plan can be any pre scheduled task from the college which is less likely to occur during the course of time.