Difference between revisions of "User:AyushPradhan/GSoC2020Proposal"
AyushPradhan (talk | contribs) |
AyushPradhan (talk | contribs) |
||
Line 2: | Line 2: | ||
== '''Contact Information:''' == |
== '''Contact Information:''' == |
||
'''Name:''' Ayush Pradhan |
'''Name:''' Ayush Pradhan <br /> |
||
'''E-mail:''' [ap.ayush.pradhn@gmail.com] |
'''E-mail:''' [ap.ayush.pradhn@gmail.com] <br /> |
||
'''Location:''' Gujarat, India |
'''Location:''' Gujarat, India <br /> |
||
'''Time zone:''' UTC+5:30 |
'''Time zone:''' UTC+5:30 <br /> |
||
'''IRC:''' ayushPradhan |
'''IRC:''' ayushPradhan <br /> |
||
'''GitHub:''' [https://github.com/git-ayush-pradhan/ git-ayush-pradhan] |
'''GitHub:''' [https://github.com/git-ayush-pradhan/ git-ayush-pradhan] <br /> |
||
'''LinkedIn:''' [https://www.linkedin.com/in/ayush-pradhan-a8bb46195/ View profile] |
'''LinkedIn:''' [https://www.linkedin.com/in/ayush-pradhan-a8bb46195/ View profile] <br /> |
||
== '''Self Introduction''' == |
== '''Self Introduction''' == |
||
Line 32: | Line 32: | ||
== '''Work Plan:''' == |
== '''Work Plan:''' == |
||
=== Preparation time === |
|||
''' Week 1: ''' |
''' Week 1: ''' |
||
* Understanding working of tokenization. |
* Understanding working of tokenization. |
||
Line 44: | Line 44: | ||
* Continue with understanding of lttoolbox and reading the code of lttoolbox |
* Continue with understanding of lttoolbox and reading the code of lttoolbox |
||
=== Deliverable 1 === |
|||
Complete study and discussion of the implemented code of lttoolbox with mentors and changes that will be made during the course of time. |
Complete study and discussion of the implemented code of lttoolbox with mentors and changes that will be made during the course of time. |
||
=== |
=== Working time === |
||
''' Week 4: ''' |
''' Week 4: ''' |
||
* Discussion with the mentor about the knowledge gained in the first 3 weeks. |
* Discussion with the mentor about the knowledge gained in the first 3 weeks. |
||
Line 70: | Line 70: | ||
* Extra week for any unseen problem that might occur in the future. |
* Extra week for any unseen problem that might occur in the future. |
||
=== Deliverable 2 === |
|||
Newly optimized lttoolbox with fully unicode encoded code and tokenization algorithm for non-alphabetical characters under the Lm and Lo block. |
Newly optimized lttoolbox with fully unicode encoded code and tokenization algorithm for non-alphabetical characters under the Lm and Lo block. |
||
=== Documentation time === |
|||
''' Week 11/12: ''' |
''' Week 11/12: ''' |
||
* Documentation of implemented changes in the lttoolbox and tokenization into Apertium official documents and sites. |
* Documentation of implemented changes in the lttoolbox and tokenization into Apertium official documents and sites. |
||
=== Deliverable 3 === |
|||
Documentation of the implemented changes in the lttoolbox and tokenization to the Apertium official documents and sites. |
Documentation of the implemented changes in the lttoolbox and tokenization to the Apertium official documents and sites. |
||
=== Project Completed === |
|||
== '''Other plan than GSoC''' == |
== '''Other plan than GSoC''' == |
Revision as of 09:46, 30 March 2020
Contents
Google Summer of Code Proposal
Contact Information:
Name: Ayush Pradhan
E-mail: [ap.ayush.pradhn@gmail.com]
Location: Gujarat, India
Time zone: UTC+5:30
IRC: ayushPradhan
GitHub: git-ayush-pradhan
LinkedIn: View profile
Self Introduction
I am a computer science undergraduate student currently in my 2nd year, studying at Vellore Institute of Technology, India. I have 3 years of experience with C/C++, 1 year of experience each in Python and Java, I have also developed many projects on them and uploaded them on GitHub.
I have been into competitive coding for the last 2 years and have achieved 2 stars in CodeChef and 3+ stars in core languages like Python / C++ / C / Problem Solving in HackerRank.
I am a goal oriented, determined and ambitious person, and I want to contribute more towards open source with a motivation to prove myself and show my full potential as a developer in the community.
Reason why I am interested in Apertium
Apertium was one of the languages in the early 2000’s which followed a machine learning translating tool. While completing the coding challenge for the robust tokenisation task, I got to learn about the ICU library which made me interested in Unicode encoding. The task in robust tokenisation is to update the lttoolbox to be fully unicode encoded with regards to alphabetical symbols and also follow a tokenization in which I am delighted to work for.
Which of the published tasks are you interested in?
The task I am interested in is robust tokenization which follows a task to update the lttoolbox to be fully Unicode and suggest an optimal tokenization for uninterrupted flow for analyzing non-alphabetical characters in a long string which earlier got interrupted and displayed unexpected outputs.
What do you plan to do?
Apertium follows an HFST based analyser which leads to errors while dealing with non-alphabetical characters in a long string. After implementation of fully unicode encoded the error caused in left-right analysis of a long string, if it encounters a non-alphabetical character (Lm, Lo), will not cause an unexpected output or failure which it causes in the current system. Tokenization should follow a test driven algorithm for optimal results. It should follow an algorithm for checking, delimiting followed by tokenization. Test cases for different languages require modified / advanced test cases for better and optimized results.
Why should Apertium/Google sponsor it?
Apertium has started with an exciting and helpful project which requires the implementation of a fully unicode encoded system and better tokenization algorithm for better results with non-alphabetical characters. Apertium needs to follow up with this project for some higher implementation with other languages under Lm, Lo block and reach to greater audiences with their project.
Work Plan:
Preparation time
Week 1:
- Understanding working of tokenization.
- Detail analysis of methods and working in ICU library
Week 2:
- Reading Apertium documentations
- Understanding working of lttoolbox and especially about lt-comp, lt-proc and lt-expand
Week 3:
- Continue with understanding of lttoolbox and reading the code of lttoolbox
Deliverable 1
Complete study and discussion of the implemented code of lttoolbox with mentors and changes that will be made during the course of time.
Working time
Week 4:
- Discussion with the mentor about the knowledge gained in the first 3 weeks.
- Working out customized timeline if needed, and the code that will be modified during the work period with the mentors.
Week 5 and 6:
- Implementation of a fully unicode encoded system in lttoolbox.
- Discussion with the mentor while implementation and changes to be made in the code.
Week 7:
- Debugging and testing of the implemented algorithm for non-alphabetical characters like Lm and Lo characters.
Week 8 and 9:
- Optimizing the tokenization algorithm with respect to changes made to the lttoolbox.
- Delivering a test driven algorithm in tokenization.
- Discussion with the mentor for the customizations made.
Week 10:
- Debugging and testing of the tokenization algorithm with different cases and languages.
Week 11:
- Extra week for any unseen problem that might occur in the future.
Deliverable 2
Newly optimized lttoolbox with fully unicode encoded code and tokenization algorithm for non-alphabetical characters under the Lm and Lo block.
Documentation time
Week 11/12:
- Documentation of implemented changes in the lttoolbox and tokenization into Apertium official documents and sites.
Deliverable 3
Documentation of the implemented changes in the lttoolbox and tokenization to the Apertium official documents and sites.
Project Completed
Other plan than GSoC
I don’t have any other commitments during the summer period as such. I can make a commitment of 40 hours per week for the project. The only situation that can be a problem to the proposed work plan can be any pre scheduled task from the college which is less likely to occur during the course of time.