User:AyushPradhan/GSoC2020Proposal

From Apertium
Revision as of 10:43, 30 March 2020 by AyushPradhan (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Google Summer of Code Proposal

Robust Tokenisation in lttoolbox
[edit]

Contact Information:[edit]

Name: Ayush Pradhan
E-mail: ap.ayush.pradhn@gmail.com
Location: Gujarat, India
Time zone: UTC+5:30
IRC: ayushPradhan
GitHub: git-ayush-pradhan
LinkedIn: View profile

Self Introduction[edit]

I am a computer science undergraduate student currently in my 2nd year, studying at Vellore Institute of Technology, India. I have 3 years of experience with C/C++, 1 year of experience each in Python and Java, I have also developed many projects on them and uploaded them on GitHub.

I have been into competitive coding for the last 2 years and have achieved 2 stars in CodeChef and 3+ stars in core languages like Python / C++ / C / Problem Solving in HackerRank.

I am a goal oriented, determined and ambitious person, and I want to contribute more towards open source with a motivation to prove myself and show my full potential as a developer in the community.

Reason why I am interested in Apertium[edit]

Apertium was one of the languages in the early 2000’s which followed a machine learning translating tool. While completing the coding challenge for the robust tokenisation task, I got to learn about the ICU library which made me interested in Unicode encoding. The task in robust tokenisation is to update the lttoolbox to be fully unicode encoded with regards to alphabetical symbols and also follow a tokenization in which I am delighted to work for.

Which of the published tasks are you interested in?[edit]

The task I am interested in is robust tokenization which follows a task to update the lttoolbox to be fully Unicode and suggest an optimal tokenization for uninterrupted flow for analyzing non-alphabetical characters in a long string which earlier got interrupted and displayed unexpected outputs.

What do you plan to do?[edit]

Apertium follows an HFST based analyser which leads to errors while dealing with non-alphabetical characters in a long string. After implementation of fully unicode encoded the error caused in left-right analysis of a long string, if it encounters a non-alphabetical character (Lm, Lo), will not cause an unexpected output or failure which it causes in the current system. Tokenization should follow a test driven algorithm for optimal results. It should follow an algorithm for checking, delimiting followed by tokenization. Test cases for different languages require modified / advanced test cases for better and optimized results.

Why should Apertium/Google sponsor it?[edit]

Apertium has started with an exciting and helpful project which requires the implementation of a fully unicode encoded system and better tokenization algorithm for better results with non-alphabetical characters. Apertium needs to follow up with this project for some higher implementation with other languages under Lm, Lo block and reach to greater audiences with their project.

Work Plan:[edit]

Brief work plan[edit]

Week Objective
Week 1 - 3 Preparation time for in depth study about lttoolbox
Week 3 - 10 Implementation of tokenisation algorithm to lttoolbox and fully Unicode encoded system
Week 11 Documentation of the result to official doocuments and sites of Apertium

Detailed work plan[edit]


Preparation time[edit]

Week Expected work
Week 1
  • Understanding working of tokenization.
  • Detail analysis of methods and working in ICU library
Week 2
  • Reading Apertium documentations
  • Understanding working of lttoolbox and especially about lt-comp, lt-proc and lt-expand
Week 3
  • Continue with understanding of lttoolbox and reading the code of lttoolbox

Deliverable 1[edit]

Complete study and discussion of the implemented code of lttoolbox with mentors and changes that will be made during the course of time.


Working time[edit]

Week Expected work
Week 4
  • Discussion with the mentor about the knowledge gained in the first 3 weeks.
  • Working out customized timeline if needed, and the code that will be modified during the work period with the mentors.
Week 5 and 6
  • Implementation of a fully unicode encoded system in lttoolbox.
  • Discussion with the mentor while implementation and changes to be made in the code.
Week 7
  • Debugging and testing of the implemented algorithm for non-alphabetical characters like Lm and Lo characters.
Week 8 and 9
  • Optimizing the tokenization algorithm with respect to changes made to the lttoolbox.
  • Delivering a test driven algorithm in tokenization.
  • Discussion with the mentor for the customizations made.
Week 10
  • Debugging and testing of the tokenization algorithm with different cases and languages.
Week 11
  • Extra week for any unseen problem that might occur in the future.

Deliverable 2[edit]

Newly optimized lttoolbox with fully unicode encoded code and tokenization algorithm for non-alphabetical characters under the Lm and Lo block.


Documentation time[edit]

Week Expected work
Week 11/12
  • Documentation of implemented changes in the lttoolbox and tokenization into Apertium official documents and sites.

Deliverable 3[edit]

Documentation of the implemented changes in the lttoolbox and tokenization to the Apertium official documents and sites.


Project Completed[edit]

The project is expected to be completed by the end of week 11.

Other plan than GSoC[edit]

I don’t have any other commitments during the summer period as such. I can make a commitment of 40 hours per week for the project. The only situation that can be a problem to the proposed work plan can be any unexpected task from the college which is less likely to occur during the course of time.