User:Weizhe/GSOC 2020 proposal
Contents
Contact
Name: Weizhe Yang
E-mail: gavinwzmails@gmail.com
IRC nick: Weizhe
Location: Shandong, China
Time zone: UTC/GMT+8
Github: https://github.com/GavinWz
Self Introduction
I am a software engineering student. I am familiar with C++ and Python, and proficient in mainstream Linux distribution environments and Bash scripting skills. I have a basic understanding of Unicode encoding and Tokenization flow, and have made some progress in ICU.
In my first year in university, I joined the high-performance computing center[1] of the university as a research assistant. Through research and learning during the period, I have a deep understanding of software architecture and open source projects. In March 2019, I participated in the "LanQiao" national programming competition - a programming skill and algorithm contest - and won the second prize.
I am committed to improving my skills through project practice, so I think I will learn more professional knowledge through this exercise in Apertium.
The reason of why I am interested in Apertium
Apertium focuses on open source software development for translation and lexical analysis. I'm very interested in the various encoding formats and the lexical analysis. Fortunately this is what Apertium does, it can easily translate between different languages and analyze various words.
Target Project
I found the project I was interested in: "Robust tokenization in lttoolbox"[2]. I started this project a month ago and completed the Coding Challenge. The source code and documents for the Coding Challenge are kept in my Repository[3]. At the same time I also configured the Apertium kernel and lttoolbox environment on my PC, carried out some operations, and had a general understanding of their works. For now I can understand the target project better and deeper so that analysis and tokenization could be handled better.
Proposal
Abstract
Apertium's lttoolbox tool currently implements the reading and tokenization of characters from input streams, but still has some trouble handling Unicode non-alphabetic characters. When dealing with the longest matched string, it may encounter an unspecified character in the alphabet, which may cause the character read to be aborted, causing the analysis process from left to right (lr) to get out of control, producing unexpected results, and causing more future work to fail.
My task will be to write algorithms so that the character reading and recognition process can proceed smoothly, regardless of the character type. And after reading, the input text can be separated into some lexemes by a given class of delimiters. Then, a new tokenization operation is carried out to assign the matching token to the obtained lexemes.
The tokenization process should be divided into four separate processes: read, partition, identify, and tag. For each process, the algorithm should be designed or modified separately from the original code. Before the algorithm design of each process, there will be a ready-made test case, the algorithm design goal is to pass the test. After passing the basic test, it is necessary to propose an in-depth test case, and then update the algorithm to get the latest test results. Through this loop, the program will gradually become more robust and satisfactory. Eventually, the test documentation and source code for each unit will be generated and progressively enhanced. After all the unit designs are completed, splicing of the units may be necessary, and before that there should be a basic test case. After the same iterative approach, the splicing process will be completed quickly. The new tokenization design will then be completed successfully.
Benefits to society(Deliverable)
Upon completion of this project, A smooth lexical analysis provided by lttoolbox will be developed, a complete test documentation will be saved, the compiler is updated, and the documents are upgraded.
After this, Apertium users using the lttoolbox as a analysis tool will no longer be bothered by Unicode restrictions. It will also give more people access to apertium for a useful lexeme analysis software.
Proposed Plan
I will start from April 1st. From April 1 to April 28, I have four weeks to prepare for the start, familiarize myself with the way apertium works, and begin to learn the necessary knowledge, such as Unicode encoding, tokenization flow, and ICU syntax. I will contribute to the project as early as possible. So I have a total of 20 weeks to work on the project. There may be a buffer week for the semester exam period (from June 22 to July 3).
Preparation before start (1 April - 28 April)
Week 1:
- Read the Unicode literature to gain an in-depth understanding of it
- Learn the working mode and details of tokenization.
- Communicate specific work with mentors and follow advice.
Week 2 and 3:
- Learn the ICU, pick out the header files and functions that may be used in future, and get familiar with the query way of the official ICU documents, so that the official explanation will be quickly found when meeting a strange function calls in the future.
- In own program, try to read and tokenize the Unicode string for the longest matching rule from the standard input stream
Week 4:
- Read apertium documentation
- Configure the lttoolbox environment
Official start (April 28 - August 18)
Stage 1 (April 28-may 26): Familiarity with how lttoolbox works
Week 1:
- Discuss specific study plans with Mentors. Identify the specific code that needs to be updated
- Complete the problems left over from the last few weeks and study them further
Week 2:
- Familiar with the use of the three commonly used tools of lttoolbox, especially the analytical tool lt-proc
- Select and compile 2-3 dictionaries and test each option them
Week 3 and 4:
- Read the relevant code in the lttoolbox repository
- Start with the FSTProcessor::analysis() function and read the lttoolbox source code again
- Understand how lttoolbox implements tokenization
- Deliverable: Familiar with the way apertium works and the use of lttoolbox
Stage 2 (May 27-june 23): Algorithms design
Weeks 5:
- Discuss with the mentors how to improve the compatibility of lttoolbox with Unicode by combining the ICU with the original code
- Divide tokenization process into four separate processes: reading, partition, identify, and tokenization
Week 6:
- Following the TDD pattern, complete the reading and partition processes
Week 7:
- Complete the reading and identify processes
Week 8:
- Complete the reading and tokenization processes
- First Evaluation
- Deliverable: The algorithms to realize these four functions are obtained
Stage 3 (June 24-july 28): Completion of updates to lttoolbox
Week 9 and 10 (take the exam) :
- Debug the algorithms of Stage 2
Week 11 and 12:
- Splice the above four functions together
- Design a function that facilitate user to operate tokenization
Week 13:
- Debug
- Optimize the code
- Second Evaluation
- Deliverable: Unit splicing is completed and project functions are all realized
Stage 4 (July 29- August 18): Update the compiler and documentation
Week 14 and 15:
- Update the compiler so that it can compile the new version of lttoolbox
- Update the lttoolbox documentation
Week 16:
- Complete all the unfinished tasks above
- Delivery of the project
- Deliverable: Complete the update of the compiler and documentation
Time Commitment
During the school period (from April 28 to July 3), I will work six days a week. During the working day (Tuesday to Friday), I can guarantee to work more than four hours a day, more than six hours on weekends, more than 30 hours a week.
During the summer vacation (July 3 - August 18), I will focus on the contribution to Apertium. There is no other work or study tasks. I will work at least 6 days a week (Monday to Saturday), more than 6 hours a day, more than 36 hours a week.
If everything goes according to plan, the project would hopefully be finished ahead of time.