User:Weizhe/GSOC 2020 proposal
Name: Weizhe Yang
IRC nick: Weizhe
Location: Shandong, China
Time zone: UTC/GMT+8
I am a software engineering student of Weifang University, China. I am familiar with C/C++, Python and Java, and proficient in mainstream Linux distribution environments and Bash scripting skills. I have a basic understanding of Unicode encoding and Tokenization flow, and have made some progress in ICU.
In my first year in university, I joined the high-performance computing center of the university as a research assistant. Through research and learning during the period, I have a deep understanding of software architecture and open source projects. In March 2019, I participated in the "LanQiao" national programming competition - a programming skill and algorithm contest - and won the second prize.
I am committed to improving my skills through project practice, so I think I will learn more professional knowledge through this exercise in Apertium.
The reason of why I am interested in Apertium
Apertium focuses on open source software development for translation and lexical analysis. I'm very interested in the various encoding formats and the lexical analysis. Fortunately this is what Apertium is doing, it can easily translate between different languages and analyze various words.
I found the project I was interested in: "Robust tokenization in lttoolbox". I started this project a month ago and completed the Coding Challenge. The source code and documents for the Coding Challenge are kept in my Repository. At the same time I also configured the Apertium kernel and lttoolbox environment on my PC, carried out some operations, and had a general understanding of their works. For now I can understand the target project better and deeper so that lexical analysis and tokenization could be handled better.
Apertium's lttoolbox tool currently implements the reading and tokenization of characters from input streams, but still has some trouble handling Unicode non-alphabetic characters. When dealing with the longest matched string, it may encounter an unspecified character in the alphabet, which may cause the character read to be aborted, causing the analysis process from left to right (lr) to get out of control, producing unexpected results, and causing more future work to fail.
My task will be to write algorithms so that the character reading and recognizing process can proceed smoothly, regardless of the character type. And after reading, the input text can be separated into some lexemes by a given class of delimiters. Then, a new tokenization operation is carried out to assign the matching token to the obtained lexemes.
The tokenization process should be divided into four separate processes: read, recognition, delimiter handling, and tokenization. For each process, the algorithm should be designed or modified separately from the original code. Before the algorithm design of each process, there will be a ready-made test case, the algorithm design goal is to pass the test. After passing the basic test, it is necessary to propose an in-depth test case, and then update the algorithm to get the latest test results. Through this loop, the program will gradually become more robust and satisfactory. Eventually, the test documentation and source code for each unit will be generated and progressively enhanced. After all the unit designs are completed, splicing of the units may be necessary, and before that there also should be a basic test case. After the same iterative approach, the splicing process will be completed quickly. The new tokenization design will then be completed successfully.
Benefits to society (Deliverable)
Upon completion of this project, A smooth lexical analysis provided by lttoolbox will be developed, a complete test documentation will be saved, the compiler will be updated, and the documents will be upgraded.
After this, Apertium users using the lttoolbox as an analysis tool will no longer be bothered by Unicode restrictions. It will also give more people access to apertium for a useful lexical analyzing and translating software.
I will start from April 1st. From April 1 to May 5, I have five weeks to prepare for the start, familiarize myself with the way apertium works, and begin to learn the necessary knowledge, such as Unicode encoding, tokenization flow, and ICU syntax. I will contribute to the project as early as possible. Maybe I have a total of 20 weeks (from April 1 to August 18) to work on the project. There may be a buffer week for the semester exam period (from June 22 to July 3).
Preparation before start (April 1 - May 5)
- Read the Unicode literature to gain an in-depth understanding of it
- Learn the working mode and details of tokenization.
- Communicate specific work with mentors and follow advice.
Week 2 and 3:
- Learn the ICU, pick out the header files and functions that may be used in future, and get familiar with the query way of the official ICU documents, so that the official explanation will be quickly found when meeting strange function calls in the future.
- In own program, try to read and tokenize the Unicode string for the longest matching rule from the standard input stream
- Read apertium documentation
- Configure the lttoolbox environment
- Familiar with the use of the three commonly used tools of lttoolbox, especially the analytical tool lt-proc
- Select and compile 2-3 dictionaries and test each option them
Official start (May 5 - August 18)
Stage 1 (May 5-May 26): Familiarity with how lttoolbox works
- Discuss specific study plans with Mentors. Identify the specific code that needs to be updated
- Complete the problems left over from the last few weeks and study them further
Week 2 and 3:
- Read the relevant code in the lttoolbox repository
- Start with the FSTProcessor::analysis() function and read the lttoolbox source code again
Deliverable: Understand how apertium implements tokenization.
Stage 2 (May 27-june 23): Algorithms design
- Discuss with the mentors how to improve the compatibility of lttoolbox with Unicode by combining the ICU with the original code
- Divide tokenization process into four separate processes: reading, recognition, delimiter handling, and tokenization
- Following the TDD pattern, complete the reading and recognizing processes
- Complete the delimiter handling process
- Complete the tokenization process
- First Evaluation
Deliverable: The algorithms to realize these four processes are obtained
Stage 3 (June 24-july 28): Completion of updates to lttoolbox
Week 8 and 9 (take the exam) :
- Debug the algorithms of Stage 2
Week 10 and 11:
- Splice the above four functions together
- Design a function that facilitate user to operate tokenization
- Code refactor
- Second Evaluation
Deliverable: Unit splicing is completed and project functions are all realized
Stage 4 (July 29- August 18): Update the compiler and documentation
Week 13 and 14:
- Update the compiler so that it can compile the new version of lttoolbox
- Update the lttoolbox documentation
- Complete all the unfinished tasks above
- Delivery of the project
Deliverable: Complete the update of the compiler and documentation
During the school period (from April 28 to July 3), I will work six days a week. During the working day (Tuesday to Friday), I can guarantee to work more than four hours a day, more than six hours on weekends, more than 30 hours a week.
During the summer vacation (July 3 - August 18), I will focus on the contribution to Apertium. There is no other work or study tasks. I will work at least 6 days a week (Monday to Saturday), more than 6 hours a day, more than 36 hours a week.
If everything goes according to plan, the project would hopefully be finished ahead of time.