User:Weizhe/GSOC 2020 proposal

Contact

Name: Weizhe Yang

E-mail: gavinwzmails@gmail.com

IRC nick: Weizhe

Location: Shandong, China

Time zone: UTC/GMT+8

Self Introduction

I am a software engineering student. I am familiar with C++ and Python, and proficient in mainstream Linux distribution environments and Bash scripting skills. I have a basic understanding of Unicode encoding and Tokenization flow, and have made some progress in ICU.

In my first year in university, I joined the high-performance computing center[1] of the university as a research assistant. Through research and learning during the period, I have a deep understanding of software architecture and open source projects. In March 2019, I participated in the "LanQiao" national programming competition - a programming skill and algorithm contest - and won the second prize.

I am committed to improving my skills through project practice, so I think I will learn more professional knowledge through this exercise in Apertium.

The reason of why I am interested in Apertium

Apertium focuses on open source software development for translation and lexical analysis. I'm very interested in the various encoding formats and the lexical analysis. Fortunately this is what Apertium does, it can easily translate between different languages and analyze various words.

Target Project

I found the project I was interested in: "Robust tokenization in lttoolbox"[2]. I started this project a month ago and completed the Coding Challenge. The source code and documents for the Coding Challenge are kept in my Repository[3]. At the same time I also configured the Apertium kernel and lttoolbox environment on my PC, carried out some operations, and had a general understanding of their works. For now I can understand the target project better and deeper so that analysis and tokenization could be handled better.

Proposal

Abstract

Apertium's lttoolbox tool currently implements the reading and tokenization of characters from input streams, but still has some trouble handling Unicode non-alphabetic characters. When dealing with the longest matched string, it may encounter an unspecified character in the alphabet, which may cause the character read to be aborted, causing the analysis process from left to right (lr) to get out of control, producing unexpected results, and causing more future work to fail.

My task will be to write algorithms so that the character reading and recognition process can proceed smoothly, regardless of the character type. And after reading, the input text can be separated into some lexemes by a given class of delimiters. Then, a new tokenization operation is carried out to assign the matching token to the obtained lexemes.

The tokenization process should be divided into four separate processes: read, partition, identify, and tag. For each process, the algorithm should be designed or modified separately from the original code. Before the algorithm design of each process, there will be a ready-made test case, the algorithm design goal is to pass the test. After passing the basic test, it is necessary to propose an in-depth test case, and then update the algorithm to get the latest test results. Through this loop, the program will gradually become more robust and satisfactory. Eventually, the test documentation and source code for each unit will be generated and progressively enhanced. After all the unit designs are completed, splicing of the units may be necessary, and before that there should be a basic test case. After the same iterative approach, the splicing process will be completed quickly. The new tokenization design will then be completed successfully.

Benefits to society(Deliverable)

Upon completion of this project, A smooth lexical analysis provided by lttoolbox will be developed, a complete test documentation will be saved, the compiler is updated, and the documents are upgraded.

After this, Apertium users using the lttoolbox as a analysis tool will no longer be bothered by Unicode restrictions. It will also give more people access to apertium for a useful lexeme analysis software.

Proposed Plan

I will start from April 1st. From April 1 to April 28, I have four weeks to prepare for the start, familiarize myself with the way apertium works, and begin to learn the necessary knowledge, such as Unicode encoding, tokenization flow, and ICU syntax. I will contribute to the project as early as possible. So I have a total of 20 weeks to work on the project. There may be a buffer week for the semester exam period (from June 22 to July 3).

Preparation before start (1 April - 28 April)

Week 1:

Read the Unicode literature to gain an in-depth understanding of it
Learn the working mode and details of tokenization.
Communicate specific work with mentors and follow advice.

Week 2 and 3:

Learn the ICU, pick out the header files and functions that may be used in future, and get familiar with the query way of the official ICU documents, so that the official explanation will be quickly found when meeting a strange function calls in the future.
In own program, try to read and tokenize the Unicode string for the longest matching rule from the standard input stream

Week 4:

Read apertium documentation
Configure the lttoolbox environment

Official start (April 28 - August 18)

Stage 1 (April 28-may 26): Familiarity with how lttoolbox works

Week 1:

Discuss specific study plans with Mentors. Identify the specific code that needs to be updated
Complete the problems left over from the last few weeks and study them further

Week 2:

Familiar with the use of the three commonly used tools of lttoolbox, especially the analytical tool lt-proc
Select and compile 2-3 dictionaries and test each option them

Week 3 and 4:

Read the relevant code in the lttoolbox repository
Start with the FSTProcessor::analysis() function and read the lttoolbox source code again
Understand how lttoolbox implements tokenization
Deliverable: Familiar with the way apertium works and the use of lttoolbox

Stage 2 (May 27-june 23): Algorithms design

Weeks 5:

Discuss with the mentors how to improve the compatibility of lttoolbox with Unicode by combining the ICU with the original code
Divide tokenization process into four separate processes: reading, partition, identify, and tokenization

Week 6:

Following the TDD pattern, complete the reading and partition processes

Week 7:

Complete the reading and identify processes

Week 8:

Complete the reading and tokenization processes
First Evaluation
Deliverable: The algorithms to realize these four functions are obtained

Stage 3 (June 24-july 28): Completion of updates to lttoolbox

Week 9 and 10 (take the exam) :

Debug the algorithms of Stage 2

Week 11 and 12:

Splice the above four functions together
Design a function that facilitate user to operate tokenization

Week 13:

Debug
Optimize the code
Second Evaluation
Deliverable: Unit splicing is completed and project functions are all realized

Stage 4 (July 29- August 18): Update the compiler and documentation

Week 14 and 15:

Update the compiler so that it can compile the new version of lttoolbox
Update the lttoolbox documentation

Week 16:

Complete all the unfinished tasks above
Delivery of the project
Deliverable: Complete the update of the compiler and documentation

Time Commitment

During the school period (from April 28 to July 3), I will work six days a week. During the working day (Tuesday to Friday), I can guarantee to work more than four hours a day, more than six hours on weekends, more than 30 hours a week.

During the summer vacation (July 3 - August 18), I will focus on the contribution to Apertium. There is no other work or study tasks. I will work at least 6 days a week (Monday to Saturday), more than 6 hours a day, more than 36 hours a week.

If everything goes according to plan, the project would hopefully be finished ahead of time.

User:Weizhe/GSOC 2020 proposal

Contents

Contact

Self Introduction

The reason of why I am interested in Apertium

Target Project

Proposal

Abstract

Benefits to society(Deliverable)

Proposed Plan

Time Commitment

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools