Difference between revisions of "User:Weizhe/GSOC 2020 proposal"

From Apertium
Jump to navigation Jump to search
(GSOC 2020 proposal)
 
Line 26: Line 26:




== My Proposal ==
== Proposal ==




Line 124: Line 124:




'''Stage 4''' (July 29- August 18): update the compiler and documentation
'''Stage 4''' (July 29- August 18): Update the compiler and documentation





Revision as of 11:10, 22 March 2020


Contact

Name: Weizhe Yang

E-mail: gavinwzmails@gmail.com

IRC nick: Weizhe

Location: Shandong, China

Time zone: UTC/GMT+8

Github: https://github.com/GavinWz


The reasons of that why am I interested in Apertium

I appreciate that Apertium focuses on open source software development for translation and lexical analysis. I'm very interested in the various encoding formats of characters and the lexical analysis of words, and Apertium is doing exactly that. It can easily translate between different languages and tokenize various words, and that got me interested in Apertium.


The published task I'm interested in

After a period of reading, I found the project I was interested in: "Robust tokenization in lttoolbox"[1] . I started this project a month ago and completed the Coding Challenge. The source code and documents for the Coding Challenge are kept in my Repository[2]. At the same time I also configured Apertium kernel and lttoolbox environment, carried out some operations, and had a general understanding of their works. Next, I will deeply understand the specific code that how to implement analysis and tokenization in lttoolbox.


Proposal

Abstract

Apertium's lttoolbox tool currently implements the reading and tokenization of characters from input streams, but still has some trouble handling Unicode non-alphabetic characters.When dealing with the longest matched string, it may encounter an unspecified character in the alphabet, which may cause the character read to be aborted, causing the analysis process from left to right (lr) to get out of control, producing unexpected results, and causing more future work to fail.

My task will be to write a algorithm so that the character reading and recognition process can proceed smoothly, regardless of the character type. And after reading, the text can be separated into some lexemes by a given class of delimiters. Then, a new tokenization operation is carried out to assign the matching token to the obtained lexemes. Finally, after testing, add this function to lttoolbox, get an enhanced lttoolbox.

Self-Introduction

I am a software engineering student. I familiar with the syntax of C++ and Python and their data structure, proficient in using Linux command line, have written some practical Bash Shell scripts.

In my freshman year, I joined the high-performance computing center[3] of the university as a research assistant. Through research and learning during the period, I have a deep understanding of software architecture and open source projects. In March 2019, I participated In the "Lanqiao" national programming competition and won the second prize.

I understand the basic knowledge of Unicode encoding and Tokenization flow now, and have begun to learn ICU syntax and made some progress. I am committed to improving my skills through project practice, so I think I will learn more professional knowledge through this exercise in Apertium, which will give me great motivation to work.

Benefits to society

Upon completion of this project, most users of Apertium will enjoy the smooth lexical analysis provided by lttoolbox.It will also attract more users to Apertium, thus providing them with a good lexical analysis and translation tool.

Proposed Plan

No matter whether I am admitted or not, I will start to prepare from April 1st. From April 1 to April 28, I have four weeks to prepare for the start, familiarize myself with the way apertium works, and begin to learn the necessary knowledge, such as Unicode encoding, tokenization flow, and ICU syntax. So if I am lucky enough to be admitted, I will officially start contributing to apertium within 16 weeks from April 28 to August 18.So I had a total of 20 weeks to work on the project. Besides, I hope that I can work less or not at all during the exam period (from June 22 to July 3). I will make up for this part of the task during the vacation.

Preparation before starting (1 April - 28 April)

Week 1:

  • Read the Unicode literature to gain an in-depth understanding of how Unicode is encoded.
  • Learn the working mode and details of tokenization.
  • Communicate specific work with mentor and follow advice.

Week 2 and 3:

  • Learn the ICU grammar, pick out the header files and functions that I may use in the future, and get familiar with the query way of the official ICU documents, so that I can quickly find the official explanation when I use strange function calls in the future.
  • In my own program, try to read and tokenize the longest match rule for Unicode strings from the standard input stream

Week 4:

  • Read apertium documentation
  • Configure the lttoolboox environment

Official start (April 28 - August 18)


Stage 1 (April 28-may 26): Familiarity with how lttoolbox works


Week 1:

  • Discuss specific study plans with Mentors
  • Complete the problems left over from the last few weeks and study them further

Week 2:

  • Familiar with the use of the three commonly used tools of lttoolbox, especially the analytical tool lt-proc
  • Select and compile 2-3 dictionaries and test each option them

Week 3 and 4:

  • Read the relevant code in the lttoolbox repository for deliverables: familiar with the way apertium works and the use of lttoolbox
  • Begin to understand how lttoolbox implements tokenization


Stage 2 (May 27-june 23): Initial implementation of lttoolbox updates


Weeks 5 and 6:

  • Discuss with the mentors how to improve the compatibility of lttoolbox with Unicode by combining the ICU with the original code
  • Learn more about ICU syntax
  • Start with the FSTProcessor::analysis() function and read the lttoolbox source code
  • Understand how lttoolbox implements tokenization

Week 7 and week 8:

  • Modify the source code in ICU so that it can read Unicode character stream, recognize lexeme, and implement classification mark of morpheme (tokenize)
  • Test the code, clean up the code
  • First Evaluation

Deliverable: Get the first version: It can read Unicode character streams, recognize lexemes, and implement tokenization for lexemes.


Stage 3 (June 24-july 28): Completion of updates to lttoolbox


Week 9 and 10 (take the exam) :

  • On the basis of the first version to achieve the original function of convergence

Week 11 and 12:

  • Complete the project code design
  • Debugging code

Week 13:

  • Optimize the code
  • Second Evaluation

Deliverable: get a second version: a new version of lttoolbox that is Unicode compliant and fully functional


Stage 4 (July 29- August 18): Update the compiler and documentation


Week 14 and 15:

  • Update the compiler so that it can compile the new version of lttoolbox
  • Update the lttoolbox documentation

Week 16:

  • Complete all the unfinished tasks above
  • Delivery of the project

Deliverable: The new lttoolbox is implemented

During the school period (from April 28 to July 3), I will work six days a week. During the working day (Tuesday to Friday), I can guarantee to work more than three hours a day, more than six hours on weekends, nearly 30 hours a week, no less than 25 hours.

During the summer vacation (July 3 - August 18), my summer vacation will focus on Apertium contribution, no other work or study tasks.I will work at least 6 days a week (Monday to Saturday), more than 6 hours a day, close to 40 hours a week, no less than 36 hours.

If everything goes according to plan, I may finish the project ahead of time.