Difference between revisions of "User:Weizhe/GSOC 2020 proposal"

From Apertium
Jump to navigation Jump to search
(GSOC 2020 proposal)
 
 
(15 intermediate revisions by 2 users not shown)
Line 2: Line 2:


== Contact ==
== Contact ==



Name: Weizhe Yang
Name: Weizhe Yang
Line 16: Line 17:




== Self Introduction ==
== The reasons of that why am I interested in Apertium ==


I appreciate that Apertium focuses on open source software development for translation and lexical analysis. I'm very interested in the various encoding formats of characters and the lexical analysis of words, and Apertium is doing exactly that. It can easily translate between different languages and tokenize various words, and that got me interested in Apertium.


I am a software engineering student of Weifang University, China. I am familiar with C/C++, Python and Java, and proficient in mainstream Linux distribution environments and Bash scripting skills. I have a basic understanding of Unicode encoding and Tokenization flow, and have made some progress in ICU.


== The published task I'm interested in ==


In my first year in university, I joined the '''high-performance computing center'''[http://cs.wfu.edu.cn/2014/0603/c1227a33048/page.htm] of the university as a research assistant. Through research and learning during the period, I have a deep understanding of software architecture and open source projects. In March 2019, I participated in the "LanQiao" national programming competition - a programming skill and algorithm contest - and won the second prize.
After a period of reading, I found the project I was interested in: "Robust tokenization in lttoolbox"[http://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code/Robust_tokenisation] . I started this project a month ago and completed the Coding Challenge. The source code and documents for the Coding Challenge are kept in my Repository[https://github.com/GavinWz/Apertium]. At the same time I also configured Apertium kernel and lttoolbox environment, carried out some operations, and had a general understanding of their works. Next, I will deeply understand the specific code that how to implement analysis and tokenization in lttoolbox.




I am committed to improving my skills through project practice, so I think I will learn more professional knowledge through this exercise in Apertium.
== My Proposal ==

== The reason of why I am interested in Apertium ==


Apertium focuses on open source software development for translation and lexical analysis. I'm very interested in the various encoding formats and the lexical analysis. Fortunately this is what Apertium is doing, it can easily translate between different languages and analyze various words.


== Target Project ==


I found the project I was interested in: '''"Robust tokenization in lttoolbox"'''[http://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code/Robust_tokenisation]. I started this project a month ago and completed the '''Coding Challenge'''. The source code and documents for the Coding Challenge are kept in my '''Repository'''[https://github.com/GavinWz/Apertium]. At the same time I also configured the Apertium kernel and lttoolbox environment on my PC, carried out some operations, and had a general understanding of their works. For now I can understand the target project better and deeper so that lexical analysis and tokenization could be handled better.


== Proposal ==





=== Abstract ===
=== Abstract ===


Apertium's lttoolbox tool currently implements the reading and tokenization of characters from input streams, but still has some trouble handling Unicode non-alphabetic characters.When dealing with the longest matched string, it may encounter an unspecified character in the alphabet, which may cause the character read to be aborted, causing the analysis process from left to right (lr) to get out of control, producing unexpected results, and causing more future work to fail.


Apertium's lttoolbox tool currently implements the reading and tokenization of characters from input streams, but still has some trouble handling Unicode non-alphabetic characters. When dealing with the longest matched string, it may encounter an unspecified character in the alphabet, which may cause the character read to be aborted, causing the analysis process from left to right (lr) to get out of control, producing unexpected results, and causing more future work to fail.
My task will be to write a algorithm so that the character reading and recognition process can proceed smoothly, regardless of the character type. And after reading, the text can be separated into some lexemes by a given class of delimiters. Then, a new tokenization operation is carried out to assign the matching token to the obtained lexemes. Finally, after testing, add this function to lttoolbox, get an enhanced lttoolbox.


=== Self-Introduction ===


My task will be to write algorithms so that the character reading and recognizing process can proceed smoothly, regardless of the character type. And after reading, the input text can be separated into some lexemes by a given class of delimiters. Then, a new tokenization operation is carried out to assign the matching token to the obtained lexemes.
I am a software engineering student. I familiar with the syntax of C++ and Python and their data structure, proficient in using Linux command line, have written some practical Bash Shell scripts.


In my freshman year, I joined the high-performance computing center[http://cs.wfu.edu.cn/2014/0603/c1227a33048/page.htm] of the university as a research assistant. Through research and learning during the period, I have a deep understanding of software architecture and open source projects. In March 2019, I participated In the "Lanqiao" national programming competition and won the second prize.


The tokenization process should be divided into four separate processes: read, recognition, delimiter handling, and tokenization. For each process, the algorithm should be designed or modified separately from the original code. Before the algorithm design of each process, there will be a ready-made test case, the algorithm design goal is to pass the test. After passing the basic test, it is necessary to propose an in-depth test case, and then update the algorithm to get the latest test results. Through this loop, the program will gradually become more robust and satisfactory. Eventually, the test documentation and source code for each unit will be generated and progressively enhanced. After all the unit designs are completed, splicing of the units may be necessary, and before that there also should be a basic test case. After the same iterative approach, the splicing process will be completed quickly. The new tokenization design will then be completed successfully.
I understand the basic knowledge of Unicode encoding and Tokenization flow now, and have begun to learn ICU syntax and made some progress. I am committed to improving my skills through project practice, so I think I will learn more professional knowledge through this exercise in Apertium, which will give me great motivation to work.


=== Benefits to society ===



Upon completion of this project, most users of Apertium will enjoy the smooth lexical analysis provided by lttoolbox.It will also attract more users to Apertium, thus providing them with a good lexical analysis and translation tool.
=== Benefits to society (Deliverable) ===


Upon completion of this project, A smooth lexical analysis provided by lttoolbox will be developed, a complete test documentation will be saved, the compiler will be updated, and the documents will be upgraded.


After this, Apertium users using the lttoolbox as an analysis tool will no longer be bothered by Unicode restrictions. It will also give more people access to apertium for a useful lexical analyzing and translating software.



=== Proposed Plan ===
=== Proposed Plan ===


No matter whether I am admitted or not, I will start to prepare from April 1st. From April 1 to April 28, I have four weeks to prepare for the start, familiarize myself with the way apertium works, and begin to learn the necessary knowledge, such as Unicode encoding, tokenization flow, and ICU syntax. So if I am lucky enough to be admitted, I will officially start contributing to apertium within 16 weeks from April 28 to August 18.So I had a total of 20 weeks to work on the project. Besides, I hope that I can work less or not at all during the exam period (from June 22 to July 3). I will make up for this part of the task during the vacation.


I will start from April 1st. From April 1 to May 5, I have five weeks to prepare for the start, familiarize myself with the way apertium works, and begin to learn the necessary knowledge, such as Unicode encoding, tokenization flow, and ICU syntax. I will contribute to the project as early as possible. Maybe I have a total of 20 weeks (from April 1 to August 18) to work on the project. There may be a buffer week for the semester exam period (from June 22 to July 3).
'''Preparation before starting''' (1 April - 28 April)


'''Preparation before start''' (April 1 - May 5)



'''Week 1:'''
'''Week 1:'''
* Read the Unicode literature to gain an in-depth understanding of how Unicode is encoded.
* Read the Unicode literature to gain an in-depth understanding of it
* Learn the working mode and details of tokenization.
* Learn the working mode and details of tokenization.
* Communicate specific work with mentor and follow advice.
* Communicate specific work with mentors and follow advice.


'''Week 2 and 3:'''
'''Week 2 and 3:'''
* Learn the ICU grammar, pick out the header files and functions that I may use in the future, and get familiar with the query way of the official ICU documents, so that I can quickly find the official explanation when I use strange function calls in the future.
* Learn the ICU, pick out the header files and functions that may be used in future, and get familiar with the query way of the official ICU documents, so that the official explanation will be quickly found when meeting strange function calls in the future.
* In my own program, try to read and tokenize the longest match rule for Unicode strings from the standard input stream
* In own program, try to read and tokenize the Unicode string for the longest matching rule from the standard input stream


'''Week 4:'''
'''Week 4:'''
* Read apertium documentation
* Read apertium documentation
* Configure the lttoolboox environment
* Configure the lttoolbox environment


'''Week 5:'''
'''Official start''' (April 28 - August 18)
* Familiar with the use of the three commonly used tools of lttoolbox, especially the analytical tool lt-proc
* Select and compile 2-3 dictionaries and test each option them




'''Official start''' (May 5 - August 18)
'''Stage 1''' (April 28-may 26): Familiarity with how lttoolbox works


'''Stage 1''' (May 5-May 26): Familiarity with how lttoolbox works




'''Week 1:'''
'''Week 1:'''
* Discuss specific study plans with Mentors
* Discuss specific study plans with Mentors. Identify the specific code that needs to be updated
* Complete the problems left over from the last few weeks and study them further
* Complete the problems left over from the last few weeks and study them further


'''Week 2:'''
'''Week 2 and 3:'''
* Read the relevant code in the lttoolbox repository
* Familiar with the use of the three commonly used tools of lttoolbox, especially the analytical tool lt-proc
* Start with the FSTProcessor::analysis() function and read the lttoolbox source code again
* Select and compile 2-3 dictionaries and test each option them


Deliverable: Understand how apertium implements tokenization.
'''Week 3 and 4:'''
* Read the relevant code in the lttoolbox repository for deliverables: familiar with the way apertium works and the use of lttoolbox
* Begin to understand how lttoolbox implements tokenization




'''Stage 2''' (May 27-june 23): Initial implementation of lttoolbox updates
'''Stage 2''' (May 27-june 23): Algorithms design




'''Weeks 5 and 6:'''
'''Weeks 4:'''

* Discuss with the mentors how to improve the compatibility of lttoolbox with Unicode by combining the ICU with the original code
* Discuss with the mentors how to improve the compatibility of lttoolbox with Unicode by combining the ICU with the original code
* Divide tokenization process into four separate processes: reading, recognition, delimiter handling, and tokenization
* Learn more about ICU syntax

* Start with the FSTProcessor::analysis() function and read the lttoolbox source code
'''Week 5:'''
* Understand how lttoolbox implements tokenization
* Following the TDD pattern, complete the reading and recognizing processes


'''Week 7 and week 8:'''
'''Week 6:'''
* Complete the delimiter handling process


'''Week 7:'''
* Modify the source code in ICU so that it can read Unicode character stream, recognize lexeme, and implement classification mark of morpheme (tokenize)
* Complete the tokenization process
* Test the code, clean up the code
* First Evaluation
* First Evaluation


Deliverable: The algorithms to realize these four processes are obtained
'''Deliverable:''' Get the first version: It can read Unicode character streams, recognize lexemes, and implement tokenization for lexemes.




Line 107: Line 133:




'''Week 9 and 10''' (take the exam) :
'''Week 8 and 9''' (take the exam) :


* Debug the algorithms of Stage 2
* On the basis of the first version to achieve the original function of convergence


'''Week 11 and 12:'''
'''Week 10 and 11:'''
* Splice the above four functions together
* Design a function that facilitate user to operate tokenization


'''Week 12:'''
* Complete the project code design
* Debugging code
* Debugging
* Code refactor

'''Week 13:'''

* Optimize the code
* Second Evaluation
* Second Evaluation


Deliverable: Unit splicing is completed and project functions are all realized
'''Deliverable:''' get a second version: a new version of lttoolbox that is Unicode compliant and fully functional




'''Stage 4''' (July 29- August 18): update the compiler and documentation
'''Stage 4''' (July 29- August 18): Update the compiler and documentation




'''Week 14 and 15:'''
'''Week 13 and 14:'''

* Update the compiler so that it can compile the new version of lttoolbox
* Update the compiler so that it can compile the new version of lttoolbox
* Update the lttoolbox documentation
* Update the lttoolbox documentation


'''Week 16:'''
'''Week 15:'''
* Complete all the unfinished tasks above
* Complete all the unfinished tasks above
* Delivery of the project
* Delivery of the project


'''Deliverable:''' The new lttoolbox is implemented
Deliverable: Complete the update of the compiler and documentation


== Time Commitment ==


During the school period (from April 28 to July 3), I will work six days a week. During the working day (Tuesday to Friday), I can guarantee to work more than four hours a day, more than six hours on weekends, more than 30 hours a week.



During the school period (from April 28 to July 3), I will work six days a week. During the working day (Tuesday to Friday), I can guarantee to work more than three hours a day, more than six hours on weekends, nearly 30 hours a week, no less than 25 hours.
During the summer vacation (July 3 - August 18), I will focus on the contribution to Apertium. There is no other work or study tasks. I will work at least 6 days a week (Monday to Saturday), more than 6 hours a day, more than 36 hours a week.


During the summer vacation (July 3 - August 18), my summer vacation will focus on Apertium contribution, no other work or study tasks.I will work at least 6 days a week (Monday to Saturday), more than 6 hours a day, close to 40 hours a week, no less than 36 hours.


If everything goes according to plan, I may finish the project ahead of time.
If everything goes according to plan, the project would hopefully be finished ahead of time.

Latest revision as of 02:22, 12 April 2020


Contact[edit]

Name: Weizhe Yang

E-mail: gavinwzmails@gmail.com

IRC nick: Weizhe

Location: Shandong, China

Time zone: UTC/GMT+8

Github: https://github.com/GavinWz


Self Introduction[edit]

I am a software engineering student of Weifang University, China. I am familiar with C/C++, Python and Java, and proficient in mainstream Linux distribution environments and Bash scripting skills. I have a basic understanding of Unicode encoding and Tokenization flow, and have made some progress in ICU.


In my first year in university, I joined the high-performance computing center[1] of the university as a research assistant. Through research and learning during the period, I have a deep understanding of software architecture and open source projects. In March 2019, I participated in the "LanQiao" national programming competition - a programming skill and algorithm contest - and won the second prize.


I am committed to improving my skills through project practice, so I think I will learn more professional knowledge through this exercise in Apertium.

The reason of why I am interested in Apertium[edit]

Apertium focuses on open source software development for translation and lexical analysis. I'm very interested in the various encoding formats and the lexical analysis. Fortunately this is what Apertium is doing, it can easily translate between different languages and analyze various words.


Target Project[edit]

I found the project I was interested in: "Robust tokenization in lttoolbox"[2]. I started this project a month ago and completed the Coding Challenge. The source code and documents for the Coding Challenge are kept in my Repository[3]. At the same time I also configured the Apertium kernel and lttoolbox environment on my PC, carried out some operations, and had a general understanding of their works. For now I can understand the target project better and deeper so that lexical analysis and tokenization could be handled better.


Proposal[edit]

Abstract[edit]

Apertium's lttoolbox tool currently implements the reading and tokenization of characters from input streams, but still has some trouble handling Unicode non-alphabetic characters. When dealing with the longest matched string, it may encounter an unspecified character in the alphabet, which may cause the character read to be aborted, causing the analysis process from left to right (lr) to get out of control, producing unexpected results, and causing more future work to fail.


My task will be to write algorithms so that the character reading and recognizing process can proceed smoothly, regardless of the character type. And after reading, the input text can be separated into some lexemes by a given class of delimiters. Then, a new tokenization operation is carried out to assign the matching token to the obtained lexemes.


The tokenization process should be divided into four separate processes: read, recognition, delimiter handling, and tokenization. For each process, the algorithm should be designed or modified separately from the original code. Before the algorithm design of each process, there will be a ready-made test case, the algorithm design goal is to pass the test. After passing the basic test, it is necessary to propose an in-depth test case, and then update the algorithm to get the latest test results. Through this loop, the program will gradually become more robust and satisfactory. Eventually, the test documentation and source code for each unit will be generated and progressively enhanced. After all the unit designs are completed, splicing of the units may be necessary, and before that there also should be a basic test case. After the same iterative approach, the splicing process will be completed quickly. The new tokenization design will then be completed successfully.


Benefits to society (Deliverable)[edit]

Upon completion of this project, A smooth lexical analysis provided by lttoolbox will be developed, a complete test documentation will be saved, the compiler will be updated, and the documents will be upgraded.


After this, Apertium users using the lttoolbox as an analysis tool will no longer be bothered by Unicode restrictions. It will also give more people access to apertium for a useful lexical analyzing and translating software.


Proposed Plan[edit]

I will start from April 1st. From April 1 to May 5, I have five weeks to prepare for the start, familiarize myself with the way apertium works, and begin to learn the necessary knowledge, such as Unicode encoding, tokenization flow, and ICU syntax. I will contribute to the project as early as possible. Maybe I have a total of 20 weeks (from April 1 to August 18) to work on the project. There may be a buffer week for the semester exam period (from June 22 to July 3).


Preparation before start (April 1 - May 5)


Week 1:

  • Read the Unicode literature to gain an in-depth understanding of it
  • Learn the working mode and details of tokenization.
  • Communicate specific work with mentors and follow advice.

Week 2 and 3:

  • Learn the ICU, pick out the header files and functions that may be used in future, and get familiar with the query way of the official ICU documents, so that the official explanation will be quickly found when meeting strange function calls in the future.
  • In own program, try to read and tokenize the Unicode string for the longest matching rule from the standard input stream

Week 4:

  • Read apertium documentation
  • Configure the lttoolbox environment

Week 5:

  • Familiar with the use of the three commonly used tools of lttoolbox, especially the analytical tool lt-proc
  • Select and compile 2-3 dictionaries and test each option them


Official start (May 5 - August 18)


Stage 1 (May 5-May 26): Familiarity with how lttoolbox works


Week 1:

  • Discuss specific study plans with Mentors. Identify the specific code that needs to be updated
  • Complete the problems left over from the last few weeks and study them further

Week 2 and 3:

  • Read the relevant code in the lttoolbox repository
  • Start with the FSTProcessor::analysis() function and read the lttoolbox source code again

Deliverable: Understand how apertium implements tokenization.


Stage 2 (May 27-june 23): Algorithms design


Weeks 4:

  • Discuss with the mentors how to improve the compatibility of lttoolbox with Unicode by combining the ICU with the original code
  • Divide tokenization process into four separate processes: reading, recognition, delimiter handling, and tokenization

Week 5:

  • Following the TDD pattern, complete the reading and recognizing processes

Week 6:

  • Complete the delimiter handling process

Week 7:

  • Complete the tokenization process
  • First Evaluation

Deliverable: The algorithms to realize these four processes are obtained


Stage 3 (June 24-july 28): Completion of updates to lttoolbox


Week 8 and 9 (take the exam) :

  • Debug the algorithms of Stage 2

Week 10 and 11:

  • Splice the above four functions together
  • Design a function that facilitate user to operate tokenization

Week 12:

  • Debugging
  • Code refactor
  • Second Evaluation

Deliverable: Unit splicing is completed and project functions are all realized


Stage 4 (July 29- August 18): Update the compiler and documentation


Week 13 and 14:

  • Update the compiler so that it can compile the new version of lttoolbox
  • Update the lttoolbox documentation

Week 15:

  • Complete all the unfinished tasks above
  • Delivery of the project

Deliverable: Complete the update of the compiler and documentation


Time Commitment[edit]

During the school period (from April 28 to July 3), I will work six days a week. During the working day (Tuesday to Friday), I can guarantee to work more than four hours a day, more than six hours on weekends, more than 30 hours a week.


During the summer vacation (July 3 - August 18), I will focus on the contribution to Apertium. There is no other work or study tasks. I will work at least 6 days a week (Monday to Saturday), more than 6 hours a day, more than 36 hours a week.


If everything goes according to plan, the project would hopefully be finished ahead of time.