Difference between revisions of "User:Eiji"
Jump to navigation
Jump to search
Line 35: | Line 35: | ||
June 12 - June 18 |
June 12 - June 18 |
||
|Analyzing drawbacks of current tokenization for apertium-iii and apertium-jpn |
|Analyzing drawbacks of current tokenization for apertium-iii and apertium-jpn |
||
|Drawing up a report and thinking of what hybrid model needs cover |
|||
|Testing speed and accuracy of the tokenizer |
|||
|- |
|- |
||
|- |
|- |
||
Line 41: | Line 41: | ||
June 19 - June 25 |
June 19 - June 25 |
||
|Producing a hybrid model |
|Producing a hybrid model |
||
|Trying to minimize the drawbacks of the current tokenizer |
|Trying to minimize the drawbacks of the current tokenizer and make the model more efficiently |
||
|- |
|- |
||
Line 48: | Line 48: | ||
June 26 - July 2 |
June 26 - July 2 |
||
|Analyzing the model and improving it |
|Analyzing the model and improving it |
||
|Checking efficiency of memory usage and speed of the model |
|Checking efficiency of memory usage and speed of the model. If it has room for improvement, improve it |
||
|- |
|- |
||
Line 86: | Line 86: | ||
|- |
|- |
||
} |
|} |
||
==Other Summer Plans== |
==Other Summer Plans== |
Revision as of 11:24, 11 March 2023
Contents
Contact Information
Name:Eiji Miyamoto E-mail address:motopon57@gmail.com University: University of Manchester IRC:thelounge72 github:https://github.com/yypy22
Why is it that you are interested in Apertium?
I am intrigued by natural language processing and its usage. Apertium is open-source and free software for machine translation, so apertium match my interest. The community here is welcoming and supportive too.
Which of the published tasks are you interested in?
Tokenization for spaceless orthographies in Japanese
What do you plan to do?
Investigating the suitable tokenizer for east/south Asian languages which usually do not use spaces and implementing it.
Reasons why Google and Apertium should sponsor it
Apertium translates European languages into other European languages mainly and my proposal for Google Summer of Code 2023 will open up the possibility for future translation in Asian languages which usually do not have space between words in sentences.
Work plan
Week 1
May 29 - June 4 |
Investigating word segmentation and tokenization from paper, and summarising useful findings into a report | Looking up NLP conference papers on word segmentation or tokenization, and checking popular Japanese tokenizers such as MeCab as well. |
Week 2:
June 5 - June 11 |
Testing possible algorithms for tokenization and becoming aware of pros and cons of them | N-gram, Longest-match left-to-right (LRLM), Maximal matching, Viterbi |
Week 3:
June 12 - June 18 |
Analyzing drawbacks of current tokenization for apertium-iii and apertium-jpn | Drawing up a report and thinking of what hybrid model needs cover |
Week 4:
June 19 - June 25 |
Producing a hybrid model | Trying to minimize the drawbacks of the current tokenizer and make the model more efficiently |
Week 5:
June 26 - July 2 |
Analyzing the model and improving it | Checking efficiency of memory usage and speed of the model. If it has room for improvement, improve it |
Week 6:
July 3 - July 9 |
Testing the hybrid model | Testing speed and accuracy with texts |
Mid-term Evaluation | ||
Week 7:
July 10 - July 16 |
Converting the model into a faster language | Python model to C++ |
Week 8:
July 17 - July 23 |
Converting the hybrid model into apertium-jpn | Sending a pull request and refactoring the code |
Week 9:
July 24 - July 30 |
Testing and fixing bugs | Testing accuracy and speed with texts and refactoring the code |
Week 10:
July 31 - August 6 |
Finalise GSOC project | Writing report and completing tests |
Other Summer Plans
I do not have any other plan and I can work full time.