User:Eiji
Contents
Contact Information
Name:Eiji Miyamoto E-mail address:motopon57@gmail.com University: University of Manchester IRC:thelounge72 github:https://github.com/yypy22 Language: Japanese, English
Why is it that you are interested in Apertium?
I am intrigued by natural language processing and its usage. Apertium is open-source and free software for machine translation, so apertium match my interest. The community here is welcoming and supportive too. I have taken NLP summer school at the University of Tokyo online and have taken some lectures related to NLP such as Data Science, Machine Learning, and Artificial Intelligence.
Which of the published tasks are you interested in?
Tokenization for spaceless orthographies in Japanese https://wiki.apertium.org/wiki/Task_ideas_for_Google_Code-in/Tokenisation_for_spaceless_orthographies
What do you plan to do?
Investigating the suitable tokenizer for east/south Asian languages which usually do not use spaces and implementing it.
Reasons why Google and Apertium should sponsor it
Apertium translates European languages into other European languages mainly and my proposal for Google Summer of Code 2023 will open up the possibility for future translation in Asian languages which usually do not have space between words in sentences. The Japanese repository in apertium has not been updated for over two years and I believe my task for GSoC2023 will make this repository more active. I also believe my proposal will expand the possibility of creating Asian Language and Other language pairs in the future.
Work plan
Time period | Weekly Goal | Details |
---|---|---|
Week 1
May 29 - June 4 |
Investigating word segmentation and tokenization from paper, and summarising useful findings into a report | Looking up NLP conference papers on word segmentation or tokenization, and checking popular Japanese tokenizers such as MeCab and Juman. |
Week 2:
June 5 - June 11 |
Testing possible algorithms for tokenization and becoming aware of the pros and cons of them | N-gram, Maximal matching, Viterbi, SentensePiece, etc... |
Week 3:
June 12 - June 18 |
Analyzing drawbacks of current tokenization for apertium-iii and apertium-jpn | Drawing up a report and thinking of what hybrid model needs to cover |
Week 4:
June 19 - June 25 |
Producing a hybrid model | Trying to minimize the drawbacks of the current tokenizer and make the model more efficiently |
Week 5:
June 26 - July 2 |
Analyzing the model and improving it | Checking efficiency of memory usage and speed of the model. If it has room for improvement, improve it |
Week 6:
July 3 - July 9 |
Testing the hybrid model | Testing speed and accuracy with texts, comparing manually tokenized sentences and the one tokenized with the hybrid model |
Mid-term Evaluation | ||
Week 7:
July 10 - July 16 |
Converting the model into a faster language | Python model to C++ |
Week 8:
July 17 - July 23 |
Converting the hybrid model into apertium-jpn | Sending a pull request and refactoring the code |
Week 9:
July 24 - July 30 |
Improving twol file and lexicon file | Adding more kanji, hiragana, and katakana in it and organizing the file for better use. |
Week 10:
July 31 - Aug 6 |
Improving disambiguation for Japanese | Adding grammar to lexical selection, determiner, and remove reading noun. |
Week 11:
Aug7 - Aug13 |
Testing and fixing bugs | Testing accuracy and speed with texts and refactoring the code. |
Week 12:
Aug 14 - August 20 |
Finalise GSOC project | Writing reports and completing tests. |
Final Evaluation |
Coding Challenge
Integration apertium-iii transducer into apertium-jpn https://github.com/yypy22/apertium-jpn
Skills
Python, Java, C++(intermediate), JavaScript(intermediate), HTML, CSS, Docker, Django, XML, Git, jQuery I am currently a second-year student at the University of Manchester and have taken algorithm, data science, AI, programming, and software engineering lectures. I also have taken summer school at Tokyo University on NLP online and got a solo silver medal(top 4%) from the Kaggle competition. I worked as an intern last summer for a data feed system and used python, js, docker, git, HTML, CSS, and SQL mainly. I believe my basic NLP knowledge of Japanese and my experience equip me to work on this project successfully.
Other Summer Plans
I do not have any other plan and can work on the project 40hrs/week or more. My summer vacation will start from 29th May to 18th September.