Difference between revisions of "User:Eiji"

From Apertium
Jump to navigation Jump to search
 
(21 intermediate revisions by the same user not shown)
Line 2: Line 2:
Name:Eiji Miyamoto
Name:Eiji Miyamoto
E-mail address:motopon57@gmail.com
E-mail address:motopon57@gmail.com
University: University of Manchester
University: University of Manchester(BSc.Computer Science)
IRC:thelounge72
IRC:thelounge72
github:https://github.com/yypy22
github:https://github.com/yypy22
Language: Japanese, English
Language: Japanese, English

==Why is it that you are interested in Apertium?==
==Why is it that you are interested in Apertium?==
I am intrigued by natural language processing and its usage. Apertium is open-source and free software for machine translation, so apertium match my interest. The community here is
I am intrigued by natural language processing and its usage. NLP is widely used and it improves human and machine communication better. We can see many applications of NLP in real life. Apertium is open-source and free software for machine translation, so apertium matches my interest. The community here is very welcoming and supportive. I have always wanted to contribute to open-source projects, and I believe apertium is the best open-source software and community to work with.
welcoming and supportive too. I have taken NLP summer school at the University of Tokyo online and have taken some lectures related to NLP such as Data Science, Machine Learning,
and Artificial Intelligence.


==Which of the published tasks are you interested in?==
==Which of the published tasks are you interested in?==
Tokenization for spaceless orthographies in Japanese
Tokenization for spaceless orthographies in Japanese
https://wiki.apertium.org/wiki/Task_ideas_for_Google_Code-in/Tokenisation_for_spaceless_orthographies
==What do you plan to do?==
==What do you plan to do?==
Investigating the suitable tokenizer for east/south Asian languages which usually do not use spaces and implementing it.
Investigating the suitable tokenizer for east/south Asian languages which usually do not use spaces and implementing it. Besides, improving Japanese-related files.


==Reasons why Google and Apertium should sponsor it==
==Reasons why Google and Apertium should sponsor it==
Apertium translates European languages into other European languages mainly and my proposal for Google Summer of Code 2023 will open up the possibility
Apertium translates European languages into other European languages mainly and my proposal for Google Summer of Code 2023 will open up the possibility
for future translation in Asian languages which usually do not have space between words in sentences. The Japanese repository in apertium has not been updated for over two years
for future translation in Asian languages which usually do not have space between words in sentences. The Japanese repository in apertium has not been updated for over two years
and I believe my task for GSoC2023 will make this repository more active. I also believe my proposal will expand the possibility to create Asian Language and Other language pairs in
and I believe my task for GSoC2023 will make this repository more active. I also believe my proposal will expand the possibility of creating Asian Language and Other language pairs in
the future.
the future.


Line 33: Line 33:
May 29 - June 4
May 29 - June 4
|Investigating word segmentation and tokenization from paper, and summarising useful findings into a report
|Investigating word segmentation and tokenization from paper, and summarising useful findings into a report
|Looking up NLP conference papers on word segmentation or tokenization, and checking popular Japanese tokenizers such as MeCab and Juman.
|Looking up NLP conference papers on word segmentation or tokenization, and checking popular Japanese tokenizers such as MeCab and Juman. Summarising the findings in three A4 papers.
|-
|-
|-
|-
|Week 2:
|Week 2:
June 5 - June 11
June 5 - June 11
|Testing possible algorithms for tokenization and becoming aware of pros and cons of them
|Testing possible algorithms for tokenization and becoming aware of the pros and cons of them
|N-gram, Maximal matching, Viterbi, SentensePiece, etc...
|N-gram, Maximal matching, Viterbi, SentensePiece, etc... Discussing the algorithm with mentors.
|-
|-
|-
|-
Line 45: Line 45:
June 12 - June 18
June 12 - June 18
|Analyzing drawbacks of current tokenization for apertium-iii and apertium-jpn
|Analyzing drawbacks of current tokenization for apertium-iii and apertium-jpn
|Drawing up a report and thinking of what hybrid model needs to cover
|Drawing up a report and thinking of what the hybrid model needs to cover. Making a pseudocode for the model. Getting feedback from mentors about the pseudocode.
|-
|-
|-
|-
Line 51: Line 51:
June 19 - June 25
June 19 - June 25
|Producing a hybrid model
|Producing a hybrid model
|Trying to minimize the drawbacks of the current tokenizer and make the model more efficiently
|Trying to minimize the drawbacks of the current tokenizer and make the model more efficient. Coding the model according to the pseudocode in python. Getting feedback from mentors about the code.
|-
|-


Line 58: Line 58:
June 26 - July 2
June 26 - July 2
|Analyzing the model and improving it
|Analyzing the model and improving it
|Checking efficiency of memory usage and speed of the model. If it has room for improvement, improve it
|Checking efficiency of memory usage and speed of the model. If it has room for improvement, improve it. Testing speed and accuracy with texts, comparing manually tokenized sentences and the one tokenized with the hybrid model.
|-
|-


Line 64: Line 64:
| Week 6:
| Week 6:
July 3 - July 9
July 3 - July 9
|Testing the hybrid model
|Converting the model into a faster language
|Python model to C++ or faster language.
|Testing speed and accuracy with texts, comparing manually tokenized sentences and the one tokenized with the hybrid model
|-
|-
|-
|-
Line 73: Line 73:
| Week 7:
| Week 7:
July 10 - July 16
July 10 - July 16
|Continue to work on converting
|Converting the model into a faster language
|Fixing bugs and checking logic and function. Testing the model with simple sentences.
|Python model to C++
|-
|-
|-
|-
Line 80: Line 80:
July 17 - July 23
July 17 - July 23
|Converting the hybrid model into apertium-jpn
|Converting the hybrid model into apertium-jpn
|Sending a pull request and refactoring the code
|Sending a pull request and refactoring the code.
|-
|-


Line 86: Line 86:
| Week 9:
| Week 9:
July 24 - July 30
July 24 - July 30
|Improving twol file and lexicon file
|Adding more kanji, hiragana, and katakana in it and organizing the file for better use. Including roman numeric characters. Evaluation of coverage improvement. Sending pull request on jpn repository.
|-

|-
| Week 10:
July 31 - Aug 6
|Improving disambiguation for Japanese
|Adding grammar to lexical selection, determiner, and remove reading noun. Evaluation of Coverage improvement. Sending pull request on jpn repository.
|-

|-
| Week 11:
Aug7 - Aug13
|Testing and fixing bugs
|Testing and fixing bugs
|Testing accuracy and speed with texts and refactoring the code
|Testing accuracy and speed with texts and refactoring the code. Modifying Japanese-related documents too.
|-
|-
|-
|-
| Week 10:
| Week 12:
July 31 - August 6
Aug 14 - August 20
|Finalise GSOC project
|Finalise GSOC project
|Writing report and completing tests
|Writing reports and completing tests.
|-
|-
|-
|-
Line 100: Line 114:
|}
|}
==Coding Challenge==
==Coding Challenge==
Integration apertium-iii transducer into apertium-jpn
Integration apertium-iii transducer into apertium-jpn. https://github.com/yypy22/apertium-jpn
I modified MakeFile.am and modes.xml to use tokenized.py from apertium-iii. I added tokenizer-related code such as $(LANG1).autotok.hfst into makefile and program tag in modes.xml, dropping tokenized.py file into jpn repository.
https://github.com/yypy22/apertium-jpn
==Skills==
Python, Java, C++(intermediate), JavaScript(intermediate), HTML, CSS, Docker, Django, XML, Git, jQuery
I am currently a second-year student at the University of Manchester and have taken algorithm, data science, AI, programming, and software engineering lectures. I also have taken summer school at Tokyo University on NLP online and got a solo silver medal(top 4%) from the Kaggle competition. I worked as an intern last summer for a data feed system and used python, js, docker, git, HTML, CSS, and SQL mainly.
I believe my basic NLP knowledge of Japanese and my experience equip me to work on this project successfully.
==Other Summer Plans==
==Other Summer Plans==
I do not have any other plan and I can work full time
I do not have any other plan and can work on the project 40hrs/week or more. My summer vacation will start from 29th May to 18th September.

Latest revision as of 11:48, 20 March 2023

Contact Information[edit]

   Name:Eiji Miyamoto
   E-mail address:motopon57@gmail.com
   University: University of Manchester(BSc.Computer Science)
   IRC:thelounge72
   github:https://github.com/yypy22
   Language: Japanese, English

Why is it that you are interested in Apertium?[edit]

I am intrigued by natural language processing and its usage. NLP is widely used and it improves human and machine communication better. We can see many applications of NLP in real life. Apertium is open-source and free software for machine translation, so apertium matches my interest. The community here is very welcoming and supportive. I have always wanted to contribute to open-source projects, and I believe apertium is the best open-source software and community to work with.

Which of the published tasks are you interested in?[edit]

Tokenization for spaceless orthographies in Japanese https://wiki.apertium.org/wiki/Task_ideas_for_Google_Code-in/Tokenisation_for_spaceless_orthographies

What do you plan to do?[edit]

Investigating the suitable tokenizer for east/south Asian languages which usually do not use spaces and implementing it. Besides, improving Japanese-related files.

Reasons why Google and Apertium should sponsor it[edit]

Apertium translates European languages into other European languages mainly and my proposal for Google Summer of Code 2023 will open up the possibility for future translation in Asian languages which usually do not have space between words in sentences. The Japanese repository in apertium has not been updated for over two years and I believe my task for GSoC2023 will make this repository more active. I also believe my proposal will expand the possibility of creating Asian Language and Other language pairs in the future.

Work plan[edit]

Time period Weekly Goal Details
Week 1

May 29 - June 4

Investigating word segmentation and tokenization from paper, and summarising useful findings into a report Looking up NLP conference papers on word segmentation or tokenization, and checking popular Japanese tokenizers such as MeCab and Juman. Summarising the findings in three A4 papers.
Week 2:

June 5 - June 11

Testing possible algorithms for tokenization and becoming aware of the pros and cons of them N-gram, Maximal matching, Viterbi, SentensePiece, etc... Discussing the algorithm with mentors.
Week 3:

June 12 - June 18

Analyzing drawbacks of current tokenization for apertium-iii and apertium-jpn Drawing up a report and thinking of what the hybrid model needs to cover. Making a pseudocode for the model. Getting feedback from mentors about the pseudocode.
Week 4:

June 19 - June 25

Producing a hybrid model Trying to minimize the drawbacks of the current tokenizer and make the model more efficient. Coding the model according to the pseudocode in python. Getting feedback from mentors about the code.
Week 5:

June 26 - July 2

Analyzing the model and improving it Checking efficiency of memory usage and speed of the model. If it has room for improvement, improve it. Testing speed and accuracy with texts, comparing manually tokenized sentences and the one tokenized with the hybrid model.
Week 6:

July 3 - July 9

Converting the model into a faster language Python model to C++ or faster language.
Mid-term Evaluation
Week 7:

July 10 - July 16

Continue to work on converting Fixing bugs and checking logic and function. Testing the model with simple sentences.
Week 8:

July 17 - July 23

Converting the hybrid model into apertium-jpn Sending a pull request and refactoring the code.
Week 9:

July 24 - July 30

Improving twol file and lexicon file Adding more kanji, hiragana, and katakana in it and organizing the file for better use. Including roman numeric characters. Evaluation of coverage improvement. Sending pull request on jpn repository.
Week 10:

July 31 - Aug 6

Improving disambiguation for Japanese Adding grammar to lexical selection, determiner, and remove reading noun. Evaluation of Coverage improvement. Sending pull request on jpn repository.
Week 11:

Aug7 - Aug13

Testing and fixing bugs Testing accuracy and speed with texts and refactoring the code. Modifying Japanese-related documents too.
Week 12:

Aug 14 - August 20

Finalise GSOC project Writing reports and completing tests.
Final Evaluation

Coding Challenge[edit]

Integration apertium-iii transducer into apertium-jpn. https://github.com/yypy22/apertium-jpn I modified MakeFile.am and modes.xml to use tokenized.py from apertium-iii. I added tokenizer-related code such as $(LANG1).autotok.hfst into makefile and program tag in modes.xml, dropping tokenized.py file into jpn repository.

Skills[edit]

Python, Java, C++(intermediate), JavaScript(intermediate), HTML, CSS, Docker, Django, XML, Git, jQuery I am currently a second-year student at the University of Manchester and have taken algorithm, data science, AI, programming, and software engineering lectures. I also have taken summer school at Tokyo University on NLP online and got a solo silver medal(top 4%) from the Kaggle competition. I worked as an intern last summer for a data feed system and used python, js, docker, git, HTML, CSS, and SQL mainly. I believe my basic NLP knowledge of Japanese and my experience equip me to work on this project successfully.

Other Summer Plans[edit]

I do not have any other plan and can work on the project 40hrs/week or more. My summer vacation will start from 29th May to 18th September.