Difference between revisions of "User:Eiji"

From Apertium
Jump to navigation Jump to search
(17 intermediate revisions by the same user not shown)
Line 5: Line 5:
 
IRC:thelounge72
 
IRC:thelounge72
 
github:https://github.com/yypy22
 
github:https://github.com/yypy22
  +
Language: Japanese, English
  +
==Why is it that you are interested in Apertium?==
  +
I am intrigued by natural language processing and its usage. Apertium is open-source and free software for machine translation, so apertium match my interest. The community here is
  +
welcoming and supportive too. I have taken NLP summer school at the University of Tokyo online and have taken some lectures related to NLP such as Data Science, Machine Learning,
  +
and Artificial Intelligence.
   
==Why is it that you are interested in Apertium?==
 
I am intrigued by natural language processing and its usage. Apertium is open-source and free software for machine translation, so apertium match my interest. The community here is welcoming and supportive too.
 
 
==Which of the published tasks are you interested in?==
 
==Which of the published tasks are you interested in?==
Tokenization for spaceless orthographies in Japanese
+
Tokenization for spaceless orthographies in Japanese
 
==What do you plan to do?==
 
==What do you plan to do?==
Investing the suitable tokenizer for east/south Asian languages without space and implementing it.
+
Investigating the suitable tokenizer for east/south Asian languages which usually do not use spaces and implementing it.
   
 
==Reasons why Google and Apertium should sponsor it==
 
==Reasons why Google and Apertium should sponsor it==
Apertium translates European languages into other European languages mainly and my proposal for Google Summer of Code 2023 will open up the possibility
+
Apertium translates European languages into other European languages mainly and my proposal for Google Summer of Code 2023 will open up the possibility
for future translation in Asian languages which usually do not have space between words in sentences.
+
for future translation in Asian languages which usually do not have space between words in sentences. The Japanese repository in apertium has not been updated for over two years
  +
and I believe my task for GSoC2023 will make this repository more active. I also believe my proposal will expand the possibility to create Asian Language and Other language pairs in
  +
the future.
   
 
== Work plan ==
 
== Work plan ==
  +
{| class="wikitable" border="1"
* '''Phase1'''
 
  +
|-
Week 1: Investigating word segmentation and tokenization from paper, and summarising useful findings into a report
 
  +
!Time period
Week 2: Testing possible algorithms for tokenization and becoming aware of pros and cons of them
 
  +
!Weekly Goal
Week 3: Analyzing drawbacks of current tokenization for apertium-iii and apertium-jpn
 
  +
!Details
Week 4: Producing a hybrid model
 
  +
|-
 
  +
|-
* '''Phase2'''
 
  +
| Week 1
 
  +
May 29 - June 4
Week 5: Analyzing the model and improving it
 
  +
|Investigating word segmentation and tokenization from paper, and summarising useful findings into a report
Week 6: Testing the hybrid model
 
  +
|Looking up NLP conference papers on word segmentation or tokenization, and checking popular Japanese tokenizers such as MeCab and Juman.
Week 7: Evaluation of the model
 
  +
|-
Week 8: Converting the model into a faster language
 
  +
|-
  +
|Week 2:
  +
June 5 - June 11
  +
|Testing possible algorithms for tokenization and becoming aware of pros and cons of them
  +
|N-gram, Maximal matching, Viterbi, SentensePiece, etc...
  +
|-
  +
|-
  +
|Week 3:
  +
June 12 - June 18
  +
|Analyzing drawbacks of current tokenization for apertium-iii and apertium-jpn
  +
|Drawing up a report and thinking of what hybrid model needs to cover
  +
|-
  +
|-
  +
| Week 4:
  +
June 19 - June 25
  +
|Producing a hybrid model
  +
|Trying to minimize the drawbacks of the current tokenizer and make the model more efficiently
  +
|-
   
  +
|-
* '''Phase3'''
 
  +
| Week 5:
  +
June 26 - July 2
  +
|Analyzing the model and improving it
  +
|Checking efficiency of memory usage and speed of the model. If it has room for improvement, improve it
  +
|-
   
  +
|-
Week 9: Converting the hybrid model into apertium-jpn
 
Week 10: Testing and fixing bugs
+
| Week 6:
  +
July 3 - July 9
Week 11: Continue to test
 
  +
|Testing the hybrid model
Week 12: Finalise GSOC project: Writing report and complete tests
 
  +
|Testing speed and accuracy with texts, comparing manually tokenized sentences and the one tokenized with the hybrid model
  +
|-
  +
|-
  +
|-
  +
|'''Mid-term Evaluation'''
  +
|-
  +
| Week 7:
  +
July 10 - July 16
  +
|Converting the model into a faster language
  +
|Python model to C++
  +
|-
  +
|-
  +
| Week 8:
  +
July 17 - July 23
  +
|Converting the hybrid model into apertium-jpn
  +
|Sending a pull request and refactoring the code
  +
|-
   
  +
|-
* '''Project completed'''
 
  +
| Week 9:
  +
July 24 - July 30
  +
|Testing and fixing bugs
  +
|Testing accuracy and speed with texts and refactoring the code
  +
|-
  +
|-
  +
| Week 10:
  +
July 31 - August 6
  +
|Finalise GSOC project
  +
|Writing report and completing tests
  +
|-
  +
|-
  +
|'''Final Evaluation'''
  +
|-
  +
|}
  +
==Coding Challenge==
  +
Integration apertium-iii transducer into apertium-jpn
  +
https://github.com/yypy22/apertium-jpn
  +
==Other Summer Plans==
  +
I do not have any other plan and I can work full time

Revision as of 18:10, 11 March 2023

Contact Information

   Name:Eiji Miyamoto
   E-mail address:motopon57@gmail.com
   University: University of Manchester
   IRC:thelounge72
   github:https://github.com/yypy22
   Language: Japanese, English

Why is it that you are interested in Apertium?

I am intrigued by natural language processing and its usage. Apertium is open-source and free software for machine translation, so apertium match my interest. The community here is welcoming and supportive too. I have taken NLP summer school at the University of Tokyo online and have taken some lectures related to NLP such as Data Science, Machine Learning, and Artificial Intelligence.

Which of the published tasks are you interested in?

Tokenization for spaceless orthographies in Japanese

What do you plan to do?

Investigating the suitable tokenizer for east/south Asian languages which usually do not use spaces and implementing it.

Reasons why Google and Apertium should sponsor it

Apertium translates European languages into other European languages mainly and my proposal for Google Summer of Code 2023 will open up the possibility for future translation in Asian languages which usually do not have space between words in sentences. The Japanese repository in apertium has not been updated for over two years and I believe my task for GSoC2023 will make this repository more active. I also believe my proposal will expand the possibility to create Asian Language and Other language pairs in the future.

Work plan

Time period Weekly Goal Details
Week 1

May 29 - June 4

Investigating word segmentation and tokenization from paper, and summarising useful findings into a report Looking up NLP conference papers on word segmentation or tokenization, and checking popular Japanese tokenizers such as MeCab and Juman.
Week 2:

June 5 - June 11

Testing possible algorithms for tokenization and becoming aware of pros and cons of them N-gram, Maximal matching, Viterbi, SentensePiece, etc...
Week 3:

June 12 - June 18

Analyzing drawbacks of current tokenization for apertium-iii and apertium-jpn Drawing up a report and thinking of what hybrid model needs to cover
Week 4:

June 19 - June 25

Producing a hybrid model Trying to minimize the drawbacks of the current tokenizer and make the model more efficiently
Week 5:

June 26 - July 2

Analyzing the model and improving it Checking efficiency of memory usage and speed of the model. If it has room for improvement, improve it
Week 6:

July 3 - July 9

Testing the hybrid model Testing speed and accuracy with texts, comparing manually tokenized sentences and the one tokenized with the hybrid model
Mid-term Evaluation
Week 7:

July 10 - July 16

Converting the model into a faster language Python model to C++
Week 8:

July 17 - July 23

Converting the hybrid model into apertium-jpn Sending a pull request and refactoring the code
Week 9:

July 24 - July 30

Testing and fixing bugs Testing accuracy and speed with texts and refactoring the code
Week 10:

July 31 - August 6

Finalise GSOC project Writing report and completing tests
Final Evaluation

Coding Challenge

Integration apertium-iii transducer into apertium-jpn https://github.com/yypy22/apertium-jpn

Other Summer Plans

I do not have any other plan and I can work full time