Difference between revisions of "User:Eiji"

From Apertium
Jump to navigation Jump to search
Line 19: Line 19:
 
== Work plan ==
 
== Work plan ==
 
{| class="wikitable" border="1"
 
{| class="wikitable" border="1"
* '''Phase1'''
 
 
|-
 
|-
 
| Week 1
 
| Week 1
Line 36: Line 35:
 
June 12 - June 18
 
June 12 - June 18
 
|Analyzing drawbacks of current tokenization for apertium-iii and apertium-jpn
 
|Analyzing drawbacks of current tokenization for apertium-iii and apertium-jpn
  +
|testing speed and accuracy of the tokenizer
 
|-
 
|-
 
|-
 
|-
Line 41: Line 41:
 
June 19 - June 25
 
June 19 - June 25
 
|Producing a hybrid model
 
|Producing a hybrid model
  +
|Trying to minimize the drawbacks of the current tokenizer
 
|-
 
|-
   
Line 47: Line 48:
 
June 26 - July 2
 
June 26 - July 2
 
|Analyzing the model and improving it
 
|Analyzing the model and improving it
  +
|Checking efficiency of memory usage and speed of the model
 
|-
 
|-
 
|-
 
|-
Line 52: Line 54:
 
July 3 - July 9
 
July 3 - July 9
 
|Testing the hybrid model
 
|Testing the hybrid model
  +
|Testing speed and accuracy with texts
 
|-
 
|-
 
|-
 
|-
Line 59: Line 62:
 
| Week 7:
 
| Week 7:
 
July 10 - July 16
 
July 10 - July 16
|Evaluation of the model
+
|Converting the model into a faster language
  +
|Python model to C++
 
|-
 
|-
 
|-
 
|-
 
| Week 8:
 
| Week 8:
 
July 17 - July 23
 
July 17 - July 23
|Converting the model into a faster language
+
|Converting the hybrid model into apertium-jpn
  +
| Sending a pull request and refactoring the code
 
|-
 
|-
   
Line 70: Line 75:
 
| Week 9:
 
| Week 9:
 
July 24 - July 30
 
July 24 - July 30
 
|testing and fixing bugs
|Converting the hybrid model into apertium-jpn
 
  +
|testing accuracy and speed with texts and refactoring the code
 
|-
 
|-
 
|-
 
|-
 
| Week 10:
 
| Week 10:
 
July 31 - August 6
 
July 31 - August 6
  +
|Finalise GSOC project
|testing and fixing bugs
 
 
|Writing report and completing tests
|-
 
|-
 
| Week 11:
 
August 7 - August 13
 
|Continue to test
 
|-
 
| Week 12:
 
August 14 - August 20
 
|Finalise GSOC project: Writing report and complete tests
 
 
|-
 
|-
  +
 
}
 
}
 
* '''Project completed'''
 
* '''Project completed'''

Revision as of 10:50, 11 March 2023

Contact Information

   Name:Eiji Miyamoto
   E-mail address:motopon57@gmail.com
   University: University of Manchester
   IRC:thelounge72
   github:https://github.com/yypy22

Why is it that you are interested in Apertium?

   I am intrigued by natural language processing and its usage. Apertium is open-source and free software for machine translation, so apertium match my interest. The community here is welcoming and supportive too.

Which of the published tasks are you interested in?

   Tokenization for spaceless orthographies in Japanese

What do you plan to do?

   Investing the suitable tokenizer for east/south Asian languages without space and implementing it. 

Reasons why Google and Apertium should sponsor it

   Apertium translates European languages into other European languages mainly and my proposal for Google Summer of Code 2023 will open up the possibility 
   for future translation in Asian languages which usually do not have space between words in sentences. 

Work plan

}
  • Project completed
Week 1

May 29 - June 4

Investigating word segmentation and tokenization from paper, and summarising useful findings into a report Looking up NLP conference papers on word segmentation or tokenization, and checking popular Japanese tokenizers such as MeCab as well.
Week 2:

June 5 - June 11

Testing possible algorithms for tokenization and becoming aware of pros and cons of them N-gram, Longest-match left-to-right (LRLM), Maximal matching, Viterbi
Week 3:

June 12 - June 18

Analyzing drawbacks of current tokenization for apertium-iii and apertium-jpn testing speed and accuracy of the tokenizer
Week 4:

June 19 - June 25

Producing a hybrid model Trying to minimize the drawbacks of the current tokenizer
Week 5:

June 26 - July 2

Analyzing the model and improving it Checking efficiency of memory usage and speed of the model
Week 6:

July 3 - July 9

Testing the hybrid model Testing speed and accuracy with texts
Mid-term Evaluation
Week 7:

July 10 - July 16

Converting the model into a faster language Python model to C++
Week 8:

July 17 - July 23

Converting the hybrid model into apertium-jpn Sending a pull request and refactoring the code
Week 9:

July 24 - July 30

testing and fixing bugs testing accuracy and speed with texts and refactoring the code
Week 10:

July 31 - August 6

Finalise GSOC project Writing report and completing tests