User:Davidho/Application
(draft)
Contents
- 1 Contact information
- 2 Why is it you are interested in machine translation?
- 3 Why is it that you are interested in the Apertium project?
- 4 Which of the published tasks are you interested in?
- 5 What do you plan to do?
- 6 Reasons why Google and Apertium should sponsor it
- 7 A description of how and who it will benefit in society
- 8 List your skills and give evidence of your qualifications
- 9 List any non-Summer-of-Code plans you have for the Summer
Contact information
Name: Junhao He
Email: davidho7066@gmail.com
IRC: Davidho
Why is it you are interested in machine translation?
I am a Chinese and have learned English for more than 10 years and Spanish for 2 years. However, when I encounter some sentences or phrases in English or Spanish that I cannot comprehend, none of translation systems so far satisfies me. The longer I learn foreign languages, the more I understand the differences between Chinese and them. I always want to create something which can handle Chinese translation appropriately, but I also know it will be a huge project. It was not until the course about compiler last year that I knew how a translator worked. And it was the time that I got being interested in machine translation.
Why is it that you are interested in the Apertium project?
The first time I came across Apertium was when I was reading the accepted projects list of GSoC 2013. And it was the Chinese-Spanish Apertium System that attracted me. Before knowing Apertium, I had no idea how to start Then I started to read documentations about Apertium and joined the IRC channel #apertium. After doing some research on Apertium, I found three characteristic of the system that impressed me. The first and the most important one is that Apertium is an open-source machine translation engine and has been expanded to treat more divergent language pairs. It is well-designed and allows everyone to contribute to it. This ensures its continuous growth and convinces me of its great prospect. Second, the linguistic data files are encoded in XML-based formats. XML files are easy to understand, which enables those who have little linguistic knowledge to expand the dictionaries. This is helpful to improve the quality of existing pairs and to adopt new pairs.
Which of the published tasks are you interested in?
Prototype recursive transfer implementation
What do you plan to do?
Before GSoC Help to improve the quality of zho-spa language pair, especially the transfer rules part. I think this is helpful to understand deeper the transfer rules.
Community bonding period: Go through the Apertium documentation and get more familiar with the system. Get further contact with the community Do a review of finite-state dependency parsing and LALR(1) grammars.
Week 1: propose a new formalism of transfer rules and discuss it with the mentor.
Week 2: propose a new formalism of transfer rules and discuss it with the mentor. And make the formalism a formal documentation.
Week 3:complete the documentation of the new format. And write a number of transfer rules in the new formalism between Chinese and Spanish or English.
Week 4:continue to write transfer rule and list them in a clever way.
Delievable 1:A documentation of the new formalism and a list containing a numbers of transfer rules.
Week 5:Rewrite rules of the Chinese and Spanish pair using the new formalism.
Week 6:Rewrite rules of the Chinese and Spanish pair using the new formalism.
Week 7:Rewrite rules of the Chinese and Spanish pair using the new formalism.
Week 8: Make tests, debug and write documentations.
Delievable 2:XML files of zho-spa pair with rewritten transfer rules.
Week 9: Integrate the new rules with Chinese and Spanish pair.
Week 10: Integrate the new rules with Chinese and Spanish pair.
Week 11: Make tests and debug.
Week 12: Clean up and dissemination.
Delievable 3: A full implementation of a prototype recursive transfer
Reasons why Google and Apertium should sponsor it
Apertium was designed to translate between closely related languages. And this translation does not involve much constituent reordering. However, with the development of the system, it is inevitable but significantly beneficial to expand to treat more divergent language pairs, of which reordering would be a key concern. This project aims to develop a prototype of a new module that can handle long-distance reordering. It will be a long stride for the whole system if it succeed. By the way, the zho-spa (Chinese-Spanish) pair was created in GSoC 2013. But it is not qualified to be released because it does not meet the demand of quality. And I also know that there is no one working on this pair. It is a badly waste. These two languages are the most spoken languages in the world. So I am convinced that this pair is of great value. But the difference between them makes it hard to be developed. I think if I can eventually propose a new formalism of transfer rules, it will help to reduce the difficulty of development.
A description of how and who it will benefit in society
Chinese is the most spoken language in the world. And the need to communicate with foreign people grows rapidly. And I believe there will be more and more people interested in China and they might want to learn Chinese. However, learning Chinese is not an easy job. A system that can translate Chinese into other languages will definitely be a great help for everyone who wants to learn Chinese and for Chinese who want to communicate with foreigners.
List your skills and give evidence of your qualifications
I am a 3rd-year undergraduate majoring in Software Engineering in South China University of Technology.
I am skillful to code with C/C++ and I have done some projects using this programming language. I am also able to use python to carry out some small tasks. I had courses of Principles of Compilers and Formal Languages last year. It was them that made me interested in Natural Language Processing. And I believe that knowledge of parsers, syntax analyzers, finite automatas and finite state transducers will help me to understand the Apertium system deeper.
I can speak three languages. They are Chinese(mother tongue), English(fluent) and Spanish(refreshing) respectively. These three language comes from three different language systems. And I am sure knowing the differences among them is of great help to propose a new formalism of transfer rules.
I am working on implementing a part of functions of a columnar database. It involves some techniques of parallel programming like OpenMP, MPI and pthread. It is a huge project and I have to work with some other people through the Internet. So I am quite confident that I am capable of finishing the programming work from distance.
List any non-Summer-of-Code plans you have for the Summer
Before my summer vacation begins, there will be a final exam at the end of June. It may last one week. Aside from that, there will be nothing I focus on but the Apertium project.