User:Linh le/2014Application
Contents
- 1 Contact information
- 2 Why is it that you are interested in learning about machine translation?
- 3 Why are you interested in Apertium project?
- 4 Which of the published tasks are you interested in? What do you plan to do?
- 5 Proposal
- 5.1 Adopting unreleased Vietnamese-English pair and bringing it to release quality
- 5.2 Why Google and Apertium should sponsor it
- 5.3 How and who it will benefit in the society
- 5.4 Work plan
- 5.4.1 Post-application period
- 5.4.2 Bonding period
- 5.4.3 Week 1 (May 19-25)
- 5.4.4 Week 2 (May 26- June 1)
- 5.4.5 Week 3 (June 2-8)
- 5.4.6 Week 4 (June 8-15)
- 5.4.7 Week 5 (June 16-22)
- 5.4.8 Week 6 (June 23-29)
- 5.4.9 Week 7 (June 30- July 6)
- 5.4.10 Week 8 (July 7-13)
- 5.4.11 Week 9 (July 14-20)
- 5.4.12 Week 10 (July 21-27)
- 5.4.13 Week 11 (July 28-August 3)
- 5.4.14 Week 12 (August 3-9)
- 5.4.15 Completion (August 10-22)
- 5.5 List your skills and give evidence of your qualifications
- 5.6 Non-GSOC schedule
Contact information
Name: Linh Le
Email address: linh.ai.le@gmail.com
Nick on IRC#apertium: LinhLe
Sourceforge username: linhaile
Timezone: Before Apr 24th, UTC + 9. After Apr 24th, UTC + 7
Why is it that you are interested in learning about machine translation?
Born as a Vietnamese in Vietnam, I have been learning English since I was a kid because it has always been deemed an essential tool for people of my generation. However, I also understand that because learning a language not only takes time, effort and dedication but also a lot of money, not everyone has the opportunities and resources to do so. Therefore, in my opinion, machine translation is a useful alternative for people who wish to approach resources in other languages for their careers and life enjoyment.
Currently, I am studying Japanese, which is my third language. The more languages I learn (and the deeper I study), the more fascinated I am by both the cross-language similarities and differences, and translation is not only a good tool to find out more, but also to show these differences and similarities. In addition, I also like programming, so machine translation is a great combination of these two interests.
Why are you interested in Apertium project?
First of all, since Apertium is an open-source project, everyone can contribute and I think that this system is perfect for language translation. Since languages are created, developed, modified, and used by people, they require human involvement and interactions, and therefore it is great for many people to work together on one project. In addition, the platform is also very open, in that people of various backgrounds and experiences can join and contribute (which is the reason why I want to do so as well).
Second, despite having no immediate relations, there are many similarities in the grammar structures of the specific pair of languages that I'm thinking about (Vietnamese<->English), and therefore I believe that the rule-based MT Apertium can provide a great system for translation between this pair.
I'm interested in GSOC for Apertium in particular because even though I've decided that I would like to contribute no matter what the result will be, participating in GSOC and having a mentor who knows Apertium well will be a greater help for me to get more experience in the field (coding as well as linguistic knowledge). In my opinion, a mentor's advice is always appreciated, and in many cases, even just discussing an issue can lead to good solutions.
Which of the published tasks are you interested in? What do you plan to do?
I'm interested in Adopting an unreleased pair, specifically the pair Vietnamese - English.
Proposal
Adopting unreleased Vietnamese-English pair and bringing it to release quality
Why Google and Apertium should sponsor it
Although Vietnamese is spoken by around 80 million people as their first language in Vietnam, as well as other millions of people in the US, Australia, Canada, Germany and so on, there is not much work in progress supporting the translation of Vietnamese to other languages. On Apertium, the pair vi-en currently is also in the incubator, not having been developed much.
A part of the Austroasiatic family, Vietnamese is considered to be closely related to some languages and dialects spoken in Cambodia, Thai, Laos, India, Bangladesh. In addition, having been under Chinese occupation for centuries, the language also bears similarities to Cantonese in terms of sounds and grammar. Therefore, the rules created from this project can be useful for some other dialects , especially ones that are less often represented.
I'm aware that there are other MT available for Vietnamese-English translation (Microsoft, Google, and one by a Vietnamese company that is very slow and only works for short sentences), among which Google Translation seems to be the best one. I have only been able to manually test these machines, and found that while Google's coverage is great for monosyllabic words, many bi- or polysyllabic words that make up a large part of Vietnamese are not recognized and translated as monosyllables, making little sense. In addition, the grammar and lexical selection of the system also seem to require more work. I understand that it is impossible to create a fluent translation system in three months, but it should be enough to create a good framework for future continuing development.
How and who it will benefit in the society
Vietnamese speakers, especially monolingual speakers, will benefit. That is to say, at least 80 millions people. As stated above, studying a language is costly in terms of time and money, and a MT is a feasible alternative, since more and more people these days are also getting access to the Internet. With this machine, reading of news, documents as well as exchanging of ideas will be made easier for Vietnamese speakers who hope to interact more with people from other countries, and the MT will also open a bigger door for non-Vietnamese speakers who hope to learn more about the language as well as the country.
Work plan
Post-application period
- Continue working on the coding challenge.
- Learn more about XML, CG
Bonding period
- Continue working on the coding challenge (if not yet completed).
- Get familiar:
- Be on IRC often and get used to the community
- Read through the already available files for vi-en translation. Read through other available documents
- Learn more about XML, CG
- Prepare rough corpora and/or get one that is already available (https://code.google.com/p/evbcorpus/).
- Learn more about both English and Vietnamese morphology, focusing more on Vietnamese, since English has already available file support
- Gather available online and offline dictionary resources
- Continue rough draft of monolingual dix
Week 1 (May 19-25)
- Continue rough versions of Vietnamese morphological dictionary
(Around 3000 words, including all types)
- Start bilingual dictionary
Week 2 (May 26- June 1)
- 3500 words in dix
- Start transfer rules
- Expand bilingual dictionary
- Document
Week 3 (June 2-8)
- Run testvoc
- Write transfer rules
- 4000 words in dix
- Expand bilingual dix
- (If possible) 200-word corpus test (coverage 60%)
Week 4 (June 8-15)
- 200-word corpus test
- Add transfer rules
- 4800 words in dix
- Expand bilingual dix
- Clean testvoc
- Document
Deliverable: 2 monolingual dix and bilingual dix, WER 20%, naive bidix-trimmed coverage 60-70% testvoc clean for <n> <qnt>, <num>, articles,interjection
Week 5 (June 16-22)
- Add more transfer rules
- Start lexical selection rules (pay special attention to pronouns, determiner, indefinite, etc.)
- 5300 words in dix
- Expand bilingual dix
Week 6 (June 23-29)
- Finish midterm evaluation
- Add lexical selection rules
- Add transfer rules
- Start CG
- Expand bilingual dix
- 5500 words in dix
- Run testvoc
- Document
Week 7 (June 30- July 6)
- Clean testvoc
- Add transfer rules
- Add CG rules
- 6000 words in dix
Week 8 (July 7-13)
- 500-word corpus test
- 6500 words in dix
- Add lexical selection rules
- Add CG rules
- Document
Deliverable: coverage 70-80%, WER 20-30%, testvoc clean for <v>, <cnjcoo>, <cnjsub>
Week 9 (July 14-20)
- Training of Vietnamese tagger
- 6725 words in dix
- 500-word corpus test
- Regression testing
Week 10 (July 21-27)
- Training of English tagger
- Add lexical selection rules
- Clean testvoc
- 7000 words in dix
- Document
Week 11 (July 28-August 3)
- 8000 words in dix
- Add transfer rules
Week 12 (August 3-9)
- 500-word corpus test
- 8500 words in dix
Completion (August 10-22)
- Document, clean up, testing
- (If possible) 9000 words in dix
Deliverable/Goals: ~9000 words monolingual dix, naive bidix-trimmed coverage 80-90%, WER 40-50%, testvoc clean for (the above and) <adj> <adv> <def> <ind>
List your skills and give evidence of your qualifications
I am a junior at Smith College, USA but Computer Science experience-wise, I am a sophomore. I have had experience coding in Python, Java, with a summer working on CSS and Artificial Intelligence. I have been learning Japanese in the two years I have been in America, and one year in Japan (currently). Apertium is my first open source project, but it has been nice so far, and I will definitely continue contributing.
I am a native Vietnamese speaker, and is a student in an American college and have been using English as the main language of communication for the last 4 years (I take classes in both Japanese and English in Japan). Before going to America 4 years ago, I have studied English for 5 years.
Non-GSOC schedule
I'm considering taking a CS class but it will become 2nd priority if I get accepted and I will discuss that with my mentor before I make any decision. Other than that, I do not have any plans for the summer, and I am willing to spend 40 hours a week on Apertium because I think that is the time I will need to complete my goals.