Difference between revisions of "User:Linh le/2014Application"

From Apertium
Jump to navigation Jump to search
Line 21: Line 21:
 
==== Adopting unreleased Vietnamese-English pair and bringing it to release quality ====
 
==== Adopting unreleased Vietnamese-English pair and bringing it to release quality ====
 
==== Why Google and Apertium should sponsor it ====
 
==== Why Google and Apertium should sponsor it ====
Although Vietnamese is spoken by around 90 million people in Vietnam, and the 7th most spoken language in the US and the 6th in Australia, there is not much ()
+
Although Vietnamese is spoken by around 80 million people as their first language in Vietnam, as well as other millions of people in the US, Australia, Canada, Germany and so on, there is not much work in progress on supporting the translation of Vietnamese to other languages.
  +
  +
A part of the Austroasiatic family, Vietnamese is considered to be closely related to some languages and dialects spoken in Cambodia, Thai, Laos, India, Bangladesh. In addition, having been under Chinese occupation for centuries, the language is also much similar to Cantonese in sounds and grammar. Therefore, the rules created from this project can be useful for some other dialects , especially ones that are less often represented.
  +
  +
I'm aware that there are other MT available for Vietnamese-English translation (Microsoft, Google, and one by a Vietnamese company that is very slow and only works for short sentences), among which Google Translation seems to be the best one. I have only been able to manually test these machines, and found that while Google's coverage is great for monosyllabic words, many bi- or polysyllabic words that make up a large part of Vietnamese are not recognized and translated as monosyllables, making little sense. In addition, the grammar and lexical selection of the system also seem to require much more work. I understand that it's impossible to create a fluent translation system in three months, but it should be enough to create a good framework for future continuing development.
   
 
==== Work plan ====
 
==== Work plan ====

Revision as of 12:15, 21 March 2014

Contact information

Name: Linh Le
Email address: linh.ai.le@gmail.com
Nick on IRC#apertium: LinhLe
Sourceforge username:
Timezone:

Why is it that you are interested in learning about machine translation?

Born as a Vietnamese in Vietnam, I have learning English since I was a kid, since it has always been deemed as an essential tool for people in my generation. However, I also understand that because learning a language not only takes time, effort and dedication but also a lot of money, and that not everyone has the opportunity and resources to do so. Therefore, in my opinion, machine translation is a useful alternative for people who wish to approach resources in other languages for their careers, entertainment or (life exp).

Why are you interested in Apertium project?

First of all, since Apertium is an open-source project, everyone can contribute and I think that this system is perfect for language translation. Since language is created by, developed by, affected by, and used by people, it requires human involvement and interactions, and therefore it's great for many people to work on one project. In addition, the platform is also very open, in that people of various backgrounds and experiences can join.
Second, despite having no immediate relations, there are many similarities in the grammar structures of the specific pair of language that I'm thinking about (Vietnamese<->English), and therefore I believe that the rule-based MT Apertium can provide a great system to for translation between this pair.

I'm interested in GSOC for Apertium in particular because I really hope to contribute to Apertium from now on. However, I know that I might not be much experienced in the field (coding as well as linguistic knowledge) and I'm trying my best to learn, but being able to take part in GSOC and having a mentor who knows Apertium well enough is a good way to get used to (). Therefore, I view this as a great opportunity for me to learn more about linguistic issues and frameworks, as well as Apertium and after completing GSOC, I would like to continue contributing to the community.

Which of the published tasks are you interested in? What do you plan to do?

I'm interested in Adopting an unreleased pair, specifically the pair Vietnamese - English.

Proposal

Adopting unreleased Vietnamese-English pair and bringing it to release quality

Why Google and Apertium should sponsor it

Although Vietnamese is spoken by around 80 million people as their first language in Vietnam, as well as other millions of people in the US, Australia, Canada, Germany and so on, there is not much work in progress on supporting the translation of Vietnamese to other languages.

A part of the Austroasiatic family, Vietnamese is considered to be closely related to some languages and dialects spoken in Cambodia, Thai, Laos, India, Bangladesh. In addition, having been under Chinese occupation for centuries, the language is also much similar to Cantonese in sounds and grammar. Therefore, the rules created from this project can be useful for some other dialects , especially ones that are less often represented.

I'm aware that there are other MT available for Vietnamese-English translation (Microsoft, Google, and one by a Vietnamese company that is very slow and only works for short sentences), among which Google Translation seems to be the best one. I have only been able to manually test these machines, and found that while Google's coverage is great for monosyllabic words, many bi- or polysyllabic words that make up a large part of Vietnamese are not recognized and translated as monosyllables, making little sense. In addition, the grammar and lexical selection of the system also seem to require much more work. I understand that it's impossible to create a fluent translation system in three months, but it should be enough to create a good framework for future continuing development.

Work plan

Post-application period
  • Continue working on the coding challenge.
  • Learn more about XML, CG
Bonding period
  • Continue working on the coding challenge (if it hasn't been completed).
  • Get familiar:
    • Be on IRC often and get used to the community
    • Read through the already available files for vi-en translation. Read through other available documents
    • Learn more about XML, CG
  • Prepare rough corpora and/or get one that is already available (https://code.google.com/p/evbcorpus/).
  • Learn more about both English and Vietnamese morphology, focusing more on Vietnamese, since English has already available file support
  • Gather available online and offline dictionary resources
  • Create rough drafts of monolingual dix
Week 1 (May 19-25)
  • Continue rough versions of two morphological dictionaries

(Around 3000 words in each, including all types)

  • Start bilingual dictionary
Week 2 (May 26- June 1)
  • 3500 words in dix
  • Start transfer rules
  • Expand bilingual dictionary
  • Document
Week 3 (June 2-8)
  • Run testvoc
  • Write transfer rules
  • 4000 words in dix
  • Expand bilingual dix
  • (If possible) 400-word corpus test (coverage 60%)
Week 4 (June 8-15)
  • 400-word corpus test (WER 30%, coverage 60-70%)
  • Clean testvoc
  • Add transfer rules
  • 4800 words in dix
  • Expand bilingual dix
  • Document

Deliverable: 2 monolingual dix and bilingual dix

Week 5 (June 16-22)
  • Add more transfer rules
  • Start lexical selection rules (pay special attention to pronouns, determiner, indefinite, etc.)
  • 5300 words in dix
  • Expand bilingual dix
Week 6 (June 23-29)
  • Finish midterm evaluation
  • Add lexical selection rules
  • Add transfer rules
  • Start CG
  • Expand bilingual dix
  • 5500 words in dix
  • Run testvoc
  • Document
Week 7 (June 30- July 6)
  • Clean testvoc
  • Add transfer rules
  • Add CG rules
  • 6000 words in dix
Week 8 (July 7-13)
  • 1000-word corpus test (coverage 70-80%, WER 30-40%)
  • 6500 words in dix
  • Add lexical selection rules
  • Add CG rules
  • Document

Deliverable: More complete dix and rules

Week 9 (July 14-20)
  • Training of Vietnamese tagger
  • 6725 words in dix
  • 1000-word corpus test (coverage 70-80%, WER 30-40% - but higher than previous result)
  • Regression testing
Week 10 (July 21-27)
  • Training of English tagger
  • Add lexical selection rules
  • Clean testvoc
  • 7000 words in dix
  • Document
Week 11 (July 28-August 3)
  • 8000 words in dix
  • Add transfer rules
Week 12 (August 3-9)
  • 1000-word corpus test (coverage 80-90%)
  • 8500 words in dix

Deliverable: MT coverage 80-90%

Completion (August 10-22)
  • Document, clean up, testing
  • (If possible) 9000 words in dix