Difference between revisions of "User:Linh le/2014Application"
Line 21: | Line 21: | ||
==== Work plan ==== |
==== Work plan ==== |
||
Post-application period |
====== Post-application period ====== |
||
* Continue working on the coding challenge. |
* Continue working on the coding challenge. |
||
* Learn more about XML, CG |
* Learn more about XML, CG |
||
Bonding period |
====== Bonding period ====== |
||
* Continue working on the coding challenge (if it hasn't been completed). |
* Continue working on the coding challenge (if it hasn't been completed). |
||
* Get familiar: |
* Get familiar: |
||
Line 36: | Line 36: | ||
* Create rough drafts of monolingual dix |
* Create rough drafts of monolingual dix |
||
Week 1 (May 19-25) |
====== Week 1 (May 19-25) ====== |
||
* Continue rough versions of two morphological dictionaries<br> |
* Continue rough versions of two morphological dictionaries<br> |
||
(Around 3000 words in each, including all types)<br> |
(Around 3000 words in each, including all types)<br> |
||
* Start bilingual dictionary |
* Start bilingual dictionary |
||
Week 2 (May 26- June 1) |
====== Week 2 (May 26- June 1) ====== |
||
* 3500 words in dix |
* 3500 words in dix |
||
* Start transfer rules |
* Start transfer rules |
||
Line 47: | Line 47: | ||
* Document |
* Document |
||
Week 3 (June 2-8) |
====== Week 3 (June 2-8) ====== |
||
* Run testvoc |
* Run testvoc |
||
* Write transfer rules |
* Write transfer rules |
||
Line 54: | Line 54: | ||
* (If possible) 400-word corpus test (coverage 60%) |
* (If possible) 400-word corpus test (coverage 60%) |
||
Week 4 (June 8-15) |
====== Week 4 (June 8-15) ====== |
||
* 400-word corpus test (WER 30%, coverage 60-70%) |
* 400-word corpus test (WER 30%, coverage 60-70%) |
||
* Clean testvoc |
* Clean testvoc |
||
Line 62: | Line 62: | ||
* Document |
* Document |
||
Deliverable: 2 monolingual dix and bilingual dix |
'''''Deliverable: 2 monolingual dix and bilingual dix''''' |
||
Week 5 (June 16-22) |
====== Week 5 (June 16-22) ====== |
||
* Add more transfer rules |
* Add more transfer rules |
||
* Start lexical selection rules (pay special attention to pronouns, determiner, indefinite, etc.) |
* Start lexical selection rules (pay special attention to pronouns, determiner, indefinite, etc.) |
||
Line 70: | Line 70: | ||
* Expand bilingual dix |
* Expand bilingual dix |
||
Week 6 (June 23-29) |
====== Week 6 (June 23-29) ====== |
||
* Finish midterm evaluation |
* Finish midterm evaluation |
||
* Add lexical selection rules |
* Add lexical selection rules |
||
Line 80: | Line 80: | ||
* Document |
* Document |
||
Week 7 (June 30- July 6) |
====== Week 7 (June 30- July 6) ====== |
||
* Clean testvoc |
* Clean testvoc |
||
* Add transfer rules |
* Add transfer rules |
||
Line 86: | Line 86: | ||
* 6000 words in dix |
* 6000 words in dix |
||
Week 8 (July 7-13) |
====== Week 8 (July 7-13) ====== |
||
* 1000-word corpus test (coverage 70-80%, WER 30-40%) |
* 1000-word corpus test (coverage 70-80%, WER 30-40%) |
||
* 6500 words in dix |
* 6500 words in dix |
||
Line 93: | Line 93: | ||
* Document |
* Document |
||
Deliverable: More complete dix and rules |
'''''Deliverable: More complete dix and rules''''' |
||
Week 9 (July 14-20) |
====== Week 9 (July 14-20) ====== |
||
* Training of Vietnamese tagger |
* Training of Vietnamese tagger |
||
* 6725 words in dix |
* 6725 words in dix |
||
Line 101: | Line 101: | ||
* Regression testing |
* Regression testing |
||
Week 10 (July 21-27) |
====== Week 10 (July 21-27) ====== |
||
* Training of English tagger |
* Training of English tagger |
||
* Add lexical selection rules |
* Add lexical selection rules |
||
Line 108: | Line 108: | ||
* Document |
* Document |
||
Week 11 (July 28-August 3) |
====== Week 11 (July 28-August 3) ====== |
||
* 8000 words in dix |
* 8000 words in dix |
||
* Add transfer rules |
* Add transfer rules |
||
Week 12 (August 3-9) |
====== Week 12 (August 3-9) ====== |
||
* 1000-word corpus test (coverage 80-90%) |
* 1000-word corpus test (coverage 80-90%) |
||
* 8500 words in dix |
* 8500 words in dix |
||
Deliverable: MT coverage 80-90% |
'''''Deliverable: MT coverage 80-90%''''' |
||
Completion (August 10-22) |
Completion (August 10-22) |
||
* Document, clean up, testing |
* Document, clean up, testing |
Revision as of 06:07, 21 March 2014
Contents
Contact information
Name: Linh Le
Email address: linh.ai.le@gmail.com
Nick on IRC#apertium: LinhLe
Sourceforge username:
Why is it that you are interested in learning about machine translation?
Born as a Vietnamese in Vietnam, I have learning English since I was a kid, since it has always been deemed as an essential tool for people in my generation. However, I also understand that because learning a language not only takes time, effort and dedication but also a lot of money, and that not everyone has the opportunity and resources to do so. Therefore, in my opinion, machine translation is a useful alternative for people who wish to approach resources in other languages for their careers, entertainment or (life exp).
Why are you interested in Apertium project?
First of all, since Apertium is an open-source project, everyone can contribute and I think that this system is perfect for language translation. Since language is created by, developed by, affected by, and used by people, it requires human involvement and human understanding. *edit* Second, despite having no immediate relations, there are many similarities in the grammar structures of the specific pair of language that I'm thinking about (Vietnamese<->English), and therefore I believe that the rule-based MT Apertium can provide a system ()
Which of the published tasks are you interested in? What do you plan to do?
I'm interested in Adopting an unreleased pair, specifically the pair Vietnamese - English.
Proposal
Adopting unreleased Vietnamese-English pair and bringing it to release quality
Why Google and Apertium should sponsor it
Although Vietnamese is spoken by around 90 million people in Vietnam, and the 7th most spoken language in the US and the 6th in Australia, there is not much ()
Work plan
Post-application period
- Continue working on the coding challenge.
- Learn more about XML, CG
====== Bonding period ======
- Continue working on the coding challenge (if it hasn't been completed).
- Get familiar:
- Be on IRC often and get used to the community
- Read through the already available files for vi-en translation. Read through other available documents
- Learn more about XML, CG
- Prepare rough corpora and/or get one that is already available (https://code.google.com/p/evbcorpus/).
- Learn more about both English and Vietnamese morphology, focusing more on Vietnamese, since English has already available file support
- Gather available online and offline dictionary resources
- Create rough drafts of monolingual dix
====== Week 1 (May 19-25) ======
- Continue rough versions of two morphological dictionaries
(Around 3000 words in each, including all types)
- Start bilingual dictionary
====== Week 2 (May 26- June 1) ======
- 3500 words in dix
- Start transfer rules
- Expand bilingual dictionary
- Document
====== Week 3 (June 2-8) ======
- Run testvoc
- Write transfer rules
- 4000 words in dix
- Expand bilingual dix
- (If possible) 400-word corpus test (coverage 60%)
====== Week 4 (June 8-15) ======
- 400-word corpus test (WER 30%, coverage 60-70%)
- Clean testvoc
- Add transfer rules
- 4800 words in dix
- Expand bilingual dix
- Document
Deliverable: 2 monolingual dix and bilingual dix
====== Week 5 (June 16-22) ======
- Add more transfer rules
- Start lexical selection rules (pay special attention to pronouns, determiner, indefinite, etc.)
- 5300 words in dix
- Expand bilingual dix
====== Week 6 (June 23-29) ======
- Finish midterm evaluation
- Add lexical selection rules
- Add transfer rules
- Start CG
- Expand bilingual dix
- 5500 words in dix
- Run testvoc
- Document
====== Week 7 (June 30- July 6) ======
- Clean testvoc
- Add transfer rules
- Add CG rules
- 6000 words in dix
====== Week 8 (July 7-13) ======
- 1000-word corpus test (coverage 70-80%, WER 30-40%)
- 6500 words in dix
- Add lexical selection rules
- Add CG rules
- Document
Deliverable: More complete dix and rules
====== Week 9 (July 14-20) ======
- Training of Vietnamese tagger
- 6725 words in dix
- 1000-word corpus test (coverage 70-80%, WER 30-40% - but higher than previous result)
- Regression testing
====== Week 10 (July 21-27) ======
- Training of English tagger
- Add lexical selection rules
- Clean testvoc
- 7000 words in dix
- Document
====== Week 11 (July 28-August 3) ======
- 8000 words in dix
- Add transfer rules
====== Week 12 (August 3-9) ======
- 1000-word corpus test (coverage 80-90%)
- 8500 words in dix
Deliverable: MT coverage 80-90%
Completion (August 10-22)
- Document, clean up, testing
- (If possible) 9000 words in dix