Difference between revisions of "User:Linh le/2014Application"
Jump to navigation
Jump to search
(Blanked the page) |
|||
(29 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
− | == Contact information<br>== |
||
− | '''Name''': Linh Le<br> |
||
− | '''Email address''': linh.ai.le@gmail.com<br> |
||
− | '''Nick on IRC#apertium''': LinhLe<br> |
||
− | '''Sourceforge username''':<br> |
||
− | |||
− | == Why is it that you are interested in learning about machine translation? == |
||
− | Born as a Vietnamese in Vietnam, I have learning English since I was a kid, since it has always been deemed as an essential tool for people in my generation. However, I also understand that because learning a language not only takes time, effort and dedication but also a lot of money, and that not everyone has the opportunity and resources to do so. Therefore, in my opinion, machine translation is a useful alternative for people who wish to approach resources in other languages for their careers, entertainment or (life exp). |
||
− | |||
− | == Why are you interested in Apertium project? == |
||
− | First of all, since Apertium is an open-source project, everyone can contribute and I think that this system is perfect for language translation. Since language is created by, developed by, affected by, and used by people, it requires human involvement and human understanding. *edit* |
||
− | Second, despite having no immediate relations, there are many similarities in the grammar structures of the specific pair of language that I'm thinking about (Vietnamese<->English), and therefore I believe that the rule-based MT Apertium can provide a system () |
||
− | |||
− | == Which of the published tasks are you interested in? What do you plan to do? == |
||
− | I'm interested in '''Adopting an unreleased pair''', specifically the pair Vietnamese - English. |
||
− | |||
− | == Proposal == |
||
− | ==== Adopting unreleased Vietnamese-English pair and bringing it to release quality ==== |
||
− | ====== Why Google and Apertium should sponsor it ====== |
||
− | Although Vietnamese is spoken by around 90 million people in Vietnam, and the 7th most spoken language in the US and the 6th in Australia, there is not much () |
||
− | |||
− | ==== Work plan ==== |
||
− | ====== Post-application period ====== |
||
− | * Continue working on the coding challenge. |
||
− | * Learn more about XML, CG |
||
− | |||
− | ====== Bonding period ====== |
||
− | * Continue working on the coding challenge (if it hasn't been completed). |
||
− | * Get familiar: |
||
− | ** Be on IRC often and get used to the community |
||
− | ** Read through the already available files for vi-en translation. Read through other available documents |
||
− | ** Learn more about XML, CG |
||
− | * Prepare rough corpora and/or get one that is already available (https://code.google.com/p/evbcorpus/). |
||
− | * Learn more about both English and Vietnamese morphology, focusing more on Vietnamese, since English has already available file support |
||
− | * Gather available online and offline dictionary resources |
||
− | * Create rough drafts of monolingual dix |
||
− | |||
− | ====== Week 1 (May 19-25) ====== |
||
− | * Continue rough versions of two morphological dictionaries<br> |
||
− | (Around 3000 words in each, including all types)<br> |
||
− | * Start bilingual dictionary |
||
− | |||
− | ====== Week 2 (May 26- June 1) ====== |
||
− | * 3500 words in dix |
||
− | * Start transfer rules |
||
− | * Expand bilingual dictionary |
||
− | * Document |
||
− | |||
− | ====== Week 3 (June 2-8) ====== |
||
− | * Run testvoc |
||
− | * Write transfer rules |
||
− | * 4000 words in dix |
||
− | * Expand bilingual dix |
||
− | * (If possible) 400-word corpus test (coverage 60%) |
||
− | |||
− | ====== Week 4 (June 8-15) ====== |
||
− | * 400-word corpus test (WER 30%, coverage 60-70%) |
||
− | * Clean testvoc |
||
− | * Add transfer rules |
||
− | * 4800 words in dix |
||
− | * Expand bilingual dix |
||
− | * Document |
||
− | |||
− | '''''Deliverable: 2 monolingual dix and bilingual dix''''' |
||
− | |||
− | ====== Week 5 (June 16-22) ====== |
||
− | * Add more transfer rules |
||
− | * Start lexical selection rules (pay special attention to pronouns, determiner, indefinite, etc.) |
||
− | * 5300 words in dix |
||
− | * Expand bilingual dix |
||
− | |||
− | ====== Week 6 (June 23-29) ====== |
||
− | * Finish midterm evaluation |
||
− | * Add lexical selection rules |
||
− | * Add transfer rules |
||
− | * Start CG |
||
− | * Expand bilingual dix |
||
− | * 5500 words in dix |
||
− | * Run testvoc |
||
− | * Document |
||
− | |||
− | ====== Week 7 (June 30- July 6) ====== |
||
− | * Clean testvoc |
||
− | * Add transfer rules |
||
− | * Add CG rules |
||
− | * 6000 words in dix |
||
− | |||
− | ====== Week 8 (July 7-13) ====== |
||
− | * 1000-word corpus test (coverage 70-80%, WER 30-40%) |
||
− | * 6500 words in dix |
||
− | * Add lexical selection rules |
||
− | * Add CG rules |
||
− | * Document |
||
− | |||
− | '''''Deliverable: More complete dix and rules''''' |
||
− | |||
− | ====== Week 9 (July 14-20) ====== |
||
− | * Training of Vietnamese tagger |
||
− | * 6725 words in dix |
||
− | * 1000-word corpus test (coverage 70-80%, WER 30-40% - but higher than previous result) |
||
− | * Regression testing |
||
− | |||
− | ====== Week 10 (July 21-27) ====== |
||
− | * Training of English tagger |
||
− | * Add lexical selection rules |
||
− | * Clean testvoc |
||
− | * 7000 words in dix |
||
− | * Document |
||
− | |||
− | ====== Week 11 (July 28-August 3) ====== |
||
− | * 8000 words in dix |
||
− | * Add transfer rules |
||
− | |||
− | ====== Week 12 (August 3-9) ====== |
||
− | * 1000-word corpus test (coverage 80-90%) |
||
− | * 8500 words in dix |
||
− | |||
− | '''''Deliverable: MT coverage 80-90%''''' |
||
− | |||
− | ====== Completion (August 10-22)====== |
||
− | * Document, clean up, testing |
||
− | * (If possible) 9000 words in dix |