Difference between revisions of "User:Linh le/2014Application"

From Apertium
Jump to navigation Jump to search
(Blanked the page)
 
(29 intermediate revisions by the same user not shown)
Line 1: Line 1:
== Contact information<br>==
 
'''Name''': Linh Le<br>
 
'''Email address''': linh.ai.le@gmail.com<br>
 
'''Nick on IRC#apertium''': LinhLe<br>
 
'''Sourceforge username''':<br>
 
 
== Why is it that you are interested in learning about machine translation? ==
 
Born as a Vietnamese in Vietnam, I have learning English since I was a kid, since it has always been deemed as an essential tool for people in my generation. However, I also understand that because learning a language not only takes time, effort and dedication but also a lot of money, and that not everyone has the opportunity and resources to do so. Therefore, in my opinion, machine translation is a useful alternative for people who wish to approach resources in other languages for their careers, entertainment or (life exp).
 
 
== Why are you interested in Apertium project? ==
 
First of all, since Apertium is an open-source project, everyone can contribute and I think that this system is perfect for language translation. Since language is created by, developed by, affected by, and used by people, it requires human involvement and human understanding. *edit*
 
Second, despite having no immediate relations, there are many similarities in the grammar structures of the specific pair of language that I'm thinking about (Vietnamese<->English), and therefore I believe that the rule-based MT Apertium can provide a system ()
 
 
== Which of the published tasks are you interested in? What do you plan to do? ==
 
I'm interested in '''Adopting an unreleased pair''', specifically the pair Vietnamese - English.
 
 
== Proposal ==
 
==== Adopting unreleased Vietnamese-English pair and bringing it to release quality ====
 
====== Why Google and Apertium should sponsor it ======
 
Although Vietnamese is spoken by around 90 million people in Vietnam, and the 7th most spoken language in the US and the 6th in Australia, there is not much ()
 
 
==== Work plan ====
 
====== Post-application period ======
 
* Continue working on the coding challenge.
 
* Learn more about XML, CG
 
 
====== Bonding period ======
 
* Continue working on the coding challenge (if it hasn't been completed).
 
* Get familiar:
 
** Be on IRC often and get used to the community
 
** Read through the already available files for vi-en translation. Read through other available documents
 
** Learn more about XML, CG
 
* Prepare rough corpora and/or get one that is already available (https://code.google.com/p/evbcorpus/).
 
* Learn more about both English and Vietnamese morphology, focusing more on Vietnamese, since English has already available file support
 
* Gather available online and offline dictionary resources
 
* Create rough drafts of monolingual dix
 
 
====== Week 1 (May 19-25) ======
 
* Continue rough versions of two morphological dictionaries<br>
 
(Around 3000 words in each, including all types)<br>
 
* Start bilingual dictionary
 
 
====== Week 2 (May 26- June 1) ======
 
* 3500 words in dix
 
* Start transfer rules
 
* Expand bilingual dictionary
 
* Document
 
 
====== Week 3 (June 2-8) ======
 
* Run testvoc
 
* Write transfer rules
 
* 4000 words in dix
 
* Expand bilingual dix
 
* (If possible) 400-word corpus test (coverage 60%)
 
 
====== Week 4 (June 8-15) ======
 
* 400-word corpus test (WER 30%, coverage 60-70%)
 
* Clean testvoc
 
* Add transfer rules
 
* 4800 words in dix
 
* Expand bilingual dix
 
* Document
 
 
'''''Deliverable: 2 monolingual dix and bilingual dix'''''
 
 
====== Week 5 (June 16-22) ======
 
* Add more transfer rules
 
* Start lexical selection rules (pay special attention to pronouns, determiner, indefinite, etc.)
 
* 5300 words in dix
 
* Expand bilingual dix
 
 
====== Week 6 (June 23-29) ======
 
* Finish midterm evaluation
 
* Add lexical selection rules
 
* Add transfer rules
 
* Start CG
 
* Expand bilingual dix
 
* 5500 words in dix
 
* Run testvoc
 
* Document
 
 
====== Week 7 (June 30- July 6) ======
 
* Clean testvoc
 
* Add transfer rules
 
* Add CG rules
 
* 6000 words in dix
 
 
====== Week 8 (July 7-13) ======
 
* 1000-word corpus test (coverage 70-80%, WER 30-40%)
 
* 6500 words in dix
 
* Add lexical selection rules
 
* Add CG rules
 
* Document
 
 
'''''Deliverable: More complete dix and rules'''''
 
 
====== Week 9 (July 14-20) ======
 
* Training of Vietnamese tagger
 
* 6725 words in dix
 
* 1000-word corpus test (coverage 70-80%, WER 30-40% - but higher than previous result)
 
* Regression testing
 
 
====== Week 10 (July 21-27) ======
 
* Training of English tagger
 
* Add lexical selection rules
 
* Clean testvoc
 
* 7000 words in dix
 
* Document
 
 
====== Week 11 (July 28-August 3) ======
 
* 8000 words in dix
 
* Add transfer rules
 
 
====== Week 12 (August 3-9) ======
 
* 1000-word corpus test (coverage 80-90%)
 
* 8500 words in dix
 
 
'''''Deliverable: MT coverage 80-90%'''''
 
 
====== Completion (August 10-22)======
 
* Document, clean up, testing
 
* (If possible) 9000 words in dix
 

Latest revision as of 14:25, 28 April 2014