Difference between revisions of "User:Linh le/2014Application"

Revision as of 06:04, 21 March 2014

1 Contact information
2 Why is it that you are interested in learning about machine translation?
3 Why are you interested in Apertium project?
4 Which of the published tasks are you interested in? What do you plan to do?
5 Proposal
- 5.1 Adopting unreleased Vietnamese-English pair and bringing it to release quality
  - 5.1.1 Why Google and Apertium should sponsor it
- 5.2 Work plan

Contact information

Name: Linh Le
Email address: linh.ai.le@gmail.com
Nick on IRC#apertium: LinhLe
Sourceforge username:

Why is it that you are interested in learning about machine translation?

Born as a Vietnamese in Vietnam, I have learning English since I was a kid, since it has always been deemed as an essential tool for people in my generation. However, I also understand that because learning a language not only takes time, effort and dedication but also a lot of money, and that not everyone has the opportunity and resources to do so. Therefore, in my opinion, machine translation is a useful alternative for people who wish to approach resources in other languages for their careers, entertainment or (life exp).

Why are you interested in Apertium project?

First of all, since Apertium is an open-source project, everyone can contribute and I think that this system is perfect for language translation. Since language is created by, developed by, affected by, and used by people, it requires human involvement and human understanding. *edit* Second, despite having no immediate relations, there are many similarities in the grammar structures of the specific pair of language that I'm thinking about (Vietnamese<->English), and therefore I believe that the rule-based MT Apertium can provide a system ()

Which of the published tasks are you interested in? What do you plan to do?

I'm interested in Adopting an unreleased pair, specifically the pair Vietnamese - English.

Proposal

Adopting unreleased Vietnamese-English pair and bringing it to release quality

Why Google and Apertium should sponsor it

Although Vietnamese is spoken by around 90 million people in Vietnam, and the 7th most spoken language in the US and the 6th in Australia, there is not much ()

Work plan

Post-application period:

Continue working on the coding challenge.
Learn more about XML, CG

Bonding period:

Continue working on the coding challenge (if it hasn't been completed).
Get familiar:
- Be on IRC often and get used to the community
- Read through the already available files for vi-en translation. Read through other available documents
- Learn more about XML, CG
Prepare rough corpora and/or get one that is already available (https://code.google.com/p/evbcorpus/).
Learn more about both English and Vietnamese morphology, focusing more on Vietnamese, since English has already available file support
Gather available online and offline dictionary resources
Create rough drafts of monolingual dix

Week 1 (May 19-25)

Continue rough versions of two morphological dictionaries

(Around 3000 words in each, including all types)

Start bilingual dictionary

Week 2 (May 26- June 1)

3500 words in dix
Start transfer rules
Expand bilingual dictionary
Document

Week 3 (June 2-8)

Run testvoc
Write transfer rules
4000 words in dix
Expand bilingual dix
(If possible) 400-word corpus test (coverage 60%)

Week 4 (June 8-15)

400-word corpus test (WER 30%, coverage 60-70%)
Clean testvoc
Add transfer rules
4800 words in dix
Expand bilingual dix
Document

Deliverable: 2 monolingual dix and bilingual dix

Week 5 (June 16-22)

Add more transfer rules
Start lexical selection rules (pay special attention to pronouns, determiner, indefinite, etc.)
5300 words in dix
Expand bilingual dix

Week 6 (June 23-29)

Finish midterm evaluation
Add lexical selection rules
Add transfer rules
Start CG
Expand bilingual dix
5500 words in dix
Run testvoc
Document

Week 7 (June 30- July 6)

Clean testvoc
Add transfer rules
Add CG rules
6000 words in dix

Week 8 (July 7-13)

1000-word corpus test (coverage 70-80%, WER 30-40%)
6500 words in dix
Add lexical selection rules
Add CG rules
Document

Deliverable: More complete dix and rules

Week 9 (July 14-20)

Training of Vietnamese tagger
6725 words in dix
1000-word corpus test (coverage 70-80%, WER 30-40% - but higher than previous result)
Regression testing

Week 10 (July 21-27)

Training of English tagger
Add lexical selection rules
Clean testvoc
7000 words in dix
Document

Week 11 (July 28-August 3)

8000 words in dix
Add transfer rules

Week 12 (August 3-9)

1000-word corpus test (coverage 80-90%)
8500 words in dix

Deliverable: MT coverage 80-90% Completion (August 10-22)

Document, clean up, testing
(If possible) 9000 words in dix

@@ Line 20: / Line 20: @@
 Although Vietnamese is spoken by around 90 million people in Vietnam, and the 7th most spoken language in the US and the 6th in Australia, there is not much ()
-====== Work plan ======
+==== Work plan ====
 Post-application period:
 * Continue working on the coding challenge.
@@ Line 31: / Line 31: @@
 ** Read through the already available files for vi-en translation. Read through other available documents
 ** Learn more about XML, CG
-* Prepare rough corpora and/or get one that is already available.
+* Prepare rough corpora and/or get one that is already available (https://code.google.com/p/evbcorpus/).
-* Learn more about both English and Vietnamese, focusing more on Vietnamese, since English has already available file support.
+* Learn more about both English and Vietnamese morphology, focusing more on Vietnamese, since English has already available file support
 * Gather available online and offline dictionary resources
 * Create rough drafts of monolingual dix
@@ Line 40: / Line 40: @@
 (Around 3000 words in each, including all types)<br>
 * Start bilingual dictionary
 Week 2 (May 26- June 1)<br>
@@ Line 53: / Line 52: @@
 * 4000 words in dix
 * Expand bilingual dix
-* (If possible) 400-word evaluation (coverage 60%)
+* (If possible) 400-word corpus test (coverage 60%)
-* Document
 Week 4 (June 8-15)
@@ Line 71: / Line 69: @@
 * 5300 words in dix
 * Expand bilingual dix
-* Document
 Week 6 (June 23-29)
@@ Line 88: / Line 85: @@
 * Add CG rules
 * 6000 words in dix
-* Document
 Week 8 (July 7-13)
 * 1000-word corpus test (coverage 70-80%, WER 30-40%)
 * 6500 words in dix
-* Add transfer rules
+* Add lexical selection rules
 * Add CG rules
+* Document
 Deliverable: More complete dix and rules
@@ Line 106: / Line 103: @@
 Week 10 (July 21-27)
 * Training of English tagger
-* Add lexical rules
+* Add lexical selection rules
 * Clean testvoc
 * 7000 words in dix
+* Document
 Week 11 (July 28-August 3)
@@ Line 120: / Line 118: @@
 Deliverable: MT coverage 80-90%
 Completion (August 10-22)
-Document, clean up, testing
+* Document, clean up, testing
+* (If possible) 9000 words in dix

Difference between revisions of "User:Linh le/2014Application"

Revision as of 06:04, 21 March 2014

Contents

Contact information

Why is it that you are interested in learning about machine translation?

Why are you interested in Apertium project?

Which of the published tasks are you interested in? What do you plan to do?

Proposal

Adopting unreleased Vietnamese-English pair and bringing it to release quality

Why Google and Apertium should sponsor it

Work plan

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools