User:Sphinx/GSoC 2013 Application: "Chinese(simple)-Chinese(traditional) language pair"
Contents
- 1 Contact information
- 2 Why are you interested in machine translation?
- 3 Why are you interested in the Apertium project?
- 4 Why Google and Apertium should sponsor it?
- 5 How and who it will benefit in society?
- 6 Which of the published tasks are you interested in? What do you plan to do?
- 7 Tasks and proposed ideas
Contact information
Name: Yishan Jiang
Email: yishanj13@gmail.com
IRC: sphinx
Github repo: https://github.com/sphinx-jiang
Why are you interested in machine translation?
When I knew nothing about computer science, machine translation was just magic against the story of "The Tower of Babel". But when I started my college in computer science, I realized that it is true and within my struggling. I am taking the class of Machine Translation in Peking University, and really desired to build translators that really work.
Why are you interested in the Apertium project?
Why Google and Apertium should sponsor it?
How and who it will benefit in society?
Which of the published tasks are you interested in? What do you plan to do?
Tasks and proposed ideas
Adopt a language pair: zh_CN-zh_TW(Chinese-simple to Chinese-traditional)
Related tasks:
- Writing linguistic data, including morphological rules and transfer rules — which are specified in a declarative language.
- A Constraint Grammar will be written if necessary.
Proposed idea
To create a translator with
- Morphological dictionaries for language Chinese-simple, Chinese-traditional: apertium-zh_CN-zh_TW.zh_CN.dix, apertium-zh_CN-zh_TW.zh_TW.dix
- Bilingual dictionary: apertium-zh_CN-zh_TW.dix
- Transfer rules: apertium-zh_CN-zh_TW.zh_CN-zh_TW.t1x, apertium-zh_CN-zh_TW.zh_TW-zh_CN.t1x.
which is testvoc clean, and has a coverage of around 80% or more on a range of free corpora.
Work plan
Task | Coding hours | TDD hours |
---|---|---|
Apertium stream format tool | 30-40 | 20-30 |
Close-class words | 60-70 | 40-50 |
Open-class words | 40-50 | 20-30 |
Transfer rules(basic) | 40-50 | 20-30 |
Transfer rules(supply) | 30-40 | 20-30 |
CG and Reduce ambiguity | 50-60 | 30-40 |
There are 3 periods, every period with 4 weeks, the work load correspond to 25th to 28th years week has 105h, and the others has 140h
TDD = Test/Debugging/Documentation
gsoc week | week of the year | tasks |
---|---|---|
1 | 25th week | Basic function transfer lookup table to apertium stream format 50h, TDD 10h |
2 | 26th week | Create close-class <sdefs> 20h, frequent words added 20h, TDD 20h |
3 | 27th week | Create Open-class <sdefs> 30h, basic words added 10h , TDD 20h |
4 | 28th week | Completion of monodix 20/30h, verify the tag definition files 15/40h |
First Deliverable | ||
5 | 29th week | Bilingual Completion 40h, TDD 20/40h |
6 | 30th week | Basic transfer rules 30/40h, TDD 20/30h |
7 | 31st week | Constraint Grammar of existing 30/40h |
8 | 32nd week | Stabilize the language pair by regression testing(as a new language pair) |
Second Deliverable | ||
9 | 33rd week | Separate character transfer added 50h(many paper use this) 50/70h |
10 | 34th week | Reduce ambiguity focusing 20/40h, CG 30/40h |
11 | 35th week | Performance measure |
12 | 36th week | Performance measure |
Finalitation |
Coding challenage
- Install Apertium (see Minimal installation from SVN)
- Go through the HOWTO
- Go through the MT course here (или здесь)
- Write a translator that translates as much of this story as possible — Minimum one sentence. (Другие переводы рассказа здесь.)
If there is no translation, translate it into the languages of your language pair first.
- Upload your work to Apertium SVN.
uploaded here: https://github.com/sphinx-jiang/apertium_language-pair
Here is the automorf.bin output of sentence one.
Here is the translate output of sentence one.