User:Sphinx/GSoC 2013 Application: "Chinese(simple)-Chinese(traditional) language pair"
Jump to navigation
Jump to search
Contents
Contact information
Name: Yishan Jiang
Email: yishanj13@gmail.com
IRC: sphinx
Github repo: https://github.com/sphinx-jiang
Tasks and proposed ideas
Adopt a language pair: zh_CN-zh_TW(Chinese-simple to Chinese-traditional)
Related tasks:
- Writing linguistic data, including morphological rules and transfer rules — which are specified in a declarative language.
- A Constraint Grammar will be written if necessary.
Proposed idea
To create a translator with
- Morphological dictionaries for language Chinese-simple, Chinese-traditional: apertium-zh_CN-zh_TW.zh_CN.dix, apertium-zh_CN-zh_TW.zh_TW.dix
- Bilingual dictionary: apertium-zh_CN-zh_TW.dix
- Transfer rules: apertium-zh_CN-zh_TW.zh_CN-zh_TW.t1x, apertium-zh_CN-zh_TW.zh_TW-zh_CN.t1x.
which is testvoc clean, and has a coverage of around 80% or more on a range of free corpora.
Work plan
Task | Coding hours | TDD hours |
---|---|---|
Apertium stream format tool | 30-40 | 20-30 |
Close-class words | 60-70 | 40-50 |
Open-class words | 40-50 | 20-30 |
Transfer rules(basic) | 40-50 | 20-30 |
Transfer rules(supply) | 30-40 | 20-30 |
CG and Reduce ambiguity | 50-60 | 30-40 |
There are 3 periods, every period with 4 weeks, the work load correspond to 25th to 28th years week has 105h, and the others has 140h
TDD = Test/Debugging/Documentation
gsoc week | week of the year | tasks |
---|---|---|
1 | 25th week | Basic function transfer lookup table to apertium stream format 50h, TDD 10h |
2 | 26th week | Create close-class <sdefs> 20h, frequent words added 20h, TDD 20h |
3 | 27th week | Create Open-class <sdefs> 30h, basic words added 10h , TDD 20h |
4 | 28th week | Completion of monodix 20/30h, verify the tag definition files 15/40h |
First Deliverable | ||
5 | 29th week | Bilingual Completion 40h, TDD 20/40h |
6 | 30th week | Basic transfer rules 30/40h, TDD 20/30h |
7 | 31st week | Constraint Grammar of existing 30/40h |
8 | 32nd week | Stabilize the language pair by regression testing(as a new language pair) |
Second Deliverable | ||
9 | 33rd week | Separate character transfer added 50h(many paper use this) 50/70h |
10 | 34th week | Reduce ambiguity focusing 20/40h, CG 30/40h |
11 | 35th week | Performance measure |
12 | 36th week | Performance measure |
Finalitation |
Coding challenage
- Install Apertium (see Minimal installation from SVN)
- Go through the HOWTO
- Go through the MT course here (или здесь)
- Write a translator that translates as much of this story as possible — Minimum one sentence. (Другие переводы рассказа здесь.)
If there is no translation, translate it into the languages of your language pair first.
- Upload your work to Apertium SVN.
uploaded here: https://github.com/sphinx-jiang/apertium_language-pair
Here is the automorf.bin output of sentence one.
Here is the translate output of sentence one.