User:Sphinx/Application for "Adopt a language pair" GSOC 2013
Contents
Contact information
Name: Yishan Jiang
Email: yishanj13@gmail.com
IRC: sphinx
Github repo: https://github.com/sphinx-jiang
Why are you interested in machine translation?
When I knew nothing about computer science, machine translation was just magic against the story of "The Tower of Babel". But when I started my college in computer science, I realized that it is true and within my struggling. I am taking the class of Machine Translation in Peking University, and really desired to build translators that really work.
Why are you interested in the Apertium project?
This year I decide to join in GSOC's project, when search from the projects, I search the word, "nlp", then I find Apertium. I went through the wiki pages, especially the "HOWTO" page, telling me that writing a translator is not as hard as i imagine. When i found some questions i asked on IRC, they are always patient to answer questions, then i felt what i did is worthy, I would like to contribute to this project. The MT course gives me the first view of rule-based methodology, as a beginner, i prefer it because that is a way understand a language in its nature better than corpus-based, which is more statistically. And give me a chance to contribute from a entry level, adopting an language pair.
Why Google and Apertium should sponsor it?
Chinese related languages(Chinese-simple, Chinese-traditional, Chinese-hongkong, and so on) are like lonely islands in the sea, away from other language system. The reason is a little because some paper insistes that Chinese has no or weak morphology, it is no use analysing lexical in translation, just dictionary is enough. But I will try to analyse it in the way of other existing pair, especially English, to extend the flexibility when used in other pair, like Chinese-English, Chinese-Spanish. That is the first step to bring Chinese into the translation language group.
How and who it will benefit in society?
Tasks and proposed ideas
Adopt a language pair: zh_CN-zh_TW(Chinese-simple to Chinese-traditional)
Related tasks:
- Writing linguistic data, including morphological rules and transfer rules — which are specified in a declarative language.
- A Constraint Grammar will be written if necessary.
Proposed idea
To create a translator with
- Morphological dictionaries for language Chinese-simple, Chinese-traditional: apertium-zh_CN-zh_TW.zh_CN.dix, apertium-zh_CN-zh_TW.zh_TW.dix
- Bilingual dictionary: apertium-zh_CN-zh_TW.dix
- Transfer rules: apertium-zh_CN-zh_TW.zh_CN-zh_TW.t1x, apertium-zh_CN-zh_TW.zh_TW-zh_CN.t1x.
which is testvoc clean, and has a coverage of around 80% or more on a range of free corpora.
Work plan
Task | Coding hours | TDD hours |
---|---|---|
Apertium stream format tool | 30-40 | 20-30 |
Close-class words | 60-70 | 40-50 |
Open-class words | 40-50 | 20-30 |
Transfer rules(basic) | 40-50 | 20-30 |
Transfer rules(supply) | 30-40 | 20-30 |
CG and Reduce ambiguity | 50-60 | 30-40 |
There are 3 periods, every period with 4 weeks, the work load correspond to 25th to 28th years week has 105h, and the others has 140h
TDD = Test/Debugging/Documentation
gsoc week | week of the year | tasks |
---|---|---|
1 | 25th week | Basic function transfer lookup table to apertium stream format 50h, TDD 10h |
2 | 26th week | Create close-class <sdefs> 20h, frequent words added 20h, TDD 20h |
3 | 27th week | Create Open-class <sdefs> 30h, basic words added 10h , TDD 20h |
4 | 28th week | Completion of monodix 20/30h, verify the tag definition files 15/40h |
First Deliverable | ||
5 | 29th week | Bilingual Completion 40h, TDD 20/40h |
6 | 30th week | Basic transfer rules 30/40h, TDD 20/30h |
7 | 31st week | Constraint Grammar of existing 30/40h |
8 | 32nd week | Stabilize the language pair by regression testing(as a new language pair) |
Second Deliverable | ||
9 | 33rd week | Separate character transfer added 50h(many paper use this) 50/70h |
10 | 34th week | Reduce ambiguity focusing 20/40h, CG 30/40h |
11 | 35th week | Performance measure |
12 | 36th week | Performance measure |
Finalitation |
Coding challenage
- Install Apertium (see Minimal installation from SVN)
- Go through the HOWTO
- Go through the MT course here (или здесь)
- Write a translator that translates as much of this story as possible — Minimum one sentence. (Другие переводы рассказа здесь.)
If there is no translation, translate it into the languages of your language pair first.
- Upload your work to Apertium SVN.
uploaded here: https://github.com/sphinx-jiang/apertium_language-pair
Here is the automorf.bin output of sentence one.
Here is the translate output of sentence one.