User:Sphinx/Application for "Adopt a language pair" GSOC 2013
Contents
- 1 Contact information
- 2 Why are you interested in machine translation?
- 3 Why are you interested in the Apertium project?
- 4 Why Google and Apertium should sponsor it?
- 5 How and who it will benefit in society?
- 6 Tasks and proposed ideas
- 7 List your skills and give evidence of your qualifications
- 8 My non-Summer-of-Code plans for the Summer
Contact information[edit]
Name: Yishan Jiang
Email: yishanj13@gmail.com
IRC: sphinx
Github repo: https://github.com/sphinx-jiang/apertium_language-pair
Why are you interested in machine translation?[edit]
When I knew nothing about computer science, machine translation was just magic against the story of "The Tower of Babel". But when I started my college in computer science, I realized that it is true and within my struggling. I am taking the class of Machine Translation in Peking University, and really desired to build translators that really work.
Why are you interested in the Apertium project?[edit]
This year I decide to join in GSOC's project, when search from the projects, I search the word, "nlp", then I find Apertium. I went through the wiki pages, especially the "HOWTO" page, telling me that writing a translator is not as hard as i imagine. When i found some questions i asked on IRC, they are always patient to answer questions, then i felt what i did is worthy, I would like to contribute to this project. The MT course gives me the first view of rule-based methodology, as a beginner, i prefer it because that is a way understand a language in its nature better than corpus-based, which is more statistically. And give me a chance to contribute from a entry level, adopting an language pair.
Why Google and Apertium should sponsor it?[edit]
Chinese related languages(Chinese-simple, Chinese-traditional, Chinese-hongkong, and so on) are like lonely islands in the sea, away from other language system. The reason is a little because some paper insistes that Chinese has no or weak morphology, it is no use analysing lexical in translation, just dictionary is enough. But I will try to analyse it in the way of other existing pair, especially English, to extend the flexibility when used in other pair, like Chinese-English, Chinese-Spanish. That is the first step to bring Chinese into the translation language group.
How and who it will benefit in society?[edit]
Chinese is a language spoken by the most large population, build the Chinese(simple)-Chinese(traditional) pair will help the inner communication between people in Taiwan and mainland. And also, help people use Chinese(simple) get to know more about ancient Chinese, which is written like Chinese(traditional), that is why call it traditional. Analysing Chinese by rules, is more than translating a very pair, which will bring Chinese in open source rule-based translation frame. The potencial benefit will be wide.
Tasks and proposed ideas[edit]
Adopt a language pair: zh_CN-zh_TW(Chinese-simple to Chinese-traditional)[edit]
Related tasks:
- Writing linguistic data, including morphological rules and transfer rules — which are specified in a declarative language.
- A Constraint Grammar will be written if necessary.
Proposed idea[edit]
To create a translator with
- Morphological dictionaries for language Chinese-simple, Chinese-traditional: apertium-zh_CN-zh_TW.zh_CN.dix, apertium-zh_CN-zh_TW.zh_TW.dix
- Bilingual dictionary: apertium-zh_CN-zh_TW.dix
- Transfer rules: apertium-zh_CN-zh_TW.zh_CN-zh_TW.t1x, apertium-zh_CN-zh_TW.zh_TW-zh_CN.t1x.
which is testvoc clean, and has a coverage of around 80% or more on a range of free corpora.
Work plan[edit]
Task | Coding hours | TDD hours |
---|---|---|
Apertium stream format tool | 30-40 | 20-30 |
Close-class words | 60-70 | 40-50 |
Open-class words | 40-50 | 20-30 |
Transfer rules(basic) | 40-50 | 20-30 |
Transfer rules(supply) | 30-40 | 20-30 |
CG and Reduce ambiguity | 50-60 | 30-40 |
There are 3 periods, every period with 4 weeks, the work load correspond to 25th to 28th years week has 105h, and the others has 140h
TDD = Test/Debugging/Documentation
gsoc week | week of the year | tasks |
---|---|---|
1 | 25th week | Basic function transfer lookup table to apertium stream format 50h, TDD 10h |
2 | 26th week | Create close-class <sdefs> 20h, frequent words added 20h, TDD 20h |
3 | 27th week | Create Open-class <sdefs> 30h, basic words added 10h , TDD 20h |
4 | 28th week | Completion of monodix 20/30h, verify the tag definition files 15/40h |
First Deliverable | ||
5 | 29th week | Bilingual Completion 40h, TDD 20/40h |
6 | 30th week | Basic transfer rules 30/40h, TDD 20/30h |
7 | 31st week | Constraint Grammar of existing 30/40h |
8 | 32nd week | Stabilize the language pair by regression testing(as a new language pair) |
Second Deliverable | ||
9 | 33rd week | Separate character transfer added 50h(many paper use this) 50/70h |
10 | 34th week | Reduce ambiguity focusing 20/40h, CG 30/40h |
11 | 35th week | Performance measure |
12 | 36th week | Performance measure |
Finalitation |
Coding challenage[edit]
- Install Apertium (see Minimal installation from SVN)
- Go through the HOWTO
- Go through the MT course here (или здесь)
- Write a translator that translates as much of this story as possible — Minimum one sentence. (Другие переводы рассказа здесь.)
If there is no translation, translate it into the languages of your language pair first.
- Upload your work to Apertium SVN.
uploaded here: https://github.com/sphinx-jiang/apertium_language-pair
Here is the automorf.bin output of sentence one.
Here is the translate output of sentence one.
List your skills and give evidence of your qualifications[edit]
- XML
I have submit my start of morphology dictionary and bilingual dictionary. And have basic knowledge of Apertium stream format.
- a scripting language (Python, Perl)
I use bash shell usually, and Python is also enough to handle the project.
- good knowledge of the language pair adopted
I went through the HOWTO page and MT course, rewriting the Chinese(simple)-Chinese(traditional) pair in Incubator for I think treat many words as noun or vblex is wrong.
- Other skills
Read the language pair en-eo, passion for learning new knowledge. I am quite comfortable with C,java,C++,programming. I also worked in nature language processing lab of Lenovo. Experience of language processing algorithm.
My non-Summer-of-Code plans for the Summer[edit]
My plan is focusing my coding skill and learning knowledge of nlp.