Difference between revisions of "User:Sphinx/GSoC 2013 Application: "Chinese(simple)-Chinese(traditional) language pair""

Revision as of 07:10, 28 April 2013

1 Contact information
2 Why are you interested in machine translation?
3 Why are you interested in the Apertium project?
4 Why Google and Apertium should sponsor it?
5 How and who it will benefit in society?
6 Tasks and proposed ideas
- 6.1 Adopt a language pair: zh_CN-zh_TW(Chinese-simple to Chinese-traditional)
  - 6.1.1 Proposed idea
7 Work plan
8 Coding challenage
9 List your skills and give evidence of your qualifications
10 My non-Summer-of-Code plans for the Summer

Contact information

Name: Yishan Jiang

Email: yishanj13@gmail.com

IRC: sphinx

Github repo: https://github.com/sphinx-jiang

Why are you interested in machine translation?

When I knew nothing about computer science, machine translation was just magic against the story of "The Tower of Babel". But when I started my college in computer science, I realized that it is true and within my struggling. I am taking the class of Machine Translation in Peking University, and really desired to build translators that really work.

Why are you interested in the Apertium project?

This year I decide to join in GSOC's project, when search from the projects, I search the word, "nlp", then I find Apertium. I went through the wiki pages, especially the "HOWTO" page, telling me that writing a translator is not as hard as i imagine. When i found some questions i asked on IRC, they are always patient to answer questions, then i felt what i did is worthy, I would like to contribute to this project. The MT course gives me the first view of rule-based methodology, as a beginner, i prefer it because that is a way understand a language in its nature better than corpus-based, which is more statistically. And give me a chance to contribute from a entry level, adopting an language pair.

Why Google and Apertium should sponsor it?

Chinese related languages(Chinese-simple, Chinese-traditional, Chinese-hongkong, and so on) are like lonely islands in the sea, away from other language system. The reason is a little because some paper insistes that Chinese has no or weak morphology, it is no use analysing lexical in translation, just dictionary is enough. But I will try to analyse it in the way of other existing pair, especially English, to extend the flexibility when used in other pair, like Chinese-English, Chinese-Spanish. That is the first step to bring Chinese into the translation language group.

How and who it will benefit in society?

Chinese is a language spoken by the most large population, build the Chinese(simple)-Chinese(traditional) pair will help the inner communication between people in Taiwan and mainland. And also, help people use Chinese(simple) get to know more about ancient Chinese, which is written like Chinese(traditional), that is why call it traditional. Analysing Chinese by rules, is more than translating a very pair, which will bring Chinese in open source rule-based translation frame. The potencial benefit will be wide.

Tasks and proposed ideas

Adopt a language pair: zh_CN-zh_TW(Chinese-simple to Chinese-traditional)

Related tasks:

Writing linguistic data, including morphological rules and transfer rules — which are specified in a declarative language.
A Constraint Grammar will be written if necessary.

Proposed idea

To create a translator with

Morphological dictionaries for language Chinese-simple, Chinese-traditional: apertium-zh_CN-zh_TW.zh_CN.dix, apertium-zh_CN-zh_TW.zh_TW.dix
Bilingual dictionary: apertium-zh_CN-zh_TW.dix
Transfer rules: apertium-zh_CN-zh_TW.zh_CN-zh_TW.t1x, apertium-zh_CN-zh_TW.zh_TW-zh_CN.t1x.

which is testvoc clean, and has a coverage of around 80% or more on a range of free corpora.

Work plan

Task	Coding hours	TDD hours
Apertium stream format tool	30-40	20-30
Close-class words	60-70	40-50
Open-class words	40-50	20-30
Transfer rules(basic)	40-50	20-30
Transfer rules(supply)	30-40	20-30
CG and Reduce ambiguity	50-60	30-40

There are 3 periods, every period with 4 weeks, the work load correspond to 25th to 28th years week has 105h, and the others has 140h

TDD = Test/Debugging/Documentation

gsoc week	week of the year	tasks
1	25th week	Basic function transfer lookup table to apertium stream format 50h, TDD 10h
2	26th week	Create close-class <sdefs> 20h, frequent words added 20h, TDD 20h
3	27th week	Create Open-class <sdefs> 30h, basic words added 10h , TDD 20h
4	28th week	Completion of monodix 20/30h, verify the tag definition files 15/40h
First Deliverable
5	29th week	Bilingual Completion 40h, TDD 20/40h
6	30th week	Basic transfer rules 30/40h, TDD 20/30h
7	31st week	Constraint Grammar of existing 30/40h
8	32nd week	Stabilize the language pair by regression testing(as a new language pair)
Second Deliverable
9	33rd week	Separate character transfer added 50h(many paper use this) 50/70h
10	34th week	Reduce ambiguity focusing 20/40h, CG 30/40h
11	35th week	Performance measure
12	36th week	Performance measure
Finalitation

Coding challenage

Install Apertium (see Minimal installation from SVN)
Go through the HOWTO
Go through the MT course here (или здесь)
Write a translator that translates as much of this story as possible — Minimum one sentence. (Другие переводы рассказа здесь.)

If there is no translation, translate it into the languages of your language pair first.

Upload your work to Apertium SVN.

uploaded here: https://github.com/sphinx-jiang/apertium_language-pair

Here is the automorf.bin output of sentence one.

Here is the translate output of sentence one.

List your skills and give evidence of your qualifications

XML

I have submit my start of morphology dictionary and bilingual dictionary. And have basic knowledge of Apertium stream format.

a scripting language (Python, Perl)

I use bash shell usually, and Python is also enough to handle the project.

good knowledge of the language pair adopted

I went through the HOWTO page and MT course, rewriting the Chinese(simple)-Chinese(traditional) pair in Incubator for I think treat many words as noun or vblex is wrong.

Other skills

Read the language pair en-eo, passion for learning new knowledge. I am quite comfortable with C,java,C++,programming. I also worked in nature language processing lab of Lenovo. Experience of language processing algorithm.

My non-Summer-of-Code plans for the Summer

My plan is focusing my coding skill and learn knowledge of nlp.

Revision as of 07:06, 28 April 2013 (edit) Sphinx (talk \| contribs) (→‎List your skills and give evidence of your qualifications) ← Older edit		Revision as of 07:10, 28 April 2013 (edit) (undo) Sphinx (talk \| contribs) (→‎My non-Summer-of-Code plans for the Summer) Newer edit →
Line 166:		Line 166:

	== My non-Summer-of-Code plans for the Summer ==		== My non-Summer-of-Code plans for the Summer ==
			My plan is focusing my coding skill and learn knowledge of nlp.

Difference between revisions of "User:Sphinx/GSoC 2013 Application: "Chinese(simple)-Chinese(traditional) language pair""

Revision as of 07:10, 28 April 2013

Contents

Contact information

Why are you interested in machine translation?

Why are you interested in the Apertium project?

Why Google and Apertium should sponsor it?

How and who it will benefit in society?

Tasks and proposed ideas

Adopt a language pair: zh_CN-zh_TW(Chinese-simple to Chinese-traditional)

Proposed idea

Work plan

Coding challenage

List your skills and give evidence of your qualifications

My non-Summer-of-Code plans for the Summer

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools