User:Sphinx/Application for "Adopt a language pair" GSOC 2013

From Apertium
Jump to navigation Jump to search

Contact information

Name: Yishan Jiang

Email: yishanj13@gmail.com

IRC: sphinx

Github repo: https://github.com/sphinx-jiang

Why are you interested in machine translation?

When I knew nothing about computer science, machine translation was just magic against the story of "The Tower of Babel". But when I started my college in computer science, I realized that it is true and within my struggling. I am taking the class of Machine Translation in Peking University, and really desired to build translators that really work.

Why are you interested in the Apertium project?

This year I decide to join in GSOC's project, when search from the projects, I search the word, "nlp", then I find Apertium. I went through the wiki pages, especially the "HOWTO" page, telling me that writing a translator is not as hard as i imagine. When i found some questions i asked on IRC, they are always patient to answer questions, then i felt what i did is worthy, I would like to contribute to this project. The MT course gives me the first view of rule-based methodology, as a beginner, i prefer it because that is a way understand a language in its nature better than corpus-based, which is more statistically. And give me a chance to contribute from a entry level, adopting an language pair.

Why Google and Apertium should sponsor it?

Chinese related languages(Chinese-simple, Chinese-traditional, Chinese-hongkong, and so on) are like lonely islands in the sea, away from other language system. The reason is a little because some paper insistes that Chinese has no or weak morphology, it is no use analysing lexical in translation, just dictionary is enough. But I will try to analyse it in the way of other existing pair, especially English, to extend the flexibility when used in other pair, like Chinese-English, Chinese-Spanish. That is the first step to bring Chinese into the translation language group.

How and who it will benefit in society?

Tasks and proposed ideas

Adopt a language pair: zh_CN-zh_TW(Chinese-simple to Chinese-traditional)

Related tasks:

  • Writing linguistic data, including morphological rules and transfer rules — which are specified in a declarative language.
  • A Constraint Grammar will be written if necessary.

Proposed idea

To create a translator with

  • Morphological dictionaries for language Chinese-simple, Chinese-traditional: apertium-zh_CN-zh_TW.zh_CN.dix, apertium-zh_CN-zh_TW.zh_TW.dix
  • Bilingual dictionary: apertium-zh_CN-zh_TW.dix
  • Transfer rules: apertium-zh_CN-zh_TW.zh_CN-zh_TW.t1x, apertium-zh_CN-zh_TW.zh_TW-zh_CN.t1x.

which is testvoc clean, and has a coverage of around 80% or more on a range of free corpora.

Work plan

Task Coding hours TDD hours
Apertium stream format tool 30-40 20-30
Close-class words 60-70 40-50
Open-class words 40-50 20-30
Transfer rules(basic) 40-50 20-30
Transfer rules(supply) 30-40 20-30
CG and Reduce ambiguity 50-60 30-40

There are 3 periods, every period with 4 weeks, the work load correspond to 25th to 28th years week has 105h, and the others has 140h

TDD = Test/Debugging/Documentation

gsoc week week of the year tasks
1 25th week Basic function transfer lookup table to apertium stream format 50h, TDD 10h
2 26th week Create close-class <sdefs> 20h, frequent words added 20h, TDD 20h
3 27th week Create Open-class <sdefs> 30h, basic words added 10h , TDD 20h
4 28th week Completion of monodix 20/30h, verify the tag definition files 15/40h
First Deliverable
5 29th week Bilingual Completion 40h, TDD 20/40h
6 30th week Basic transfer rules 30/40h, TDD 20/30h
7 31st week Constraint Grammar of existing 30/40h
8 32nd week Stabilize the language pair by regression testing(as a new language pair)
Second Deliverable
9 33rd week Separate character transfer added 50h(many paper use this) 50/70h
10 34th week Reduce ambiguity focusing 20/40h, CG 30/40h
11 35th week Performance measure
12 36th week Performance measure
Finalitation

Coding challenage

  • Install Apertium (see Minimal installation from SVN)
  • Go through the HOWTO
  • Go through the MT course here (или здесь)
  • Write a translator that translates as much of this story as possible — Minimum one sentence. (Другие переводы рассказа здесь.)

If there is no translation, translate it into the languages of your language pair first.

  • Upload your work to Apertium SVN.

uploaded here: https://github.com/sphinx-jiang/apertium_language-pair

Here is the automorf.bin output of sentence one.

Testm1.png

Here is the translate output of sentence one.

Tests1.png