Difference between revisions of "User:Sphinx/Application for "Adopt a language pair" GSOC 2013"

From Apertium
Jump to navigation Jump to search
Line 9: Line 9:
 
'''Github repo: ''' https://github.com/sphinx-jiang
 
'''Github repo: ''' https://github.com/sphinx-jiang
 
== Why are you interested in machine translation? ==
 
== Why are you interested in machine translation? ==
  +
  +
== Why are you interested in the Apertium project? ==
  +
== Why Google and Apertium should sponsor it? ==
  +
== How and who it will benefit in society? ==
  +
== Which of the published tasks are you interested in? What do you plan to do? ==
  +
 
== Tasks and proposed ideas ==
 
== Tasks and proposed ideas ==
 
=== Adopt a language pair: zh_CN-zh_TW(Chinese-simple to Chinese-traditional) ===
 
=== Adopt a language pair: zh_CN-zh_TW(Chinese-simple to Chinese-traditional) ===

Revision as of 02:30, 28 April 2013

Contact information

Name: Yishan Jiang

Email: yishanj13@gmail.com

IRC: sphinx

Github repo: https://github.com/sphinx-jiang

Why are you interested in machine translation?

Why are you interested in the Apertium project?

Why Google and Apertium should sponsor it?

How and who it will benefit in society?

Which of the published tasks are you interested in? What do you plan to do?

Tasks and proposed ideas

Adopt a language pair: zh_CN-zh_TW(Chinese-simple to Chinese-traditional)

Related tasks:

  • Writing linguistic data, including morphological rules and transfer rules — which are specified in a declarative language.
  • A Constraint Grammar will be written if necessary.

Proposed idea

To create a translator with

  • Morphological dictionaries for language Chinese-simple, Chinese-traditional: apertium-zh_CN-zh_TW.zh_CN.dix, apertium-zh_CN-zh_TW.zh_TW.dix
  • Bilingual dictionary: apertium-zh_CN-zh_TW.dix
  • Transfer rules: apertium-zh_CN-zh_TW.zh_CN-zh_TW.t1x, apertium-zh_CN-zh_TW.zh_TW-zh_CN.t1x.

which is testvoc clean, and has a coverage of around 80% or more on a range of free corpora.

Work plan

Task Coding hours TDD hours
Apertium stream format tool 30-40 20-30
Close-class words 60-70 40-50
Open-class words 40-50 20-30
Transfer rules(basic) 40-50 20-30
Transfer rules(supply) 30-40 20-30
CG and Reduce ambiguity 50-60 30-40

There are 3 periods, every period with 4 weeks, the work load correspond to 25th to 28th years week has 105h, and the others has 140h

TDD = Test/Debugging/Documentation

gsoc week week of the year tasks
1 25th week Basic function transfer lookup table to apertium stream format 50h, TDD 10h
2 26th week Create close-class <sdefs> 20h, frequent words added 20h, TDD 20h
3 27th week Create Open-class <sdefs> 30h, basic words added 10h , TDD 20h
4 28th week Completion of monodix 20/30h, verify the tag definition files 15/40h
First Deliverable
5 29th week Bilingual Completion 40h, TDD 20/40h
6 30th week Basic transfer rules 30/40h, TDD 20/30h
7 31st week Constraint Grammar of existing 30/40h
8 32nd week Stabilize the language pair by regression testing(as a new language pair)
Second Deliverable
9 33rd week Separate character transfer added 50h(many paper use this) 50/70h
10 34th week Reduce ambiguity focusing 20/40h, CG 30/40h
11 35th week Performance measure
12 36th week Performance measure
Finalitation

Coding challenage

  • Install Apertium (see Minimal installation from SVN)
  • Go through the HOWTO
  • Go through the MT course here (или здесь)
  • Write a translator that translates as much of this story as possible — Minimum one sentence. (Другие переводы рассказа здесь.)

If there is no translation, translate it into the languages of your language pair first.

  • Upload your work to Apertium SVN.

uploaded here: https://github.com/sphinx-jiang/apertium_language-pair

Here is the automorf.bin output of sentence one.

Testm1.png

Here is the translate output of sentence one.

Tests1.png