User:Sphinx/GSoC 2013 Application: "Chinese(simple)-Chinese(traditional) language pair"

From Apertium
Revision as of 02:31, 28 April 2013 by Sphinx (talk | contribs) (Created page with '== Contact information == '''Name: ''' Yishan Jiang '''Email: ''' yishanj13@gmail.com '''IRC: ''' sphinx '''Github repo: ''' https://github.com/sphinx-jiang == Tasks and prop…')
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Contact information

Name: Yishan Jiang

Email: yishanj13@gmail.com

IRC: sphinx

Github repo: https://github.com/sphinx-jiang

Tasks and proposed ideas

Adopt a language pair: zh_CN-zh_TW(Chinese-simple to Chinese-traditional)

Related tasks:

  • Writing linguistic data, including morphological rules and transfer rules — which are specified in a declarative language.
  • A Constraint Grammar will be written if necessary.

Proposed idea

To create a translator with

  • Morphological dictionaries for language Chinese-simple, Chinese-traditional: apertium-zh_CN-zh_TW.zh_CN.dix, apertium-zh_CN-zh_TW.zh_TW.dix
  • Bilingual dictionary: apertium-zh_CN-zh_TW.dix
  • Transfer rules: apertium-zh_CN-zh_TW.zh_CN-zh_TW.t1x, apertium-zh_CN-zh_TW.zh_TW-zh_CN.t1x.

which is testvoc clean, and has a coverage of around 80% or more on a range of free corpora.

Work plan

Task Coding hours TDD hours
Apertium stream format tool 30-40 20-30
Close-class words 60-70 40-50
Open-class words 40-50 20-30
Transfer rules(basic) 40-50 20-30
Transfer rules(supply) 30-40 20-30
CG and Reduce ambiguity 50-60 30-40

There are 3 periods, every period with 4 weeks, the work load correspond to 25th to 28th years week has 105h, and the others has 140h

TDD = Test/Debugging/Documentation

gsoc week week of the year tasks
1 25th week Basic function transfer lookup table to apertium stream format 50h, TDD 10h
2 26th week Create close-class <sdefs> 20h, frequent words added 20h, TDD 20h
3 27th week Create Open-class <sdefs> 30h, basic words added 10h , TDD 20h
4 28th week Completion of monodix 20/30h, verify the tag definition files 15/40h
First Deliverable
5 29th week Bilingual Completion 40h, TDD 20/40h
6 30th week Basic transfer rules 30/40h, TDD 20/30h
7 31st week Constraint Grammar of existing 30/40h
8 32nd week Stabilize the language pair by regression testing(as a new language pair)
Second Deliverable
9 33rd week Separate character transfer added 50h(many paper use this) 50/70h
10 34th week Reduce ambiguity focusing 20/40h, CG 30/40h
11 35th week Performance measure
12 36th week Performance measure
Finalitation

Coding challenage

  • Install Apertium (see Minimal installation from SVN)
  • Go through the HOWTO
  • Go through the MT course here (или здесь)
  • Write a translator that translates as much of this story as possible — Minimum one sentence. (Другие переводы рассказа здесь.)

If there is no translation, translate it into the languages of your language pair first.

  • Upload your work to Apertium SVN.

uploaded here: https://github.com/sphinx-jiang/apertium_language-pair

Here is the automorf.bin output of sentence one.

Testm1.png

Here is the translate output of sentence one.

Tests1.png