Difference between revisions of "User:Sphinx/Application for "Adopt a language pair" GSOC 2013"

From Apertium
Jump to navigation Jump to search
()
Line 9: Line 9:
 
'''Github repo: ''' https://github.com/sphinx-jiang
 
'''Github repo: ''' https://github.com/sphinx-jiang
 
== Tasks and proposed ideas ==
 
== Tasks and proposed ideas ==
  +
=== Adopt a language pair: zh_CN-zh_TW(Chinese-simple to Chinese-traditional) ===
=== Interface for creating tagged corpora ===
 
 
Related tasks:
 
Related tasks:
  +
*Writing linguistic data, including morphological rules and transfer rules — which are specified in a declarative language.
Interface for creating tagged corpora
 
  +
*A Constraint Grammar will be written if necessary.
* Making an interface where you can load a raw text of, say, 30.000 words or (optional) create a corpus or X size for a given language from Wikipedia
 
* The interface should be able to take a non-disambiguated tagged corpus and be able to disambiguate it manually
 
* And, a user-friendly interface to train a supervised tagger
 
 
==== Proposed idea ====
 
==== Proposed idea ====
  +
To create a translator with
To create a tagged corpus there are 2 interfaces, the first one to introduce a corpus (and modify it if you want), and the second one to actually tag the corpus (for this task I have design 2 versions)
 
  +
*Morphological dictionaries for language Chinese-simple, Chinese-traditional: apertium-zh_CN-zh_TW.zh_CN.dix, apertium-zh_CN-zh_TW.zh_TW.dix
 
  +
*Bilingual dictionary: apertium-zh_CN-zh_TW.dix
The first interface is really simple, it is the '''Input corpus file UI''':
 
  +
*Transfer rules: apertium-zh_CN-zh_TW.zh_CN-zh_TW.t1x, apertium-zh_CN-zh_TW.zh_TW-zh_CN.t1x.
[[Image:Apertium input file UI.png|center|Input corpus file UI]]
 
  +
which is testvoc clean, and has a coverage of around 80% or more on a range of free corpora.
 
==== How it works: ====
 
* You are able to select a corpus file and introduce the content into the text view
 
* Introduce a corpus already tagged
 
* Introduce a compress dump wikipedia file
 
* Modify (or directly enter a new corpus from scratch) the actual corpus
 
* Once you finish your edition you click on the apply button and appears the corpus tagger UI
 
 
 
After insert a file and click apply it tag with the corpus (if it isn't yet) and shows the '''supervised corpus tagger UI''':
 

Revision as of 03:02, 26 April 2013

Contact information

Name: Yishan Jiang

Email: yishanj13@gmail.com

IRC: sphinx

Github repo: https://github.com/sphinx-jiang

Tasks and proposed ideas

Adopt a language pair: zh_CN-zh_TW(Chinese-simple to Chinese-traditional)

Related tasks:

  • Writing linguistic data, including morphological rules and transfer rules — which are specified in a declarative language.
  • A Constraint Grammar will be written if necessary.

Proposed idea

To create a translator with

  • Morphological dictionaries for language Chinese-simple, Chinese-traditional: apertium-zh_CN-zh_TW.zh_CN.dix, apertium-zh_CN-zh_TW.zh_TW.dix
  • Bilingual dictionary: apertium-zh_CN-zh_TW.dix
  • Transfer rules: apertium-zh_CN-zh_TW.zh_CN-zh_TW.t1x, apertium-zh_CN-zh_TW.zh_TW-zh_CN.t1x.

which is testvoc clean, and has a coverage of around 80% or more on a range of free corpora.