Difference between revisions of "User:Sphinx/Application for "Adopt a language pair" GSOC 2013"
Jump to navigation
Jump to search
() |
|||
Line 9: | Line 9: | ||
'''Github repo: ''' https://github.com/sphinx-jiang |
'''Github repo: ''' https://github.com/sphinx-jiang |
||
== Tasks and proposed ideas == |
== Tasks and proposed ideas == |
||
+ | === Adopt a language pair: zh_CN-zh_TW(Chinese-simple to Chinese-traditional) === |
||
− | === Interface for creating tagged corpora === |
||
Related tasks: |
Related tasks: |
||
+ | *Writing linguistic data, including morphological rules and transfer rules — which are specified in a declarative language. |
||
− | Interface for creating tagged corpora |
||
+ | *A Constraint Grammar will be written if necessary. |
||
− | * Making an interface where you can load a raw text of, say, 30.000 words or (optional) create a corpus or X size for a given language from Wikipedia |
||
− | * The interface should be able to take a non-disambiguated tagged corpus and be able to disambiguate it manually |
||
− | * And, a user-friendly interface to train a supervised tagger |
||
==== Proposed idea ==== |
==== Proposed idea ==== |
||
+ | To create a translator with |
||
− | To create a tagged corpus there are 2 interfaces, the first one to introduce a corpus (and modify it if you want), and the second one to actually tag the corpus (for this task I have design 2 versions) |
||
+ | *Morphological dictionaries for language Chinese-simple, Chinese-traditional: apertium-zh_CN-zh_TW.zh_CN.dix, apertium-zh_CN-zh_TW.zh_TW.dix |
||
− | |||
+ | *Bilingual dictionary: apertium-zh_CN-zh_TW.dix |
||
− | The first interface is really simple, it is the '''Input corpus file UI''': |
||
+ | *Transfer rules: apertium-zh_CN-zh_TW.zh_CN-zh_TW.t1x, apertium-zh_CN-zh_TW.zh_TW-zh_CN.t1x. |
||
− | [[Image:Apertium input file UI.png|center|Input corpus file UI]] |
||
+ | which is testvoc clean, and has a coverage of around 80% or more on a range of free corpora. |
||
− | |||
− | ==== How it works: ==== |
||
− | * You are able to select a corpus file and introduce the content into the text view |
||
− | * Introduce a corpus already tagged |
||
− | * Introduce a compress dump wikipedia file |
||
− | * Modify (or directly enter a new corpus from scratch) the actual corpus |
||
− | * Once you finish your edition you click on the apply button and appears the corpus tagger UI |
||
− | |||
− | |||
− | After insert a file and click apply it tag with the corpus (if it isn't yet) and shows the '''supervised corpus tagger UI''': |
Revision as of 03:02, 26 April 2013
Contents
Contact information
Name: Yishan Jiang
Email: yishanj13@gmail.com
IRC: sphinx
Github repo: https://github.com/sphinx-jiang
Tasks and proposed ideas
Adopt a language pair: zh_CN-zh_TW(Chinese-simple to Chinese-traditional)
Related tasks:
- Writing linguistic data, including morphological rules and transfer rules — which are specified in a declarative language.
- A Constraint Grammar will be written if necessary.
Proposed idea
To create a translator with
- Morphological dictionaries for language Chinese-simple, Chinese-traditional: apertium-zh_CN-zh_TW.zh_CN.dix, apertium-zh_CN-zh_TW.zh_TW.dix
- Bilingual dictionary: apertium-zh_CN-zh_TW.dix
- Transfer rules: apertium-zh_CN-zh_TW.zh_CN-zh_TW.t1x, apertium-zh_CN-zh_TW.zh_TW-zh_CN.t1x.
which is testvoc clean, and has a coverage of around 80% or more on a range of free corpora.