Difference between revisions of "User:Oğuz/GSoC 2019"
(Created page with "GSoC 2019 proposal draft to create and develop Uyghur-Turkish translation pair. == Personal Information == Name: Oğuzhan Kuyrukçu E-mail: kuyrukcuoguz@gmail.com Phone n...") |
|||
Line 21: | Line 21: | ||
== Proposal: 4 language pairs up to release quality == |
== Proposal: Bringing 4 language pairs up to release quality == |
||
'''Which of the published tasks are you interested in? What do you plan to do?''' |
'''Which of the published tasks are you interested in? What do you plan to do?''' |
||
Line 29: | Line 29: | ||
'''Why should google and apertium sponsor it?''' |
'''Why should google and apertium sponsor it?''' |
||
? |
|||
Apertium hosts numerous MTs of Turkic languages but some of them haven't been worked to completion. By refining these pairs we'd be bringing in these MTs to Apertium repertoire and cover an important ground in Turkic computational linguistics. |
|||
An extensive Uyghur-Turkish machine translator is yet to be done and most of the research in Turkology is done through Turkish, compared to other Turkic languages such as Uyghur. As such, a machine translator would enable those working in Turkology and related fields to study Uyghur texts through Turkish. Furthermore, cultural contact between Turkish and Uyghur populations are increasing with migration and these populations can use this tool to familiarize themselves with each other's culture. |
|||
? |
|||
'''Resources''' |
'''Resources''' |
||
Line 43: | Line 42: | ||
Uyghur-English-Mandarin dictionary[http://dict.yulghun.com] |
Uyghur-English-Mandarin dictionary[http://dict.yulghun.com] |
||
--- uyghur ---- |
--- uyghur ---- |
||
== Work Plan == |
|||
-Post-application period: |
|||
Studying the grammars of languages where necessary (Tatar and Kyrgyz). |
|||
Tagging examplary sentences and writing rules. |
|||
-Community-bonding period: |
|||
Working on coverage. As of now all pairs have around 80% coverage. |
|||
Starting annotation and rule writing. |
|||
-Month 1: |
|||
Writing scripts |
|||
Adding words to bidix, get coverage up from 80% |
|||
Chunking |
|||
Transfer rules |
|||
Begin CG rules |
|||
-Month 2: |
|||
POS tagging/constraint grammar |
|||
Transfer rules |
|||
Get CG rules up to 100, ~50% disambiguation |
|||
>90% coverage |
|||
-Month 3: |
|||
Creation of an Annotated Corpus |
|||
'''Plan by Weeks''' |
|||
1. 80% coverage |
|||
2. Basic CG |
|||
3. 84% coverage |
|||
4. Transfer |
|||
5. 86% coverage |
|||
6. Transfer, lexical selection, 65% coverage |
|||
7. CG, 80% coverage |
|||
8. Transfer, lexsel, 84% coverage |
|||
9. Transfer |
|||
10. CG, Transfer |
|||
11. Transfer, lexsel, 86% coverage |
|||
12. Transfer, 88% coverage |
|||
13. Preparing text for annotation |
|||
14-16. Annotating the Uyghur corpus, %90 coverage |
|||
== Work Plan == |
== Work Plan == |
Revision as of 20:23, 1 April 2019
GSoC 2019 proposal draft to create and develop Uyghur-Turkish translation pair.
Contents
Personal Information
Name: Oğuzhan Kuyrukçu
E-mail: kuyrukcuoguz@gmail.com
Phone number: +905414785653
ITC: oguz
Time zone: UTC+3
Why is it that you are interested in Apertium?
I'm a student of linguistics and I recently took up an interest in computational linguistics. I worked with Apertium last year on a Machine Translation project and had a great experience. I'd like to do that again.
Proposal: Bringing 4 language pairs up to release quality
Which of the published tasks are you interested in? What do you plan to do?
My plan is to adopt 4 unreleased language pairs, uig-tur, kyr-tur, tat-tur and uzb-tur. I'll be working to bring them up to release quality, which will involve writing and refining rules for transfer and lexical selection that will result in a valid text in the target language.
Why should google and apertium sponsor it?
Apertium hosts numerous MTs of Turkic languages but some of them haven't been worked to completion. By refining these pairs we'd be bringing in these MTs to Apertium repertoire and cover an important ground in Turkic computational linguistics.
Resources
uyghur ----
E.N Necip, Uyghurche-Turkche Lughet
Rıdvan Öztürk, Yeni Uygur Türkçesi Grameri
Wikipedia
Uyghur-English-Mandarin dictionary[1] --- uyghur ----
Work Plan
-Post-application period:
Studying the grammars of languages where necessary (Tatar and Kyrgyz). Tagging examplary sentences and writing rules.
-Community-bonding period:
Working on coverage. As of now all pairs have around 80% coverage. Starting annotation and rule writing.
-Month 1:
Writing scripts
Adding words to bidix, get coverage up from 80%
Chunking
Transfer rules
Begin CG rules
-Month 2:
POS tagging/constraint grammar
Transfer rules
Get CG rules up to 100, ~50% disambiguation
>90% coverage
-Month 3:
Creation of an Annotated Corpus
Plan by Weeks
1. 80% coverage
2. Basic CG
3. 84% coverage
4. Transfer
5. 86% coverage
6. Transfer, lexical selection, 65% coverage
7. CG, 80% coverage
8. Transfer, lexsel, 84% coverage
9. Transfer
10. CG, Transfer
11. Transfer, lexsel, 86% coverage
12. Transfer, 88% coverage
13. Preparing text for annotation
14-16. Annotating the Uyghur corpus, %90 coverage
Work Plan
??
Coding Challenge
I will tag 20 sentences for each language pair and write 5 ambiguity rules.
Deliverables
??
WER comparable to other inter-Turkic/Romance pairs. Data for machine-learned disambiguation.
??
Summer Obligations and Commitments
I have no scheduled commitments.
Qualification
I'm a 3rd year student of linguistics at Boğaziçi University. I've taken several compuational linguistics classes as part of my studies. I've already worked with Apertium last year on an MT project between Uyghur and Turkish, which I believe helped me improve my understanding of computational linguistics.