Difference between revisions of "User:Oğuz/GSoC 2019"

From Apertium
Jump to navigation Jump to search
(Created page with "GSoC 2019 proposal draft to create and develop Uyghur-Turkish translation pair. == Personal Information == Name: Oğuzhan Kuyrukçu E-mail: kuyrukcuoguz@gmail.com Phone n...")
 
 
(12 intermediate revisions by 2 users not shown)
Line 1: Line 1:
GSoC 2019 proposal draft to create and develop Uyghur-Turkish translation pair.
GSoC 2019 proposal draft to improve four Turkic translation pairs.


== Personal Information ==
== Personal Information ==
Line 21: Line 21:




== Proposal: 4 language pairs up to release quality ==
== Proposal: Bringing 4 language pairs up to release quality ==


'''Which of the published tasks are you interested in? What do you plan to do?'''
'''Which of the published tasks are you interested in? What do you plan to do?'''


My plan is to adopt 4 unreleased language pairs, uig-tur, kyr-tur, tat-tur and uzb-tur. I'll be working to bring them up to release quality, which will involve writing and refining rules for transfer and lexical selection that will result in a valid text in the target language.
My plan is to adopt 4 unreleased language pairs, tur->uig, kir->tur, tat->tur and uzb->tur (in these directions). I'll be working to bring them up to release quality, which will involve writing and refining rules for transfer, expanding dictionaries, testvoc and lexical selection that will result in a valid text in the target language.

{{comment|It's kir, not kyr. Also, a big part of what's involved is [[testvoc]], not just writing and refining rules. Probably expanding dictionaries is fairly important as well. —[[User:Firespeaker|Firespeaker]] ([[User talk:Firespeaker|talk]]) 04:55, 5 April 2019 (CEST)}}




'''Why should google and apertium sponsor it?'''
'''Why should google and apertium sponsor it?'''
?
An extensive Uyghur-Turkish machine translator is yet to be done and most of the research in Turkology is done through Turkish, compared to other Turkic languages such as Uyghur. As such, a machine translator would enable those working in Turkology and related fields to study Uyghur texts through Turkish. Furthermore, cultural contact between Turkish and Uyghur populations are increasing with migration and these populations can use this tool to familiarize themselves with each other's culture.
?


Apertium hosts numerous MTs of Turkic languages but some of them haven't been worked to completion. By refining these pairs we'd be bringing in these MTs to Apertium repertoire and cover an important ground in Turkic computational linguistics.
'''Resources'''

---- uyghur ----
==Resources==
E.N Necip, Uyghurche-Turkche Lughet

A. B. Ercilasun, Türk Lehçeleri Grameri

A. F. Sjoberg, Uzbek Structural Grammar

J. Hebert and N. Poppe, Kirghiz Manual

N. Poppe, Tatar Manual


Rıdvan Öztürk, Yeni Uygur Türkçesi Grameri
Rıdvan Öztürk, Yeni Uygur Türkçesi Grameri


E. N. Necip, Uyghurche-Turkche Lughet
Wikipedia

R. Ehmetyanov et al, Türkçe-Tatarca Sözlük


Uyghur-English-Mandarin dictionary[http://dict.yulghun.com]
Uyghur-English-Mandarin dictionary[http://dict.yulghun.com]

--- uyghur ----
Pamukkale University's Turkish-Kyrgyz[http://ctle.pau.edu.tr/kgtr/], Turkish-Tatar[http://ctle.pau.edu.tr/tttr/] and Turkish-Uzbek[http://ctle.pau.edu.tr/uztr/] dictionaries

Indiana University's Uzbek-English dictionary[https://www.indiana.edu/~ctild/Main/Uzbek-EnglishDictionary/]

cevirce.com[http://cevirce.com/] for Turkish-Kyrgyz, Turkish-Uzbek and Turkish-Tatar translations

{{comment|How do you plan to use these? Looking up words manually? Scraping them? We have pages of resources for some of these languages with sources already listed—so why list them here? —[[User:Firespeaker|Firespeaker]] ([[User talk:Firespeaker|talk]]) 04:56, 5 April 2019 (CEST)}}


== Work Plan ==
== Work Plan ==


-Post-application period:
??

Studying the grammars of languages where necessary (Tatar and Kyrgyz).
Tagging examplary sentences and writing rules.


-Community-bonding period:

Working on coverage. As of now all pairs have around 80% coverage.
Starting annotation and rule writing.


-Month 1:

Writing scripts

Adding words to bidix, get coverage up from 80%

Chunking

Transfer rules

Begin CG rules


-Month 2:

POS tagging/constraint grammar

Transfer rules

Get CG rules up to 100, ~50% disambiguation

>90% coverage


-Month 3:

Creation of an Annotated Corpus


'''Plan by Weeks'''

1. All pairs up to 80% coverage

2. Basic CG, 82% coverage

3. 83% coverage

4. Transfer, 84% coverage

5. 85% coverage

6. Transfer, lexical selection

7. CG, 86% coverage

8. Transfer, lexsel, 87% coverage

9. 88% coverage

10. CG, Transfer

11. Transfer, lexsel, 89% coverage

12. Transfer, lexsel

13. Preparing texts for annotation

14-16. Annotating the corpora, %90 coverage on each pair




== Coding Challenge ==
== Coding Challenge ==


I will tag 20 sentences for each language pair and write 5 ambiguity rules.
I will tag 20 sentences for each language pair and write 5 ambiguity rules. I'll put these on github.

{{comment|where are you putting it? —[[User:Firespeaker|Firespeaker]] ([[User talk:Firespeaker|talk]]) 05:00, 5 April 2019 (CEST)}}


== Deliverables ==
== Deliverables ==
??


WER comparable to other inter-Turkic/Romance pairs.
WER comparable to other inter-Turkic/Romance pairs.
Implementation of ambiguous rules.
Data for machine-learned disambiguation.
Data for machine-learned disambiguation.


??




Line 70: Line 156:


I'm a 3rd year student of linguistics at Boğaziçi University. I've taken several compuational linguistics classes as part of my studies. I've already worked with Apertium last year on an MT project between Uyghur and Turkish, which I believe helped me improve my understanding of computational linguistics.
I'm a 3rd year student of linguistics at Boğaziçi University. I've taken several compuational linguistics classes as part of my studies. I've already worked with Apertium last year on an MT project between Uyghur and Turkish, which I believe helped me improve my understanding of computational linguistics.


[[Category:GSoC 2019 student proposals]]

Latest revision as of 13:00, 7 April 2019

GSoC 2019 proposal draft to improve four Turkic translation pairs.

Personal Information[edit]

Name: Oğuzhan Kuyrukçu

E-mail: kuyrukcuoguz@gmail.com

Phone number: +905414785653

ITC: oguz

Time zone: UTC+3


Why is it that you are interested in Apertium?

I'm a student of linguistics and I recently took up an interest in computational linguistics. I worked with Apertium last year on a Machine Translation project and had a great experience. I'd like to do that again.


Proposal: Bringing 4 language pairs up to release quality[edit]

Which of the published tasks are you interested in? What do you plan to do?

My plan is to adopt 4 unreleased language pairs, tur->uig, kir->tur, tat->tur and uzb->tur (in these directions). I'll be working to bring them up to release quality, which will involve writing and refining rules for transfer, expanding dictionaries, testvoc and lexical selection that will result in a valid text in the target language.

It's kir, not kyr. Also, a big part of what's involved is testvoc, not just writing and refining rules. Probably expanding dictionaries is fairly important as well. —Firespeaker (talk) 04:55, 5 April 2019 (CEST)


Why should google and apertium sponsor it?

Apertium hosts numerous MTs of Turkic languages but some of them haven't been worked to completion. By refining these pairs we'd be bringing in these MTs to Apertium repertoire and cover an important ground in Turkic computational linguistics.

Resources[edit]

A. B. Ercilasun, Türk Lehçeleri Grameri

A. F. Sjoberg, Uzbek Structural Grammar

J. Hebert and N. Poppe, Kirghiz Manual

N. Poppe, Tatar Manual

Rıdvan Öztürk, Yeni Uygur Türkçesi Grameri

E. N. Necip, Uyghurche-Turkche Lughet

R. Ehmetyanov et al, Türkçe-Tatarca Sözlük

Uyghur-English-Mandarin dictionary[1]

Pamukkale University's Turkish-Kyrgyz[2], Turkish-Tatar[3] and Turkish-Uzbek[4] dictionaries

Indiana University's Uzbek-English dictionary[5]

cevirce.com[6] for Turkish-Kyrgyz, Turkish-Uzbek and Turkish-Tatar translations

How do you plan to use these? Looking up words manually? Scraping them? We have pages of resources for some of these languages with sources already listed—so why list them here? —Firespeaker (talk) 04:56, 5 April 2019 (CEST)

Work Plan[edit]

-Post-application period:

Studying the grammars of languages where necessary (Tatar and Kyrgyz). Tagging examplary sentences and writing rules.


-Community-bonding period:

Working on coverage. As of now all pairs have around 80% coverage. Starting annotation and rule writing.


-Month 1:

Writing scripts

Adding words to bidix, get coverage up from 80%

Chunking

Transfer rules

Begin CG rules


-Month 2:

POS tagging/constraint grammar

Transfer rules

Get CG rules up to 100, ~50% disambiguation

>90% coverage


-Month 3:

Creation of an Annotated Corpus


Plan by Weeks

1. All pairs up to 80% coverage

2. Basic CG, 82% coverage

3. 83% coverage

4. Transfer, 84% coverage

5. 85% coverage

6. Transfer, lexical selection

7. CG, 86% coverage

8. Transfer, lexsel, 87% coverage

9. 88% coverage

10. CG, Transfer

11. Transfer, lexsel, 89% coverage

12. Transfer, lexsel

13. Preparing texts for annotation

14-16. Annotating the corpora, %90 coverage on each pair


Coding Challenge[edit]

I will tag 20 sentences for each language pair and write 5 ambiguity rules. I'll put these on github.

where are you putting it? —Firespeaker (talk) 05:00, 5 April 2019 (CEST)

Deliverables[edit]

WER comparable to other inter-Turkic/Romance pairs. Implementation of ambiguous rules. Data for machine-learned disambiguation.


Summer Obligations and Commitments[edit]

I have no scheduled commitments.


Qualification[edit]

I'm a 3rd year student of linguistics at Boğaziçi University. I've taken several compuational linguistics classes as part of my studies. I've already worked with Apertium last year on an MT project between Uyghur and Turkish, which I believe helped me improve my understanding of computational linguistics.