User:Memduh/GSoC 2017

From Apertium
< User:Memduh
Revision as of 10:03, 16 April 2017 by Unhammer (talk | contribs) (→‎Qualification)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Draft for GSoC 2017 project proposal.

Proposal to create and develop a Crimean Tatar-Turkish translation pair.

Personal Information[edit]

Name: Memduh Gökırmak

Email address: memduhg@gmail.com

UTC+3 Time Zone

IRC: fotonzade

Why is it you are interested in machine translation?[edit]

The study of natural language processing is fascinating to me, and machine learning is a remarkably practical application of this field readily usable by and appealing to most of the world.

Why is it that you are interested in Apertium?[edit]

Rule-based machine translation facilitates the automatic translation of languages that suffer from scarcity of resources, and so makes it possible to work with interesting languages from Kalmyk to Zazaki. The open sourced nature of Apertium and the energy and communication of the community are also particularly appealing to me.

Proposal: Crimean Tatar-Turkish MT[edit]

Why should Google and Apertium sponsor this proposal?

Which of the published tasks are you interested in? What do you plan to do?[edit]

I will develop a translation pair between Crimean Tatar and Turkish. This involves writing and revising transfer and lexical selection rules to the point that the output of the system becomes intelligible, valid text in the target language.

Who will it benefit in society, and how?[edit]

It will facilitate the communication of Turkish and Crimean Tatar speakers, who can already understand the other's written standard to some extent. If Qırımtatar and Turkish text can be translated into each other this will also support efforts of speakers of each language to publish and understand material in the other. It may also be new ground to explore for Turkish Turkology, which seems to have focused more on the other language called Tatar, which is the language spoken in the Republic of Tatarstan in the Russian Federation.

Major Goals[edit]

  • Around 95% coverage
  • WER comparable to other inter-Turkic/Romance pairs.

Obstacles[edit]

  • Lack of (Turkish) resources. There are no Turkish-Qırımtatar dictionaries available. This may actually be to our advantage in that we could publish our bidix as a printed dictionary.

Resources[edit]

  • Wikipedia
  • Wiktionary
  • Turkish Language Institute's Comparative Syntax
  • Russian-Qırımtatar dictionary here: http://medeniye.org/lugat
  • Various other resources in Russian
  • Darya Kavitstkaya's grammar

Work Plan[edit]

  • Post-application period:

Facilitating MT of a children's story from Crimean Tatar to Turkish.

  • Community-bonding period:
    • bidix words, up to 50%
  • Month 1:
    • Writing scripts
    • Adding words to bidix, get coverage to around 80%
    • Chunking
    • Transfer rules
    • Begin CG for CRH
  • Month 2:
    • POS tagging/constraint grammar
    • Transfer rules
    • Get CG rules up to 100, ~50% disambiguation
    • >90% coverage
  • Month 3:
    • Creation of an Annotated Corpus

Plan by Weeks[edit]

1. Coverage
2. Basic CG
3. Coverage
4. Transfer

5. Coverage
6. Transfer, lexical selection
7. CG
8. Transfer, lexsel

9. Transfer
10. CG, Transfer
11. Transfer, lexsel
12. Transfer

13. Preparing text for annotation
14-16. Annotating the Crimean Tatar corpus

Deliverables[edit]

  • WER comparable to other inter-Turkic/Romance pairs.
  • Data for machine-learned disambiguation.

Summer Obligations and Commitments[edit]

I will work as an intern for 20 days in a tech startup, and also take summer classes one day of the week for two months.

Qualification[edit]

I am a fourth year computer engineering student at Istanbul Technical University, and part of the ITU NLP team. I worked on the conversion of the ITU Turkish Treebanks to Universal Dependencies format (UD Turkish) (Sulubacak et. al., 2016), and have co-written a paper on MWEs in Turkish (Adalı et. al., 2016). I have been accepted by the Erasmus Mundus LCT (Language and Communication Technology) Master's program.