User:Memduh/Proposal

From Apertium
< User:Memduh
Revision as of 23:44, 23 March 2016 by Memduh (talk | contribs)
Jump to navigation Jump to search

Personal Information

Name: Memduh Gökırmak

Email address: memduhg@gmail.com

Why is it you are interested in machine translation?

The study of natural language processing is fascinating to me, and machine learning is a remarkably practical application of this field readily usable by and appealing to most of the world.

Why is it that you are interested in Apertium?

Rule-based machine translation facilitates the automatic translation of languages that suffer from scarcity of resources, and so makes it possible to work with interesting languages from Kalmyk to Zazaki. The open sourced nature of Apertium and the energy and communication of the community are also particularly appealing to me.

Proposal

Which of the published tasks are you interested in? What do you plan to do?

I plan to bring the Kurmanji-English language pair to state of the art quality.

Who will it benefit in society, and how?

Due to the history and politics of the region, many descendants of Kurmanji-speaking families have not been able to learn Kurmanji to any degree of fluency, and the development of translation resources can be an immense help for the many people trying to learn this language. As efforts to learn Kurmanji become more widespread, Kurmanji content is also produced at an increasingly rapid rate.

There is little work done in the machine translation of Kurmanji, and the development of this language pair will be a beneficial step for the field of NLP to embrace the study of this language.

Major Goals

  • Around 95% coverage
  • Reduction of WER by about 40%

Obstacles

  • Standardization: Kurmanji varies a lot from region to region, in both spoken and written language. The variety of Kurmanji spoken or written in a certain region tends to reflect the influence of other prominent languages, e.g. Turkish in Turkey, Sorani in the Kurdistan Regional Government. As a result entries in a dictionary may not match some uses/spellings/meanings.
  • Readily available corpora: Obtaining corpora for examination and reference during the development of the language pair will require a bit more effort compared to a "bigger" language. I have scraped two years' worth of articles from ([rudaw.net/kurmanci Rudaw]), totalling around 1.2 million words, but for the sake of variety of topic and possible dialect influence obtaining other corpora will be helpful.
  • Parallel corpora: Direct translations from Kurmanji to English are sparse enough that it may not be meaningful to attempt to gather a corpus from such resources. Various distributions of Ubuntu have files translated into Kurmanji, but these would require a good bit of processing to use and would probably not offer much insight into the language.

Many novels and other types of texts have been translated, however, between Turkish and Kurdish. As I am a native speaker of Turkish and due to the strong Turkic background of the Apertium community, this will provide a useful reference.

Resources

  • Thackston's Kurmanji Grammar
  • Celadet Bedirxan and Roger Lescot's Grammar
  • A variety of dictionaries are available, the most easily accessible and possibly the most useful being Kurmanji-Turkish dictionaries.

Work Plan

Qualification

I am a third year computer engineering student at Istanbul Technical University, and am part of the ITU NLP team. I have been working on the conversion of the ITU Turkish Treebanks to Universal Dependencies format, and have co-written a paper on MWEs in Turkish.

References