User:Memduh/Proposal

From Apertium
< User:Memduh
Revision as of 06:57, 3 May 2016 by Unhammer (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Proposal to bring the Kurmanji-English pair to state of the art quality.

Accepted for GsoC2016, see Kurmanji_and_English/Work_plan

Personal Information[edit]

Name: Memduh Gökırmak

Email address: memduhg@gmail.com

UTC+2 Time Zone

IRC: memduh

Why is it you are interested in machine translation?[edit]

The study of natural language processing is fascinating to me, and machine learning is a remarkably practical application of this field readily usable by and appealing to most of the world.

Why is it that you are interested in Apertium?[edit]

Rule-based machine translation facilitates the automatic translation of languages that suffer from scarcity of resources, and so makes it possible to work with interesting languages from Kalmyk to Zazaki. The open sourced nature of Apertium and the energy and communication of the community are also particularly appealing to me.

Proposal: Kurmanji-English MT[edit]

There are around 20 million people descended from Kurmanji-speaking families in the world. This proposal will facilitate the integration of this language into the international community, through aiding translation of news and information. It will also be helpful in the teaching of the language, which had been hindered for various reasons until recently.

Why should Google and Apertium sponsor this proposal?

Which of the published tasks are you interested in? What do you plan to do?[edit]

I plan to bring the Kurmanji-English language pair to state of the art quality.

Who will it benefit in society, and how?[edit]

Due to the history and politics of the region, many descendants of Kurmanji-speaking families have not been able to learn Kurmanji to any degree of fluency, and the development of translation resources can be an immense help for the many people trying to learn this language. As efforts to learn Kurmanji become more widespread, Kurmanji content is also produced at an increasingly rapid rate.

There is little work done in the machine translation of Kurmanji, and the development of this language pair will be a beneficial step for the field of NLP to embrace the study of this language.

Major Goals[edit]

  • Around 95% coverage
  • Reduction of WER by about 40%

Obstacles[edit]

  • Standardization: Kurmanji varies a lot from region to region, in both spoken and written language. The variety of Kurmanji spoken or written in a certain region tends to reflect the influence of other prominent languages, e.g. Turkish in Turkey, Sorani in the Kurdistan Regional Government. As a result entries in a dictionary may not match some uses/spellings/meanings.
  • Readily available corpora: Obtaining corpora for examination and reference during the development of the language pair will require a bit more effort compared to a "bigger" language. I have scraped two years' worth of articles from Rudaw, totalling around 1.2 million words, but for the sake of variety of topic and possible dialect influence obtaining other corpora will be helpful.
  • Parallel corpora: Direct translations from Kurmanji to English are sparse enough that it may not be meaningful to attempt to gather a corpus from such resources. Various distributions of Ubuntu have files translated into Kurmanji, but these would require a good bit of processing to use and would probably not offer much insight into the language.

Many novels and other types of texts have been translated, however, between Turkish and Kurdish. As I am a native speaker of Turkish and due to the strong Turkic background of the Apertium community, this will provide a useful reference.

Resources[edit]

  • Thackston's Kurmanji Grammar
  • Celadet Bedirxan and Roger Lescot's Grammar
  • A variety of dictionaries are available, the most easily accessible and possibly the most useful being Kurmanji-Turkish dictionaries.
  • Corpus scraped from Rudaw.
  • Wikipedia

Work Plan[edit]

  • Post-application period:
    • work on Newroz speech
      • get WER
    • gather more corpora: monolingual, Kurmanji-Turkish and if at all possible Kurmanji-English
    • write scripts for adding words/translations
    • familiarize myself with chunking, constraint grammar and lexical selection through documentation and articles.
  • Community-bonding period:
    • testvoc, will go without saying that I will clean the testvoc fairly regularly throughout the development process.
    • continue writing scripts
  • Month 1:
    • Writing scripts
    • Adding words to monodix/bidix, get naive coverage to around 95%
    • Chunking
    • Transfer rules
  • Month 2:
    • POS tagging/constraint grammar
    • Transfer rules
  • Month 3:
    • Transfer rules
    • Lexical selection

Plan by Weeks[edit]

  • Week 1: Adding words, paradigms, translations. Writing scripts for practical use in developing the pair.
  • Week 2: Continue adding to dictionaries, begin work on chunking.
  • Week 3-4: Continue adding words, paradigms, translations, add transfer rules, chunking.
  • Week 5-8: POS tagging, constraint grammar, transfer rules
  • Week 9-12: Lexical selection, transfer rules.

Deliverables[edit]

  • 1: Coverage of 95%
  • 2: Kurmanji constraint grammar
  • 3: WER reduction of 40%

Summer Obligations and Commitments[edit]

I have no other plans than to work on this proposal during the summer.

Qualification[edit]

I am a third year computer engineering student at Istanbul Technical University, and part of the ITU NLP team. I have been working on the conversion of the ITU Turkish Treebanks to Universal Dependencies format, and have co-written a paper on MWEs in Turkish (Adalı et. al., 2016).