Apertium has moved from SourceForge to GitHub.
If you have any questions, please come and talk to us on #apertium on irc.freenode.net or contact the GitHub migration team.

User:Memduh/Proposal

From Apertium
< User:Memduh(Difference between revisions)
Jump to: navigation, search
Line 57: Line 57:
 
**Transfer rules
 
**Transfer rules
 
**Lexical selection
 
**Lexical selection
  +
   
 
== Work Plan ==
 
== Work Plan ==

Revision as of 01:17, 24 March 2016

Contents

Personal Information

Name: Memduh Gökırmak

Email address: memduhg@gmail.com

Why is it you are interested in machine translation?

The study of natural language processing is fascinating to me, and machine learning is a remarkably practical application of this field readily usable by and appealing to most of the world.

Why is it that you are interested in Apertium?

Rule-based machine translation facilitates the automatic translation of languages that suffer from scarcity of resources, and so makes it possible to work with interesting languages from Kalmyk to Zazaki. The open sourced nature of Apertium and the energy and communication of the community are also particularly appealing to me.

Proposal

Which of the published tasks are you interested in? What do you plan to do?

I plan to bring the Kurmanji-English language pair to state of the art quality.

Who will it benefit in society, and how?

Due to the history and politics of the region, many descendants of Kurmanji-speaking families have not been able to learn Kurmanji to any degree of fluency, and the development of translation resources can be an immense help for the many people trying to learn this language. As efforts to learn Kurmanji become more widespread, Kurmanji content is also produced at an increasingly rapid rate.

There is little work done in the machine translation of Kurmanji, and the development of this language pair will be a beneficial step for the field of NLP to embrace the study of this language.

Major Goals

  • Around 95% coverage
  • Reduction of WER by about 40%

Obstacles

  • Standardization: Kurmanji varies a lot from region to region, in both spoken and written language. The variety of Kurmanji spoken or written in a certain region tends to reflect the influence of other prominent languages, e.g. Turkish in Turkey, Sorani in the Kurdistan Regional Government. As a result entries in a dictionary may not match some uses/spellings/meanings.
  • Readily available corpora: Obtaining corpora for examination and reference during the development of the language pair will require a bit more effort compared to a "bigger" language. I have scraped two years' worth of articles from Rudaw, totalling around 1.2 million words, but for the sake of variety of topic and possible dialect influence obtaining other corpora will be helpful.
  • Parallel corpora: Direct translations from Kurmanji to English are sparse enough that it may not be meaningful to attempt to gather a corpus from such resources. Various distributions of Ubuntu have files translated into Kurmanji, but these would require a good bit of processing to use and would probably not offer much insight into the language.

Many novels and other types of texts have been translated, however, between Turkish and Kurdish. As I am a native speaker of Turkish and due to the strong Turkic background of the Apertium community, this will provide a useful reference.

Resources

  • Thackston's Kurmanji Grammar
  • Celadet Bedirxan and Roger Lescot's Grammar
  • A variety of dictionaries are available, the most easily accessible and possibly the most useful being Kurmanji-Turkish dictionaries.

Work Plan

  • Post-application period:
    • work on Newroz speech
      • get WER
    • gather more corpora: monolingual, Kurmanji-Turkish and if at all possible Kurmanji-English
    • write scripts for adding words/translations
    • familiarize myself with constraint grammar and lexical selection through documentation and articles.
  • Community-bonding period:
    • testvoc, will go without saying that I will clean the testvoc fairly regularly throughout the development process.
    • continue writing scripts
  • Month 1:
    • Writing scripts
    • Adding words to monodix/bidix, get naive coverage to around 95%
    • Transfer rules
  • Month 2:
    • POS tagging/constraint grammar
    • Transfer rules
  • Month 3:
    • Transfer rules
    • Lexical selection


Work Plan

  • Week 1: Adding words, paradigms, translations. Writing scripts for practical use in developing the pair.
  • Week 2-4: Continue adding words, paradigms, translations; adding transfer rules
  • Week 5-8: POS tagging, constraint grammar, more transfer rules
  • Week 9-12: Lexical selection, transfer rules.

Deliverables

  • Naive coverage of 95%
  • WER reduction of 40%
  • Kurmanji constraint grammar


Qualification

I am a third year computer engineering student at Istanbul Technical University, and part of the ITU NLP team. I have been working on the conversion of the ITU Turkish Treebanks to Universal Dependencies format, and have co-written a paper on MWEs in Turkish.

References

Personal tools