Difference between revisions of "User:Memduh/Proposal"

Revision as of 00:17, 24 March 2016

Personal Information

Name: Memduh Gökırmak

Email address: memduhg@gmail.com

Why is it you are interested in machine translation?

The study of natural language processing is fascinating to me, and machine learning is a remarkably practical application of this field readily usable by and appealing to most of the world.

Why is it that you are interested in Apertium?

Rule-based machine translation facilitates the automatic translation of languages that suffer from scarcity of resources, and so makes it possible to work with interesting languages from Kalmyk to Zazaki. The open sourced nature of Apertium and the energy and communication of the community are also particularly appealing to me.

Proposal

Which of the published tasks are you interested in? What do you plan to do?

I plan to bring the Kurmanji-English language pair to state of the art quality.

Who will it benefit in society, and how?

Due to the history and politics of the region, many descendants of Kurmanji-speaking families have not been able to learn Kurmanji to any degree of fluency, and the development of translation resources can be an immense help for the many people trying to learn this language. As efforts to learn Kurmanji become more widespread, Kurmanji content is also produced at an increasingly rapid rate.

There is little work done in the machine translation of Kurmanji, and the development of this language pair will be a beneficial step for the field of NLP to embrace the study of this language.

Major Goals

Around 95% coverage
Reduction of WER by about 40%

Obstacles

Standardization: Kurmanji varies a lot from region to region, in both spoken and written language. The variety of Kurmanji spoken or written in a certain region tends to reflect the influence of other prominent languages, e.g. Turkish in Turkey, Sorani in the Kurdistan Regional Government. As a result entries in a dictionary may not match some uses/spellings/meanings.
Readily available corpora: Obtaining corpora for examination and reference during the development of the language pair will require a bit more effort compared to a "bigger" language. I have scraped two years' worth of articles from Rudaw, totalling around 1.2 million words, but for the sake of variety of topic and possible dialect influence obtaining other corpora will be helpful.
Parallel corpora: Direct translations from Kurmanji to English are sparse enough that it may not be meaningful to attempt to gather a corpus from such resources. Various distributions of Ubuntu have files translated into Kurmanji, but these would require a good bit of processing to use and would probably not offer much insight into the language.

Many novels and other types of texts have been translated, however, between Turkish and Kurdish. As I am a native speaker of Turkish and due to the strong Turkic background of the Apertium community, this will provide a useful reference.

Resources

Thackston's Kurmanji Grammar
Celadet Bedirxan and Roger Lescot's Grammar
A variety of dictionaries are available, the most easily accessible and possibly the most useful being Kurmanji-Turkish dictionaries.

Work Plan

Post-application period:
- work on Newroz speech
  - get WER
- gather more corpora: monolingual, Kurmanji-Turkish and if at all possible Kurmanji-English
- write scripts for adding words/translations
- familiarize myself with constraint grammar and lexical selection through documentation and articles.
Community-bonding period:
- testvoc, will go without saying that I will clean the testvoc fairly regularly throughout the development process.
- continue writing scripts
Month 1:
- Writing scripts
- Adding words to monodix/bidix, get naive coverage to around 95%
- Transfer rules
Month 2:
- POS tagging/constraint grammar
- Transfer rules
Month 3:
- Transfer rules
- Lexical selection

Work Plan

Week 1: Adding words, paradigms, translations. Writing scripts for practical use in developing the pair.
Week 2-4: Continue adding words, paradigms, translations; adding transfer rules
Week 5-8: POS tagging, constraint grammar, more transfer rules
Week 9-12: Lexical selection, transfer rules.

Deliverables

Naive coverage of 95%
WER reduction of 40%
Kurmanji constraint grammar

Qualification

I am a third year computer engineering student at Istanbul Technical University, and part of the ITU NLP team. I have been working on the conversion of the ITU Turkish Treebanks to Universal Dependencies format, and have co-written a paper on MWEs in Turkish.

@@ Line 57: / Line 57: @@
 **Transfer rules
 **Lexical selection
 == Work Plan ==

Difference between revisions of "User:Memduh/Proposal"

Revision as of 00:17, 24 March 2016

Contents

Personal Information

Why is it you are interested in machine translation?

Why is it that you are interested in Apertium?

Proposal

Which of the published tasks are you interested in? What do you plan to do?

Who will it benefit in society, and how?

Major Goals

Obstacles

Resources

Work Plan

Work Plan

Deliverables

Qualification

References

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools