User:Darthxaher/Application

Google Summer of Code Application 2009 Abu Zaher Md. Faridee

Name:

Abu Zaher Md. Faridee

Email Address:

zaher14@gmail.com

Contact Information:

IRC: darthxaher@irc.freenode.net

Jabber: zaher14@gmail.com

Yahoo: darthxaher

Cell Phone: +880 1714070147

Why is it you are interested in machine translation?

During my past years as a Computer Science undergrad student, I have come across a lot of interesting subjects like Automata Theory and Compiler. In my next semester I'll be taking Pattern Recognition and Machine Learning courses. Machine Translation/Natural Language processing is one of the best applications where I would be able to hone my skill gained from these subjects. I'm also planning for my undergrad thesis next year on Machine Translation which could benefit a lot from this project.

Why is it that you are interested in the Apertium project?

I have been long time supporter of the open source movement in my country. Adopting to open source philosophy is crucial for a developing country like Bangladesh where cost of proprietary software is unbearable for the most people. Open source machine translation that is offered by Apertium will have far reaching effect in the Bengali localization of open source softwares.

A fews day's spending with the Apertium community has made me realise that people here are both knowledgeable and helpful. I think that is another reason for me to be here.

Which of the published tasks are you interested in? What do you plan to do?

I'm interested in “Conversion of Anubadok”.

Anubadok is an English-to-Bengali machine translation system experimentally developed by G M Hossain. While its working fairly, a lot of works need to be done. My idea is to port the existing Anubadok system to the framework of Apertium. I plan to implement a Bengali morphological generator. The tagging system also needs to be standardized and then a new transfer system can be written. I think Apertium's flexible framework will allow me to implement a language pair as good as Anubadok. Further enhancement could then be carried out after GSoC to make it a full fledged system.

Why should Google and Apertium sponsor it

From what I have seen, most of the language pairs currently available are for the European languages. Very little work has been done for Indic languages. Indic languages tend to be different from European languages and many things needs to be standardized. My focus would be on building an English-Bengali language pair. Building this pair will eventually contribute in creating other European-Indic and Indic-Indic language pairs.

How and who will it benefit in society

I'm from Bangladesh, where Bengali is spoken as first language. A lot of people don't know English and therefore away from the benefit of Information Technology. Localization of open source softwares are a great way to bring them closer to the benefits of IT. And my idea, if successful implemented, will go a long to the Socio Economic development of my people, I believe.

I have come to know that Wikipedia uses Apertium to translate articles from English[4], specially for minor languages. There is no option to convert to Bengali. But a lot of Bengali Wikipedians currently use Anubadok to convert articles and then revise manually. I think an English-Bengali translation system will alleviate the work of a lot of Wikipedians.

Work plan

I have talked with Mentor Francis Tyers a couple of times regarding this idea. Three major aspects here are a morphological generator, a full bilingual dictionary and a new transfer system. My first priority would be to create the morphological generator. This could be accomplished by re-coding Anubadok's[3] one. Or we can use the one from CRBLP[2], where they have done a substancial amount of work. Both options are viable as both of them are licensed under GPL. I'd gradually move to create a transfer system after creating a full bilingual dictionary.

Week 1

Thorough check of Apertium and Anubadok code-base as well as CRBLP's analyzer/generator[7] (if we can get our hands on that).

Gathering thorough knowledge of how the both systems work and how they differ.

Requirement analysis on how to adopt Anubadok / CRBLP's analyzer into Apertium (CRBLP's existing analyzer would be the first priority otherwise we'll fall back to Anubadok)

Deliverable: A report on Anubadok's/CRBLP's morphological generation system and how-to adopt it into Apertium

Week 2

Identification of Tags.

Creation of Bengali monodix by selecting Closed Category (noun genders and multiwords) and Open category Words.

Note: Choosing the most frequent words will require analyzing a language corpus. It can be done by either

a. Running a script on Bengali Wikipedia articles or

b. Reuse existing corpus from CRBLP research[6] on prothom-alo newspaper (if possible)

Our goal is to have a 70% word coverage.

Deliverable: A Primary version of Bengali monodix.

Week 3

Primary Work on the Bengali Morphological Generator.

Continue Working on Bengali monodix

Deliverable: Updated monodix.

Week 4

More work on the Bengali Morphological Generator.

Finish Working on Bengali monodix.

Start working on English to Bengali bidix

Note: Primary start with about 1500 lemmata for the bidix, gradually will come to full coverage of the monodix.

Deliverable: Finished Bengali monodix.

Week 5

Continue working on English to Bengali bidix

Finish Bengali Morphological generator.

Deliverable: Finished Morphological generator.

Week 6

Continue working on English to Begali bidix.

Week 7

Manual Checking on English to Begali bidix.

Preliminary Analysis for the transfer system (Including chunking to rearrange the word order)

Deliverable: English to Begali bidix.

Week 8

Refine the transfer system.

Week 9

Refine the transfer system.

Week 10

Finish working on the transfer system.

Start TestVocing.

Deliverable: A working transfer system.

Week 11

Peer review. TestVoc done.

Comparison of quality with the original Anubadok program.

Deliverable: Preliminary Release

Week 12

Evaluation/ Cleanup

Deliverable: Finished Product

List your skills and give evidence of your qualifications

Right now I'm in my 3^rd year / 2^nd Semester of my undergraduate in Computer Science and Engineering in Bangladesh University of Engineering and Technology. I have attended both theoretical and practical courses in Algorithm, Automata Theory and Compiler and believe that I have adequate theoretical knowledge for this project.

I have been an open source advocate in my country from my college years. I have been working with Ankur[1], a not-profit organization since then. With them I have conducted numerous open source camps. I participated in creating and beta testing of several products from Ankur, notably Firefox spell-checking dictionary and off-line add-on cd for several localized Ubuntu versions.

I maintain a public blog here[5]. Although its not a day-to-day blog, I try to keep it updated when I get free time with the interesting things I come across.

I have been working as a part time software developer for a local company for the last 2 years. I mainly worked in high level languages like python, php, javaEE there and had to do a lot of system-admin on Linux, Solaris and MacOS X.

I believe myself to be highly competent to accomplish this project in the feasible timeline.

List any non-Summer-of-Code plans you have for the Summer

I don't have any other plans beside Google Summer of Code this sumer. However, my class schedule do overlap with GSoC's schedule but I think it won't conflict with the work plan. I could always do extra work in the weekend to minimize the overlapping.

References

[1] http://www.ankur.org.bd/wiki/People

[2] http://www.bracuniversity.net/research/crblp/index.php

[3] http://anubadok.sf.net/

[4] http://meta.wikimedia.org/wiki/Wikipedia_Machine_Translation_Project

[5] http://zaher14.blogspot.com/

[6] http://www.bracuniversity.net/research/crblp/nlpcourse/report/corpus_report.pdf

[7] http://www.bracuniversity.net/research/crblp/nlpcourse/report/MORPHOLOGY_report.pdf