User:Darthxaher/Application2009

From Apertium
Jump to navigation Jump to search
Google Summer of Code Application 2009
Abu Zaher Md. Faridee
Department of Computer Science and Engineering
Bangladesh University of Engineering and Technology


1 Name

Abu Zaher Md. Faridee

2 Email Address

zaher14@gmail.com

3 Contact Information

IRC: darthxaher@irc.freenode.net

Cell Phone: +880 1714070147

4 Why is it you are interested in machine translation?

During my past years as a Computer Science undergrad student, I have come across a lot of interesting subjects like Automata Theory and Compiler. In my next semester I’ll be taking Pattern Recognition and Machine Learning courses. Machine Translation/Natural Language processing is one of the best applications where I would be able to hone my skill gained from these subjects. I’m also planning for my undergrad thesis next year on Machine Translation which could benefit a lot from this project.

5 Why is it you are interested in the Apertium Project?

I have been long time supporter of the open source movement in my country. Adopting to open source philosophy is crucial for a developing country like Bangladesh where cost of proprietary software is unbearable for the most people. Open source machine translation that is being offered by Apertium will have far reaching effect in the Bengali adoption and localization of open source softwares.

A few day’s spending with the Apertium community has made me realize that people here are both knowledgeable and helpful. I think that is another reason for me to be here.

6 Which of the published tasks are you interested in? What do you plan to do?

I’m interested in Conversion of Anubadok which is essentially building a new English-Bengali language pair.

Anubadok is an experimental English-to-Bengali machine translation system developed by G M Hossain. While its working fairly, a lot of works need to be done. My idea is to port the existing Anubadok system to the framework of Apertium. I plan to implement a Bengali morphological generator. The tagging system also needs to be standardized and then a new transfer system can be written. I think Apertium’s flexible framework will allow me to implement a language pair as good as Anubadok. Further enhancement could then be carried out after GSoC to make it a full fledged system.

It should be noted that only translation from English to Bengali is the focus of this Project. Bengali to English is out of the scope right now.

7 Why should Google and Apertium Sponsor it?

From what I have seen, most of the language pairs currently available in Apertium are for the European languages. Very little work has been done for Indic languages. Indic languages tend to be different from European languages and many things needs to be standardized. My focus would be on building an English-Bengali language pair. Building this pair will eventually contribute in creating other European-Indic and Indic-Indic language pairs.

8 How and who will it benefit in society?

I’m from Bangladesh, where Bengali is spoken as first language. A lot of people don’t know English and therefore away from the benefit of Information Technology. Localization of open source softwares are a great way to bring them closer to the benefits of IT. And my idea, if successful implemented, will go a long to the Socio Economic development of my people, I believe.

I have come to know that Wikipedia uses Apertium to translate articles from English[1], specially for minor languages. There is no option to convert to Bengali. But a lot of Bengali Wikipedians currently use Anubadok to convert articles and then revise manually. I think an English-Bengali translation system will alleviate the work of a lot of Wikipedians.

9 Work Plan

I have been keeping in contact with mentor Francis Tyers regarding this idea. Three major aspects here are:

-A morphological generator for Bengali part

-A full bilingual dictionary

-A transfer system

I have been maintaining contact with some of the expert people in this field who worked on English to Bengali translation. Dr. G M Hossain, the author of Anubadok[2] has given me a lot of insight on Anubadok. I also personally met Dr. Mumit Khan, head of CRBLP[3]. CRBLP has been working on several Bengali language research issues and did some work on Machine Translation too. And all of their work has been released under GPL.

Francis and I were hoping that they could provide us with a morphological generator but they seem not to have implemented one of their own. I was, however, able to get the result of their corpus analysis on prothom-alo newspaper[4]. It has the list of the most frequent used words. We can take the first several thousand for the start. They also have given me link to some other people who might have worked on a Bengali morphological analyzer/generator but so far I haven’t heard from them.

Keeping all this findings and GSoC’s tight schedule in mind, I have decided that, my first priority would be to create the morphological generator. This could be accomplished by re-implementing Anubadok’s rule-sets. I’d then gradually move to create a transfer system after creating a full bilingual dictionary.

I intend to follow this time schedule.

Community Bonding Period (April 20 - May 22)

-Familiarizing with Apertium’s tool-chain and its community

-Thorough check of Apertium and Anubadok code-base.

-Gathering thorough knowledge of how the both systems work and how they differ.

-Requirement analysis on how to adopt Anubadok’s generator into Apertium.

-Try to come up with -

-A set of tags that will be needed for the morphological generator, apart form the default ones from Apertium (Anubadok’s tag-set is incompatible with Apertium)

-A set of rule-sets that Anubadok uses for the morphological generation

-A set of transfer rules that will be used for the transfer system

Deliverable A preliminary skeleton of the apertium-en-bn package

Note: If we hear something from CRBLP about any analyzer/generator, we’ll take that into consideration too.

Week 1 (May 23 - May 29)

-Start creation of Bengali monodix by selecting closed category words (conjunction, aux verbs, determiners etc)


Note Choosing the most frequent words will require analyzing a language corpus. It can be done by either

-Reuse existing corpus from CRBLP research[5] on prothom-alo newspaper and if needed,

-Running a script on Bengali Wikipedia articles

Our goal is to have a 70% word coverage.

Deliverable A Primary version of Bengali monodix.

Week 2 (May 30 - June 5)

-Primary Work on the Bengali Morphological Generator.

-Continue Working on Bengali monodix, now for open category words (nouns, verbs, adjectives etc)

Deliverable Updated monodix.

Week 3 (June 6 - June 12)

-More work on the Bengali Morphological Generation rules

-Finish Working on Bengali monodix.

Deliverable Finished Bengali monodix.

Week 4 (June 13 - June 19)

-Start working on English to Bengali bidix

-Update the morphological generator

Note Primary start with about 1500 lemmata for the bidix, gradually will come to full coverage of the monodix.

Week 5 (June 20 - June 26)

-Continue working on English to Bengali bidix

-Finish Bengali Morphological generator.

Deliverable Finished Morphological generator

Week 6 (June 27 - July 3)

-Continue working on English to Bengali bidix.

Week 7 (July 4 - July 10)

-Manual Checking on English to Bengali bidix.

-Preliminary Analysis for the transfer system (Including chunking to rearrange the word order)

Deliverable English to Bengali bidix.

Mid term evaluation on morphological generator and e2b bidix

Week 8 (July 11 - July 17)

-Refine the transfer system

Week 9 (July 18 - July 24)

-Refine the transfer system, add adequate lexical selection rules

Week 10 (July 25 - July 31)

-Finish working on the transfer system

-Start TestVocing

Deliverable: A working transfer system.

Week 11 (August 1 - August 7)

-Peer review(TestVoc done).

-Comparison of quality with the original Anubadok program

Deliverable: Preliminary Release

Week 12 (August 8 - August 14)

-Evaluation/ Cleanup

Deliverable: Finished Product

Week 13 (August 15 - August 16)

-Only reserved for any emergency situation

10 List your skills and give evidence of your qualifications

Right now I’m in my 3rd year / 2nd Semester of my undergraduate in Computer Science and Engineering in Bangladesh University of Engineering and Technology. I have attended both theoretical and practical courses in Algorithm, Automata Theory and Compiler and believe that I have basic theoretical knowledge for this project.

I have been an open source advocate in my country from my college years. I have been working with Ankur[6], a not-profit organization since then. With them I have conducted numerous open source camps. I participated in creating and beta testing of several products from Ankur, notably Firefox spell-checking dictionary and off-line add-on CD for several localized Ubuntu versions.

I’m the developer of several open source applications. Netaccess-squid[7] has been created as an open source alternative to Cyberoam System[8]. Aural Aurora[9] is a Spring[10]/Oracle based music discovery and social collaboration system.

I had worked as a part time software developer for AfriGISBD, an off-shore development house of AfriGIS[11] 2 years. I mainly worked in high level languages like python, php, javaEE there and had to do a lot of system-admin on Linux, Solaris and Mac OS X. I’m currently employed (part-time) at MuktoSoft[12] where I’m working on iPhone based software.

I maintain a public blog here[13]. Although its not a day-to-day blog, I try to keep it updated when I get free time with the interesting technical things I come across.

My resume can be viwed from here.

I have been doing some experiments for the Apertium en-bn language pair already and keeping in contact with the developers. Maintaining a good communication among the developers is must have quality in any type open source software development. Therefore, I believe myself to be highly competent to accomplish this project in the feasible time-line.

11 List any non-Summer-of-Code plans you have for the Summer

I don’t have any other plans beside Google Summer of Code this sumer. However, my class schedule do overlap with GSoC’s schedule but I think it won’t conflict with the work plan. I could always do extra work in the weekend to minimize the overlapping.

12 Conclusion

I’d like to thank all the developers of Apertium for putting up such a great effort. I’d also like to thank Google for organizing Google Summer of Code and flourishing Open Source community.