User:Darthxaher/Application2010

From Apertium
Jump to navigation Jump to search
Google Summer of Code Application 2009
Abu Zaher Md. Faridee
Department of Computer Science and Engineering
Bangladesh University of Engineering and Technology


Name

Abu Zaher Md. Faridee

Email Address

zaher14@gmail.com

Contact Information

IRC: darthxaher@irc.freenode.net

Cell Phone: +880 1714070147

Why is it you are interested in machine translation?

As a student of Computer Science, I'm personally very interested in fields of Artificial Intelligence, Machine Learning and Pattern Recognition. I think machine translation is one of the most exiting applications in this field. The most interesting thing about Machine Translation is how fundamentally different the various MT techniques are. Whereas rule bases machine translation relies upon extensively on automata theory and pattern matching, Statistical machine translation approach harnesses the essence of statistics and information theory. There have been extensive work in this field in the recent decade and there is much to be done.

Working on machine translation also involves the unique bonus of getting to know a lot of different languages and cultures, which is its own reward.

Why is it you are interested in the Apertium Project?

I successfully completed my last Google Summer of Code project (2009) titled 'Conversion of Anubadok: Creating an English Bengali Language Pair' under Apertium. The project was a great experience for me. I had the wonderful experience of working with some of the experts in rule based machine translation technique. Though quite interested in working in this field, my knowledge on machine translation was not that much great. But during the course of the project I got the chance to understand the intricate things of RBMT through my mentor and Apertium's helpful community. It goes without saying that Apertium's community is one of the most active open source communities out there and here I really feel at home.

I have been long time supporter of the open source movement in my country. Adopting to open source philosophy is crucial for a developing country like Bangladesh where cost of proprietary software is unbearable for the most people. Open source machine translation that is being offered by Apertium will have far reaching effect in the local Bengali Language adoption and localization of open source softwares.

Which of the published tasks are you interested in? What do you plan to do?

I'm interested in 'VM for the transfer module' idea, that is creating a virtual machine for the transfer stage in Apertium's pipeline.

As already mentioned in the idea's page, Apertium currently uses XML tree walking in the transfer stage, the stage in which Apertium brings forth the structural changes in the sentences. This is quite inefficient as XML parsing is quite time consuming. The idea is to create a pseudo-assembly level mini instructions that embodies the rules stated in the XML files (t1x, t2x. T3x), then compile them to a easy to use byte-code format. A tiny and highly optimized Virtual Machine would need to be written to run the byte-code. Even a non JIT optimized VM could achieve several magnitude of performance over existing XML based solution.

Why should Google and Apertium Sponsor it?

The existing architecture of Apertium is very robust and fast, but it should be faster.

How and who will it benefit in society?

Work Plan [Messed up, need heavy fix]

I have been keeping in touch with Sergio, Francis and Jim regarding the details and plan for this project. So far I've noted the following things will need to be done:

  • Create a python prototype for the VM:

This will be the testbed for brainstorming. Primarily I think only sticking with the pre-transfer stage will be a wise decision.

  • Port the VM into C++:

After implementing the python prototype, we'll have a clear view of the needed data structures, instructions and byte-code format.

Community Bonding Period

Week 1: April 27 - May 2

Week 2: May 3 - May 9

Week 3: May 10 - May 16

Week 4: May 17 - May 23

Coding Period

May 24 - May 30

May 31 - June 6

June 7 - June 13

June 14 - June 20

June 21 - June 27

June 28 - July 24

July 5 - July 11

July 12 - July 18

July 19 - July 25

July 26 - August 1

August 2 - August 8

August 9 - August 16

List your skills and give evidence of your qualifications [copy-paste last year]

As I've already mentioned, I successfully completed my Google Summer of Code project titled 'Conversion of Anubadok: Creating an English Bengali Language Pair' under Apertium last year. It was a really ambitious project given the fact that there was little linguistic data available for Bengali other than another open source machine translation project called Anubadok. The project had three stages, building a Bengali morphological generator, creating a English to Bengali bilingual dictionary and the creating a transfer system. Building the morphological analyzer/generator proved to be tougher than we originally comprehended as for Apertium needs more information for each lexical category which was included in Anubadok's data. Therefor by the end of the project we had a morphological analyzer with 68% coverage of the most used 20 thousand words. The post GSoC report can be viewed from here.

The project was followed up by a successful paper submission at freeRBMT09 by me and my mentor Francis Tyers. The paper can be accesses from here.

Right now I’m in my 4th year / 2nd Semester of my undergraduate in Computer Science and Engineering in Bangladesh University of Engineering and Technology. I have attended both theoretical and practical courses in Algorithm, Automata Theory and Compiler and believe that I have basic theoretical knowledge for this project.

I have been an open source advocate in my country from my college years. I have been working with Ankur[1], a not-profit organization since then. With them I have conducted numerous open source camps. I participated in creating and beta testing of several products from Ankur, notably Firefox spell-checking dictionary and off-line add-on CD for several localized Ubuntu versions.

I’m the developer of several open source applications. Netaccess-squid[2] has been created as an open source alternative to Cyberoam System[3]. Aural Aurora[4] is a Spring[5]/Oracle based music discovery and social collaboration system.

I had worked as a part time software developer for AfriGISBD, an off-shore development house of AfriGIS[6] 2 years. I mainly worked in high level languages like python, php, javaEE there and had to do a lot of system-admin on Linux, Solaris and Mac OS X. I’m currently employed (part-time) at MuktoSoft[7] where I’m working on iPhone based software.

I maintain a public blog here[8]. Although its not a day-to-day blog, I try to keep it updated when I get free time with the interesting technical things I come across.

My resume can be viwed from here.

List any non-Summer-of-Code plans you have for the Summer

I don’t have any other plans beside Google Summer of Code this sumer. However, my class schedule do overlap with GSoC’s schedule but I think it won’t conflict with the work plan. I could always do extra work in the weekend to minimize the overlapping.

Conclusion

I’d like to thank all the developers of Apertium for putting up such a great effort. I’d also like to thank Google for organizing Google Summer of Code and flourishing Open Source community.