Difference between revisions of "User:Darthxaher/Application2010"

From Apertium
Jump to navigation Jump to search
(Work Plan: Weeks)
 
(120 intermediate revisions by 2 users not shown)
Line 1: Line 1:
== Name ==
<center>Google Summer of Code Application 2009</center>


<center>Abu Zaher Md. Faridee</center>
Abu Zaher Md. Faridee

<center>Department of Computer Science and Engineering</center>


== Affiliation ==
<center>Bangladesh University of Engineering and Technology</center>


Final year undergraduate student, Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology


== Email Address ==
'''1 Name'''

Abu Zaher Md. Faridee

'''2 Email Address'''


[mailto:zaher14@gmail.com zaher14@gmail.com]
[mailto:zaher14@gmail.com zaher14@gmail.com]


'''3 Contact Information'''
== Contact Information ==


IRC: [mailto:darthxaher@irc.freenode.net darthxaher@irc.freenode.net]
IRC: [mailto:darthxaher@irc.freenode.net darthxaher@irc.freenode.net]
Line 22: Line 17:
Cell Phone: <tt>+880 1714070147</tt>
Cell Phone: <tt>+880 1714070147</tt>


'''4 Why is it you are interested in machine translation? '''
== Why is it you are interested in machine translation? ==


As a student of Computer Science, I'm personally very interested in fields of Artificial Intelligence, Machine Learning and Pattern Recognition. I think machine translation is one of the most exiting applications in this field. The most interesting thing about Machine Translation is how fundamentally different the various MT techniques are. Whereas rule bases machine translation relies upon extensively on automata theory and pattern matching, Statistical machine translation approach harnesses the essence of statistics and information theory. There have been extensive work in this field in the recent decade and there is much to be done.
As a student of Computer Science, I'm personally very interested in the fields of artificial intelligence, machine learning and pattern recognition. I think machine translation is one of the most exciting applications in this field. The most interesting thing about machine translation is how fundamentally different the various machine translation techniques are. Whereas rule bases machine translation relies upon extensively on automata theory and pattern matching, statistical machine translation approach harnesses the essence of statistics and information theory. There have been extensive work in this field in the recent decades and there is much to be done, posing new challenges to developers every day.


Working on machine translation also involves the unique bonus of getting to know a lot of different languages and cultures, which is its own reward.
Working on machine translation also involves the unique bonus of getting to know a lot of different languages, history and cultures, which is its own reward.


'''5 Why is it you are interested in the Apertium Project? '''
== Why is it you are interested in the Apertium Project? ==


I successfully completed my last Google Summer of Code project (2009) titled 'Conversion of Anubadok: Creating an English Bengali Language Pair' under Apertium. The project was a great experience for me. I had the wonderful experience of working with some of the experts in rule based machine translation technique. Though quite interested in working in this field, my knowledge on machine translation was not that much great. But during the course of the project I got the chance to understand the intricate things of RBMT through my mentor and Apertium's helpful community. It goes without saying that Apertium's community is one of the most active open source communities out there and here I really feel at home.
I successfully completed my last Google Summer of Code project (2009) titled '''''Conversion of Anubadok: Creating an English Bengali Language Pair''''' under Apertium. The project was a great experience for me. I had the wonderful opportunity of working with some of the experts in rule based machine translation technique. Though quite interested in working in this field, my knowledge on machine translation was not that much great. But during the course of the project I got the chance to understand the intricate details of rule based machine translation through my mentor and Apertium's helpful community. It goes without saying that Apertium's community is one of the most active open source communities out there and here I really feel at home.

One of the most exciting features of Apertium's development philosophy is its focus for endangered and less resourced languages. In a world of globalization, its hard to maintain cultural identity of minority groups. Many cultures are are on the verge of extinction. The machine translation work offered by Apertium for less resourced languages makes Apertium not only a challenging software project but also a great humanitarian effort on its own accord.


I have been long time supporter of the open source movement in my country. Adopting to open source philosophy is crucial for a developing country like Bangladesh where cost of proprietary software is unbearable for the most people. Open source machine translation that is being offered by Apertium will have far reaching effect in the local Bengali Language adoption and localization of open source softwares.
I have been long time supporter of the open source movement in my country. Adopting to open source philosophy is crucial for a developing country like Bangladesh where cost of proprietary software is unbearable for the most people. Open source machine translation that is being offered by Apertium will have far reaching effect in the local Bengali Language adoption and localization of open source softwares.


'''6 Which of the published tasks are you interested in? What do you plan to do? '''
== Which of the published tasks are you interested in? What do you plan to do? ==


I'm interested in 'VM for the transfer module' idea, that is creating a virtual machine for the transfer stage in Apertium's pipeline.
I'm interested in '''''VM for the transfer module''''' idea, that is creating a virtual machine for the structural transfer stage in Apertium's pipeline.


As already mentioned in the idea's page, Apertium currently uses XML tree walking in the transfer stage, the stage in which Apertium brings forth the structural changes in the sentences. This is quite inefficient as XML parsing is quite time consuming. The idea is to create a pseudo-assembly level mini instructions that embodies the rules stated in the XML files (t1x, t2x. T3x), then compile them to a easy to use byte-code format. A tiny and highly optimized Virtual Machine would need to be written to run the byte-code. Even a non JIT optimized VM could achieve several magnitude of performance over existing XML based solution.
As already mentioned in the idea's page, Apertium currently uses XML tree walking in the structural transfer stage. This is quite inefficient as XML parsing is quite CPU time consuming (specially in case of languages involving three stage transfer). The idea is to create a pseudo-assembly level instruction-set that embodies the rules stated in the XML files (t1x, t2x. T3x), then compile them to a easy to use byte-code format. A tiny and optimized virtual machine would need to be written to run the byte-code that will do the actual structural transfer work. Even a non JIT optimized byte-code driven VM could achieve several magnitude of performance over existing XML based solution.


'''7 Why should Google and Apertium Sponsor it? '''
== Why should Google and Apertium Sponsor it? ==


The existing architecture of Apertium is very robust and already quite fast, but there are rooms to make it faster. The demand of faster and efficient machine translation system now is more than ever. Different machine translation systems are used everyday in the web and each day there is an ever growing load on the web servers running them. Apertium already has a highly scalable web architecture, but its performance will always be dependent on the core of Apertium's engine. The current project idea promises several magnitude of performance increase over the current implementation of the core engine by removing the bottleneck, thus freeing up more CPU time for the servers and allowing more service for the end users.
The existing architecture of Apertium is very robust and fast, but it should be faster.


'''8 How and who will it benefit in society? '''
== How and who will it benefit in society? ==


Every year European Union spend approximately one billion Euros for translating documents between the languages of the European nations<ref>http://www.independent.co.uk/news/world/europe/cost-in-translation-eu-spends-83641bn-on-language-services-407991.html</ref>. While recent advancements in machine translation field has somewhat leveraged the manual translation work, it has not reached that maturity to work without human intervention. On the other hand, most of the state of the translation systems are still closed source and costly. This makes them unsuitable for large scale deployment for this kind of government endeavour. Open source solutions like Apertium are better options as they are cost effective and flexible, and they can be readily modified to cope with ongoing demand. Every single step that increase the performance, efficiency or correctness of Apertium will have therefore long term effect its worldwide adoption and potentially save millions of dollars.
'''9 Work Plan''' [Messed up, need heavy fix]


== Work Plan ==
I have been keeping in touch with Sergio, Francis and Jim regarding the details and plan for this project. So far I've noted the following things will need to be done:


I have been keeping in touch with mentors Sergio Ortiz Rojas, Francis Tyers and Jimmy O'Regan regarding the details and plan for this project. So far I've noted the following things will need to be done:
* Create a python prototype for the VM:


* Separate the lexical transfer module from the structural transfer module without breaking the existing system. This part is necessary because the VM will take word pairs [source language, target language] as input, so we need something that can feed the VM that data. '''''(Note: I have been notified by mentor Ortiz Rojas that he will take care of this part)'''''
This will be the testbed for brainstorming. Primarily I think only sticking with the pre-transfer stage will be a wise decision.

* Define the instruction-set and data structure for the pseudo assembly

* Define the byte-code format

* Create a python prototype for the stack based virtual machine:

: This will be the test-bed for brainstorming and R&D. Primarily, I think only sticking with the first-level transfer stage will be a wise decision.

: By the end of this stage we should be able to generate -

::* Text pseudo-assembly from the transfer XML file (first only t1x, later include to t2x and t3x)
::* Byte-code from the text pseudo-assembly i.e. compile the text file


* Port the VM into C++:
* Port the VM into C++:


After implementing the python prototype, we'll have a clear view of the needed data structures, instructions and byte-code format.
: After implementing the python prototype, we'll have a clear view of the needed data structures, instructions and byte-code format. If desired, JIT optimization techniques could be carried out in later part this phase.
=== Community Bonding Period ===

As I'm already quite familiar with both the code and community of Apertium, I think I could start coding right after the community bonding period begins. That gives me a total of 16 weeks to get the project done. I'd like to use only the first week for reviewing the code structure of Apertium and Lt-toolbox and reserve the last 3 weeks for debugging, polishing, testing and any other emergencies. So that leaves me with 12 weeks for the coding phase.

==== Week 1: April 27 - May 2 ====

* Reviewing of the code structure of Apertium and Lt-toolbox that involves the lexical transfer and structural transfer module. Also come up with some set of test cases (at first, small t1x files and accompanying input) with increasing order of complexity (include t2x and t3x) for a preferred language pair. These test cases will be the the basis of how we'll keep adding more and more complex instructions in the python prototype. Previous works from Jacob Nordfalk will be an excellent place to start.

'''Deliverable:''' Set of test cases

==== Week 2: May 3 - May 9 ====

Define a very minimal set of instructions and data structure for the prototype VM, try to come up with a simple as possible schematics for the python implementation.

'''Deliverable:''' A minimal instruction set and data structure + schematics for the python VM

==== Week 3: May 10 - May 16 ====

Start preliminary work on 't1x' to 'pseudo-assembly text' compiler.

'''Deliverable:''' A minimal compiler that can compile from 't1x' to 'pseudo-assembly' text

==== Week 4: May 17 - May 23 ====

Start coding on the python prototype VM. It will be able to run the instructions from 't1x' pseudo-assembly text file and produce the desired structurally translated output.

'''Deliverable:''' A very very simple and working prototype.

=== Official Coding Period ===

==== Week 5: May 24 - May 30 ====

Start working in t2x and t3x to pseudo-assembly compiler. This will involve adding more instructions to our instruction set.

==== Week 6: May 31 - June 6 ====

Modify the VM so that now it can run t2x and t3x pseudo assembly files and produce desired output.

'''Deliverable:''' A functional compiler and VM that can handle t1x, t2x, t3x files.

==== Week 7: June 7 - June 13 ====

By this time the interpreted VM would be in a workable state. So now we add 'txx' to 'byte-code' generation mode to our compiler.

'''Deliverable:''' The python implementation of the byte-code compile

==== Week 8: June 14 - June 20 ====

Start adding byte-code aware mode to our VM, fix any remaining issues in the 'txx' to 'byte-code' compiler. Test this new module.

==== Week 9: June 21 - June 27 ====

Finish working on the byte-code mode VM, also explore the possibilities of JIT optimization for the VM.

'''Deliverable:''' Fully working python prototype
==== Week 10: June 28 - July 24 ====

Start primary work on the C++ implementation, focus of this implementation is fast and reliable VM and a compiler that is also binary compatible with the previously created python implementation.

'''Deliverable:''' Reimplementation of the python 'txx' to byte-code compiler in C++


==== Week 11: July 5 - July 11 ====
Community Bonding Period


Start working on porting the python VM into C++
April 27 - May 2


==== Week 12: July 12 - July 18 (Mid-term Evaluation)====
May 3 - May 9


Final work on C++ VM
May 10 - May 16


'''Deliverable:''' A fully functional C++ implementation of VM
May 17 - May 23


==== Week 13: July 19 - July 25 ====
Coding Period


Try to cut the loose ends of the C++ VM and explore the possibilities of JIT optimizations
May 24 - May 30


'''Deliverable:''' A JIT optimized (if possible) VM
May 31 - June 6


==== Week 14: July 26 - August 1 ====
June 7 - June 13


Reserved for testing, debugging and any other emergencies
June 14 - June 20


==== Week 15: August 2 - August 8 ====
June 21 - June 27


Reserved for testing, debugging and any other emergencies
June 28 - July 24


==== Week 16: August 9 - August 16 ====
July 5 - July 11


Reserved for testing, debugging and any other emergencies
July 12 - July 18


==== Final Evaluation: August 9 - August 16 ====
July 19 - July 25


Final evaluation
July 26 - August 1


=== Development Note ===
August 2 - August 8


From my previous GSoC experience I have realized that a lot of unexpected problems arise during the course of the project which are not part of the original plan. In those cases I'll try to modify the project plan to fit our needs but try to maintain the deadlines as much as possible.
August 9 - August 16


'''10 List your skills and give evidence of your qualifications''' [copy-paste last year]
== List your skills and give evidence of your qualifications ==


As I've already mentioned, I successfully completed my Google Summer of Code project titled 'Conversion of Anubadok: Creating an English Bengali Language Pair' under Apertium last year. It was a really ambitious project given the fact that there was little linguistic data available for Bengali other than another open source machine translation project called Anubadok. The project had three stages, building a Bengali morphological generator, creating a English to Bengali bilingual dictionary and the creating a transfer system. Building the morphological analyzer/generator proved to be tougher than we originally comprehended as for Apertium needs more information for each lexical category which was included in Anubadok's data. Therefor by the end of the project we had a morphological analyzer with 68% coverage of the most used 20 thousand words. The post GSoC report can be viewed from [http://wiki.apertium.org/wiki/Google_Summer_of_Code/Report_2009#Conversion_of_Anubadok_.28darthxaher.29 here].
As I've already mentioned, I successfully completed my Google Summer of Code project titled '''''[http://socghop.appspot.com/gsoc/student_project/show/google/gsoc2009/apertium/t124021625239 Conversion of Anubadok: Creating an English Bengali Language Pair]''''' under Apertium last year. It was a really ambitious project given the fact that there was little linguistic data available for Bengali other than another open source machine translation project called Anubadok. The project had three stages, building a Bengali morphological analyser/generator, creating a English to Bengali bilingual dictionary and the creating a transfer system. Building the morphological analyser/generator proved to be tougher than we originally comprehended as Apertium needs more information for each lexical category than what was tagged in Anubadok's dataset. Therefore by the end of the project we had a morphological analyser with 68% coverage of the most used 20 thousand words. The [http://wiki.apertium.org/wiki/Google_Summer_of_Code/Report_2009#Conversion_of_Anubadok_.28darthxaher.29 post GSoC report] discussed the outcome in further details.


The project was followed up by a successful paper submission at freeRBMT09 by me and my mentor Francis Tyers. The paper can be accesses from here.
The project was followed up by a successful paper submission at [http://xixona.dlsi.ua.es/freerbmt09/ freeRBMT09] by me and my mentor Francis Tyers. The paper can be accessed from [http://www.mt-archive.info/FreeRBMT-2009-Faridee.pdf here]. I also tried to remain active in Apertium's community after summer of code. Several of the maintenance enhancements were carried out long after the project was officially over.


Right now I’m in my 4th year / 2nd Semester of my undergraduate in Computer Science and Engineering in Bangladesh University of Engineering and Technology. I have attended both theoretical and practical courses in Algorithm, Automata Theory and Compiler and believe that I have basic theoretical knowledge for this project.
Right now I’m in my final year of my undergraduate in Computer Science and Engineering in Bangladesh University of Engineering and Technology. I have attended both theoretical and practical courses in Algorithm, Automata Theory and Compiler. I took a compiler design course and as a part of term project successfully created a compiler that could run a subset of Pascal. I also attended basic OS course, where I had to go through Linux's source code implement a custom system call. implement some new features in 'Nachos' and solve some classic concurrency problem using POSIX thread. So I believe that I have both the basic theoretical and practical knowledge for this project.


I have been an open source advocate in my country from my college years. I have been working with Ankur<ref name="ftn5">[http://www.ankur.org.bd/wiki/People http://www.ankur.org.bd/wiki/People]</ref>, a not-profit organization since then. With them I have conducted numerous open source camps. I participated in creating and beta testing of several products from Ankur, notably Firefox spell-checking dictionary and off-line add-on CD for several localized Ubuntu versions.
I have been an open source advocate in my country from my college years. I have been working with [http://www.ankur.org.bd/wiki/People Ankur], a not-profit organization since then. With them I have conducted numerous open source camps. I participated in creating and beta testing of several products from Ankur, notably Firefox spell-checking dictionary and off-line add-on CD for several localized Ubuntu versions.


I’m the developer of several open source applications. Netaccess-squid<ref name="ftn6">[http://sourceforge.net/projects/netaccess-squid/ http://sourceforge.net/projects/netaccess-squid/]</ref> has been created as an open source alternative to Cyberoam System<ref name="ftn7">[http://www.cyberoam.com/productoverview.html http://www.cyberoam.com/productoverview.html]</ref>. Aural Aurora<ref name="ftn8">[http://code.google.com/p/auralaurora/ http://code.google.com/p/auralaurora/]</ref> is a Spring<ref name="ftn9">[http://www.springsource.org/ http://www.springsource.org/]</ref>/Oracle based music discovery and social collaboration system.
I’m the developer of several open source applications. [http://sourceforge.net/projects/netaccess-squid/ Netaccess-squid] has been created as an open source alternative to [http://www.cyberoam.com/productoverview.html Cyberoam System]. [http://code.google.com/p/auralaurora/ Aural Aurora] is a [http://www.springsource.org/ Spring]/Oracle based music discovery and social collaboration system.


I had worked as a part time software developer for AfriGISBD, an off-shore development house of AfriGIS<ref name="ftn10">[http://www.afrigis.co.za/ http://www.afrigis.co.za/]</ref> 2 years. I mainly worked in high level languages like python, php, javaEE there and had to do a lot of system-admin on Linux, Solaris and Mac OS X. I’m currently employed (part-time) at MuktoSoft<ref name="ftn11">[http://muktosoft.com/ http://muktosoft.com/]</ref> where I’m working on iPhone based software.
I had worked as a part time software developer for AfriGISBD, an off-shore development house of [http://www.afrigis.co.za/ AfriGIS] 2 years. I mainly worked in high level languages like python, php, javaEE there and had to do a lot of system-admin on Linux, Solaris and Mac OS X. Until last year I was also working at [http://muktosoft.com/ MuktoSoft] (part-time) which involved iPhone/iPod softwares.


I maintain a public blog here<ref name="ftn12">[http://zaher14.blogspot.com/ http://zaher14.blogspot.com/]</ref>. Although its not a day-to-day blog, I try to keep it updated when I get free time with the interesting technical things I come across.
I maintain a public blog [http://zaher14.blogspot.com/ here]. Although its not a day-to-day blog, I try to keep it updated when I get free time with the interesting technical things I come across.


My resume can be viwed from [http://www.box.net/shared/th752ezsml here].
My resume can be viwed from [http://www.box.net/shared/th752ezsml here].


'''11 List any non-Summer-of-Code plans you have for the Summer'''
== List any non-Summer-of-Code plans you have for the Summer ==


Right now, as I write this application, I'm attending my first semester final exams at university. My exams will be over in April 14 and my next semester does not start until July. So after this exam is over, I will be able to fully concentrate on my work of the project. My class schedules will overlap at the ending phase of the project, but normally there is not much pressure at the beginning of the semester. Keeping all these in mind, I plan to finish the project up as soon as possible while I can give it full attention and at the later stage do bug fixing and enhancements. And if need arises, I could always do extra work in the weekend to minimize the effect of overlapping.
I don’t have any other plans beside Google Summer of Code this sumer. However, my class schedule do overlap with GSoC’s schedule but I think it won’t conflict with the work plan. I could always do extra work in the weekend to minimize the overlapping.


'''12 Conclusion'''
== Conclusion ==


I’d like to thank all the developers of Apertium for putting up such a great effort. I’d also like to thank Google for organizing Google Summer of Code and flourishing Open Source community.
I’d like to thank all the developers of Apertium for putting up such a great effort. I’d also like to thank Google for organizing Google Summer of Code and flourishing Open Source community.

Latest revision as of 01:16, 10 April 2010

Name[edit]

Abu Zaher Md. Faridee

Affiliation[edit]

Final year undergraduate student, Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology

Email Address[edit]

zaher14@gmail.com

Contact Information[edit]

IRC: darthxaher@irc.freenode.net

Cell Phone: +880 1714070147

Why is it you are interested in machine translation?[edit]

As a student of Computer Science, I'm personally very interested in the fields of artificial intelligence, machine learning and pattern recognition. I think machine translation is one of the most exciting applications in this field. The most interesting thing about machine translation is how fundamentally different the various machine translation techniques are. Whereas rule bases machine translation relies upon extensively on automata theory and pattern matching, statistical machine translation approach harnesses the essence of statistics and information theory. There have been extensive work in this field in the recent decades and there is much to be done, posing new challenges to developers every day.

Working on machine translation also involves the unique bonus of getting to know a lot of different languages, history and cultures, which is its own reward.

Why is it you are interested in the Apertium Project?[edit]

I successfully completed my last Google Summer of Code project (2009) titled Conversion of Anubadok: Creating an English Bengali Language Pair under Apertium. The project was a great experience for me. I had the wonderful opportunity of working with some of the experts in rule based machine translation technique. Though quite interested in working in this field, my knowledge on machine translation was not that much great. But during the course of the project I got the chance to understand the intricate details of rule based machine translation through my mentor and Apertium's helpful community. It goes without saying that Apertium's community is one of the most active open source communities out there and here I really feel at home.

One of the most exciting features of Apertium's development philosophy is its focus for endangered and less resourced languages. In a world of globalization, its hard to maintain cultural identity of minority groups. Many cultures are are on the verge of extinction. The machine translation work offered by Apertium for less resourced languages makes Apertium not only a challenging software project but also a great humanitarian effort on its own accord.

I have been long time supporter of the open source movement in my country. Adopting to open source philosophy is crucial for a developing country like Bangladesh where cost of proprietary software is unbearable for the most people. Open source machine translation that is being offered by Apertium will have far reaching effect in the local Bengali Language adoption and localization of open source softwares.

Which of the published tasks are you interested in? What do you plan to do?[edit]

I'm interested in VM for the transfer module idea, that is creating a virtual machine for the structural transfer stage in Apertium's pipeline.

As already mentioned in the idea's page, Apertium currently uses XML tree walking in the structural transfer stage. This is quite inefficient as XML parsing is quite CPU time consuming (specially in case of languages involving three stage transfer). The idea is to create a pseudo-assembly level instruction-set that embodies the rules stated in the XML files (t1x, t2x. T3x), then compile them to a easy to use byte-code format. A tiny and optimized virtual machine would need to be written to run the byte-code that will do the actual structural transfer work. Even a non JIT optimized byte-code driven VM could achieve several magnitude of performance over existing XML based solution.

Why should Google and Apertium Sponsor it?[edit]

The existing architecture of Apertium is very robust and already quite fast, but there are rooms to make it faster. The demand of faster and efficient machine translation system now is more than ever. Different machine translation systems are used everyday in the web and each day there is an ever growing load on the web servers running them. Apertium already has a highly scalable web architecture, but its performance will always be dependent on the core of Apertium's engine. The current project idea promises several magnitude of performance increase over the current implementation of the core engine by removing the bottleneck, thus freeing up more CPU time for the servers and allowing more service for the end users.

How and who will it benefit in society?[edit]

Every year European Union spend approximately one billion Euros for translating documents between the languages of the European nations[1]. While recent advancements in machine translation field has somewhat leveraged the manual translation work, it has not reached that maturity to work without human intervention. On the other hand, most of the state of the translation systems are still closed source and costly. This makes them unsuitable for large scale deployment for this kind of government endeavour. Open source solutions like Apertium are better options as they are cost effective and flexible, and they can be readily modified to cope with ongoing demand. Every single step that increase the performance, efficiency or correctness of Apertium will have therefore long term effect its worldwide adoption and potentially save millions of dollars.

Work Plan[edit]

I have been keeping in touch with mentors Sergio Ortiz Rojas, Francis Tyers and Jimmy O'Regan regarding the details and plan for this project. So far I've noted the following things will need to be done:

  • Separate the lexical transfer module from the structural transfer module without breaking the existing system. This part is necessary because the VM will take word pairs [source language, target language] as input, so we need something that can feed the VM that data. (Note: I have been notified by mentor Ortiz Rojas that he will take care of this part)
  • Define the instruction-set and data structure for the pseudo assembly
  • Define the byte-code format
  • Create a python prototype for the stack based virtual machine:
This will be the test-bed for brainstorming and R&D. Primarily, I think only sticking with the first-level transfer stage will be a wise decision.
By the end of this stage we should be able to generate -
  • Text pseudo-assembly from the transfer XML file (first only t1x, later include to t2x and t3x)
  • Byte-code from the text pseudo-assembly i.e. compile the text file
  • Port the VM into C++:
After implementing the python prototype, we'll have a clear view of the needed data structures, instructions and byte-code format. If desired, JIT optimization techniques could be carried out in later part this phase.

Community Bonding Period[edit]

As I'm already quite familiar with both the code and community of Apertium, I think I could start coding right after the community bonding period begins. That gives me a total of 16 weeks to get the project done. I'd like to use only the first week for reviewing the code structure of Apertium and Lt-toolbox and reserve the last 3 weeks for debugging, polishing, testing and any other emergencies. So that leaves me with 12 weeks for the coding phase.

Week 1: April 27 - May 2[edit]

  • Reviewing of the code structure of Apertium and Lt-toolbox that involves the lexical transfer and structural transfer module. Also come up with some set of test cases (at first, small t1x files and accompanying input) with increasing order of complexity (include t2x and t3x) for a preferred language pair. These test cases will be the the basis of how we'll keep adding more and more complex instructions in the python prototype. Previous works from Jacob Nordfalk will be an excellent place to start.

Deliverable: Set of test cases

Week 2: May 3 - May 9[edit]

Define a very minimal set of instructions and data structure for the prototype VM, try to come up with a simple as possible schematics for the python implementation.

Deliverable: A minimal instruction set and data structure + schematics for the python VM

Week 3: May 10 - May 16[edit]

Start preliminary work on 't1x' to 'pseudo-assembly text' compiler.

Deliverable: A minimal compiler that can compile from 't1x' to 'pseudo-assembly' text

Week 4: May 17 - May 23[edit]

Start coding on the python prototype VM. It will be able to run the instructions from 't1x' pseudo-assembly text file and produce the desired structurally translated output.

Deliverable: A very very simple and working prototype.

Official Coding Period[edit]

Week 5: May 24 - May 30[edit]

Start working in t2x and t3x to pseudo-assembly compiler. This will involve adding more instructions to our instruction set.

Week 6: May 31 - June 6[edit]

Modify the VM so that now it can run t2x and t3x pseudo assembly files and produce desired output.

Deliverable: A functional compiler and VM that can handle t1x, t2x, t3x files.

Week 7: June 7 - June 13[edit]

By this time the interpreted VM would be in a workable state. So now we add 'txx' to 'byte-code' generation mode to our compiler.

Deliverable: The python implementation of the byte-code compile

Week 8: June 14 - June 20[edit]

Start adding byte-code aware mode to our VM, fix any remaining issues in the 'txx' to 'byte-code' compiler. Test this new module.

Week 9: June 21 - June 27[edit]

Finish working on the byte-code mode VM, also explore the possibilities of JIT optimization for the VM.

Deliverable: Fully working python prototype

Week 10: June 28 - July 24[edit]

Start primary work on the C++ implementation, focus of this implementation is fast and reliable VM and a compiler that is also binary compatible with the previously created python implementation.

Deliverable: Reimplementation of the python 'txx' to byte-code compiler in C++

Week 11: July 5 - July 11[edit]

Start working on porting the python VM into C++

Week 12: July 12 - July 18 (Mid-term Evaluation)[edit]

Final work on C++ VM

Deliverable: A fully functional C++ implementation of VM

Week 13: July 19 - July 25[edit]

Try to cut the loose ends of the C++ VM and explore the possibilities of JIT optimizations

Deliverable: A JIT optimized (if possible) VM

Week 14: July 26 - August 1[edit]

Reserved for testing, debugging and any other emergencies

Week 15: August 2 - August 8[edit]

Reserved for testing, debugging and any other emergencies

Week 16: August 9 - August 16[edit]

Reserved for testing, debugging and any other emergencies

Final Evaluation: August 9 - August 16[edit]

Final evaluation

Development Note[edit]

From my previous GSoC experience I have realized that a lot of unexpected problems arise during the course of the project which are not part of the original plan. In those cases I'll try to modify the project plan to fit our needs but try to maintain the deadlines as much as possible.

List your skills and give evidence of your qualifications[edit]

As I've already mentioned, I successfully completed my Google Summer of Code project titled Conversion of Anubadok: Creating an English Bengali Language Pair under Apertium last year. It was a really ambitious project given the fact that there was little linguistic data available for Bengali other than another open source machine translation project called Anubadok. The project had three stages, building a Bengali morphological analyser/generator, creating a English to Bengali bilingual dictionary and the creating a transfer system. Building the morphological analyser/generator proved to be tougher than we originally comprehended as Apertium needs more information for each lexical category than what was tagged in Anubadok's dataset. Therefore by the end of the project we had a morphological analyser with 68% coverage of the most used 20 thousand words. The post GSoC report discussed the outcome in further details.

The project was followed up by a successful paper submission at freeRBMT09 by me and my mentor Francis Tyers. The paper can be accessed from here. I also tried to remain active in Apertium's community after summer of code. Several of the maintenance enhancements were carried out long after the project was officially over.

Right now I’m in my final year of my undergraduate in Computer Science and Engineering in Bangladesh University of Engineering and Technology. I have attended both theoretical and practical courses in Algorithm, Automata Theory and Compiler. I took a compiler design course and as a part of term project successfully created a compiler that could run a subset of Pascal. I also attended basic OS course, where I had to go through Linux's source code implement a custom system call. implement some new features in 'Nachos' and solve some classic concurrency problem using POSIX thread. So I believe that I have both the basic theoretical and practical knowledge for this project.

I have been an open source advocate in my country from my college years. I have been working with Ankur, a not-profit organization since then. With them I have conducted numerous open source camps. I participated in creating and beta testing of several products from Ankur, notably Firefox spell-checking dictionary and off-line add-on CD for several localized Ubuntu versions.

I’m the developer of several open source applications. Netaccess-squid has been created as an open source alternative to Cyberoam System. Aural Aurora is a Spring/Oracle based music discovery and social collaboration system.

I had worked as a part time software developer for AfriGISBD, an off-shore development house of AfriGIS 2 years. I mainly worked in high level languages like python, php, javaEE there and had to do a lot of system-admin on Linux, Solaris and Mac OS X. Until last year I was also working at MuktoSoft (part-time) which involved iPhone/iPod softwares.

I maintain a public blog here. Although its not a day-to-day blog, I try to keep it updated when I get free time with the interesting technical things I come across.

My resume can be viwed from here.

List any non-Summer-of-Code plans you have for the Summer[edit]

Right now, as I write this application, I'm attending my first semester final exams at university. My exams will be over in April 14 and my next semester does not start until July. So after this exam is over, I will be able to fully concentrate on my work of the project. My class schedules will overlap at the ending phase of the project, but normally there is not much pressure at the beginning of the semester. Keeping all these in mind, I plan to finish the project up as soon as possible while I can give it full attention and at the later stage do bug fixing and enhancements. And if need arises, I could always do extra work in the weekend to minimize the effect of overlapping.

Conclusion[edit]

I’d like to thank all the developers of Apertium for putting up such a great effort. I’d also like to thank Google for organizing Google Summer of Code and flourishing Open Source community.