Difference between revisions of "User:Darthxaher/Application"

From Apertium
Jump to navigation Jump to search
m
(Major Revision 2)
Line 28: Line 28:
'''5 Why is it you are interested in the Apertium Project? '''
'''5 Why is it you are interested in the Apertium Project? '''


I have been long time supporter of the open source movement in my country. Adopting to open source philosophy is crucial for a developing country like Bangladesh where cost of proprietary software is unbearable for the most people. Open source machine translation that being is offered by Apertium will have far reaching effect in the Bengali adoption and localization of open source softwares.
I have been long time supporter of the open source movement in my country. Adopting to open source philosophy is crucial for a developing country like Bangladesh where cost of proprietary software is unbearable for the most people. Open source machine translation that is being offered by Apertium will have far reaching effect in the Bengali adoption and localization of open source softwares.


A fews day’s spending with the Apertium community has made me realize that people here are both knowledgeable and helpful. I think that is another reason for me to be here.
A fews day’s spending with the Apertium community has made me realize that people here are both knowledgeable and helpful. I think that is another reason for me to be here.
Line 52: Line 52:
'''9 Work Plan'''
'''9 Work Plan'''


I have been keeping in contact with Mentor Francis Tyers regarding this idea. Three major aspects here are:
I have been keeping in contact with mentor Francis Tyers regarding this idea. Three major aspects here are:


-A morphological generator for Bengali part
-A morphological generator for Bengali part
Line 58: Line 58:
-A full bilingual dictionary
-A full bilingual dictionary


-A transfer sytem
-A transfer system


I have been maintaining contact with some of expert people who worked on English to Bengali translation. Dr. G M Hossain, the author of Anubadok<ref name="ftn1">[http://anubadok.sf.net/ http://anubadok.sf.net/]</ref> has given me a lot of insight on Anubadok. I also personaly met Dr. Mumit Khan, head of CRBLP<ref name="ftn2">[http://www.bracuniversity.net/research/crblp/index.php http://www.bracuniversity.net/research/crblp/index.php]</ref>. CRBLP has been working on several Bengali language research issues and did some work on Machine Translation too. And all of their work has been released under GPL.
I have been maintaining contact with some of the expert people in this field who worked on English to Bengali translation. Dr. G M Hossain, the author of Anubadok<ref name="ftn1">[http://anubadok.sf.net/ http://anubadok.sf.net/]</ref> has given me a lot of insight on Anubadok. I also personally met Dr. Mumit Khan, head of CRBLP<ref name="ftn2">[http://www.bracuniversity.net/research/crblp/index.php http://www.bracuniversity.net/research/crblp/index.php]</ref>. CRBLP has been working on several Bengali language research issues and did some work on Machine Translation too. And all of their work has been released under GPL.


Francis and I was hoping that they could provide us with a morphological generator but they seem not to have implemented one of their own. I was, however, able to get the result of their courpus analysis on prothom-alo newspaper<ref name="ftn3">[http://www.bracuniversity.net/research/crblp/nlpcourse/report/MORPHOLOGY_report.pdf%20 http://www.bracuniversity.net/research/crblp/nlpcourse/report/MORPHOLOGY_report.pdf ]</ref>. It has the list of the most frequent used words. We can take the first several thousand for the start. They also have given me link to some other people who might have worked on a Bengali morphological analyzer/generator but so far I haven’t heard from them.
Francis and I were hoping that they could provide us with a morphological generator but they seem not to have implemented one of their own. I was, however, able to get the result of their corpus analysis on prothom-alo newspaper<ref name="ftn3">[http://www.bracuniversity.net/research/crblp/nlpcourse/report/MORPHOLOGY_report.pdf%20 http://www.bracuniversity.net/research/crblp/nlpcourse/report/MORPHOLOGY_report.pdf ]</ref>. It has the list of the most frequent used words. We can take the first several thousand for the start. They also have given me link to some other people who might have worked on a Bengali morphological analyzer/generator but so far I haven’t heard from them.


Keeping all this findings and GSoC’s tight schedule in mind, I have decided that, my first priority would be to create the morphological generator. This could be accomplished by re-implementing Anubadok’s rulesets. I’d then gradually move to create a transfer system after creating a full bilingual dictionary.
Keeping all this findings and GSoC’s tight schedule in mind, I have decided that, my first priority would be to create the morphological generator. This could be accomplished by re-implementing Anubadok’s rule-sets. I’d then gradually move to create a transfer system after creating a full bilingual dictionary.

I intend to follow this time schedule.


'''Community Bonding Period (April 20 - May 22)'''
'''Community Bonding Period (April 20 - May 22)'''


-Familiarizing with Apertium’s Toolchain and its Community
-Familiarizing with Apertium’s tool-chain and its community


-Thorough check of Apertium and Anubadok code-base.
-Thorough check of Apertium and Anubadok code-base.
Line 76: Line 78:
-Requirement analysis on how to adopt Anubadok’s generator into Apertium.
-Requirement analysis on how to adopt Anubadok’s generator into Apertium.


-Try to come up with -
'''Deliverable''' A report on Anubadok’s morphological generation system and how-to adopt it into Apertium. This report will comprise


-A set of tags that will be needed for the morphological generator, apart form the default ones from Apertium (Anubadok’s tagset is incompatible with Apertium)
-A set of tags that will be needed for the morphological generator, apart form the default ones from Apertium (Anubadok’s tag-set is incompatible with Apertium)


-A set of rulesets that Anubadok uses for the morphological generation
-A set of rule-sets that Anubadok uses for the morphological generation


-A set of transfer rules that will be used for the transfer system
-A set of transfer rules that will be used for the transfer system

'''Deliverable''' A preliminary skeleton of the apertium-en-bn package


'''Note:''' If we hear something from CRBLP about any analyzer/generator, we’ll take that into consideration too.
'''Note:''' If we hear something from CRBLP about any analyzer/generator, we’ll take that into consideration too.
Line 88: Line 92:
'''Week 1 (May 23 - May 29)'''
'''Week 1 (May 23 - May 29)'''


-Start creation of Bengali monodix by selecting Closed Category words (Conjuction, aux verbs, determiners etc)
-Start creation of Bengali monodix by selecting closed category words (conjunction, aux verbs, determiners etc)




Line 153: Line 157:
'''Week 9 (July 18 - July 24)'''
'''Week 9 (July 18 - July 24)'''


-Refine the transfer system
-Refine the transfer system, add adequate lexical selection rules

'''Week 10 (July 25 - July 31)'''
'''Week 10 (July 25 - July 31)'''


Line 186: Line 191:
I have been an open source advocate in my country from my college years. I have been working with Ankur<ref name="ftn5">[http://www.ankur.org.bd/wiki/People http://www.ankur.org.bd/wiki/People]</ref>, a not-profit organization since then. With them I have conducted numerous open source camps. I participated in creating and beta testing of several products from Ankur, notably Firefox spell-checking dictionary and off-line add-on CD for several localized Ubuntu versions.
I have been an open source advocate in my country from my college years. I have been working with Ankur<ref name="ftn5">[http://www.ankur.org.bd/wiki/People http://www.ankur.org.bd/wiki/People]</ref>, a not-profit organization since then. With them I have conducted numerous open source camps. I participated in creating and beta testing of several products from Ankur, notably Firefox spell-checking dictionary and off-line add-on CD for several localized Ubuntu versions.


I’m the developer of several open source applications. Netaccess-squid<ref name="ftn6">[http://sourceforge.net/projects/netaccess-squid/ http://sourceforge.net/projects/netaccess-squid/]</ref> has been created as an open source alternative to Cyberoam System<ref name="ftn7">[http://www.cyberoam.com/productoverview.html http://www.cyberoam.com/productoverview.html]</ref>. Aural Aurora<ref name="ftn8">[http://code.google.com/p/auralaurora/ http://code.google.com/p/auralaurora/]</ref> is a Spring<ref name="ftn9">[http://www.springsource.org/ http://www.springsource.org/]</ref>/Oracle based music discovery and social collaboration system.
I maintain a public blog here<ref name="ftn6">[http://zaher14.blogspot.com/ http://zaher14.blogspot.com/]</ref>. Although its not a day-to-day blog, I try to keep it updated when I get free time with the interesting things I come across.

I had worked as a part time software developer for AfriGISBD, an off-shore development house of AfriGIS<ref name="ftn10">[http://www.afrigis.co.za/ http://www.afrigis.co.za/]</ref> 2 years. I mainly worked in high level languages like python, php, javaEE there and had to do a lot of system-admin on Linux, Solaris and Mac OS X. I’m currently employed (part-time) at MuktoSoft<ref name="ftn11">[http://muktosoft.com/ http://muktosoft.com/]</ref> where I’m working on iPhone based software.


I maintain a public blog here<ref name="ftn12">[http://zaher14.blogspot.com/ http://zaher14.blogspot.com/]</ref>. Although its not a day-to-day blog, I try to keep it updated when I get free time with the interesting technical things I come across.
I have been working as a part time software developer for a local company for the last 2 years. I mainly worked in high level languages like python, php, javaEE there and had to do a lot of system-admin on Linux, Solaris and MacOS X.


I believe myself to be highly competent to accomplish this project in the feasible time-line. I have been doing some experiments for the en-bn language pair already and keepong in contact with the developers which is a must have quality for any open source software developer.
'''I have been doing some experiments for the Apertium en-bn language pair already and keeping in contact with the developers. Maintaining a good communication among the developers is must have quality in any type open source software development. Therefore, I believe myself to be highly competent to accomplish this project in the feasible time-line. '''


'''11 List any non-Summer-of-Code plans you have for the Summer'''
'''11 List any non-Summer-of-Code plans you have for the Summer'''
Line 198: Line 205:
'''12 Conclusion'''
'''12 Conclusion'''


I’d like to thank all the developers of Apertium for putting up such a great effort. I’d also like to thank Google for organising Google Summer of Code and fluorishing Open Source community.
I’d like to thank all the developers of Apertium for putting up such a great effort. I’d also like to thank Google for organizing Google Summer of Code and flourishing Open Source community.


----
----

Revision as of 01:34, 3 April 2009

Google Summer of Code Application 2009
Abu Zaher Md. Faridee
Department of Computer Science and Engineering
Bangladesh University of Engineering and Technology


1 Name

Abu Zaher Md. Faridee

2 Email Address

zaher14@gmail.com

3 Contact Information

IRC: darthxaher@irc.freenode.net

Cell Phone: +880 1714070147

4 Why is it you are interested in machine translation?

During my past years as a Computer Science undergrad student, I have come across a lot of interesting subjects like Automata Theory and Compiler. In my next semester I’ll be taking Pattern Recognition and Machine Learning courses. Machine Translation/Natural Language processing is one of the best applications where I would be able to hone my skill gained from these subjects. I’m also planning for my undergrad thesis next year on Machine Translation which could benefit a lot from this project.

5 Why is it you are interested in the Apertium Project?

I have been long time supporter of the open source movement in my country. Adopting to open source philosophy is crucial for a developing country like Bangladesh where cost of proprietary software is unbearable for the most people. Open source machine translation that is being offered by Apertium will have far reaching effect in the Bengali adoption and localization of open source softwares.

A fews day’s spending with the Apertium community has made me realize that people here are both knowledgeable and helpful. I think that is another reason for me to be here.

6 Which of the published tasks are you interested in? What do you plan to do?

I’m interested in Conversion of Anubadok which is essentially building a new English-Bengali language pair.

Anubadok is an experimental English-to-Bengali machine translation system developed by G M Hossain. While its working fairly, a lot of works need to be done. My idea is to port the existing Anubadok system to the framework of Apertium. I plan to implement a Bengali morphological generator. The tagging system also needs to be standardized and then a new transfer system can be written. I think Apertium’s flexible framework will allow me to implement a language pair as good as Anubadok. Further enhancement could then be carried out after GSoC to make it a full fledged system.

It should be noted that only translation from English to Bengali is the focus of this Project. Bengali to English is out of the scope right now.

7 Why should Google and Apertium Sponsor it?

From what I have seen, most of the language pairs currently available in Apertium are for the European languages. Very little work has been done for Indic languages. Indic languages tend to be different from European languages and many things needs to be standardized. My focus would be on building an English-Bengali language pair. Building this pair will eventually contribute in creating other European-Indic and Indic-Indic language pairs.

8 How and who will it benefit in society?

I’m from Bangladesh, where Bengali is spoken as first language. A lot of people don’t know English and therefore away from the benefit of Information Technology. Localization of open source softwares are a great way to bring them closer to the benefits of IT. And my idea, if successful implemented, will go a long to the Socio Economic development of my people, I believe.

I have come to know that Wikipedia uses Apertium to translate articles from English[1], specially for minor languages. There is no option to convert to Bengali. But a lot of Bengali Wikipedians currently use Anubadok to convert articles and then revise manually. I think an English-Bengali translation system will alleviate the work of a lot of Wikipedians.

9 Work Plan

I have been keeping in contact with mentor Francis Tyers regarding this idea. Three major aspects here are:

-A morphological generator for Bengali part

-A full bilingual dictionary

-A transfer system

I have been maintaining contact with some of the expert people in this field who worked on English to Bengali translation. Dr. G M Hossain, the author of Anubadok[2] has given me a lot of insight on Anubadok. I also personally met Dr. Mumit Khan, head of CRBLP[3]. CRBLP has been working on several Bengali language research issues and did some work on Machine Translation too. And all of their work has been released under GPL.

Francis and I were hoping that they could provide us with a morphological generator but they seem not to have implemented one of their own. I was, however, able to get the result of their corpus analysis on prothom-alo newspaper[4]. It has the list of the most frequent used words. We can take the first several thousand for the start. They also have given me link to some other people who might have worked on a Bengali morphological analyzer/generator but so far I haven’t heard from them.

Keeping all this findings and GSoC’s tight schedule in mind, I have decided that, my first priority would be to create the morphological generator. This could be accomplished by re-implementing Anubadok’s rule-sets. I’d then gradually move to create a transfer system after creating a full bilingual dictionary.

I intend to follow this time schedule.

Community Bonding Period (April 20 - May 22)

-Familiarizing with Apertium’s tool-chain and its community

-Thorough check of Apertium and Anubadok code-base.

-Gathering thorough knowledge of how the both systems work and how they differ.

-Requirement analysis on how to adopt Anubadok’s generator into Apertium.

-Try to come up with -

-A set of tags that will be needed for the morphological generator, apart form the default ones from Apertium (Anubadok’s tag-set is incompatible with Apertium)

-A set of rule-sets that Anubadok uses for the morphological generation

-A set of transfer rules that will be used for the transfer system

Deliverable A preliminary skeleton of the apertium-en-bn package

Note: If we hear something from CRBLP about any analyzer/generator, we’ll take that into consideration too.

Week 1 (May 23 - May 29)

-Start creation of Bengali monodix by selecting closed category words (conjunction, aux verbs, determiners etc)


Note Choosing the most frequent words will require analyzing a language corpus. It can be done by either

-Reuse existing corpus from CRBLP research[5] on prothom-alo newspaper and if needed,

-Running a script on Bengali Wikipedia articles

Our goal is to have a 70% word coverage.

Deliverable A Primary version of Bengali monodix.

Week 2 (May 30 - June 5)

-Primary Work on the Bengali Morphological Generator.

-Continue Working on Bengali monodix, now for open category words (nouns, verbs, adjectives etc)

Deliverable Updated monodix.

Week 3 (June 6 - June 12)

-More work on the Bengali Morphological Generation rules

-Finish Working on Bengali monodix.

Deliverable Finished Bengali monodix.

Week 4 (June 13 - June 19)

-Start working on English to Bengali bidix

-Update the morphological generator

Note Primary start with about 1500 lemmata for the bidix, gradually will come to full coverage of the monodix.

Week 5 (June 20 - June 26)

-Continue working on English to Bengali bidix

-Finish Bengali Morphological generator.

Deliverable Finished Morphological generator

Week 6 (June 27 - July 3)

-Continue working on English to Bengali bidix.

Week 7 (July 4 - July 10)

-Manual Checking on English to Bengali bidix.

-Preliminary Analysis for the transfer system (Including chunking to rearrange the word order)

Deliverable English to Bengali bidix.

Mid term evaluation on morphological generator and e2b bidix

Week 8 (July 11 - July 17)

-Refine the transfer system

Week 9 (July 18 - July 24)

-Refine the transfer system, add adequate lexical selection rules

Week 10 (July 25 - July 31)

-Finish working on the transfer system

-Start TestVocing

Deliverable: A working transfer system.

Week 11 (August 1 - August 7)

-Peer review(TestVoc done).

-Comparison of quality with the original Anubadok program

Deliverable: Preliminary Release

Week 12 (August 8 - August 14)

-Evaluation/ Cleanup

Deliverable: Finished Product

Week 13 (August 15 - August 16)

-Only reserved for any emergency situation

10 List your skills and give evidence of your qualifications

Right now I’m in my 3rd year / 2nd Semester of my undergraduate in Computer Science and Engineering in Bangladesh University of Engineering and Technology. I have attended both theoretical and practical courses in Algorithm, Automata Theory and Compiler and believe that I have basic theoretical knowledge for this project.

I have been an open source advocate in my country from my college years. I have been working with Ankur[6], a not-profit organization since then. With them I have conducted numerous open source camps. I participated in creating and beta testing of several products from Ankur, notably Firefox spell-checking dictionary and off-line add-on CD for several localized Ubuntu versions.

I’m the developer of several open source applications. Netaccess-squid[7] has been created as an open source alternative to Cyberoam System[8]. Aural Aurora[9] is a Spring[10]/Oracle based music discovery and social collaboration system.

I had worked as a part time software developer for AfriGISBD, an off-shore development house of AfriGIS[11] 2 years. I mainly worked in high level languages like python, php, javaEE there and had to do a lot of system-admin on Linux, Solaris and Mac OS X. I’m currently employed (part-time) at MuktoSoft[12] where I’m working on iPhone based software.

I maintain a public blog here[13]. Although its not a day-to-day blog, I try to keep it updated when I get free time with the interesting technical things I come across.

I have been doing some experiments for the Apertium en-bn language pair already and keeping in contact with the developers. Maintaining a good communication among the developers is must have quality in any type open source software development. Therefore, I believe myself to be highly competent to accomplish this project in the feasible time-line.

11 List any non-Summer-of-Code plans you have for the Summer

I don’t have any other plans beside Google Summer of Code this sumer. However, my class schedule do overlap with GSoC’s schedule but I think it won’t conflict with the work plan. I could always do extra work in the weekend to minimize the overlapping.

12 Conclusion

I’d like to thank all the developers of Apertium for putting up such a great effort. I’d also like to thank Google for organizing Google Summer of Code and flourishing Open Source community.