From Apertium
Jump to navigation Jump to search

I'm Ragib Ahsan from Bangladesh. I'm currently an undergrad student of Computer Science and Engineering Department in Bangladesh University of Engineering and Technology.

I'm willing to participate in Google Summer of Code 2011 with apertium. And I'm interested in adopting the new Bengali-English language pair.

Apertium Bengali-English[edit]

Currently the morphological analyzer is nearly complete with 68% coverage of wiki. The bilingual dictionary needs a lot of entries and finally the transfer system has only a few rules to work with.

Some example outputs are -

I eat rice -> আমি ধান খাই
I love you -> আমি আপনাকে ভালবাসি

You can find a list of tests here

My project goal should be as follows:

  • Completing the monolingual dictionary for Bengali upto a wide coverage (at least 80%) of wiki.
  • Completing the bilingual dictionary with necessary entries.
  • Writing the transfer rules, that will be a challenging part as the two languages are not closely related.
  • Finally, performing tetsvocing to ensure release quality

Preparing Myself[edit]

I've downloaded and installed the "apertium-bn-en" pack from the apertium incubator. And I'm really excited playing around with it in my system. I've gone through the Apertium New Language Pair HOWTO already. I tried to have a look at the Apertium Official Documentation. It seems really complex. I'm discussing various issues with the prospective mentors Francis Tyers and Abu Zaher. With their help and some exploring on the apertium-bn-en project I've finally prepared my proposal for This years GSoC. You can find it here.

I also found the paper on Bengali Morphological Analyzer[1] quite interesting. And last but not the least I'm trying to solve some of the challenges given on this project. Check here.

Community Bonding Period[edit]

The community bonding period started right after the announcement of accepted student proposal on April 25. I had my plans for this period. As mentioned in my work plan I'm exploring the Apertium tool chain to be familiarized with it. I'm regular on the IRC and I'm getting to know the community more closely. In the meantime, I'm working on putting some new bdix entries to reduce the load a bit from the coding phase. I also have a plan to prepare a test case list for testvocing at the end. I'm putting some remarks here and intend to update time to time-

Week 1: April 26 - May 02, 2011

  • Familiarizing with using the svn
  • After building the bn-en with make tool we need to copy the '*.mode' files from 'modes' directory to the systems installation directory i.e. /usr/share/apertium/modes
  • 240 new adjectives added to
  • 147 new nouns added to
  • Planning for generating a test case list

Week 2: May 03 - May 09, 2011

  • 156 new nouns added to
  • 100 new adjectives added to
  • Another 100 new adjectives added to
  • Studying the apertium-doc, currently, getting to know the different modules of apertium
  • Correcting some minor errors in the previous entries and misplaced lemmas (nouns and adjectives)
  • Adding some test cases here
  • Added another 138 adjectives to with some minor corrections

Week 3: May 10 - May 16, 2011

  • Studying the apertium-doc, currently, getting to know the different modules of apertium
  • 87 new adjectives added to with minor corrections
  • studying the Bengali verbs issues specially those mentioned in the paper on Bengali Morphological Analyzer[1]

Week 4: May 17 - May 22, 2011

  • Studying the apertium-bn-en a bit more closely
  • Generating some more test cases

Coding Phase[edit]

Week 1: May 23 - May 29, 2011

  • 44 adjectives added to bidix with some minor corrections
  • more adjectives added to bidix
  • 67 nouns added to bidix
  • 45 nouns added to bidix with some corrections on previous entry
  • learning monodix basics

Week 2: May 30 - June 5, 2011

  • corrections on previous entries
  • some scripts added to /dev/gsoc2011/bidix/adjective & /dev/bidix/gsoc2011/noun
  • again corrections based on code review

Week 3: June 6 - June 12, 2011

  • corrections based on code review
  • tagging adjectives, nouns and proper nouns from frequency list
  • checking bengali monodix

Week 4: June 13 - June 19, 2011

  • more adjectives added to bidix
  • more nouns added to bidix
  • added some scripts to automate the task with monodix entries
  • added 43 adjectives to with associating scripts to "dev/gsoc2011/monodix/adjective"
  • added 143 nouns to with associating scripts to "dev/gsoc2011/monodix/noun"
  • added 87 proper nouns to with associating scripts to "dev/gsoc2011/monodix/proper-noun"
  • added 26 adjectives to bn.dix
  • added 59 proper-nouns to bn.dix
  • added 126 nouns to bn.dix

Week 5: June 20 - June 26, 2011

  • tagging new raw fresh words
  • added 74 new adjectives to bn.dix
  • added 132 new proper nouns to bn.dix
  • added 198 new nouns to bn.dix
  • added 104 adjectives to bn.dix
  • added 125 proper nouns to bn.dix
  • wiki frequency list added
  • added 25 adjectives to bn.dix
  • added 71 nouns to bn.dix
  • added 20 proper-nouns to bn.dix

Week 6: June 27 - July 03, 2011

  • added 42 adjectives to bn.dix
  • added 13 adverbs to bn.dix
  • added 31 proper-nouns to bn.dix
  • added 127 nouns to bn.dix
  • added 44 adjectives and 7 adverbs to bn.dix
  • added 111 nouns and 58 proper nouns to bn.dix
  • added 45 adjectives and 3 adverbs
  • added 113 nouns and 56 porper nouns
  • added 56 adjectives and 3 adverbs
  • added 118 nouns and 64 porper nouns
  • added 75 adjectives to bn.dix
  • added 85 proper-nouns to bn.dix
  • added 125 nouns to bn.dix
  • added 78 adjectives and 94 proper-nouns to bn.dix
  • added 160 nouns to bn.dix
  • added '-টি' inflections to some noun paradigms
  • added 88 adjectives and 8 adverbs to bn.dix
  • added 93 proper-nouns and 152 nouns to bn.dix

Week 7: July 04 - July 10, 2011

  • added 54 adjectives and 6 adverbs to bn.dix
  • added 114 nouns and 62 proper-nouns to bn.dix
  • added 381 nouns to bidix
  • added 402 nouns to bidix
  • added 446 nouns to bidix
  • added 202 adjectives to bidix
  • added 178 adjectives to bidix
  • added 211 adjectives to bidix
  • added 473 proper-nouns to bidix
  • added 488 proper-nouns to bidix
  • added 542 proper-nouns to bidix

Week 8: July 11 - July 17, 2011

  • studying the verbs
  • studying the transfer system

Week 9: July 18 - July 24, 2011

  • checking the transfer rules
  • checking the chunk, interchunk and postchunk of en-bn

Week 10: July 25 - July 31, 2011

  • added a transfer rule det_adj_nom on testing purpose
  • got sick :(
  • minor corrections in bidix (now can recognize "my", "our", "your" etc.)

Week 11: August 1 - August 7, 2011

  • minor corrections in bidix and en-bn.t1x
  • corrected some problems related "person" recognition in en-bn.t1x
  • added rule "DET NOM" to en-bn.t1x
  • corrected "ART NOM"
  • corrected rule "VERB CONJ"
  • added rule "VBSER PRES"
  • added rule "VBSER PAST"
  • corrected "DET NOM"
  • added rule "VBHAVER VBLEX"
  • changed pardef "এক__num" in bn.dix
  • corrected rule "DET ADJ NOM"

Week 12: August 8 - August 14, 2011

  • corrected rule "NOM" to support proper nouns
  • added rule "VAUX VBLEX"
  • added rule "FTAUX BE VBLEX"
  • added Regression tests on tenses
  • corrected rule "NOM" (for plurals and <def> tag)
  • corrected rule "ADJ NOM" (for proper nouns and animacy)
  • added rule "VBHAVER VBSER VBLEX"
  • added rule "NUM NOM"
  • added some numbers to bidix
  • added rule "POST" for postpositions
  • added rule "SN SV POST SN" in 'en-bn.t2x'
  • added 425 multi-word verbs(with suffix কর) to bn.dix
  • added 26 prepositions to bn-en.dix
  • added rule "FTAUX VBHAVER VBLEX" to en-bn.t1x
  • added rule "FTAUX VBHAVER VBSER VBLEX" to en-bn.t1x
  • added rule "VBDO"
  • added rule "VBDO VBLEX"
  • added rule "SVD SN SV" to en-bn.t2x
  • added rule "GERUND" to en-bn.t1x
  • added rule "SVI SN SVG" to en-bn.t2x
  • added rule for chunks: "vbser_past" and "vbger" in en-bn.t3x

Week 13: August 15 - August 22, 2011

  • corrections on GERUND
  • corrections on person
  • added rule "SVH SN SV" to en-bn.t1x
  • correction on "SVH SN SV" in t2x
  • added rule "SVP SN SN"
  • added rule "SN SVPS SN" to en-bn.t2x
  • added rule "SVPS SN SN" to en-bn.t2x
  • added rule "FTAUX BE" to en-bn.t1x and "SN SVF SN" to en-bn.t2x
  • added rule "SN SVP SN" and "SN SVPS SN" is merged to "SN SVI SN" in en-bn.t2x
  • added rule "SVP SN SN" and "SVPS SN SN" is merged to "SVI SN SN" in en-bn.t2x
  • added rule "VBHAVER VBSER" "SN SVHB SN"to en-bn.t1x and to en-bn.t2x
  • added 80 adverbs to bn-en.dix
  • added rules for adverbs modifying verbs
  • added 166 adverbs
  • added rule ADV ADJ NOM to en-bn.t1x
  • added rule DET ADV ADJ NOM and some corrections
  • added rule ART ADJ NOM to en-bn.t1x
  • fixed verb's possesion on subject, especially for "have"
  • added rule "ADV" to t1x and "ADV SVI SN" to t2x
  • 1.0 1.1 Development of a Morphological Analyzer for Bengali