Bengali and English/Final report
Description
Our primary goal was to achieve at least 80% coverage of Benglai wiki for the monodix. We completed that in the midterm. It was more than 3000 entries required to reach the goal. Previously the coverage was around 68%. In the meantime, the bilingual dictionary was also being enriched with new entries. There are some entries that need to be updated to the English monodix section. After the midterm, we focused on the transfer system. It was kind of fresh start as there were only a few rules available. The transfer system for the bn-en pair is quite challenging as the languages are not closely related. In fact, there are several complex issues that arise when translating. We had to deal with that and unfortunately with very few resources available. Currently the transfer system is in its primitive state and we are looking forward to complete the task as soon as possible. The technical details are provided below. Though we couldn't complete whatever we dreamt, I really enjoyed being in such a kind of team and working together throughout the summer.
Monodix
The Bengali monodix is now in quite a good state with around 80% coverage of Bengali wiki. There are about 8230 lemmas among which 3594 are nouns, 1766 proper nouns, 1620 adjectives, 473 adverbs and 777 other lemmas. We are looking forward to increase the coverage alongside completing the transfer rules.
Bidix
The bidix currently consist of 7446 entries. Though this is quite a big number with respect to the monodix, some common usual words are still not there. But it will be covered gradually with the work of transfer rules. And the English lemmas provided in the bidix are not all exist in the English monodix. That also need to be updated. Currently there are 3444 nouns, 1686 proper nouns, 1384 adjectives, 1243 other lemmas.
Constraint Grammar and Transfer rules
We've just beginning to build up the skeleton for the transfer system. This is a complex task as the two languages are not closely related at all. There are several issues with ordering pos like verb-object ordering. Again there is no preposition in Bengali, it comes after the nominal. These issues are resolved with a primary approach. The negation of verbs are yet not handled. That'd be the most challenging part of the transfer system I guess. A good look of what sort of cases are handled can be seen from the regression tests.
Statistics
- Dictionaries
apertium-bn-en.bn.dix
: 8,230apertium-bn-en.en-bn.dix
:7,495
- Coverage
- Bengali Wikipedia: 80.59% +/- 1.7878%
- Prothom Alo:
- Rules
apertium-bn-en.en-bn.t1x
: 42apertium-bn-en.en-bn.t2x
: 16apertium-bn-en.en-bn.t3x
: 2
- Error rate
File | Num. Words | % OOV | WER (Sur) | PER (Sur) | WER (Lem) | PER (Lem) |
---|---|---|---|---|---|---|
prothom-alo |
- | - |
Future work
There are a lot of works to be done. Currently we are focusing completely on the transfer system. There are several issues like verb negation that is not so easy to resolve. We have to complete that and in the meantime enrich the bilingual dictionary. We also have a plan for building a user interface for dictionary entries. But the transfer system is on the prime focus.
Thanks
Though the work is not complete, I have to thank Zaher for his tremendous support throughout the GSoC timeline. And with the small resource for Bengali he actually done a great job to come up with this state. I'm really grateful to him for whatever I've done here. And of course, Francis helped a great deal to make things work the right way. I really learned a lot from these guys over the whole GSoC period. And I'll be really happy to work with them in future :)