User:Rafi kamal/Application

From Apertium
Revision as of 18:31, 21 March 2014 by Rafi kamal (talk | contribs)
Jump to navigation Jump to search

Contact Information

Name: Rafi Kamal Email: rafikamal93@yahoo.com IRC nick at #apertium: rafi GitHub: github.com/rafi-kamal SourceForge username: rafikamal93

Why are you interested in machine translation?

I'm from Bangladesh and Bangla is my native tongue. But I've to use English for a lot of purposes. For example, the medium of education in my university is English. So as a general user, I've felt the need of a good machine translation system numerous times.

I've created an open source English-Bangla dictionary. I've tried to add more words in its database as well as integrate an machine translation system into it. But unfortunately, the existing machine translation systems (Google translate or Bing translate) are not free. So as a developer, I always feel the need of an open-source machine translation system.

And lastly, I'm planning to do research on natural language processing in Bangla to develop a Bangla search engine. I hope working experience in an existing machine translation system will help in my research.

Why is it that you are interested in the Apertium project?

Apertium is open source, that's the main reason of my interest. I've worked on open source project before and I really like it. Another reason is, there has already been a lot of work done in Bangla-English translation in Apertium, and I've the opportunity to contribute to it.

Which of the published tasks are you interested in? What do you plan to do?

I'm interested on the project 'Adopt an unreleased language pair'. I want to work with Bangla-English language pair. Some work has already been done in this language pair, focusing mainly on English to Bangla translation. I'll focus on Bangla to English translation.

I plan to do the following in the project:

Expanding dictionaries

Currently the Bangla monodix has an 80% coverage of the Bangla wiki. My goal is to achieve about 88-90% coverage.

Currently there are 7446 entries in the Bangla monodix. I've already collected a list of 35000 bangla words which is sorted according to their frequency from the developer of Ridmik Keyboard (A popular Bangla keyboard for Android). I plan to update the dictionary with ~3000 most frequently used words which is not currently in the dictionary.

The English monodix of this project was taken from the en-es language pair. I plan to update the English monodix with new entries of en-es pair.

Handling Bangla Enclitic

Morphological analyzer can't analyze some enclitics (For example, 'টি'). I've to add appropriate paradigm definition to solve this problem.

Disambiguation

I'll write lexical selection rules for disambiguation of bidix output. These rules will be used by Apertium's lexical selection module.

Adding transfer rules

Bangla and English are structurally very different languages. So transfer rules plays an important part to produce meaningful translation. The transfer system has been identified as the weakest part of the current system. So I've to work a lot to improve this part.

I've already identified some problem with negative form of verbs and modal auxiliaries. For example,

আমি কাজ করি > I work আমি কাজ করি না > I work not (Should be: I don't work)

কাজটি করা উচিত

> work doing @উচিত (Should be: The work should be done)


To identify other problems, first I'll translate several corpora using the translator. Then I'll post edit these, and identify which rules I need to add, or which rules I need to modify.

Why Google and Apertium should sponsor it

Bangla is the 7th most spoken languages in the world, with about 220 million native users and 250 million total speakers (source: Wikipedia). But unfortunately there is no open source machine translation system, except the Apertium's one. If I can bring this language pair to release quality, it can help millions of people.

Work Plan

Community Bonding Period

Take a deeper look at the Apertium pipeline Reading the wiki pages in detail Prepare corpora which will be used to in the coding period Week 1-2

Add ~3000 words to the Bangla monodix to achieve 85-87% coverage Update English monodix using en-es language pair Deliverable (Week 1-2): Updated Bangla and English monodix Week 3-4

Update English-Bangla bidix to include the updated words from both Bangla and English monodix (Approximately 3000 new entry) Deliverable (Week 3-4): Updated English-Bangla bidix Week 5

Updating morphological analyzer to handle Bangla enclitic Deliverable (Week 3-4): Updated morphological analyzer which can handle enclitics Week 6

Post-edit 2-3 corpora prepared during the community bonding period Analyze the corpora with their machine translated counterparts, calculate WER (Word Error Rate) Identify the areas with improvement opportunity Week 7-8

Writing lexical selection rules for disambiguation Writing tag definitions file for PoS tagger, if necessary Deliverable (Week 6-8): A rules.xml file containing the lexical selection rules, and a .tsx file containing the tag definitions Week 9

Writing transfer rules for translating negative form of verbs properly Writing transfer rules for modal auxiliaries Week 10

Adding transfer rules for interrogative sentences Adding transfer rules imperative sentences Week 11

Adding transfer rules for exclamatory sentences Adding other transfer rules based on the post-edit analysis of the corpus Deliverable (Week 9-11): Three updated transfer rule files Week 12

Running testvoc, creating and running regression tests Week 13

Evaluation, writing wiki pages Deliverable (Week 12-13): Final project

Skills and Expertise

Academic

Currently I'm a 3rd year student of Computer Science & Engineering student at Bangladesh University of Engineering & Technology. In the university I've taken courses on Discrete Math, Data Structure, Algorithm, Automata Theory, Compiler. Besides, I've taken a Machine learning course provided by Coursera and successfully completed it.

Language

I'm a native speaker of Bangla. English is my second language. It's also the medium of my current study. I've studied Bangla grammar and English grammar in high school and college.

Porgramming

I know C++, Python and Java. Besides, I'm familiar with various Unix tools like sed, awk, grep. I regularly take part in competitive programming competitions. Here is my Codeforces profile.

Open-source Involvement

I've been involved in open source projects for a long time. Here are some of the open source projects I've worked on:

Ridmik Dictionary: It's an open source English-Bangla dictionary for Android, currently one of the most popular Bangla dictionaries for Android. I developed this when I was in 2nd year. Mothur: The Mothur project seeks to develop a single piece of open-source, expandable software to fill the bioinformatics needs of the microbial ecology community. I've worked in this project to implement confusion matrix. ACM-solutions: It's a collection of detailed analysis and solutions of problems taken from different online judges.