Difference between revisions of "User:Rafi kamal/Application"

Latest revision as of 10:24, 15 May 2014

Contact Information[edit]

Name: Rafi Kamal
IRC nick at #apertium: rafi_kamal
GitHub: github.com/rafi-kamal
SourceForge username: rafikamal93

Why are you interested in machine translation?[edit]

I'm from Bangladesh and Bangla is my native language. But I have to use English for a lot of purposes. For example, the medium of education in my university is English. So as a general user, I've felt the need of a good machine translation system numerous times.

I've created an open source English-Bangla dictionary. I've tried to enrich its database by adding words from an existing source, as well as integrated an machine translation system into it. But unfortunately, the existing machine translation systems (Google translate or Bing translate) are not free. So as a developer, I always feel the need of an open-source machine translation system.

And lastly, I'm planning to do research on natural language processing in Bangla to develop a Bangla search engine. I hope working experience in an existing machine translation system will help in my research.

Why is it that you are interested in the Apertium project?[edit]

Apertium is open source, that's the main reason of my interest. I've worked on open source projects before and I really like the experience. Another reason is, there has already been a lot of work done in Bangla-English translation in Apertium, and I have the opportunity to improve it.

Which of the published tasks are you interested in? What do you plan to do?[edit]

I'm interested on the project 'Adopt an unreleased language pair'. I want to work with Bangla-English language pair. Some work has already been done in this pair, focusing mainly on English to Bangla translation. I'll focus on Bangla to English translation in my project.

I plan to do the following in the project:

1. Expanding dictionaries[edit]

Currently the Bangla monodix has an 80% coverage of the Bangla wiki. My goal is to achieve about 90% coverage.

Currently there are 8230 entries in the Bangla monodix (3594 nouns, 1766 proper nouns, 1620 adjectives, 473 adverbs and 777 other lemmas). I've already collected a list of 35000 bangla words which is sorted according to their frequency from the developer of Ridmik Keyboard (A popular Bangla keyboard for Android). I plan to update the dictionary with ~3000 most frequently used words which is not currently in the dictionary.

The English monodix of this project was taken from the en-es language pair. Currently it has 7446 entries (3444 nouns, 1686 proper nouns, 1384 adjectives, 1243 other lemmas). I plan to update the English monodix with new entries of en-es pair.

Currently the bn-en bidix contains 7446 entries (444 nouns, 1686 proper nouns, 1384 adjectives, 1243 other lemmas). There is an open source English-Bangla dictionary from Ankur.org which contains about 17000 bn-en pair. I'll use this dictionary to enrich the bn-en bidix.

2. Handling Bangla Enclitic[edit]

Morphological analyzer can't analyze some enclitics (For example, 'টি'). I've to add appropriate paradigm definitions to solve this problem.

3. Disambiguation[edit]

I'll write lexical selection rules for disambiguation of bidix output. These rules will be used by Apertium's lexical selection module.

4. Adding transfer rules[edit]

Bangla and English are not closely related languages. So transfer rules play an important role to produce meaningful translation. The transfer system has been identified as the weakest part of the current system. So I've to work a lot to improve this part.

I've already identified some problem with negative form of verbs and modal auxiliaries. For example,

আমি কাজ করি
> I work
আমি কাজ করি না
> I work not (Should be: I don't work)

কাজটি করা উচিত
> work doing @উচিত (Should be: The work should be done)

To identify other problems, first I'll translate several corpora using the translator. Then I'll post-edit these, and identify which rules I need to add or which rules I need to modify.

Why Google and Apertium should sponsor it?[edit]

Bangla is the 7th most spoken language in the world, with about 220 million native users and 250 million total speakers (source: Wikipedia). But unfortunately there is no open source machine translation system, except the Apertium's one. If I can bring this language pair to release quality, it can help millions of people.

Work Plan[edit]

Community Bonding Period[edit]

Take a deeper look at the Apertium pipeline
Reading the wiki pages in detail, editing them and creating new wiki pages if necessary
Prepare corpora which will be used to in the coding period

Week 1[edit]

Add about 1200 nouns and 300 proper nouns in Bangla monodix
Update English monodix using English-Spanish language pair

Week 2[edit]

Add about 800 verbs, 300 adjectives, 250 adverbs and 150 other types of words in the Bangla monodix

Deliverable (Week 1-2): Updated Bangla and English monodix, where Bangla monodix would have a 90% wiki coverage
NB: Exact analysis on how many words should be added to the dictionary will be done in community bonding period

Week 3-4[edit]

Update English-Bangla bidix to include the updated words from both Bangla and English monodix (Approximately 3000 new entry, containing words from both open and closed class)

Deliverable (Week 3-4): Updated English-Bangla bidix

Week 5[edit]

Updating morphological analyzer to handle Bangla enclitic

Deliverable (Week 5): Updated morphological analyzer which can handle enclitics

Week 6[edit]

Post-edit 2-3 corpora prepared during the community bonding period
Analyze the corpora with their machine translated counterparts, calculate WER (Word Error Rate)
Identify the areas with improvement opportunity

Week 7-8[edit]

Writing lexical selection rules for disambiguation
Writing tag definitions for PoS tagger, if necessary

Deliverable (Week 6-8): A rules.xml file containing the lexical selection rules, and a .tsx file containing the tag definitions

Week 9[edit]

Writing transfer rules for translating negative form of verbs properly
Writing transfer rules for modal auxiliaries

Week 10[edit]

Adding transfer rules for interrogative sentences
Adding transfer rules for imperative and exclamatory sentences

Week 11[edit]

Adding transfer rules for complex and compound sentences
Adding other transfer rules based on the post-edit analysis of the corpora

Deliverable (Week 9-11): Three updated transfer rule files

Week 12[edit]

Running testvoc, creating and running regression tests

Week 13[edit]

Evaluation, writing wiki pages

Deliverable (Week 12-13): Final project

Skills and Expertise[edit]

Academic[edit]

Currently I'm a 3rd year student of Computer Science & Engineering department at Bangladesh University of Engineering & Technology. At university I've taken courses on Discrete Math, Data Structure, Algorithm, Automata Theory and Compiler. Besides, I've taken a Machine Learning course provided by Coursera and successfully completed it.

Language[edit]

I'm a native speaker of Bangla. English is my second language. It's also the medium of my current study. I've studied Bangla grammar and English grammar in high school and college.

Programming[edit]

I know C++, Python and Java. Besides, I'm familiar with various Unix tools like sed, awk, grep. I regularly take part in competitive programming competitions. Here is my Codeforces profile.

Open-source Involvement[edit]

I've been involved in open source projects for a long time. Here are some of the open source projects I've worked on:

Ridmik Dictionary: It's an open source English-Bangla dictionary for Android, currently one of the most popular Bangla dictionaries for Android. I developed this when I was in 2nd year.
Mothur: The Mothur project seeks to develop a single piece of open-source, expandable software to fill the bioinformatics needs of the microbial ecology community. I've worked in this project to implement confusion matrix.
ACM-solutions: It's a collection of detailed analyses and solutions of problems taken from different online judges.

Coding Challange[edit]

As part of the coding challenge suggested by Francis, I've done the following:

Installed Apertium and bn-en language pair from SVN
Translated four 500 page articles from Bangla to English using Apertium
Postedited these four articles to create reference translation
Used apertium-eval-translator to calculate word error rate (WER) and position independent word error rate (PWER) of existing translation

Here is the evaluation result of existing bn-en translation:

Article Number	WER	PWER
1	94.35 %	89.01 %
2	93.39 %	85.89 %
3	94.54 %	87.68 %
4	89.15 %	81.85 %

Translated articles can be found here.

After adding new words and transfer rules, here is the updated evaluation table:

Article Number	WER	WER Change	PWER	PWER Change
1	89.40 %	- 4.95 %	71.52 %	- 17.49 %
2	94.24 %	+ 0.85 %	76.07 %	- 9.82 %
3	93.16 %	- 1.38 %	75.58 %	- 12.1 %
4	85.58 %	- 3.57 %	67.23 %	- 14.62 %

Article 1 and 4 are the development articles, and article 2 and 3 are held out articles.

I've added

55 new entries in bn-en bidix
31 new entries in bn monodix
18 new entries in en monodix
15 new transfer rules in bn-en.t1x
2 new transfer rules in bn-en.t2x

@@ Line 4: / Line 4: @@
 '''Name:''' Rafi Kamal<br>
-'''Email:''' rafikamal93@yahoo.com<br>
+'''IRC nick at #apertium:''' rafi_kamal<br>
-'''IRC nick at #apertium:''' rafi<br>
 '''GitHub:''' github.com/rafi-kamal<br>
 '''SourceForge username:''' rafikamal93<br>
@@ Line 37: / Line 36: @@
 The English monodix of this project was taken from the en-es language pair. Currently it has 7446 entries (3444 nouns, 1686 proper nouns, 1384 adjectives, 1243 other lemmas). I plan to update the English monodix with new entries of en-es pair.
+Currently the bn-en bidix contains 7446 entries (444 nouns, 1686 proper nouns, 1384 adjectives, 1243 other lemmas). There is an open source English-Bangla dictionary from [http://ankur.org/ Ankur.org] which contains about 17000 bn-en pair. I'll use this dictionary to enrich the bn-en bidix.
 ==== 2. Handling Bangla Enclitic ====
@@ Line 155: / Line 156: @@
 [https://github.com/mothur/mothur Mothur]: The Mothur project seeks to develop a single piece of open-source, expandable software to fill the bioinformatics needs of the microbial ecology community. I've worked in this project to implement confusion matrix.<br>
 [https://github.com/rafi-kamal/ACM-Solutions ACM-solutions]: It's a collection of detailed analyses and solutions of problems taken from different online judges.<br>
+==== Coding Challange ====
+As part of the [http://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code/Make_a_language_pair_state-of-the-art coding challenge] suggested by Francis, I've done the following:
+* Installed Apertium and bn-en language pair from SVN
+* Translated four 500 page articles from Bangla to English using Apertium
+* Postedited these four articles to create reference translation
+* Used <code>apertium-eval-translator</code> to calculate word error rate (WER) and position independent word error rate (PWER) of existing translation
+Here is the evaluation result of existing bn-en translation:
+{|class="wikitable"
+|-
+! Article Number
+! WER
+! PWER
+|-
+| 1
+| 94.35 %
+| 89.01 %
+|-
+| 2
+| 93.39 %
+| 85.89 %
+|-
+| 3
+| 94.54 %
+| 87.68 %
+|-
+| 4
+| 89.15 %
+| 81.85 %
+|}
+Translated articles can be found [https://github.com/rafi-kamal/Apertium-Coding-Challenge here].
+After adding new words and transfer rules, here is the updated evaluation table:
+{|class="wikitable"
+|-
+! Article Number
+! WER
+! WER Change
+! PWER
+! PWER Change
+|-
+| 1
+| 89.40 %
+| - 4.95 %
+| 71.52 %
+| - 17.49 %
+|-
+| 2
+| 94.24 %
+| + 0.85 %
+| 76.07 %
+| - 9.82 %
+|-
+| 3
+| 93.16 %
+| - 1.38 %
+| 75.58 %
+| - 12.1 %
+|-
+| 4
+| 85.58 %
+| - 3.57 %
+| 67.23 %
+| - 14.62 %
+|}
+Article 1 and 4 are the development articles, and article 2 and 3 are held out articles.
+I've added
+* 55 new entries in bn-en bidix
+* 31 new entries in bn monodix
+* 18 new entries in en monodix
+* 15 new transfer rules in bn-en.t1x
+* 2 new transfer rules in bn-en.t2x
+[[Category:GSoC 2014 Student proposals|Rafi]]