Difference between revisions of "User:Rafi kamal/Application"

From Apertium
Jump to navigation Jump to search
 
(19 intermediate revisions by the same user not shown)
Line 3: Line 3:
   
   
Name: Rafi Kamal
+
'''Name:''' Rafi Kamal<br>
  +
'''IRC nick at #apertium:''' rafi_kamal<br>
Email: rafikamal93@yahoo.com
 
  +
'''GitHub:''' github.com/rafi-kamal<br>
IRC nick at #apertium: rafi
 
  +
'''SourceForge username:''' rafikamal93<br>
GitHub: github.com/rafi-kamal
 
SourceForge username: rafikamal93
 
 
   
 
== Why are you interested in machine translation? ==
 
== Why are you interested in machine translation? ==
Line 19: Line 17:
 
And lastly, I'm planning to do research on natural language processing in Bangla to develop a Bangla search engine. I hope working experience in an existing machine translation system will help in my research.
 
And lastly, I'm planning to do research on natural language processing in Bangla to develop a Bangla search engine. I hope working experience in an existing machine translation system will help in my research.
 
 
 
 
== Why is it that you are interested in the Apertium project? ==
 
== Why is it that you are interested in the Apertium project? ==
   
   
 
Apertium is open source, that's the main reason of my interest. I've worked on open source projects before and I really like the experience. Another reason is, there has already been a lot of work done in Bangla-English translation in Apertium, and I have the opportunity to improve it.
 
Apertium is open source, that's the main reason of my interest. I've worked on open source projects before and I really like the experience. Another reason is, there has already been a lot of work done in Bangla-English translation in Apertium, and I have the opportunity to improve it.
 
 
 
   
 
== Which of the published tasks are you interested in? What do you plan to do? ==
 
== Which of the published tasks are you interested in? What do you plan to do? ==
Line 35: Line 29:
 
I plan to do the following in the project:
 
I plan to do the following in the project:
   
1. Expanding dictionaries
+
==== 1. Expanding dictionaries ====
   
 
Currently the Bangla monodix has an 80% coverage of the Bangla wiki. My goal is to achieve about 90% coverage.
 
Currently the Bangla monodix has an 80% coverage of the Bangla wiki. My goal is to achieve about 90% coverage.
Line 43: Line 37:
 
The English monodix of this project was taken from the en-es language pair. Currently it has 7446 entries (3444 nouns, 1686 proper nouns, 1384 adjectives, 1243 other lemmas). I plan to update the English monodix with new entries of en-es pair.
 
The English monodix of this project was taken from the en-es language pair. Currently it has 7446 entries (3444 nouns, 1686 proper nouns, 1384 adjectives, 1243 other lemmas). I plan to update the English monodix with new entries of en-es pair.
   
  +
Currently the bn-en bidix contains 7446 entries (444 nouns, 1686 proper nouns, 1384 adjectives, 1243 other lemmas). There is an open source English-Bangla dictionary from [http://ankur.org/ Ankur.org] which contains about 17000 bn-en pair. I'll use this dictionary to enrich the bn-en bidix.
2. Handling Bangla Enclitic
 
  +
  +
==== 2. Handling Bangla Enclitic ====
   
 
Morphological analyzer can't analyze some enclitics (For example, 'টি'). I've to add appropriate paradigm definitions to solve this problem.
 
Morphological analyzer can't analyze some enclitics (For example, 'টি'). I've to add appropriate paradigm definitions to solve this problem.
   
3. Disambiguation
+
==== 3. Disambiguation ====
   
 
I'll write lexical selection rules for disambiguation of bidix output. These rules will be used by Apertium's lexical selection module.
 
I'll write lexical selection rules for disambiguation of bidix output. These rules will be used by Apertium's lexical selection module.
   
4. Adding transfer rules
+
==== 4. Adding transfer rules ====
   
 
Bangla and English are not closely related languages. So transfer rules play an important role to produce meaningful translation. The transfer system has been identified as the weakest part of the current system. So I've to work a lot to improve this part.
 
Bangla and English are not closely related languages. So transfer rules play an important role to produce meaningful translation. The transfer system has been identified as the weakest part of the current system. So I've to work a lot to improve this part.
Line 57: Line 53:
 
I've already identified some problem with negative form of verbs and modal auxiliaries. For example,
 
I've already identified some problem with negative form of verbs and modal auxiliaries. For example,
   
আমি কাজ করি
+
আমি কাজ করি<br>
> I work
+
> I work<br>
আমি কাজ করি না
+
আমি কাজ করি না <br>
> I work not (Should be: I don't work)
+
> I work not (Should be: I don't work)<br>
   
কাজটি করা উচিত
+
কাজটি করা উচিত <br>
> work doing @উচিত (Should be: The work should be done)
+
> work doing @উচিত (Should be: The work should be done)<br>
   
 
To identify other problems, first I'll translate several corpora using the translator. Then I'll post-edit these, and identify which rules I need to add or which rules I need to modify.
 
To identify other problems, first I'll translate several corpora using the translator. Then I'll post-edit these, and identify which rules I need to add or which rules I need to modify.
   
  +
== Why Google and Apertium should sponsor it? ==
 
   
   
  +
Bangla is the 7th most spoken language in the world, with about 220 million native users and 250 million total speakers (source: [http://en.wikipedia.org/wiki/Bengali_language Wikipedia]). But unfortunately there is no open source machine translation system, except the Apertium's one. If I can bring this language pair to release quality, it can help millions of people.
== Why Google and Apertium should sponsor it? ==
 
   
  +
== Work Plan ==
   
  +
==== Community Bonding Period ====
Bangla is the 7th most spoken language in the world, with about 220 million native users and 250 million total speakers (source: Wikipedia). But unfortunately there is no open source machine translation system, except the Apertium's one. If I can bring this language pair to release quality, it can help millions of people.
 
   
  +
* Take a deeper look at the Apertium pipeline
  +
* Reading the wiki pages in detail, editing them and creating new wiki pages if necessary
  +
* Prepare corpora which will be used to in the coding period
   
  +
==== Week 1 ====
   
  +
* Add about 1200 nouns and 300 proper nouns in Bangla monodix
  +
* Update English monodix using English-Spanish language pair
   
== Work Plan ==
+
==== Week 2 ====
   
  +
* Add about 800 verbs, 300 adjectives, 250 adverbs and 150 other types of words in the Bangla monodix
  +
'''Deliverable (Week 1-2):''' Updated Bangla and English monodix, where Bangla monodix would have a 90% wiki coverage<br>
  +
'''NB:''' Exact analysis on how many words should be added to the dictionary will be done in community bonding period
   
  +
==== Week 3-4 ====
Community Bonding Period
 
   
  +
* Update English-Bangla bidix to include the updated words from both Bangla and English monodix (Approximately 3000 new entry, containing words from both open and closed class)
Take a deeper look at the Apertium pipeline
 
  +
'''Deliverable (Week 3-4):''' Updated English-Bangla bidix
Reading the wiki pages in detail
 
Prepare corpora which will be used to in the coding period
 
Week 1
 
   
  +
==== Week 5 ====
Add about 1200 nouns and 300 proper nouns in Bangla monodix
 
Update English monodix using English-Spanish language pair
 
Week 2
 
   
  +
* Updating morphological analyzer to handle Bangla enclitic
Add about 800 verbs, 300 adjectives, 250 adverbs and 150 other types of words in the Bangla monodix
 
Deliverable (Week 1-2): Updated Bangla and English monodix, where Bangla monodix would have a 90% wiki coverage
+
'''Deliverable (Week 5):''' Updated morphological analyzer which can handle enclitics
N.B.: Exact analysis on how many words should be added to the dictionary will be done in community bonding period
 
Week 3-4
 
   
  +
==== Week 6 ====
Update English-Bangla bidix to include the updated words from both Bangla and English monodix (Approximately 3000 new entry, containing words from both open and closed class)
 
Deliverable (Week 3-4): Updated English-Bangla bidix
 
Week 5
 
   
  +
* Post-edit 2-3 corpora prepared during the community bonding period
Updating morphological analyzer to handle Bangla enclitic
 
  +
* Analyze the corpora with their machine translated counterparts, calculate WER (Word Error Rate)
Deliverable (Week 3-4): Updated morphological analyzer which can handle enclitics
 
  +
* Identify the areas with improvement opportunity
Week 6
 
   
  +
==== Week 7-8 ====
Post-edit 2-3 corpora prepared during the community bonding period
 
Analyze the corpora with their machine translated counterparts, calculate WER (Word Error Rate)
 
Identify the areas with improvement opportunity
 
Week 7-8
 
   
Writing lexical selection rules for disambiguation
+
* Writing lexical selection rules for disambiguation
Writing tag definitions for PoS tagger, if necessary
+
* Writing tag definitions for PoS tagger, if necessary
Deliverable (Week 6-8): A rules.xml file containing the lexical selection rules, and a .tsx file containing the tag definitions
+
'''Deliverable (Week 6-8):''' A ''rules.xml'' file containing the lexical selection rules, and a ''.tsx'' file containing the tag definitions
Week 9
 
   
  +
==== Week 9 ====
Writing transfer rules for translating negative form of verbs properly
 
Writing transfer rules for modal auxiliaries
 
Week 10
 
   
Adding transfer rules for interrogative sentences
+
* Writing transfer rules for translating negative form of verbs properly
Adding transfer rules for imperative and exclamatory sentences
+
* Writing transfer rules for modal auxiliaries
Week 11
 
   
  +
==== Week 10 ====
Adding transfer rules for complex and compound sentences
 
Adding other transfer rules based on the post-edit analysis of the corpora
 
Deliverable (Week 9-11): Three updated transfer rule files
 
Week 12
 
   
  +
* Adding transfer rules for interrogative sentences
Running testvoc, creating and running regression tests
 
  +
* Adding transfer rules for imperative and exclamatory sentences
Week 13
 
   
  +
==== Week 11 ====
Evaluation, writing wiki pages
 
Deliverable (Week 12-13): Final project
 
   
  +
* Adding transfer rules for complex and compound sentences
 
  +
* Adding other transfer rules based on the post-edit analysis of the corpora
  +
'''Deliverable (Week 9-11):''' Three updated transfer rule files
  +
  +
==== Week 12 ====
  +
  +
* Running testvoc, creating and running regression tests
  +
  +
==== Week 13 ====
   
  +
* Evaluation, writing wiki pages
  +
'''Deliverable (Week 12-13):''' Final project
   
 
== Skills and Expertise ==
 
== Skills and Expertise ==
   
   
Academic
+
==== Academic ====
   
Currently I'm a 3rd year student of Computer Science & Engineering department at Bangladesh University of Engineering & Technology. At university I've taken courses on Discrete Math, Data Structure, Algorithm, Automata Theory, Compiler. Besides, I've taken a Machine learning course provided by Coursera and successfully completed it.
+
Currently I'm a 3rd year student of Computer Science & Engineering department at Bangladesh University of Engineering & Technology. At university I've taken courses on Discrete Math, Data Structure, Algorithm, Automata Theory and Compiler. Besides, I've taken a Machine Learning course provided by Coursera and successfully completed it.
   
Language
+
==== Language ====
   
 
I'm a native speaker of Bangla. English is my second language. It's also the medium of my current study. I've studied Bangla grammar and English grammar in high school and college.
 
I'm a native speaker of Bangla. English is my second language. It's also the medium of my current study. I've studied Bangla grammar and English grammar in high school and college.
   
Programming
+
==== Programming ====
   
I know C++, Python and Java. Besides, I'm familiar with various Unix tools like sed, awk, grep. I regularly take part in competitive programming competitions. Here is my Codeforces profile.
+
I know C++, Python and Java. Besides, I'm familiar with various Unix tools like sed, awk, grep. I regularly take part in competitive programming competitions. Here is my [http://codeforces.com/profile/rafi_kamal Codeforces profile].
   
Open-source Involvement
+
==== Open-source Involvement ====
   
 
I've been involved in open source projects for a long time. Here are some of the open source projects I've worked on:
 
I've been involved in open source projects for a long time. Here are some of the open source projects I've worked on:
   
Ridmik Dictionary: It's an open source English-Bangla dictionary for Android, currently one of the most popular Bangla dictionaries for Android. I developed this when I was in 2nd year.
+
[https://play.google.com/store/apps/details?id=buet.rafi.dictionary Ridmik Dictionary]: It's an open source English-Bangla dictionary for Android, currently one of the most popular Bangla dictionaries for Android. I developed this when I was in 2nd year.<br>
Mothur: The Mothur project seeks to develop a single piece of open-source, expandable software to fill the bioinformatics needs of the microbial ecology community. I've worked in this project to implement confusion matrix.
+
[https://github.com/mothur/mothur Mothur]: The Mothur project seeks to develop a single piece of open-source, expandable software to fill the bioinformatics needs of the microbial ecology community. I've worked in this project to implement confusion matrix.<br>
ACM-solutions: It's a collection of detailed analyses and solutions of problems taken from different online judges.
+
[https://github.com/rafi-kamal/ACM-Solutions ACM-solutions]: It's a collection of detailed analyses and solutions of problems taken from different online judges.<br>
  +
  +
==== Coding Challange ====
  +
  +
As part of the [http://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code/Make_a_language_pair_state-of-the-art coding challenge] suggested by Francis, I've done the following:
  +
  +
* Installed Apertium and bn-en language pair from SVN
  +
* Translated four 500 page articles from Bangla to English using Apertium
  +
* Postedited these four articles to create reference translation
  +
* Used <code>apertium-eval-translator</code> to calculate word error rate (WER) and position independent word error rate (PWER) of existing translation
  +
  +
Here is the evaluation result of existing bn-en translation:
  +
  +
{|class="wikitable"
  +
|-
  +
! Article Number
  +
! WER
  +
! PWER
  +
|-
  +
| 1
  +
| 94.35 %
  +
| 89.01 %
  +
|-
  +
| 2
  +
| 93.39 %
  +
| 85.89 %
  +
|-
  +
| 3
  +
| 94.54 %
  +
| 87.68 %
  +
|-
  +
| 4
  +
| 89.15 %
  +
| 81.85 %
  +
|}
  +
  +
Translated articles can be found [https://github.com/rafi-kamal/Apertium-Coding-Challenge here].
  +
  +
After adding new words and transfer rules, here is the updated evaluation table:
  +
  +
{|class="wikitable"
  +
|-
  +
! Article Number
  +
! WER
  +
! WER Change
  +
! PWER
  +
! PWER Change
  +
|-
  +
| 1
  +
| 89.40 %
  +
| - 4.95 %
  +
| 71.52 %
  +
| - 17.49 %
  +
|-
  +
| 2
  +
| 94.24 %
  +
| + 0.85 %
  +
| 76.07 %
  +
| - 9.82 %
  +
|-
  +
| 3
  +
| 93.16 %
  +
| - 1.38 %
  +
| 75.58 %
  +
| - 12.1 %
  +
|-
  +
| 4
  +
| 85.58 %
  +
| - 3.57 %
  +
| 67.23 %
  +
| - 14.62 %
  +
|}
  +
  +
Article 1 and 4 are the development articles, and article 2 and 3 are held out articles.
  +
  +
I've added
  +
* 55 new entries in bn-en bidix
  +
* 31 new entries in bn monodix
  +
* 18 new entries in en monodix
  +
* 15 new transfer rules in bn-en.t1x
  +
* 2 new transfer rules in bn-en.t2x
  +
  +
[[Category:GSoC 2014 Student proposals|Rafi]]

Latest revision as of 10:24, 15 May 2014

Contact Information[edit]

Name: Rafi Kamal
IRC nick at #apertium: rafi_kamal
GitHub: github.com/rafi-kamal
SourceForge username: rafikamal93

Why are you interested in machine translation?[edit]

I'm from Bangladesh and Bangla is my native language. But I have to use English for a lot of purposes. For example, the medium of education in my university is English. So as a general user, I've felt the need of a good machine translation system numerous times.

I've created an open source English-Bangla dictionary. I've tried to enrich its database by adding words from an existing source, as well as integrated an machine translation system into it. But unfortunately, the existing machine translation systems (Google translate or Bing translate) are not free. So as a developer, I always feel the need of an open-source machine translation system.

And lastly, I'm planning to do research on natural language processing in Bangla to develop a Bangla search engine. I hope working experience in an existing machine translation system will help in my research.

Why is it that you are interested in the Apertium project?[edit]

Apertium is open source, that's the main reason of my interest. I've worked on open source projects before and I really like the experience. Another reason is, there has already been a lot of work done in Bangla-English translation in Apertium, and I have the opportunity to improve it.

Which of the published tasks are you interested in? What do you plan to do?[edit]

I'm interested on the project 'Adopt an unreleased language pair'. I want to work with Bangla-English language pair. Some work has already been done in this pair, focusing mainly on English to Bangla translation. I'll focus on Bangla to English translation in my project.

I plan to do the following in the project:

1. Expanding dictionaries[edit]

Currently the Bangla monodix has an 80% coverage of the Bangla wiki. My goal is to achieve about 90% coverage.

Currently there are 8230 entries in the Bangla monodix (3594 nouns, 1766 proper nouns, 1620 adjectives, 473 adverbs and 777 other lemmas). I've already collected a list of 35000 bangla words which is sorted according to their frequency from the developer of Ridmik Keyboard (A popular Bangla keyboard for Android). I plan to update the dictionary with ~3000 most frequently used words which is not currently in the dictionary.

The English monodix of this project was taken from the en-es language pair. Currently it has 7446 entries (3444 nouns, 1686 proper nouns, 1384 adjectives, 1243 other lemmas). I plan to update the English monodix with new entries of en-es pair.

Currently the bn-en bidix contains 7446 entries (444 nouns, 1686 proper nouns, 1384 adjectives, 1243 other lemmas). There is an open source English-Bangla dictionary from Ankur.org which contains about 17000 bn-en pair. I'll use this dictionary to enrich the bn-en bidix.

2. Handling Bangla Enclitic[edit]

Morphological analyzer can't analyze some enclitics (For example, 'টি'). I've to add appropriate paradigm definitions to solve this problem.

3. Disambiguation[edit]

I'll write lexical selection rules for disambiguation of bidix output. These rules will be used by Apertium's lexical selection module.

4. Adding transfer rules[edit]

Bangla and English are not closely related languages. So transfer rules play an important role to produce meaningful translation. The transfer system has been identified as the weakest part of the current system. So I've to work a lot to improve this part.

I've already identified some problem with negative form of verbs and modal auxiliaries. For example,

আমি কাজ করি
> I work
আমি কাজ করি না
> I work not (Should be: I don't work)

কাজটি করা উচিত
> work doing @উচিত (Should be: The work should be done)

To identify other problems, first I'll translate several corpora using the translator. Then I'll post-edit these, and identify which rules I need to add or which rules I need to modify.

Why Google and Apertium should sponsor it?[edit]

Bangla is the 7th most spoken language in the world, with about 220 million native users and 250 million total speakers (source: Wikipedia). But unfortunately there is no open source machine translation system, except the Apertium's one. If I can bring this language pair to release quality, it can help millions of people.

Work Plan[edit]

Community Bonding Period[edit]

  • Take a deeper look at the Apertium pipeline
  • Reading the wiki pages in detail, editing them and creating new wiki pages if necessary
  • Prepare corpora which will be used to in the coding period

Week 1[edit]

  • Add about 1200 nouns and 300 proper nouns in Bangla monodix
  • Update English monodix using English-Spanish language pair

Week 2[edit]

  • Add about 800 verbs, 300 adjectives, 250 adverbs and 150 other types of words in the Bangla monodix

Deliverable (Week 1-2): Updated Bangla and English monodix, where Bangla monodix would have a 90% wiki coverage
NB: Exact analysis on how many words should be added to the dictionary will be done in community bonding period

Week 3-4[edit]

  • Update English-Bangla bidix to include the updated words from both Bangla and English monodix (Approximately 3000 new entry, containing words from both open and closed class)

Deliverable (Week 3-4): Updated English-Bangla bidix

Week 5[edit]

  • Updating morphological analyzer to handle Bangla enclitic

Deliverable (Week 5): Updated morphological analyzer which can handle enclitics

Week 6[edit]

  • Post-edit 2-3 corpora prepared during the community bonding period
  • Analyze the corpora with their machine translated counterparts, calculate WER (Word Error Rate)
  • Identify the areas with improvement opportunity

Week 7-8[edit]

  • Writing lexical selection rules for disambiguation
  • Writing tag definitions for PoS tagger, if necessary

Deliverable (Week 6-8): A rules.xml file containing the lexical selection rules, and a .tsx file containing the tag definitions

Week 9[edit]

  • Writing transfer rules for translating negative form of verbs properly
  • Writing transfer rules for modal auxiliaries

Week 10[edit]

  • Adding transfer rules for interrogative sentences
  • Adding transfer rules for imperative and exclamatory sentences

Week 11[edit]

  • Adding transfer rules for complex and compound sentences
  • Adding other transfer rules based on the post-edit analysis of the corpora

Deliverable (Week 9-11): Three updated transfer rule files

Week 12[edit]

  • Running testvoc, creating and running regression tests

Week 13[edit]

  • Evaluation, writing wiki pages

Deliverable (Week 12-13): Final project

Skills and Expertise[edit]

Academic[edit]

Currently I'm a 3rd year student of Computer Science & Engineering department at Bangladesh University of Engineering & Technology. At university I've taken courses on Discrete Math, Data Structure, Algorithm, Automata Theory and Compiler. Besides, I've taken a Machine Learning course provided by Coursera and successfully completed it.

Language[edit]

I'm a native speaker of Bangla. English is my second language. It's also the medium of my current study. I've studied Bangla grammar and English grammar in high school and college.

Programming[edit]

I know C++, Python and Java. Besides, I'm familiar with various Unix tools like sed, awk, grep. I regularly take part in competitive programming competitions. Here is my Codeforces profile.

Open-source Involvement[edit]

I've been involved in open source projects for a long time. Here are some of the open source projects I've worked on:

Ridmik Dictionary: It's an open source English-Bangla dictionary for Android, currently one of the most popular Bangla dictionaries for Android. I developed this when I was in 2nd year.
Mothur: The Mothur project seeks to develop a single piece of open-source, expandable software to fill the bioinformatics needs of the microbial ecology community. I've worked in this project to implement confusion matrix.
ACM-solutions: It's a collection of detailed analyses and solutions of problems taken from different online judges.

Coding Challange[edit]

As part of the coding challenge suggested by Francis, I've done the following:

  • Installed Apertium and bn-en language pair from SVN
  • Translated four 500 page articles from Bangla to English using Apertium
  • Postedited these four articles to create reference translation
  • Used apertium-eval-translator to calculate word error rate (WER) and position independent word error rate (PWER) of existing translation

Here is the evaluation result of existing bn-en translation:

Article Number WER PWER
1 94.35 % 89.01 %
2 93.39 % 85.89 %
3 94.54 % 87.68 %
4 89.15 % 81.85 %

Translated articles can be found here.

After adding new words and transfer rules, here is the updated evaluation table:

Article Number WER WER Change PWER PWER Change
1 89.40 % - 4.95 % 71.52 % - 17.49 %
2 94.24 % + 0.85 % 76.07 % - 9.82 %
3 93.16 % - 1.38 % 75.58 % - 12.1 %
4 85.58 % - 3.57 % 67.23 % - 14.62 %

Article 1 and 4 are the development articles, and article 2 and 3 are held out articles.

I've added

  • 55 new entries in bn-en bidix
  • 31 new entries in bn monodix
  • 18 new entries in en monodix
  • 15 new transfer rules in bn-en.t1x
  • 2 new transfer rules in bn-en.t2x