Difference between revisions of "User:Rafi kamal/Application"

From Apertium
Jump to navigation Jump to search
Line 1: Line 1:

Contact Information
== Contact Information ==



Name: Rafi Kamal
Name: Rafi Kamal
Line 7: Line 9:
SourceForge username: rafikamal93
SourceForge username: rafikamal93
Why are you interested in machine translation?


== Why are you interested in machine translation? ==
I'm from Bangladesh and Bangla is my native tongue. But I've to use English for a lot of purposes. For example, the medium of education in my university is English. So as a general user, I've felt the need of a good machine translation system numerous times.


I'm from Bangladesh and Bangla is my native language. But I have to use English for a lot of purposes. For example, the medium of education in my university is English. So as a general user, I've felt the need of a good machine translation system numerous times.
I've created an open source English-Bangla dictionary. I've tried to add more words in its database as well as integrate an machine translation system into it. But unfortunately, the existing machine translation systems (Google translate or Bing translate) are not free. So as a developer, I always feel the need of an open-source machine translation system.
I've created an open source English-Bangla dictionary. I've tried to enrich its database by adding words from an existing source, as well as integrated an machine translation system into it. But unfortunately, the existing machine translation systems (Google translate or Bing translate) are not free. So as a developer, I always feel the need of an open-source machine translation system.
And lastly, I'm planning to do research on natural language processing in Bangla to develop a Bangla search engine. I hope working experience in an existing machine translation system will help in my research.
And lastly, I'm planning to do research on natural language processing in Bangla to develop a Bangla search engine. I hope working experience in an existing machine translation system will help in my research.
Why is it that you are interested in the Apertium project?


== Why is it that you are interested in the Apertium project? ==
Apertium is open source, that's the main reason of my interest. I've worked on open source project before and I really like it. Another reason is, there has already been a lot of work done in Bangla-English translation in Apertium, and I've the opportunity to contribute to it.


Which of the published tasks are you interested in? What do you plan to do?


Apertium is open source, that's the main reason of my interest. I've worked on open source projects before and I really like the experience. Another reason is, there has already been a lot of work done in Bangla-English translation in Apertium, and I have the opportunity to improve it.
I'm interested on the project 'Adopt an unreleased language pair'. I want to work with Bangla-English language pair. Some work has already been done in this language pair, focusing mainly on English to Bangla translation. I'll focus on Bangla to English translation.



== Which of the published tasks are you interested in? What do you plan to do? ==

I'm interested on the project 'Adopt an unreleased language pair'. I want to work with Bangla-English language pair. Some work has already been done in this pair, focusing mainly on English to Bangla translation. I'll focus on Bangla to English translation in my project.


I plan to do the following in the project:
I plan to do the following in the project:


Expanding dictionaries
1. Expanding dictionaries


Currently the Bangla monodix has an 80% coverage of the Bangla wiki. My goal is to achieve about 88-90% coverage.
Currently the Bangla monodix has an 80% coverage of the Bangla wiki. My goal is to achieve about 90% coverage.


Currently there are 7446 entries in the Bangla monodix. I've already collected a list of 35000 bangla words which is sorted according to their frequency from the developer of Ridmik Keyboard (A popular Bangla keyboard for Android). I plan to update the dictionary with ~3000 most frequently used words which is not currently in the dictionary.
Currently there are 8230 entries in the Bangla monodix (3594 nouns, 1766 proper nouns, 1620 adjectives, 473 adverbs and 777 other lemmas). I've already collected a list of 35000 bangla words which is sorted according to their frequency from the developer of Ridmik Keyboard (A popular Bangla keyboard for Android). I plan to update the dictionary with ~3000 most frequently used words which is not currently in the dictionary.


The English monodix of this project was taken from the en-es language pair. I plan to update the English monodix with new entries of en-es pair.
The English monodix of this project was taken from the en-es language pair. Currently it has 7446 entries (3444 nouns, 1686 proper nouns, 1384 adjectives, 1243 other lemmas). I plan to update the English monodix with new entries of en-es pair.


Handling Bangla Enclitic
2. Handling Bangla Enclitic


Morphological analyzer can't analyze some enclitics (For example, 'টি'). I've to add appropriate paradigm definition to solve this problem.
Morphological analyzer can't analyze some enclitics (For example, 'টি'). I've to add appropriate paradigm definitions to solve this problem.


Disambiguation
3. Disambiguation


I'll write lexical selection rules for disambiguation of bidix output. These rules will be used by Apertium's lexical selection module.
I'll write lexical selection rules for disambiguation of bidix output. These rules will be used by Apertium's lexical selection module.


Adding transfer rules
4. Adding transfer rules


Bangla and English are structurally very different languages. So transfer rules plays an important part to produce meaningful translation. The transfer system has been identified as the weakest part of the current system. So I've to work a lot to improve this part.
Bangla and English are not closely related languages. So transfer rules play an important role to produce meaningful translation. The transfer system has been identified as the weakest part of the current system. So I've to work a lot to improve this part.


I've already identified some problem with negative form of verbs and modal auxiliaries. For example,
I've already identified some problem with negative form of verbs and modal auxiliaries. For example,
Line 53: Line 63:


কাজটি করা উচিত
কাজটি করা উচিত
> work doing @উচিত (Should be: The work should be done)
> work doing @উচিত (Should be: The work should be done)

To identify other problems, first I'll translate several corpora using the translator. Then I'll post-edit these, and identify which rules I need to add or which rules I need to modify.



To identify other problems, first I'll translate several corpora using the translator. Then I'll post edit these, and identify which rules I need to add, or which rules I need to modify.


Why Google and Apertium should sponsor it
== Why Google and Apertium should sponsor it? ==


Bangla is the 7th most spoken languages in the world, with about 220 million native users and 250 million total speakers (source: Wikipedia). But unfortunately there is no open source machine translation system, except the Apertium's one. If I can bring this language pair to release quality, it can help millions of people.


Bangla is the 7th most spoken language in the world, with about 220 million native users and 250 million total speakers (source: Wikipedia). But unfortunately there is no open source machine translation system, except the Apertium's one. If I can bring this language pair to release quality, it can help millions of people.
Work Plan




== Work Plan ==



Community Bonding Period
Community Bonding Period
Line 70: Line 86:
Reading the wiki pages in detail
Reading the wiki pages in detail
Prepare corpora which will be used to in the coding period
Prepare corpora which will be used to in the coding period
Week 1-2
Week 1

Add about 1200 nouns and 300 proper nouns in Bangla monodix
Update English monodix using English-Spanish language pair
Week 2


Add about 800 verbs, 300 adjectives, 250 adverbs and 150 other types of words in the Bangla monodix
Add ~3000 words to the Bangla monodix to achieve 85-87% coverage
Deliverable (Week 1-2): Updated Bangla and English monodix, where Bangla monodix would have a 90% wiki coverage
Update English monodix using en-es language pair
N.B.: Exact analysis on how many words should be added to the dictionary will be done in community bonding period
Deliverable (Week 1-2): Updated Bangla and English monodix
Week 3-4
Week 3-4


Update English-Bangla bidix to include the updated words from both Bangla and English monodix (Approximately 3000 new entry)
Update English-Bangla bidix to include the updated words from both Bangla and English monodix (Approximately 3000 new entry, containing words from both open and closed class)
Deliverable (Week 3-4): Updated English-Bangla bidix
Deliverable (Week 3-4): Updated English-Bangla bidix
Week 5
Week 5
Line 91: Line 111:


Writing lexical selection rules for disambiguation
Writing lexical selection rules for disambiguation
Writing tag definitions file for PoS tagger, if necessary
Writing tag definitions for PoS tagger, if necessary
Deliverable (Week 6-8): A rules.xml file containing the lexical selection rules, and a .tsx file containing the tag definitions
Deliverable (Week 6-8): A rules.xml file containing the lexical selection rules, and a .tsx file containing the tag definitions
Week 9
Week 9
Line 100: Line 120:


Adding transfer rules for interrogative sentences
Adding transfer rules for interrogative sentences
Adding transfer rules imperative sentences
Adding transfer rules for imperative and exclamatory sentences
Week 11
Week 11


Adding transfer rules for exclamatory sentences
Adding transfer rules for complex and compound sentences
Adding other transfer rules based on the post-edit analysis of the corpus
Adding other transfer rules based on the post-edit analysis of the corpora
Deliverable (Week 9-11): Three updated transfer rule files
Deliverable (Week 9-11): Three updated transfer rule files
Week 12
Week 12
Line 114: Line 134:
Deliverable (Week 12-13): Final project
Deliverable (Week 12-13): Final project


Skills and Expertise


== Skills and Expertise ==



Academic
Academic


Currently I'm a 3rd year student of Computer Science & Engineering student at Bangladesh University of Engineering & Technology. In the university I've taken courses on Discrete Math, Data Structure, Algorithm, Automata Theory, Compiler. Besides, I've taken a Machine learning course provided by Coursera and successfully completed it.
Currently I'm a 3rd year student of Computer Science & Engineering department at Bangladesh University of Engineering & Technology. At university I've taken courses on Discrete Math, Data Structure, Algorithm, Automata Theory, Compiler. Besides, I've taken a Machine learning course provided by Coursera and successfully completed it.


Language
Language
Line 124: Line 148:
I'm a native speaker of Bangla. English is my second language. It's also the medium of my current study. I've studied Bangla grammar and English grammar in high school and college.
I'm a native speaker of Bangla. English is my second language. It's also the medium of my current study. I've studied Bangla grammar and English grammar in high school and college.


Programming
Porgramming


I know C++, Python and Java. Besides, I'm familiar with various Unix tools like sed, awk, grep. I regularly take part in competitive programming competitions. Here is my Codeforces profile.
I know C++, Python and Java. Besides, I'm familiar with various Unix tools like sed, awk, grep. I regularly take part in competitive programming competitions. Here is my Codeforces profile.
Line 134: Line 158:
Ridmik Dictionary: It's an open source English-Bangla dictionary for Android, currently one of the most popular Bangla dictionaries for Android. I developed this when I was in 2nd year.
Ridmik Dictionary: It's an open source English-Bangla dictionary for Android, currently one of the most popular Bangla dictionaries for Android. I developed this when I was in 2nd year.
Mothur: The Mothur project seeks to develop a single piece of open-source, expandable software to fill the bioinformatics needs of the microbial ecology community. I've worked in this project to implement confusion matrix.
Mothur: The Mothur project seeks to develop a single piece of open-source, expandable software to fill the bioinformatics needs of the microbial ecology community. I've worked in this project to implement confusion matrix.
ACM-solutions: It's a collection of detailed analysis and solutions of problems taken from different online judges.
ACM-solutions: It's a collection of detailed analyses and solutions of problems taken from different online judges.

[[Category:GSoC 2014 Student proposals|Rafi]]

Revision as of 18:34, 21 March 2014

Contact Information

Name: Rafi Kamal Email: rafikamal93@yahoo.com IRC nick at #apertium: rafi GitHub: github.com/rafi-kamal SourceForge username: rafikamal93


Why are you interested in machine translation?

I'm from Bangladesh and Bangla is my native language. But I have to use English for a lot of purposes. For example, the medium of education in my university is English. So as a general user, I've felt the need of a good machine translation system numerous times.

I've created an open source English-Bangla dictionary. I've tried to enrich its database by adding words from an existing source, as well as integrated an machine translation system into it. But unfortunately, the existing machine translation systems (Google translate or Bing translate) are not free. So as a developer, I always feel the need of an open-source machine translation system.

And lastly, I'm planning to do research on natural language processing in Bangla to develop a Bangla search engine. I hope working experience in an existing machine translation system will help in my research.


Why is it that you are interested in the Apertium project?

Apertium is open source, that's the main reason of my interest. I've worked on open source projects before and I really like the experience. Another reason is, there has already been a lot of work done in Bangla-English translation in Apertium, and I have the opportunity to improve it.



Which of the published tasks are you interested in? What do you plan to do?

I'm interested on the project 'Adopt an unreleased language pair'. I want to work with Bangla-English language pair. Some work has already been done in this pair, focusing mainly on English to Bangla translation. I'll focus on Bangla to English translation in my project.

I plan to do the following in the project:

1. Expanding dictionaries

Currently the Bangla monodix has an 80% coverage of the Bangla wiki. My goal is to achieve about 90% coverage.

Currently there are 8230 entries in the Bangla monodix (3594 nouns, 1766 proper nouns, 1620 adjectives, 473 adverbs and 777 other lemmas). I've already collected a list of 35000 bangla words which is sorted according to their frequency from the developer of Ridmik Keyboard (A popular Bangla keyboard for Android). I plan to update the dictionary with ~3000 most frequently used words which is not currently in the dictionary.

The English monodix of this project was taken from the en-es language pair. Currently it has 7446 entries (3444 nouns, 1686 proper nouns, 1384 adjectives, 1243 other lemmas). I plan to update the English monodix with new entries of en-es pair.

2. Handling Bangla Enclitic

Morphological analyzer can't analyze some enclitics (For example, 'টি'). I've to add appropriate paradigm definitions to solve this problem.

3. Disambiguation

I'll write lexical selection rules for disambiguation of bidix output. These rules will be used by Apertium's lexical selection module.

4. Adding transfer rules

Bangla and English are not closely related languages. So transfer rules play an important role to produce meaningful translation. The transfer system has been identified as the weakest part of the current system. So I've to work a lot to improve this part.

I've already identified some problem with negative form of verbs and modal auxiliaries. For example,

আমি কাজ করি > I work আমি কাজ করি না > I work not (Should be: I don't work)

কাজটি করা উচিত > work doing @উচিত (Should be: The work should be done)

To identify other problems, first I'll translate several corpora using the translator. Then I'll post-edit these, and identify which rules I need to add or which rules I need to modify.



Why Google and Apertium should sponsor it?

Bangla is the 7th most spoken language in the world, with about 220 million native users and 250 million total speakers (source: Wikipedia). But unfortunately there is no open source machine translation system, except the Apertium's one. If I can bring this language pair to release quality, it can help millions of people.



Work Plan

Community Bonding Period

Take a deeper look at the Apertium pipeline Reading the wiki pages in detail Prepare corpora which will be used to in the coding period Week 1

Add about 1200 nouns and 300 proper nouns in Bangla monodix Update English monodix using English-Spanish language pair Week 2

Add about 800 verbs, 300 adjectives, 250 adverbs and 150 other types of words in the Bangla monodix Deliverable (Week 1-2): Updated Bangla and English monodix, where Bangla monodix would have a 90% wiki coverage N.B.: Exact analysis on how many words should be added to the dictionary will be done in community bonding period Week 3-4

Update English-Bangla bidix to include the updated words from both Bangla and English monodix (Approximately 3000 new entry, containing words from both open and closed class) Deliverable (Week 3-4): Updated English-Bangla bidix Week 5

Updating morphological analyzer to handle Bangla enclitic Deliverable (Week 3-4): Updated morphological analyzer which can handle enclitics Week 6

Post-edit 2-3 corpora prepared during the community bonding period Analyze the corpora with their machine translated counterparts, calculate WER (Word Error Rate) Identify the areas with improvement opportunity Week 7-8

Writing lexical selection rules for disambiguation Writing tag definitions for PoS tagger, if necessary Deliverable (Week 6-8): A rules.xml file containing the lexical selection rules, and a .tsx file containing the tag definitions Week 9

Writing transfer rules for translating negative form of verbs properly Writing transfer rules for modal auxiliaries Week 10

Adding transfer rules for interrogative sentences Adding transfer rules for imperative and exclamatory sentences Week 11

Adding transfer rules for complex and compound sentences Adding other transfer rules based on the post-edit analysis of the corpora Deliverable (Week 9-11): Three updated transfer rule files Week 12

Running testvoc, creating and running regression tests Week 13

Evaluation, writing wiki pages Deliverable (Week 12-13): Final project



Skills and Expertise

Academic

Currently I'm a 3rd year student of Computer Science & Engineering department at Bangladesh University of Engineering & Technology. At university I've taken courses on Discrete Math, Data Structure, Algorithm, Automata Theory, Compiler. Besides, I've taken a Machine learning course provided by Coursera and successfully completed it.

Language

I'm a native speaker of Bangla. English is my second language. It's also the medium of my current study. I've studied Bangla grammar and English grammar in high school and college.

Programming

I know C++, Python and Java. Besides, I'm familiar with various Unix tools like sed, awk, grep. I regularly take part in competitive programming competitions. Here is my Codeforces profile.

Open-source Involvement

I've been involved in open source projects for a long time. Here are some of the open source projects I've worked on:

Ridmik Dictionary: It's an open source English-Bangla dictionary for Android, currently one of the most popular Bangla dictionaries for Android. I developed this when I was in 2nd year. Mothur: The Mothur project seeks to develop a single piece of open-source, expandable software to fill the bioinformatics needs of the microbial ecology community. I've worked in this project to implement confusion matrix. ACM-solutions: It's a collection of detailed analyses and solutions of problems taken from different online judges.