Difference between revisions of "User:Rafi kamal/Application"
Rafi kamal (talk | contribs) |
Rafi kamal (talk | contribs) |
||
Line 1: | Line 1: | ||
Contact Information |
== Contact Information == |
||
Name: Rafi Kamal |
Name: Rafi Kamal |
||
Line 7: | Line 9: | ||
SourceForge username: rafikamal93 |
SourceForge username: rafikamal93 |
||
⚫ | |||
⚫ | |||
⚫ | |||
⚫ | |||
I've created an open source English-Bangla dictionary. I've tried to |
I've created an open source English-Bangla dictionary. I've tried to enrich its database by adding words from an existing source, as well as integrated an machine translation system into it. But unfortunately, the existing machine translation systems (Google translate or Bing translate) are not free. So as a developer, I always feel the need of an open-source machine translation system. |
||
And lastly, I'm planning to do research on natural language processing in Bangla to develop a Bangla search engine. I hope working experience in an existing machine translation system will help in my research. |
And lastly, I'm planning to do research on natural language processing in Bangla to develop a Bangla search engine. I hope working experience in an existing machine translation system will help in my research. |
||
⚫ | |||
⚫ | |||
⚫ | |||
⚫ | |||
⚫ | |||
⚫ | |||
⚫ | |||
⚫ | |||
⚫ | |||
I plan to do the following in the project: |
I plan to do the following in the project: |
||
Expanding dictionaries |
1. Expanding dictionaries |
||
Currently the Bangla monodix has an 80% coverage of the Bangla wiki. My goal is to achieve about |
Currently the Bangla monodix has an 80% coverage of the Bangla wiki. My goal is to achieve about 90% coverage. |
||
Currently there are |
Currently there are 8230 entries in the Bangla monodix (3594 nouns, 1766 proper nouns, 1620 adjectives, 473 adverbs and 777 other lemmas). I've already collected a list of 35000 bangla words which is sorted according to their frequency from the developer of Ridmik Keyboard (A popular Bangla keyboard for Android). I plan to update the dictionary with ~3000 most frequently used words which is not currently in the dictionary. |
||
The English monodix of this project was taken from the en-es language pair. I plan to update the English monodix with new entries of en-es pair. |
The English monodix of this project was taken from the en-es language pair. Currently it has 7446 entries (3444 nouns, 1686 proper nouns, 1384 adjectives, 1243 other lemmas). I plan to update the English monodix with new entries of en-es pair. |
||
Handling Bangla Enclitic |
2. Handling Bangla Enclitic |
||
Morphological analyzer can't analyze some enclitics (For example, 'টি'). I've to add appropriate paradigm |
Morphological analyzer can't analyze some enclitics (For example, 'টি'). I've to add appropriate paradigm definitions to solve this problem. |
||
Disambiguation |
3. Disambiguation |
||
I'll write lexical selection rules for disambiguation of bidix output. These rules will be used by Apertium's lexical selection module. |
I'll write lexical selection rules for disambiguation of bidix output. These rules will be used by Apertium's lexical selection module. |
||
Adding transfer rules |
4. Adding transfer rules |
||
Bangla and English are |
Bangla and English are not closely related languages. So transfer rules play an important role to produce meaningful translation. The transfer system has been identified as the weakest part of the current system. So I've to work a lot to improve this part. |
||
I've already identified some problem with negative form of verbs and modal auxiliaries. For example, |
I've already identified some problem with negative form of verbs and modal auxiliaries. For example, |
||
Line 53: | Line 63: | ||
কাজটি করা উচিত |
কাজটি করা উচিত |
||
⚫ | |||
> work doing @উচিত (Should be: The work should be done) |
> work doing @উচিত (Should be: The work should be done) |
||
⚫ | |||
⚫ | |||
Why Google and Apertium should sponsor it |
== Why Google and Apertium should sponsor it? == |
||
⚫ | Bangla is the 7th most spoken |
||
⚫ | Bangla is the 7th most spoken language in the world, with about 220 million native users and 250 million total speakers (source: Wikipedia). But unfortunately there is no open source machine translation system, except the Apertium's one. If I can bring this language pair to release quality, it can help millions of people. |
||
⚫ | |||
⚫ | |||
Community Bonding Period |
Community Bonding Period |
||
Line 70: | Line 86: | ||
Reading the wiki pages in detail |
Reading the wiki pages in detail |
||
Prepare corpora which will be used to in the coding period |
Prepare corpora which will be used to in the coding period |
||
Week 1 |
Week 1 |
||
Add about 1200 nouns and 300 proper nouns in Bangla monodix |
|||
⚫ | |||
Week 2 |
|||
Add about 800 verbs, 300 adjectives, 250 adverbs and 150 other types of words in the Bangla monodix |
|||
Add ~3000 words to the Bangla monodix to achieve 85-87% coverage |
|||
⚫ | |||
⚫ | |||
N.B.: Exact analysis on how many words should be added to the dictionary will be done in community bonding period |
|||
⚫ | |||
Week 3-4 |
Week 3-4 |
||
Update English-Bangla bidix to include the updated words from both Bangla and English monodix (Approximately 3000 new entry) |
Update English-Bangla bidix to include the updated words from both Bangla and English monodix (Approximately 3000 new entry, containing words from both open and closed class) |
||
Deliverable (Week 3-4): Updated English-Bangla bidix |
Deliverable (Week 3-4): Updated English-Bangla bidix |
||
Week 5 |
Week 5 |
||
Line 91: | Line 111: | ||
Writing lexical selection rules for disambiguation |
Writing lexical selection rules for disambiguation |
||
Writing tag definitions |
Writing tag definitions for PoS tagger, if necessary |
||
Deliverable (Week 6-8): A rules.xml file containing the lexical selection rules, and a .tsx file containing the tag definitions |
Deliverable (Week 6-8): A rules.xml file containing the lexical selection rules, and a .tsx file containing the tag definitions |
||
Week 9 |
Week 9 |
||
Line 100: | Line 120: | ||
Adding transfer rules for interrogative sentences |
Adding transfer rules for interrogative sentences |
||
Adding transfer rules imperative sentences |
Adding transfer rules for imperative and exclamatory sentences |
||
Week 11 |
Week 11 |
||
Adding transfer rules for |
Adding transfer rules for complex and compound sentences |
||
Adding other transfer rules based on the post-edit analysis of the |
Adding other transfer rules based on the post-edit analysis of the corpora |
||
Deliverable (Week 9-11): Three updated transfer rule files |
Deliverable (Week 9-11): Three updated transfer rule files |
||
Week 12 |
Week 12 |
||
Line 114: | Line 134: | ||
Deliverable (Week 12-13): Final project |
Deliverable (Week 12-13): Final project |
||
⚫ | |||
⚫ | |||
Academic |
Academic |
||
Currently I'm a 3rd year student of Computer Science & Engineering |
Currently I'm a 3rd year student of Computer Science & Engineering department at Bangladesh University of Engineering & Technology. At university I've taken courses on Discrete Math, Data Structure, Algorithm, Automata Theory, Compiler. Besides, I've taken a Machine learning course provided by Coursera and successfully completed it. |
||
Language |
Language |
||
Line 124: | Line 148: | ||
I'm a native speaker of Bangla. English is my second language. It's also the medium of my current study. I've studied Bangla grammar and English grammar in high school and college. |
I'm a native speaker of Bangla. English is my second language. It's also the medium of my current study. I've studied Bangla grammar and English grammar in high school and college. |
||
Programming |
|||
Porgramming |
|||
I know C++, Python and Java. Besides, I'm familiar with various Unix tools like sed, awk, grep. I regularly take part in competitive programming competitions. Here is my Codeforces profile. |
I know C++, Python and Java. Besides, I'm familiar with various Unix tools like sed, awk, grep. I regularly take part in competitive programming competitions. Here is my Codeforces profile. |
||
Line 134: | Line 158: | ||
Ridmik Dictionary: It's an open source English-Bangla dictionary for Android, currently one of the most popular Bangla dictionaries for Android. I developed this when I was in 2nd year. |
Ridmik Dictionary: It's an open source English-Bangla dictionary for Android, currently one of the most popular Bangla dictionaries for Android. I developed this when I was in 2nd year. |
||
Mothur: The Mothur project seeks to develop a single piece of open-source, expandable software to fill the bioinformatics needs of the microbial ecology community. I've worked in this project to implement confusion matrix. |
Mothur: The Mothur project seeks to develop a single piece of open-source, expandable software to fill the bioinformatics needs of the microbial ecology community. I've worked in this project to implement confusion matrix. |
||
ACM-solutions: It's a collection of detailed |
ACM-solutions: It's a collection of detailed analyses and solutions of problems taken from different online judges. |
||
[[Category:GSoC 2014 Student proposals|Rafi]] |
Revision as of 18:34, 21 March 2014
Contents
Contact Information
Name: Rafi Kamal Email: rafikamal93@yahoo.com IRC nick at #apertium: rafi GitHub: github.com/rafi-kamal SourceForge username: rafikamal93
Why are you interested in machine translation?
I'm from Bangladesh and Bangla is my native language. But I have to use English for a lot of purposes. For example, the medium of education in my university is English. So as a general user, I've felt the need of a good machine translation system numerous times.
I've created an open source English-Bangla dictionary. I've tried to enrich its database by adding words from an existing source, as well as integrated an machine translation system into it. But unfortunately, the existing machine translation systems (Google translate or Bing translate) are not free. So as a developer, I always feel the need of an open-source machine translation system.
And lastly, I'm planning to do research on natural language processing in Bangla to develop a Bangla search engine. I hope working experience in an existing machine translation system will help in my research.
Why is it that you are interested in the Apertium project?
Apertium is open source, that's the main reason of my interest. I've worked on open source projects before and I really like the experience. Another reason is, there has already been a lot of work done in Bangla-English translation in Apertium, and I have the opportunity to improve it.
Which of the published tasks are you interested in? What do you plan to do?
I'm interested on the project 'Adopt an unreleased language pair'. I want to work with Bangla-English language pair. Some work has already been done in this pair, focusing mainly on English to Bangla translation. I'll focus on Bangla to English translation in my project.
I plan to do the following in the project:
1. Expanding dictionaries
Currently the Bangla monodix has an 80% coverage of the Bangla wiki. My goal is to achieve about 90% coverage.
Currently there are 8230 entries in the Bangla monodix (3594 nouns, 1766 proper nouns, 1620 adjectives, 473 adverbs and 777 other lemmas). I've already collected a list of 35000 bangla words which is sorted according to their frequency from the developer of Ridmik Keyboard (A popular Bangla keyboard for Android). I plan to update the dictionary with ~3000 most frequently used words which is not currently in the dictionary.
The English monodix of this project was taken from the en-es language pair. Currently it has 7446 entries (3444 nouns, 1686 proper nouns, 1384 adjectives, 1243 other lemmas). I plan to update the English monodix with new entries of en-es pair.
2. Handling Bangla Enclitic
Morphological analyzer can't analyze some enclitics (For example, 'টি'). I've to add appropriate paradigm definitions to solve this problem.
3. Disambiguation
I'll write lexical selection rules for disambiguation of bidix output. These rules will be used by Apertium's lexical selection module.
4. Adding transfer rules
Bangla and English are not closely related languages. So transfer rules play an important role to produce meaningful translation. The transfer system has been identified as the weakest part of the current system. So I've to work a lot to improve this part.
I've already identified some problem with negative form of verbs and modal auxiliaries. For example,
আমি কাজ করি > I work আমি কাজ করি না > I work not (Should be: I don't work)
কাজটি করা উচিত > work doing @উচিত (Should be: The work should be done)
To identify other problems, first I'll translate several corpora using the translator. Then I'll post-edit these, and identify which rules I need to add or which rules I need to modify.
Why Google and Apertium should sponsor it?
Bangla is the 7th most spoken language in the world, with about 220 million native users and 250 million total speakers (source: Wikipedia). But unfortunately there is no open source machine translation system, except the Apertium's one. If I can bring this language pair to release quality, it can help millions of people.
Work Plan
Community Bonding Period
Take a deeper look at the Apertium pipeline Reading the wiki pages in detail Prepare corpora which will be used to in the coding period Week 1
Add about 1200 nouns and 300 proper nouns in Bangla monodix Update English monodix using English-Spanish language pair Week 2
Add about 800 verbs, 300 adjectives, 250 adverbs and 150 other types of words in the Bangla monodix Deliverable (Week 1-2): Updated Bangla and English monodix, where Bangla monodix would have a 90% wiki coverage N.B.: Exact analysis on how many words should be added to the dictionary will be done in community bonding period Week 3-4
Update English-Bangla bidix to include the updated words from both Bangla and English monodix (Approximately 3000 new entry, containing words from both open and closed class) Deliverable (Week 3-4): Updated English-Bangla bidix Week 5
Updating morphological analyzer to handle Bangla enclitic Deliverable (Week 3-4): Updated morphological analyzer which can handle enclitics Week 6
Post-edit 2-3 corpora prepared during the community bonding period Analyze the corpora with their machine translated counterparts, calculate WER (Word Error Rate) Identify the areas with improvement opportunity Week 7-8
Writing lexical selection rules for disambiguation Writing tag definitions for PoS tagger, if necessary Deliverable (Week 6-8): A rules.xml file containing the lexical selection rules, and a .tsx file containing the tag definitions Week 9
Writing transfer rules for translating negative form of verbs properly Writing transfer rules for modal auxiliaries Week 10
Adding transfer rules for interrogative sentences Adding transfer rules for imperative and exclamatory sentences Week 11
Adding transfer rules for complex and compound sentences Adding other transfer rules based on the post-edit analysis of the corpora Deliverable (Week 9-11): Three updated transfer rule files Week 12
Running testvoc, creating and running regression tests Week 13
Evaluation, writing wiki pages Deliverable (Week 12-13): Final project
Skills and Expertise
Academic
Currently I'm a 3rd year student of Computer Science & Engineering department at Bangladesh University of Engineering & Technology. At university I've taken courses on Discrete Math, Data Structure, Algorithm, Automata Theory, Compiler. Besides, I've taken a Machine learning course provided by Coursera and successfully completed it.
Language
I'm a native speaker of Bangla. English is my second language. It's also the medium of my current study. I've studied Bangla grammar and English grammar in high school and college.
Programming
I know C++, Python and Java. Besides, I'm familiar with various Unix tools like sed, awk, grep. I regularly take part in competitive programming competitions. Here is my Codeforces profile.
Open-source Involvement
I've been involved in open source projects for a long time. Here are some of the open source projects I've worked on:
Ridmik Dictionary: It's an open source English-Bangla dictionary for Android, currently one of the most popular Bangla dictionaries for Android. I developed this when I was in 2nd year. Mothur: The Mothur project seeks to develop a single piece of open-source, expandable software to fill the bioinformatics needs of the microbial ecology community. I've worked in this project to implement confusion matrix. ACM-solutions: It's a collection of detailed analyses and solutions of problems taken from different online judges.