Difference between revisions of "User:Srj31/GSOC 2020 proposal:Bengali-Hindi pair"
Line 1: | Line 1: | ||
== Contact information == |
'''== Contact information ==''' |
||
Name: Sourabh Raj Jaiswal |
Name: Sourabh Raj Jaiswal |
||
Line 18: | Line 18: | ||
== '''Why is it that you are interested in Machine Translation and Apertium?''' == |
== '''Why is it that you are interested in Machine Translation and Apertium?''' == |
||
I had the honor of representing Team India at the International Linguistics Olympiad 2019 at South Korea. It was this experience and the 3 years of indulging in Linguistics along with my interest in NLP, that got me interested in a Machine Translation project. Apertium being an open source platform for developing rule-based machine translation systems, has intrigued me to contribute to this platform and I will have the opportunity to further create more language pairs for the various languages of India. |
I had been wanted to work on a project which involved Machine Translation and was intrigued by how computers could understand what we said.This curiosity got me into studying linguistics in high school. I came fifth at the National Linguistics Olympiad and had the honor of representing Team India at the International Linguistics Olympiad 2019 at South Korea. It was this experience and the 3 years of indulging in Linguistics along with my interest in NLP, that got me interested in a Machine Translation project. Apertium being an open source platform for developing rule-based machine translation systems, has intrigued me to contribute to this platform and I will have the opportunity to further create more language pairs for the various languages of India. |
||
Line 29: | Line 27: | ||
== How it will benefit == |
'''== How it will benefit ==''' |
||
Line 37: | Line 34: | ||
⚫ | |||
⚫ | |||
Line 53: | Line 48: | ||
'''[[First Phase]]''' |
==='''[[First Phase]]'''=== |
||
'''Week 1 :''' |
'''Week 1 :''' |
||
Improving the monolingual dictionaries |
Improving the monolingual dictionaries |
||
Line 76: | Line 71: | ||
'''[[Second Phase]]''' |
==='''[[Second Phase]]'''=== |
||
'''Week 5:''' |
'''Week 5:''' |
||
Line 100: | Line 95: | ||
'''[[Third Phase]]''' |
==='''[[Third Phase]]'''=== |
||
'''Week 9:''' |
'''Week 9:''' |
||
Expand bilingual dictionaries and work on disambiguation rules |
Expand bilingual dictionaries and work on disambiguation rules |
||
'''Week 10:''' |
'''Week 10:''' |
||
Testvoc and some improvements, More work on the transfer rules |
|||
'''Week 11:''' |
'''Week 11:''' |
||
Test with regular conversations plus text from newspapers or magazines |
Test with regular conversations plus text from newspapers or magazines |
||
'''Week 12:''' |
'''Week 12:''' |
||
Write documentation, complete testing and fixing bugs |
Write documentation, complete testing and fixing bugs |
||
Line 123: | Line 122: | ||
==Coding Challenge== |
'''==Coding Challenge==''' |
||
All the work has been saved in the GitHub repo: https://github.com/srj31/apertium-ben-hin |
All the work has been saved in the GitHub repo: https://github.com/srj31/apertium-ben-hin |
||
I have also stored the story’s translation in Bengali and Hindi post-edited. |
I have also stored the story’s translation in Bengali and Hindi post-edited. |
||
Line 158: | Line 157: | ||
''''Work Done'''': |
''''Work Done'''': |
||
Added entries to the bilingual dictionaries |
Added entries to the bilingual dictionaries |
||
Added certain words to the monolingual dictionaries as well |
Added certain words to the monolingual dictionaries as well |
||
Started working with the transfer rules |
Started working with the transfer rules |
||
''''Issues pending'''': |
''''Issues pending'''': |
||
Translating from bengali to hindi requires genitive pronouns to agree with the gender |
Translating from bengali to hindi requires genitive pronouns to agree with the gender |
||
Negation marker is not handled |
Negation marker is not handled |
||
Revision as of 12:34, 25 March 2020
== Contact information ==
Name: Sourabh Raj Jaiswal
Location: Noida, India
E-mail address: sourabhrj31@gmail.com
IRC: srj31
Timezone: UTC +5:30
Github: https://github.com/srj31
Contents
Why is it that you are interested in Machine Translation and Apertium?
I had been wanted to work on a project which involved Machine Translation and was intrigued by how computers could understand what we said.This curiosity got me into studying linguistics in high school. I came fifth at the National Linguistics Olympiad and had the honor of representing Team India at the International Linguistics Olympiad 2019 at South Korea. It was this experience and the 3 years of indulging in Linguistics along with my interest in NLP, that got me interested in a Machine Translation project. Apertium being an open source platform for developing rule-based machine translation systems, has intrigued me to contribute to this platform and I will have the opportunity to further create more language pairs for the various languages of India.
Which of the published tasks are you interested in? What do you plan to do?
I plan on working on Adopting the unreleased language pair Hindi-Bengali. I plan on improving the pair to be able to translate sentences coherently. As Bengali and Hindi are similar in various linguistic aspects, they will have similar rules thus machine translation would give correct sentences if done correctly.
== How it will benefit ==
Bengali is the official and most widely spoken language of Bangladesh and second most widely spoken of the 22 scheduled languages of India, behind Hindi. Translation allows ideas and information to spread across cultures. In the process, translation changes history. With Hindi and bengali being the most widespread language in India, creating one for languages in India will provide the spread of culture and literature of India. The current Bengali-Hindi pair has only a bilingual dictionary having some of the noun words and not much. By doing this project I aim to create a working language pair which performs much better and can create a correct translation.
== Work Plan ==
Post-application period:
Find language resources for both and ben-hin
Learn more regarding the Apertium dictionaries and tools
Community Bonding:
Getting familiar with all the Apertium modules and it’s working. Discussion with mentors and clearing doubts.
First Phase
Week 1 : Improving the monolingual dictionaries Adding nouns, prepositions, adjectives, adverbs in the bilingual dictionary of the en-bn pair.
Week 2:
Adding verbs, pronouns, conjunctions in the bilingual dictionaries
Writing the transfer rules for verbs and nouns.
Week 3:
Continue improving the dictionaries and the transfer rules
Test the current workings
Week 4:
Update the documentation and prepare for the evaluation
Deliverable 1: Bilingual dictionaries and some transfer rules
Second Phase
Week 5: With the guidance of the mentors learn more about the morphological rules and to review the work from weeks 1-4.
Fixing minor issues in bilingual and monolingual dictionaries.
Perform testvoc/ corpus test
Week 6:
Start working on CG and disambiguation
Week 7:
Continue with the disambiguation tests and its solutions
Week 8:
Test translations, make some improvements, fix some bugs and prepare for the evaluation
Deliverable 2: Provide coherent translation between the language pairs
Third Phase
Week 9:
Expand bilingual dictionaries and work on disambiguation rules
Week 10:
Testvoc and some improvements, More work on the transfer rules
Week 11:
Test with regular conversations plus text from newspapers or magazines
Week 12:
Write documentation, complete testing and fixing bugs
Final Evaluation
==Coding Challenge==
All the work has been saved in the GitHub repo: https://github.com/srj31/apertium-ben-hin
I have also stored the story’s translation in Bengali and Hindi post-edited.
Initially the pair did not have much in the bilingual dictionary. After learning through the resources available on the apertium wiki and by testing out the bn-en pair I started working on this pair.
Statistics about input files
Number of words in reference: 503 Number of words in test: 388 Number of unknown words (marked with a star) in test: 70 Percentage of unknown words: 18.04 %
Results when removing unknown-word marks (stars)
Edit distance: 380 Word error rate (WER): 75.55 % Number of position-independent correct words: 145 Position-independent word error rate (PER): 71.17 %
Results when unknown-word marks (stars) are not removed
Edit distance: 389 Word Error Rate (WER): 77.34 % Number of position-independent correct words: 136 Position-independent word error rate (PER): 72.96 %
Statistics about the translation of unknown words
Number of unknown words which were free rides: 9 Percentage of unknown words that were free rides: 12.86 %
'Work Done':
Added entries to the bilingual dictionaries
Added certain words to the monolingual dictionaries as well
Started working with the transfer rules
'Issues pending':
Translating from bengali to hindi requires genitive pronouns to agree with the gender
Negation marker is not handled
Skill
'Ongoing major' : Bachelors in Mathematics and Computing
'Relevant technical skills' : Python(Advanced), XML(Intermediate), C++(Advanced), Java(intermediate)
'Languages' : Hindi(native), English(Advanced), Bengali(Advanced),
'Experience :'
I have studied many languages and linguistics in general during my preparation for the International Linguistics Olympiad 2019 and while creating and testing problems for the National Linguistic Olympiad , which has allowed me to formalise rules which are followed while translating from one language to another. I have done online courses on NLP on the online platform coursera and am proficient in Data Structures and Algorithms, through competitive programming.
Non Summer of Code plans
Though I have no plans other than Summer of Code, in the light of the Corona emergency, colleges might have to postpone our major examinations(still tentative), thus I will only be able to give 20hrs/week, and this should last 1 week or 2 weeks of the first phase. After which I will be having a summer break, during which I can work for 40+hrs/week.