Difference between revisions of "User:Srj31/GSOC 2020 proposal:Bengali-Hindi pair"
(64 intermediate revisions by the same user not shown) | |||
Line 17: | Line 17: | ||
== '''Why is it that you are interested in Machine Translation and Apertium?''' == |
== '''Why is it that you are interested in Machine Translation and Apertium?''' == |
||
I |
I have always been intrigued by how computers could understand what we said and had been wanting to work on a project which involved Machine Translation This curiosity got me into studying linguistics in high school. I came fifth at the National Linguistics Olympiad and had the honor of representing Team India at the International Linguistics Olympiad 2019 at South Korea. It was this experience and the 3 years of indulging in Linguistics along with my interest in NLP, that got me interested in a Machine Translation project. Apertium being an open source platform for developing rule-based machine translation systems, has intrigued me to contribute to this platform and I will have the opportunity to further create more language pairs for the various languages of India. |
||
== '''Which of the published tasks are you interested in? What do you plan to do?''' == |
== '''Which of the published tasks are you interested in? What do you plan to do?''' == |
||
I plan on working on Adopting the unreleased language pair Hindi-Bengali. I plan on improving the pair to be able to translate sentences coherently. As Bengali and Hindi are similar in various linguistic aspects, they will have similar rules thus machine translation would give correct sentences if done correctly. |
I plan on working on Adopting the unreleased language pair Hindi-Bengali and get the pair released in both directions having a WER < 20. I plan on improving the pair to be able to translate sentences coherently. As Bengali and Hindi are similar in various linguistic aspects, they will have similar rules thus machine translation would give correct sentences if done correctly. Hindi and Bengali are very similar languages and are an "Apertium-ish" pair, thus this pair would work with the Apertium architecture. |
||
== '''How it will benefit''' == |
== '''How it will benefit''' == |
||
Bengali is the official and most widely spoken language of Bangladesh and second most widely spoken of the 22 scheduled languages of India, behind Hindi. Translation allows ideas and information to spread across cultures. In the process, translation changes history. With Hindi and bengali being the most widespread language in India, creating one for languages in India will provide the spread of culture and literature of India. The current Bengali-Hindi pair has only a bilingual dictionary having some of the noun words and not much. By doing this project I aim to create a working language pair which performs much better and can create a correct translation. |
Bengali is the official and most widely spoken language of Bangladesh and second most widely spoken of the 22 scheduled languages of India, behind Hindi. Translation allows ideas and information to spread across cultures. In the process, translation changes history. With Hindi and bengali being the most widespread language in India, creating one for languages in India will provide the spread of culture and literature of India. The current Bengali-Hindi pair has only a bilingual dictionary having some of the noun words and not much. By doing this project I aim to create a working language pair which performs much better and can create a correct translation, thus creating translations for two of the most spoken languages in India. |
||
== '''Work Plan''' == |
== '''Work Plan''' == |
||
Line 39: | Line 34: | ||
'''Post-application period:''' |
'''Post-application period:''' |
||
Find language resources for both and ben-hin |
|||
Find language resources for ben-hin(Mainly wikipedia with articles available in both languages) |
|||
Learn more regarding the Apertium dictionaries and tools |
|||
Learn more regarding the Apertium dictionaries and tools(transfer rules, CG, anaphora resolution) and look at other language pairs and their rules, to further understand how to create efficient rules |
|||
'''Community Bonding:''' |
'''Community Bonding:''' |
||
Getting familiar with all the Apertium modules and it’s working. Discussion with mentors and clearing doubts. |
Getting familiar with all the Apertium modules and it’s working. Discussion with mentors and clearing doubts. |
||
Line 50: | Line 48: | ||
'''Week 1 :''' |
'''Week 1 :''' |
||
Improving the monolingual dictionaries |
Improving the monolingual dictionaries |
||
Adding nouns, prepositions, adjectives, adverbs in the bilingual dictionary of the en-bn pair. |
|||
Adding nouns, prepositions, adjectives, adverbs in the bilingual dictionary of the ben-hin pair. |
|||
'''Week 2:''' |
'''Week 2:''' |
||
Adding verbs, pronouns, conjunctions in the bilingual dictionaries |
Adding verbs, pronouns, conjunctions in the bilingual dictionaries and make lexical selection rules |
||
Writing the transfer rules for verbs and nouns. |
|||
'''Week 3:''' |
'''Week 3:''' |
||
Continue improving the dictionaries and the |
Continue improving the dictionaries and the lexical selection rules, |
||
Test the current workings |
Test the current workings |
||
'''Week 4:''' |
'''Week 4:''' |
||
Learn more about the Constraint Grammar and implement it |
|||
Update the documentation and prepare for the evaluation |
|||
Update the documentation and prepare for the evaluation |
|||
'''Deliverable 1:''' Bilingual dictionaries and some transfer rules |
|||
'''Deliverable 1:''' Bilingual dictionaries and handling cases of synonyms while translating , WER < 30 for ben->hin |
|||
===[[Second Phase]]=== |
===[[Second Phase]]=== |
||
Line 75: | Line 74: | ||
With the guidance of the mentors learn more about the morphological rules and to review the work from weeks 1-4. |
With the guidance of the mentors learn more about the morphological rules and to review the work from weeks 1-4. |
||
Fixing minor issues in bilingual and monolingual dictionaries. |
Fixing minor issues in bilingual and monolingual dictionaries. |
||
Perform testvoc/ corpus test |
|||
'''Week 6:''' |
'''Week 6:''' |
||
Expand the bilingual dictionary |
|||
Start working on CG and disambiguation |
|||
Write transfer rules for ben->hin transfer |
|||
'''Week 7:''' |
'''Week 7:''' |
||
Expand the bilingual dictionary, lexical selection rules |
|||
Continue with the disambiguation tests and its solutions |
|||
Transfer rules for ben->hin |
|||
Manual Disambiguation(ben) |
|||
'''Week 8:''' |
|||
Test translations, make some improvements, fix some bugs and prepare for the evaluation |
|||
'''Week 8:''' |
|||
'''Deliverable 2''': Provide coherent translation between the language pairs |
|||
Transfer rules for hin->ben |
|||
Test translations, make some improvements, fix some bugs and prepare for the evaluation |
|||
'''Deliverable 2''': Provide coherent translation between the language pairs, WER < 25 for (ben-> hin and hin->ben) |
|||
===[[Third Phase]]=== |
===[[Third Phase]]=== |
||
Line 99: | Line 104: | ||
'''Week 9:''' |
'''Week 9:''' |
||
Expand bilingual dictionaries and work on disambiguation rules |
Expand bilingual dictionaries and work on disambiguation rules(ben and hin) |
||
Transfer rules(hin -> ben) |
|||
'''Week 10:''' |
'''Week 10:''' |
||
Testvoc and some improvements, More work on the transfer rules |
Testvoc and some improvements, More work on the transfer rules (hin-> ben) |
||
Line 117: | Line 124: | ||
'''Final Evaluation''' : will have the translations with WER < 20 for ben-> hin and hin-> ben |
|||
'''Final Evaluation''' |
|||
'''NOTE''': The third week has been kept light to allow for compensation of any unseen issues or even to implement something new |
|||
=='''Coding Challenge'''== |
=='''Coding Challenge'''== |
||
All the work has been saved in the GitHub repo: https://github.com/srj31/apertium-ben-hin |
All the work has been saved in the GitHub repo: https://github.com/srj31/apertium-ben-hin |
||
I have also stored the story’s translation in [https://github.com/srj31/apertium-ben-hin/blob/master/story-ben.txt Bengali] and [https://github.com/srj31/apertium-ben-hin/blob/master/story-hin.txt Hindi] post-edited. Along with it the machine translations from [https://github.com/srj31/apertium-ben-hin/blob/master/dev/mt_bn-hi.txt ben-hin] and [https://github.com/srj31/apertium-ben-hin/blob/master/dev/mt_hi-bn.txt hin-ben] are also stored. |
|||
I have also stored the story’s translation in Bengali and Hindi post-edited. |
|||
Initially the pair did not have much in the bilingual dictionary. After learning through the resources available on the apertium wiki and by testing out the bn-en pair I started working on this pair. |
Initially the pair did not have much in the bilingual dictionary. After learning through the resources available on the apertium wiki and by testing out the bn-en pair I started working on this pair. |
||
I focused mainly on translating from ben - > hin |
|||
These results are for '''ben -> hin''' |
|||
'''Work Done''': |
|||
- Added entries to the bilingual dictionaries |
|||
- Added certain words to the monolingual dictionaries as well |
|||
- Locative cases of the nouns are handled now going from ben-> hin |
|||
- Gender marking for adjectives is taken care of while translating ben-> hin |
|||
- Negation marker is handled now when going from ben->hin |
|||
- Started working with the transfer rules |
|||
'''Issues pending''': |
|||
- Translating from bengali to hindi requires genitive pronouns to agree with the gender (solution: get the gender of the noun following the pronoun) |
|||
- Hindi verbs do not have past tense tags in the monodix (solution: it seems they use the imprft tag instead of past) |
|||
- Bengali has same form for other tenses/ other forms , thus disambiguation is required. (solution: manual disambiguation required) |
|||
e.g. |
|||
জেমস '''লিখতে''' ভালোবাসেন - जेम्स को '''लिखना''' पसंद है - (James loves to '''write''') |
|||
জেমস '''লিখতে''' ভাল - जेम्स अच्छा '''लिखता''' हैं - (James is good at '''writing''') |
|||
জেমস '''লিখতে''' ভাল '''ছিল''' - जेम्स '''लिखने''' में अच्छा '''था''' - (James was good at '''writing''') |
|||
- Disambiguation is required for the genders, bengali verbs do not mark gender but hindi does. (solution: need the transfer rules to identify the gender from the subject of the sentence , else |
|||
from the context of the text. Both the languages follow SOV word order, thus different cases |
|||
have to be taken for intransitive and transitive verbs, so I was suggested that anaphora |
|||
resolution could be helpful here in identifying the subject so that the correct marker proceeds |
|||
the verb) |
|||
e.g. |
|||
সে '''লিখছে''' - वह '''लिख रही है''' - (She is '''writing''') |
|||
সে '''লিখছে''' - वह '''लिख रहा है -''' (He is '''writing''') |
|||
Statistics about input files |
Statistics about input files |
||
------------------------------------------------------- |
------------------------------------------------------- |
||
Number of words in reference: |
Number of words in reference: 498 |
||
Number of words in test: 388 |
|||
Number of |
Number of words in test: 424 |
||
Percentage of unknown words: 18.04 % |
|||
Number of unknown words (marked with a star) in test: 7 |
|||
Percentage of unknown words: 1.65 % |
|||
Results when removing unknown-word marks (stars) |
Results when removing unknown-word marks (stars) |
||
------------------------------------------------------- |
------------------------------------------------------- |
||
Edit distance: |
Edit distance: 221 |
||
Word error rate (WER): 75.55 % |
|||
Word error rate (WER): 44.38 % |
|||
Number of position-independent correct words: 145 |
|||
Position-independent word error rate (PER): 71.17 % |
|||
Number of position-independent correct words: 300 |
|||
Position-independent word error rate (PER): 39.76 % |
|||
Results when unknown-word marks (stars) are not removed |
Results when unknown-word marks (stars) are not removed |
||
------------------------------------------------------- |
------------------------------------------------------- |
||
Edit distance: |
Edit distance: 221 |
||
Word Error Rate (WER): 77.34 % |
|||
Word Error Rate (WER): 44.38 % |
|||
Number of position-independent correct words: 136 |
|||
Position-independent word error rate (PER): 72.96 % |
|||
Number of position-independent correct words: 300 |
|||
Position-independent word error rate (PER): 39.76 % |
|||
Statistics about the translation of unknown words |
Statistics about the translation of unknown words |
||
------------------------------------------------------- |
------------------------------------------------------- |
||
Number of unknown words which were free rides: |
Number of unknown words which were free rides: 0 |
||
Percentage of unknown words that were free rides: 12.86 % |
|||
Percentage of unknown words that were free rides: 0.00 % |
|||
''''Work Done'''': |
|||
Added entries to the bilingual dictionaries |
|||
Added certain words to the monolingual dictionaries as well |
|||
for '''hin->ben''' |
|||
Started working with the transfer rules |
|||
Issues : |
|||
- The case marker is sometimes not attached to the noun stem, thus the noun and the case marker are treated separately (solution: for each noun the following word has to be check if its a case marker, change the case in bengali else it is nom) |
|||
Translating from bengali to hindi requires genitive pronouns to agree with the gender |
|||
- Transfer rules need to be built so that irrelevant tags can be taken care of |
|||
Negation marker is not handled |
|||
- Auxiliary verbs are not taken care of properly and since they are seperated from the verb stem, they are treated individually(solution: the auxiliary verb should override some of the tags of the verb stem) |
|||
Hindi verbs do not have past tense tags |
|||
Statistics about input files |
|||
------------------------------------------------------- |
|||
Number of words in reference: 376 |
|||
Number of words in test: 454 |
|||
Number of unknown words (marked with a star) in test: 62 |
|||
Percentage of unknown words: 13.66 % |
|||
Results when removing unknown-word marks (stars) |
|||
------------------------------------------------------- |
|||
Edit distance: 279 |
|||
Word error rate (WER): 74.20 % |
|||
Number of position-independent correct words: 186 |
|||
Position-independent word error rate (PER): 71.28 % |
|||
Results when unknown-word marks (stars) are not removed |
|||
------------------------------------------------------- |
|||
Edit distance: 279 |
|||
Word Error Rate (WER): 74.20 % |
|||
Number of position-independent correct words: 186 |
|||
Position-independent word error rate (PER): 71.28 % |
|||
Statistics about the translation of unknown words |
|||
------------------------------------------------------- |
|||
Number of unknown words which were free rides: 0 |
|||
Percentage of unknown words that were free rides: 0.00 % |
|||
=='''Skill'''== |
=='''Skill'''== |
||
Line 176: | Line 281: | ||
'''Relevant technical skills''' : |
'''Relevant technical skills''' : |
||
Python(Advanced), XML(Intermediate), C++(Advanced), Java(intermediate) |
Python(Advanced), XML(Intermediate), C++(Advanced), Java(intermediate), Git(Intermediate), Bash scripts(Basic) |
||
'''Languages''' : Hindi(native), English(Advanced), Bengali(Advanced), |
'''Languages''' : Hindi(native), English(Advanced), Bengali(Advanced), |
||
Line 182: | Line 287: | ||
'''Experience :''' |
'''Experience :''' |
||
I have studied many languages and linguistics in general while preparing for the International |
|||
I have studied many languages and linguistics in general during my preparation for the International Linguistics Olympiad 2019 and while creating and testing problems for the National Linguistic Olympiad , which has allowed me to formalise rules which are followed while translating from one language to another. I have done online courses on NLP on the online platform coursera and am proficient in Data Structures and Algorithms, through competitive programming. |
|||
Linguistics Olympiad 2019 and Asian Pacific Linguistic Olympiad, and creating and testing |
|||
problems for the National Linguistic Olympiad , has allowed me to learn about the rules that a |
|||
language follows and to formalise rules which are followed while translating from one language |
|||
to another and to notice certain patterns that are followed while translating from source |
|||
language to target language. I have also been involved in programming since high school. |
|||
Taking part in competitive programming contests, preparing problems for school contests and |
|||
have worked on IoT projects such as home automation. I have done online courses on NLP on |
|||
the online platform coursera and am proficient in Data Structures and Algorithms on account of |
|||
competitive programming. |
|||
=='''Non Summer of Code plans'''== |
=='''Non Summer of Code plans'''== |
Latest revision as of 12:53, 31 March 2020
Contents
Contact information[edit]
Name: Sourabh Raj Jaiswal
Location: Noida, India
E-mail address: sourabhrj31@gmail.com
IRC: srj31
Timezone: UTC +5:30
Github: https://github.com/srj31
Why is it that you are interested in Machine Translation and Apertium?[edit]
I have always been intrigued by how computers could understand what we said and had been wanting to work on a project which involved Machine Translation This curiosity got me into studying linguistics in high school. I came fifth at the National Linguistics Olympiad and had the honor of representing Team India at the International Linguistics Olympiad 2019 at South Korea. It was this experience and the 3 years of indulging in Linguistics along with my interest in NLP, that got me interested in a Machine Translation project. Apertium being an open source platform for developing rule-based machine translation systems, has intrigued me to contribute to this platform and I will have the opportunity to further create more language pairs for the various languages of India.
Which of the published tasks are you interested in? What do you plan to do?[edit]
I plan on working on Adopting the unreleased language pair Hindi-Bengali and get the pair released in both directions having a WER < 20. I plan on improving the pair to be able to translate sentences coherently. As Bengali and Hindi are similar in various linguistic aspects, they will have similar rules thus machine translation would give correct sentences if done correctly. Hindi and Bengali are very similar languages and are an "Apertium-ish" pair, thus this pair would work with the Apertium architecture.
How it will benefit[edit]
Bengali is the official and most widely spoken language of Bangladesh and second most widely spoken of the 22 scheduled languages of India, behind Hindi. Translation allows ideas and information to spread across cultures. In the process, translation changes history. With Hindi and bengali being the most widespread language in India, creating one for languages in India will provide the spread of culture and literature of India. The current Bengali-Hindi pair has only a bilingual dictionary having some of the noun words and not much. By doing this project I aim to create a working language pair which performs much better and can create a correct translation, thus creating translations for two of the most spoken languages in India.
Work Plan[edit]
Post-application period:
Find language resources for ben-hin(Mainly wikipedia with articles available in both languages)
Learn more regarding the Apertium dictionaries and tools(transfer rules, CG, anaphora resolution) and look at other language pairs and their rules, to further understand how to create efficient rules
Community Bonding:
Getting familiar with all the Apertium modules and it’s working. Discussion with mentors and clearing doubts.
First Phase[edit]
Week 1 : Improving the monolingual dictionaries
Adding nouns, prepositions, adjectives, adverbs in the bilingual dictionary of the ben-hin pair.
Week 2:
Adding verbs, pronouns, conjunctions in the bilingual dictionaries and make lexical selection rules
Week 3: Continue improving the dictionaries and the lexical selection rules,
Test the current workings
Week 4:
Learn more about the Constraint Grammar and implement it
Update the documentation and prepare for the evaluation
Deliverable 1: Bilingual dictionaries and handling cases of synonyms while translating , WER < 30 for ben->hin
Second Phase[edit]
Week 5: With the guidance of the mentors learn more about the morphological rules and to review the work from weeks 1-4. Fixing minor issues in bilingual and monolingual dictionaries.
Week 6:
Expand the bilingual dictionary
Write transfer rules for ben->hin transfer
Week 7: Expand the bilingual dictionary, lexical selection rules
Transfer rules for ben->hin
Manual Disambiguation(ben)
Week 8:
Transfer rules for hin->ben
Test translations, make some improvements, fix some bugs and prepare for the evaluation
Deliverable 2: Provide coherent translation between the language pairs, WER < 25 for (ben-> hin and hin->ben)
Third Phase[edit]
Week 9:
Expand bilingual dictionaries and work on disambiguation rules(ben and hin)
Transfer rules(hin -> ben)
Week 10:
Testvoc and some improvements, More work on the transfer rules (hin-> ben)
Week 11:
Test with regular conversations plus text from newspapers or magazines
Week 12:
Write documentation, complete testing and fixing bugs
Final Evaluation : will have the translations with WER < 20 for ben-> hin and hin-> ben
NOTE: The third week has been kept light to allow for compensation of any unseen issues or even to implement something new
Coding Challenge[edit]
All the work has been saved in the GitHub repo: https://github.com/srj31/apertium-ben-hin I have also stored the story’s translation in Bengali and Hindi post-edited. Along with it the machine translations from ben-hin and hin-ben are also stored. Initially the pair did not have much in the bilingual dictionary. After learning through the resources available on the apertium wiki and by testing out the bn-en pair I started working on this pair.
I focused mainly on translating from ben - > hin
These results are for ben -> hin
Work Done:
- Added entries to the bilingual dictionaries
- Added certain words to the monolingual dictionaries as well
- Locative cases of the nouns are handled now going from ben-> hin
- Gender marking for adjectives is taken care of while translating ben-> hin
- Negation marker is handled now when going from ben->hin
- Started working with the transfer rules
Issues pending:
- Translating from bengali to hindi requires genitive pronouns to agree with the gender (solution: get the gender of the noun following the pronoun)
- Hindi verbs do not have past tense tags in the monodix (solution: it seems they use the imprft tag instead of past)
- Bengali has same form for other tenses/ other forms , thus disambiguation is required. (solution: manual disambiguation required)
e.g.
জেমস লিখতে ভালোবাসেন - जेम्स को लिखना पसंद है - (James loves to write) জেমস লিখতে ভাল - जेम्स अच्छा लिखता हैं - (James is good at writing) জেমস লিখতে ভাল ছিল - जेम्स लिखने में अच्छा था - (James was good at writing)
- Disambiguation is required for the genders, bengali verbs do not mark gender but hindi does. (solution: need the transfer rules to identify the gender from the subject of the sentence , else from the context of the text. Both the languages follow SOV word order, thus different cases have to be taken for intransitive and transitive verbs, so I was suggested that anaphora resolution could be helpful here in identifying the subject so that the correct marker proceeds the verb)
e.g.
সে লিখছে - वह लिख रही है - (She is writing) সে লিখছে - वह लिख रहा है - (He is writing)
Statistics about input files
Number of words in reference: 498
Number of words in test: 424
Number of unknown words (marked with a star) in test: 7
Percentage of unknown words: 1.65 %
Results when removing unknown-word marks (stars)
Edit distance: 221
Word error rate (WER): 44.38 %
Number of position-independent correct words: 300
Position-independent word error rate (PER): 39.76 %
Results when unknown-word marks (stars) are not removed
Edit distance: 221
Word Error Rate (WER): 44.38 %
Number of position-independent correct words: 300
Position-independent word error rate (PER): 39.76 %
Statistics about the translation of unknown words
Number of unknown words which were free rides: 0
Percentage of unknown words that were free rides: 0.00 %
for hin->ben
Issues :
- The case marker is sometimes not attached to the noun stem, thus the noun and the case marker are treated separately (solution: for each noun the following word has to be check if its a case marker, change the case in bengali else it is nom)
- Transfer rules need to be built so that irrelevant tags can be taken care of
- Auxiliary verbs are not taken care of properly and since they are seperated from the verb stem, they are treated individually(solution: the auxiliary verb should override some of the tags of the verb stem)
Statistics about input files
Number of words in reference: 376
Number of words in test: 454
Number of unknown words (marked with a star) in test: 62
Percentage of unknown words: 13.66 %
Results when removing unknown-word marks (stars)
Edit distance: 279
Word error rate (WER): 74.20 %
Number of position-independent correct words: 186
Position-independent word error rate (PER): 71.28 %
Results when unknown-word marks (stars) are not removed
Edit distance: 279
Word Error Rate (WER): 74.20 %
Number of position-independent correct words: 186
Position-independent word error rate (PER): 71.28 %
Statistics about the translation of unknown words
Number of unknown words which were free rides: 0
Percentage of unknown words that were free rides: 0.00 %
Skill[edit]
Ongoing major : Bachelors in Mathematics and Computing
Relevant technical skills : Python(Advanced), XML(Intermediate), C++(Advanced), Java(intermediate), Git(Intermediate), Bash scripts(Basic)
Languages : Hindi(native), English(Advanced), Bengali(Advanced),
Experience :
I have studied many languages and linguistics in general while preparing for the International Linguistics Olympiad 2019 and Asian Pacific Linguistic Olympiad, and creating and testing problems for the National Linguistic Olympiad , has allowed me to learn about the rules that a language follows and to formalise rules which are followed while translating from one language to another and to notice certain patterns that are followed while translating from source language to target language. I have also been involved in programming since high school. Taking part in competitive programming contests, preparing problems for school contests and have worked on IoT projects such as home automation. I have done online courses on NLP on the online platform coursera and am proficient in Data Structures and Algorithms on account of competitive programming.
Non Summer of Code plans[edit]
Though I have no plans other than Summer of Code, in the light of the Corona emergency, colleges might have to postpone our major examinations(still tentative), thus I will only be able to give 20hrs/week, and this should last 1 week or 2 weeks of the first phase. After which I will be having a summer break, during which I can work for 40+hrs/week.