Difference between revisions of "User:Srj31/GSOC 2020 proposal:Bengali-Hindi pair"

From Apertium
Jump to navigation Jump to search
Line 120: Line 120:
=='''Coding Challenge'''==
=='''Coding Challenge'''==
All the work has been saved in the GitHub repo: https://github.com/srj31/apertium-ben-hin
All the work has been saved in the GitHub repo: https://github.com/srj31/apertium-ben-hin
I have also stored the story’s translation in Bengali and Hindi post-edited.
I have also stored the story’s translation in [https://github.com/srj31/apertium-ben-hin/blob/master/story-ben.txt Bengali] and [https://github.com/srj31/apertium-ben-hin/blob/master/story-hin.txt Hindi] post-edited.
Initially the pair did not have much in the bilingual dictionary. After learning through the resources available on the apertium wiki and by testing out the bn-en pair I started working on this pair.
Initially the pair did not have much in the bilingual dictionary. After learning through the resources available on the apertium wiki and by testing out the bn-en pair I started working on this pair.



Revision as of 15:37, 30 March 2020

Contact information

Name: Sourabh Raj Jaiswal

Location: Noida, India

E-mail address: sourabhrj31@gmail.com

IRC: srj31

Timezone: UTC +5:30

Github: https://github.com/srj31


Why is it that you are interested in Machine Translation and Apertium?

I had been wanted to work on a project which involved Machine Translation and was intrigued by how computers could understand what we said.This curiosity got me into studying linguistics in high school. I came fifth at the National Linguistics Olympiad and had the honor of representing Team India at the International Linguistics Olympiad 2019 at South Korea. It was this experience and the 3 years of indulging in Linguistics along with my interest in NLP, that got me interested in a Machine Translation project. Apertium being an open source platform for developing rule-based machine translation systems, has intrigued me to contribute to this platform and I will have the opportunity to further create more language pairs for the various languages of India.


Which of the published tasks are you interested in? What do you plan to do?

I plan on working on Adopting the unreleased language pair Hindi-Bengali and get the pair released in both directions having a WER < 20. I plan on improving the pair to be able to translate sentences coherently. As Bengali and Hindi are similar in various linguistic aspects, they will have similar rules thus machine translation would give correct sentences if done correctly. Hindi and Bengali are very similar languages and are an "Apertium-ish" pair.

How it will benefit

Bengali is the official and most widely spoken language of Bangladesh and second most widely spoken of the 22 scheduled languages of India, behind Hindi. Translation allows ideas and information to spread across cultures. In the process, translation changes history. With Hindi and bengali being the most widespread language in India, creating one for languages in India will provide the spread of culture and literature of India. The current Bengali-Hindi pair has only a bilingual dictionary having some of the noun words and not much. By doing this project I aim to create a working language pair which performs much better and can create a correct translation.


Work Plan

Post-application period: Find language resources for both and ben-hin Learn more regarding the Apertium dictionaries and tools


Community Bonding: Getting familiar with all the Apertium modules and it’s working. Discussion with mentors and clearing doubts.


First Phase

Week 1 : Improving the monolingual dictionaries Adding nouns, prepositions, adjectives, adverbs in the bilingual dictionary of the ben-hin pair.


Week 2: Adding verbs, pronouns, conjunctions in the bilingual dictionaries Writing the transfer rules for verbs and nouns.


Week 3: Continue improving the dictionaries and the transfer rules, Test the current workings


Week 4: Update the documentation and prepare for the evaluation


Deliverable 1: Bilingual dictionaries and transfer rules

Second Phase

Week 5: With the guidance of the mentors learn more about the morphological rules and to review the work from weeks 1-4. Fixing minor issues in bilingual and monolingual dictionaries. Perform testvoc/ corpus test


Week 6: Write transfer rules for hin-ben transfer Start working on CG and disambiguation


Week 7: Continue with the disambiguation tests and its solutions


Week 8: Test translations, make some improvements, fix some bugs and prepare for the evaluation


Deliverable 2: Provide coherent translation between the language pairs

Third Phase

Week 9:

Expand bilingual dictionaries and work on disambiguation rules


Week 10:

Testvoc and some improvements, More work on the transfer rules


Week 11:

Test with regular conversations plus text from newspapers or magazines


Week 12:

Write documentation, complete testing and fixing bugs


Final Evaluation


NOTE: The third week has been kept light to allow for compensation of any unseen issues or even to implement something new

Coding Challenge

All the work has been saved in the GitHub repo: https://github.com/srj31/apertium-ben-hin I have also stored the story’s translation in Bengali and Hindi post-edited. Initially the pair did not have much in the bilingual dictionary. After learning through the resources available on the apertium wiki and by testing out the bn-en pair I started working on this pair.

These results are for ben -> hin


Work Done:

- Added entries to the bilingual dictionaries

- Added certain words to the monolingual dictionaries as well

- Locative cases of the nouns are handled now going from ben-> hin

- Negation marker is handled now when going from ben->hin

- Started working with the transfer rules


Issues pending:

- Translating from bengali to hindi requires genitive pronouns to agree with the gender (solution: get the gender of the noun following the pronoun)

- Hindi verbs do not have past tense tags in the monodix (solution: it seems they use the imprft tag instead of past)

- Bengali has same form for other tenses/ other forms , thus disambiguation is required. (solution: manual disambiguation required)

e.g.

            জেমস লিখতে ভালোবাসেন  - जेम्स को लिखना पसंद है - (James loves to write) 
            জেমস লিখতে ভাল  - जेम्स अच्छा लिखता हैं - (James is good at writing)
            জেমস লিখতে ভাল ছিল  - जेम्स लिखने में अच्छा था  - (James was good at writing)

- Disambiguation is required for the genders, bengali verbs do not mark gender but hindi does. (solution: need the transfer rules to identify the gender from the subject of the sentence , else from the context of the text)

e.g.

            সে লিখছে  -  वह लिख रही है - (She is writing)
            সে লিখছে  -  वह लिख रहा है - (He is writing)

Test file: 'm1.txt'

Reference file 'story-hin.txt'

Statistics about input files


Number of words in reference: 498

Number of words in test: 424

Number of unknown words (marked with a star) in test: 7

Percentage of unknown words: 1.65 %


Results when removing unknown-word marks (stars)


Edit distance: 221

Word error rate (WER): 44.38 %

Number of position-independent correct words: 300

Position-independent word error rate (PER): 39.76 %


Results when unknown-word marks (stars) are not removed


Edit distance: 221

Word Error Rate (WER): 44.38 %

Number of position-independent correct words: 300

Position-independent word error rate (PER): 39.76 %


Statistics about the translation of unknown words


Number of unknown words which were free rides: 0

Percentage of unknown words that were free rides: 0.00 %



for hin->ben

Issues :

- The case marker is separate from the noun stem, thus the noun and the case marker are treated separately (solution: for each noun the following word has to be check if its a case marker, change the case in bengali else it is nom)

- Transfer rules need to be built so that irrelevant tags can be taken care of

-


Test file: 'm3.txt'

Reference file 'story-ben.txt'


Statistics about input files


Number of words in reference: 376

Number of words in test: 454

Number of unknown words (marked with a star) in test: 62

Percentage of unknown words: 13.66 %


Results when removing unknown-word marks (stars)


Edit distance: 279

Word error rate (WER): 74.20 %

Number of position-independent correct words: 186

Position-independent word error rate (PER): 71.28 %


Results when unknown-word marks (stars) are not removed


Edit distance: 279

Word Error Rate (WER): 74.20 %

Number of position-independent correct words: 186

Position-independent word error rate (PER): 71.28 %


Statistics about the translation of unknown words


Number of unknown words which were free rides: 0

Percentage of unknown words that were free rides: 0.00 %

Skill

Ongoing major : Bachelors in Mathematics and Computing

Relevant technical skills : Python(Advanced), XML(Intermediate), C++(Advanced), Java(intermediate)

Languages : Hindi(native), English(Advanced), Bengali(Advanced),

Experience :

I have studied many languages and linguistics in general during my preparing for the International Linguistics Olympiad 2019 and creating and testing problems for the National Linguistic Olympiad , has allowed me to formalise rules which are followed while translating from one language to another and to notice certain patterns that are followed while translating from source language to target language. I have done online courses on NLP on the online platform coursera and am proficient in Data Structures and Algorithms, through competitive programming.

Non Summer of Code plans

Though I have no plans other than Summer of Code, in the light of the Corona emergency, colleges might have to postpone our major examinations(still tentative), thus I will only be able to give 20hrs/week, and this should last 1 week or 2 weeks of the first phase. After which I will be having a summer break, during which I can work for 40+hrs/week.