User:Saswata Bose/GSoC2024Proposal

From Apertium
Jump to navigation Jump to search

Contact Information

Name: Saswata Bose

Location: India

University:

  • International Institute of Information Technology Hyderabad (Deemed to be University)
  • Indian Institute of Technology Madras (Enrolled in an Online Degree Course)

Email address: saswatabosepersonal@gmail.com

Timezone: GMT+5:30

Github: HimalayanSaswataBose

Why is it that you are interested in Apertium?

  • Apertium allows one, as a language lover, to work very closely on a language both from a linguistic and a computational perspective. I, being a Research student of Computational Linguistics at my current institution, follows in line with my thoughts both academically and physically.
  • In this growing field of LLMs, people seem to have forgotten the power of Rule-based methods. Apertium, to me, is a pioneer in this field which makes me even more interested to work closely with the team.
  • Translation systems based out of Rule-based methods can be very accurate if configured properly. This can not only help in analyzing low resource languages but also aid in preparation of Gold standard datasets from a single common corpus.

Which of the published tasks are you interested in? What do you plan to do?

  • Interested Task: Add a new variety to an existing language
  • Planned Action: Add "Barendri" variety of Bengali to the BN-EN (Bengali-English) Language Pair.

Proposal

Deliverables:

  • Updating the BN-EN bilingual dictionary, specific to the “Barendri” dialect
  • Updating the BN monolingual dictionary specific to the "Barendri" dialect
  • Updating the EN monolingual dictionary, if required.
  • Updating the transfer rules for the BN-EN pair.
  • Updating the BN-EN translator.
  • Trying to develop the BN-EN Language pair, as a whole, to get updated to the Nursery level at least, from the Incubator stage.


Reasons why Google and Apertium should sponsor it:

  • Bengali is spoken by approximately 240 million people (standing as the seventh most spoken language by total number of speakers). If the BN-EN Language pair is developed, it will cater to a huge population.
  • The last update to BN-EN, with respect to the translator or dictionary was close to ten years ago. The project will be a way to update the dataset.
  • Release of this one of a kind Bengali Dialect Language pair will fuel the development of more dialect based engines
  • "Barendri" is a dialect which is prevalent in regions, almost equally, in both India (in the state of West Bengal) and Bangladesh. This makes the target population of higher variety.
  • The project will be useful majorly in two fields of Linguistic Research: Facilitating research in low-resource languages, and to understand the dialect variations by comparing the dialects of the two capitals (namely Kolkata (West Bengal) and Dhaka (Bangladesh)) with a dialect that lies almost midway.

How and who it will benefit in society

  • The project will be able to develop an accurate translator for one of the most widely spoken languages.
  • The system will be one of the first in the industry to have a dialect information embedded into it, which can be combined with various input systems (like Speech Recognition, Textual data, OCR) to take it to the masses and facilitate communication with other people in the language of their choice.
  • Due to change of dialects, the vocabulary change becomes substantial in Bengali dialects, so much so that many dialects (like, Sylheti) can not be interpreted by people knowing solely standard Bengali. This system can be extended to be a first of its kind interdialectal converter in such situations.
  • Just as a normal translator, it can be used to facilitate communication with other people. As a result, fulfilling the general use cases of a translator.

Work plan

Community bonding period (May 1 - May 26):

  • Discussing the project ideas and taking suggestions from the community regarding the implementation of the project.
  • Exploring and finding resources for Barendri.
  • Setting up environment for development and setting up similar technical aspects.

Work Period (May 27 - Aug 26):

Week 1 (27/05-02/06): (Vocabulary creation)

  • Building paradigms in the monolingual Bengali dictionary in the perspective of "Barendri" dialect.
  • Adding the most used vocabulary to the Monodix and Bidix in accordance with "Barendri".

Week 2 (03/06-09/06): (Vocabulary Addition using other References)

  • Adding further vocabulary in the Monodix and the Bidix in accordance to “Barendri by maximum possible translations of the already included words in the dictionaries.
  • Using reference material (primarily the 5 volume “Ancholik Bhashar Obhidaan” (Dictionary of Regional Language)) to extend the vocabulary as much as possible

Week 3 (10/06-16/06): (Multiword Expressions)

  • Adding Idiomatic cases, connotations and Multi Word Expressions as many as possible

Week 4 (17/06-23/06): (Rectification of .t3x file)

  • Learn the construction of the .t3x file
  • Writing the .t3x file as much as possible
  • Complete pending portions of .t2x file

Week 5 (24/06-30/06): (Documentation and Backlogs)

  • Complete all pending tasks
  • Prepare documentation

Week 6 (01/07-07/07): (Finalization phase)

  • Identifying edge cases and persisting problems
  • Preparing for mid-term evaluations

Week 7 (12/07-18/07): (Preliminary Translation)

  • Translating Barendri text to search for errors
  • Developing lexical selection rules

Week 8 (19/07-25/07): (Disambiguation)

  • Tallying Standard Bengali translation and Barendri translations in aim to get deeper insight of errors
  • Working on disambiguation rules.

Week 9 (26/07-01/08): (Testvoc)

  • Complete pending tasks
  • Start testvoc with "Generation testvoc with lttoolbox analyser" as mentioned here.

Week 10 (02/08-08/08): (Corpus testvoc)

  • Run Corpus testvoc (as mentioned here here) and solve errors.

Week 11 (09/08-15/08): (Documentation)

  • Start Documentation writing
  • Discussion with mentors, organization and the community

Week 12+ (15/08-26/08): (Final submission)

  • Prepare for final evaluation

Skills

Here is a brief review of my skillset

  • I am a freshman at the International Institute of Information Technology (Hyderabad), pursuing a dual degree course offering a B.Tech in Computer Science and MS (by Research) in Computational Linguistics. This has given me some experience and skills in Core Linguistics as well as Computational Linguistics. Currently, I am also working on my Computational Linguistics semester project, which aims to build a POS tagger for Bengali, considering a case of two dialects being spoken together.
  • I am also pursuing, alongside, an online degree from the Indian Institute of Technology, Madras offering a BS in Data Science and Applications. I am, at present, a Diploma student in the course. Owing to the same, I have a fairly sound understanding of data and datasets, therefore allowing me to take better decisions of my chosen datasets, their cleaning and overall analysis from a data perspective. This shall, in this context, allow me to explore a good amount of variation through the chosen datasets so that the details are not missed out.
  • I am proficient in certain programming languages, including Python (about 8 years), C, HTML/CSS, Javascript, MySQL, Bash scripting, X86-64 Assembly and little experience with XML. I have some experience in working with Tensorflow, Keras, Pandas (about 4 years) and NLTK.
  • I am a native Bengali speaker, knowing a bit of "Barendri" Bengali. I can also speak, read and write Hindi and English.
  • I consider myself to be a fast learner. I try to learn things relatively faster, maintaining the quality throughout. I also learn through experimentation, which gives me a complete view of the subject.

Coding Challenge

I could only complete the coding challenge partially, investing my time in it from 18th March to 31st March. I was busy with my college exams beforehand. The progress made till now can be found here. The transfer rules are not yet rectified. I planned to also look at the Hindi-English pair after learning the usage of transfer rules so that I could take the best of the lot and increase the accuracy of the system.

Data Acquisition

The Data used were stories in standard Bengali translated to "Barendri" by native speakers of the dialect. This method was used to emphasise on aspects of more pronounced changes, provided the fact that major changes in the dialect are in pronouns alongwith certain verbs and related inflections.

Non summer of code plans

I have no such plans as of now for this summer, till July, after which my Monsoon Semester would start. Accordingly, I can comfortably dedicate about 35-40 hours per week till July. If situation demands, I might work a few hours more to compensate for the diminishing progress in August. In August, I would be able to give about 20-25 hours per week, making sure that the deliverables are completed with acceptable quality.