User:Saswata Bose/GSoC2024Proposal

From Apertium
Revision as of 21:02, 1 April 2024 by Saswata Bose (talk | contribs)
Jump to navigation Jump to search

Contact Information

Name: Saswata Bose

Location: India

University: International Institute of Information Technology Hyderabad (Deemed to be University)

Email address: saswatabosepersonal@gmail.com

Timezone: GMT+5:30

Github: HimalayanSaswataBose


Why is it that you are interested in Apertium?

  • Apertium allows one, as a language lover, to work very closely on a language both from a linguistic and a computational perspective. I, being a Research student of Computational Linguistics at my current institution, follows in line with my thoughts both academically and physically.
  • In this growing field of LLMs, people seem to have forgotten the power of Rule-based methods. Apertium, to me, is a pioneer in this field which makes me even more interested to work closely with the team.
  • Translation systems based out of Rule-based methods can be very accurate if configured properly. This can not only help in analyzing low resource languages but also aid in preparation of Gold standard datasets from a single common corpus.

Which of the published tasks are you interested in? What do you plan to do?

  • Interested Task: Add a new variety to an existing language
  • Planned Action: Add "Barendri" variety of Bengali to the BN-EN (Bengali-English) Language Pair.

Proposal

Deliverables:

  • Creating the BN-EN bilingual dictionary.
  • Creating the BN monolingual dictionary specific to the "Barendri" dialect
  • Updating the EN monolingual dictionary, if required.
  • Building the transfer rules for the BN-EN pair.
  • Creating a BN-EN translator.
  • Trying to develop the BN-EN Language pair to get updated to the Nursery level from the Incubator stage.

Reasons why Google and Apertium should sponsor it:

  • Bengali is spoken by approximately 240 million people (standing as the seventh most spoken language by total number of speakers). If the BN-EN Language pair is developed, it will cater to a huge population.
  • The last update to BN-EN, with respect to the translator or dictionary was close to ten years ago. The project will be a way to update the dataset.
  • Release of this one of a kind Bengali Dialect Language pair will fuel the development of more dialect based engines
  • "Barendri" is a dialect which is prevalent in regions, almost equally, in both India (in the state of West Bengal) and Bangladesh. This makes the target population of higher variety.
  • The project will be useful majorly in two fields of Linguistic Research: Facilitating research in low-resource languages, and to understand the dialect variations by comparing the dialects of the two capitals (namely Kolkata (West Bengal) and Dhaka (Bangladesh)) with a dialect that lies almost midway.

How and who it will benefit in society

  • The project will be able to develop an accurate translator for one of the most widely spoken languages.
  • The system will be one of the first in the industry to have a dialect information embedded into it, which can be combined with various input systems (like Speech Recognition, Textual data, OCR) to take it to the masses and facilitate communication with other people in the language of their choice.
  • Due to change of dialects, the vocabulary change becomes substantial in Bengali dialects, so much so that many dialects (like, Sylheti) can not be interpreted by people knowing solely standard Bengali. This system can be extended to be a first of its kind interdialectal converter in such situations.
  • Just as a normal translator, it can be used to facilitate communication with other people.

Work plan

Community bonding period (May 1 - May 26):

  • Discussing the project ideas and taking suggestions from the community regarding the implementation of the project.
  • Exploring and finding resources for Barendri.
  • Setting up environment for development and setting up similar technical aspects.

Work Period (May 27 - Aug 26):

Week 1 (27/05-02/06):

Week 2 (03/06-09/06):

Week 3 (10/06-16/06):

Week 4 (17/06-23/06):

Week 5 (24/06-30/06):

Week 6 (01/07-07/07):

Deliverable 1: Monolingual and Bilingual dictionary, basic transfer rules

Week 7 (12/07-18/07):

Week 8 (19/07-25/07):

Week 9 (26/07-01/08):

Week 10 (02/08-08/08):

Week 11 (09/08-15/08):

Week 12+ (15/08-26/08):

Project completed

Skills

Coding Challenge

Data Acquisition

Resources

Non summer of code plans