Difference between revisions of "User:Saswata Bose/GSoC2024Proposal"

From Apertium
Jump to navigation Jump to search
(To update [In reverse order]: Non GSOC plans, Resources, Data Collections, Coding Challenge, Skills)
 
Line 104: Line 104:
   
 
== Skills ==
 
== Skills ==
I am a senior Computer Science undergraduate at the prestigious Birla Institute of Technology and Science Pilani (BITS Pilani) in India. I interned at Ericsson, where I built an NLP based ticket-classifier using python. As part of my Natural Language Processing coursework at my university, I created a POS tagger for Hin-Eng code mixed datasets using the Hidden Markov Model. I also interned at the Artificial Intelligence Institute of South Carolina, where I worked on transformer architecture.
 
Through these projects and my university coursework, I have gained proficiency in programming languages and tools such as Python, C++, XML, Git, bash scripting, and HTML/CSS. In general, I enjoy problem-solving with various programming tools.
 
I am a native Hindi speaker with the ability to read and write Marwari.
 
I believe I am a good fit for this project because I have previously worked in Natural Language Processing for my projects and understand two languages, HIN and MWR.
 
   
== Coding Challenge/Contributions ==
 
* Successfully setup apertium environment.
 
* Created a Pull Request fixing minor compilation errors for apertium-mar-eng : https://github.com/apertium/apertium-mar-eng/pull/1 .
 
* Working on creating a HIN-MWR translator : [https://github.com/hitenvidhani/apertium-hin-mwr apertium-hin-mwr], [https://github.com/hitenvidhani/apertium-mwr apertium-mwr] and [https://github.com/hitenvidhani/apertium-hin apertium-hin].
 
* Worked on adding words to monodix and bidix, adding transfer rule in t1x, adding paradigms to MWR monodix.
 
Some outputs of the translation from HIN to MWR:
 
<center>[[File:hitenproposal1.png]]</center>
 
<br>
 
<center>[[File:hitenproposal2.png]]</center>
 
<br>
 
<center>[[File:hitenproposal3.png]]</center>
 
<br>
 
<center>[[File:hitenproposal4.png]]</center>
 
   
== Test Corpus ==
+
== Coding Challenge ==
  +
* HIN corpus : https://github.com/hitenvidhani/apertium-hin-mwr/blob/master/test_corpus/hin_corpus.txt
 
  +
* MWR corpus : https://github.com/hitenvidhani/apertium-hin-mwr/blob/master/test_corpus/mwr_corpus.txt
 
  +
== Data Acquisition ==
  +
   
 
== Resources ==
 
== Resources ==
  +
* https://wikitravel.org/en/Rajasthani_phrasebook
 
* https://www.languageshome.com/English-Marwadi.htm
 
* https://hi.glosbe.com/mwr/hi
 
* https://hattai.page.tl/marwari-dictionary.htm
 
* https://www.marwaribaatein.com/marwari-language
 
* https://crazychhora.com/learn-marwadi/
 
   
 
== Non summer of code plans ==
 
== Non summer of code plans ==
  +
I have no plans other than GSoC for the summer of 2023. I can devote 30-40 hours per week to this project. As my university curriculum begins in August, I would like to work ~40 hours per week in the months of June and July and ~20 hours per week in August.
 
   
 
[[Category:GSoC_2024_student_proposals]]
 
[[Category:GSoC_2024_student_proposals]]

Revision as of 21:02, 1 April 2024

Contact Information

Name: Saswata Bose

Location: India

University: International Institute of Information Technology Hyderabad (Deemed to be University)

Email address: saswatabosepersonal@gmail.com

Timezone: GMT+5:30

Github: HimalayanSaswataBose


Why is it that you are interested in Apertium?

  • Apertium allows one, as a language lover, to work very closely on a language both from a linguistic and a computational perspective. I, being a Research student of Computational Linguistics at my current institution, follows in line with my thoughts both academically and physically.
  • In this growing field of LLMs, people seem to have forgotten the power of Rule-based methods. Apertium, to me, is a pioneer in this field which makes me even more interested to work closely with the team.
  • Translation systems based out of Rule-based methods can be very accurate if configured properly. This can not only help in analyzing low resource languages but also aid in preparation of Gold standard datasets from a single common corpus.

Which of the published tasks are you interested in? What do you plan to do?

  • Interested Task: Add a new variety to an existing language
  • Planned Action: Add "Barendri" variety of Bengali to the BN-EN (Bengali-English) Language Pair.

Proposal

Deliverables:

  • Creating the BN-EN bilingual dictionary.
  • Creating the BN monolingual dictionary specific to the "Barendri" dialect
  • Updating the EN monolingual dictionary, if required.
  • Building the transfer rules for the BN-EN pair.
  • Creating a BN-EN translator.
  • Trying to develop the BN-EN Language pair to get updated to the Nursery level from the Incubator stage.

Reasons why Google and Apertium should sponsor it:

  • Bengali is spoken by approximately 240 million people (standing as the seventh most spoken language by total number of speakers). If the BN-EN Language pair is developed, it will cater to a huge population.
  • The last update to BN-EN, with respect to the translator or dictionary was close to ten years ago. The project will be a way to update the dataset.
  • Release of this one of a kind Bengali Dialect Language pair will fuel the development of more dialect based engines
  • "Barendri" is a dialect which is prevalent in regions, almost equally, in both India (in the state of West Bengal) and Bangladesh. This makes the target population of higher variety.
  • The project will be useful majorly in two fields of Linguistic Research: Facilitating research in low-resource languages, and to understand the dialect variations by comparing the dialects of the two capitals (namely Kolkata (West Bengal) and Dhaka (Bangladesh)) with a dialect that lies almost midway.

How and who it will benefit in society

  • The project will be able to develop an accurate translator for one of the most widely spoken languages.
  • The system will be one of the first in the industry to have a dialect information embedded into it, which can be combined with various input systems (like Speech Recognition, Textual data, OCR) to take it to the masses and facilitate communication with other people in the language of their choice.
  • Due to change of dialects, the vocabulary change becomes substantial in Bengali dialects, so much so that many dialects (like, Sylheti) can not be interpreted by people knowing solely standard Bengali. This system can be extended to be a first of its kind interdialectal converter in such situations.
  • Just as a normal translator, it can be used to facilitate communication with other people.

Work plan

Community bonding period (May 1 - May 26):

  • Discussing the project ideas and taking suggestions from the community regarding the implementation of the project.
  • Exploring and finding resources for Barendri.
  • Setting up environment for development and setting up similar technical aspects.

Work Period (May 27 - Aug 26):

Week 1 (27/05-02/06):

Week 2 (03/06-09/06):

Week 3 (10/06-16/06):

Week 4 (17/06-23/06):

Week 5 (24/06-30/06):

Week 6 (01/07-07/07):

Deliverable 1: Monolingual and Bilingual dictionary, basic transfer rules

Week 7 (12/07-18/07):

Week 8 (19/07-25/07):

Week 9 (26/07-01/08):

Week 10 (02/08-08/08):

Week 11 (09/08-15/08):

Week 12+ (15/08-26/08):

Project completed

Skills

Coding Challenge

Data Acquisition

Resources

Non summer of code plans