Difference between revisions of "User:Hiten"

From Apertium
Jump to navigation Jump to search
Line 29: Line 29:
   
 
== Reasons why Google and Apertium should sponsor it: ==
 
== Reasons why Google and Apertium should sponsor it: ==
* Marwari has about 22 million speakers from India and its neighboring countries. Despite its popularity, major translation tools like Google Translate don't include it.
+
* Marwari is spoken by approximately 22 million people in India and its neighbouring countries. Despite its widespread, major translation tools such as Google Translate do not include it.
 
* The project adds diversity to Apertium by incorporating Marwari.
 
* The project adds diversity to Apertium by incorporating Marwari.
 
* This project will make a significant contribution to the community, with the potential to be useful for building projects or conducting research in the growing field of low-resource languages.
 
* This project will make a significant contribution to the community, with the potential to be useful for building projects or conducting research in the growing field of low-resource languages.

Revision as of 10:01, 19 March 2023

Contact Information

Name: Hiten Vidhani

Location: India

University: Birla Institute of Technology and Science Pilani

Email address: vidhani.hiten2001@gmail.com

IRC: @hi101:matrix.org

Timezone: GMT+5:30

Github: hitenvidhani


Why is it that you are interested in Apertium?

Which of the published tasks are you interested in? What do you plan to do?

I am interested in the task "Bring an unreleased translation pair to releasable quality." I plan to develop the Marwari-Hindi(MWR-HIN) pair.

Proposal

Deliverables:

  • Creating the MWR-HIN bilingual dictionary.
  • Creating the MWR monolingual dictionary
  • Updating the HIN monolingual dictionary, if required.
  • Building the transfer rules for the MWR-HIN pair.
  • Creating a MWR-HIN translator.

Reasons why Google and Apertium should sponsor it:

  • Marwari is spoken by approximately 22 million people in India and its neighbouring countries. Despite its widespread, major translation tools such as Google Translate do not include it.
  • The project adds diversity to Apertium by incorporating Marwari.
  • This project will make a significant contribution to the community, with the potential to be useful for building projects or conducting research in the growing field of low-resource languages.
  • The release of the first open-source MWR-HIN translator will aid developers in creating additional Marwari-related language pairs.

How and who it will benefit in society

  • The project will benefit the native Marwari speakers as well as those traveling to the Indian state of Rajasthan, where Marwari is the most widely spoken language. It will also help tourists visiting Rajasthan, a popular tourist destination around the world, communicate with the locals.
  • It will assist Natural Language Processing researchers in conducting research in Marwari.
  • This project can be used by developers to create other language pairs that are closely related to Marwari.
  • In the long run, this project aims to reduce the language barrier between people from different regions.

Work plan

Community bonding period (May 4 - May 28):

  • Getting introduced to the organization and community of Apertium.
  • Understanding the code/projects which would be needed as a reference for my project.
  • Discussing the project ideas and taking suggestions from the community regarding the implementation of the project.
  • Exploring and finding resources for Marwari.

Work Period (May 29 - 28 Aug):

Week 1:

  • Adding nouns and adjectives to bilingual and MWR monolingual dictionary.

Week 2:

  • Getting familiar to the syntax for writing transfer rules.
  • Writing transfer rules for nouns and adjectives.

Week 3:

  • Adding verbs and other parts of speech to the dictionaries.
  • Writing transfer rules for the same.

Week 4:

  • Run tests
  • Update documentation
  • Prepare for the first evaluation

Deliverable 1: Monolingual and Bilingual dictionary, basic transfer rules

Week 5:

  • Translating essays/paragraphs and aim to achieve WER < 50%.
  • Working on lexical selection rules.

Week 6:

  • Using testvoc clean for adjectives.
  • Aim to achieve WER < 35%.

Week 7:

  • Expanding dictionaries further.
  • Working on disambiguation rules for MWR-HIN.

Week 8:

  • Expanding bilingual dictionary
  • Lexical selection rules
  • Disambiguation rules
  • Transfer rules
  • Prepare for the second evaluation

Deliverable 2: Improved Bilingual dictionary and updated rules Week 9&10:

  • Testvoc MWR-HIN
  • Discussing documentation details with mentors and organization.

Week 11&12:

  • Completing any pending tasks.
  • Final discussion and release of the project and documentation.
  • Project completed

Skills

I am a senior Computer Science undergraduate at Birla Institute of Technology and Science Pilani(BITS Pilani), India, which is an institute of Eminence. I have also done my internship at Ericsson where I build a NLP based ticket-classifier using python. I also developed a POS tagger for Hin-Eng code mixed dataset by using Hidden Markov Model as a part of the Natural Language Processing coursework in my university. I have also interned at Artificial Intelligence Institute of South Carolina where I had worked on the transformer architecture. Through these projects and my university coursework I have gained proficiency in programming languages like Python, C++, XML, HTML/CSS. In general I love solving problems using various programming tools. I am a native hindi speaker and can read and write Marwari. As I have previously worked in Natural Language Processing for my projects, and that I understand two languages HIN and MWR, I believe that I am a good fit for this project. I'd also be glad to be a part of this wonderful community at apertium and learn from them.


Non summer of code plans

I do not have any non-GSoC plans for the coming summer of 2023. I can spend 30 hours a week for this project. Although my university curriculum will be starting from August so I'll be working extensively in the summer to compensate before my university starts.