User:Hiten

From Apertium
Jump to navigation Jump to search

Contact Information

Name: Hiten Vidhani

Location: India

University: Birla Institute of Technology and Science Pilani

Email address: vidhani.hiten2001@gmail.com

IRC: @hi101:matrix.org

Timezone: GMT+5:30

Github: hitenvidhani


Why is it that you are interested in Apertium?

  • Machine Translation is an exciting field that affects many people directly or indirectly. As I enjoy developing programs/software that positively impact people's lives, Apertium provides me with the opportunity to do so.
  • By being open source apertium also provides all the dictionaries and their systems to everyone for free.
  • Apertium's rule-based translation system particularly appeals to low-resource languages that have limited data availability. Due to this limited availability, the rule based approach is better than the Neural Network approach.
  • The best part of apertium is the community, which is always willing to assist you when needed. I would be thrilled to work with this incredible community of developers.

Which of the published tasks are you interested in? What do you plan to do?

I am interested in the task "Bring an unreleased translation pair to releasable quality." I plan to develop the Hindi-Marwari(HIN-MWR) pair.

Proposal

Deliverables:

  • Creating the HIN-MWR bilingual dictionary.
  • Creating the MWR monolingual dictionary
  • Updating the HIN monolingual dictionary, if required.
  • Building the transfer rules for the HIN-MWR pair.
  • Creating a HIN-MWR translator.

Reasons why Google and Apertium should sponsor it:

  • Marwari is spoken by approximately 22 million people in India and its neighbouring countries. Despite its widespread use, major translation tools such as Google Translate do not include it.
  • The project adds diversity to Apertium by incorporating Marwari.
  • This project will make a significant contribution to the community, with the potential to be useful for building projects or conducting research in the growing field of low-resource languages.
  • The release of the first open-source HIN-MWR translator will aid developers in creating additional language pairs related to Marwari.

How and who it will benefit in society

  • The project will benefit the native Marwari speakers as well as those traveling to the Indian state of Rajasthan, where Marwari is the most widely spoken language. It will also help tourists visiting Rajasthan, a popular tourist destination around the world, communicate with the locals.
  • It will assist Natural Language Processing researchers in conducting research in Marwari.
  • This project can be used by developers to create other language pairs that are closely related to Marwari.
  • In the long run, this project aims to reduce the language barrier between people from different regions.

Work plan

Community bonding period (May 4 - May 28):

  • Getting introduced to the organization and community of Apertium.
  • Understanding the code/projects which would be needed as a reference for my project.
  • Discussing the project ideas and taking suggestions from the community regarding the implementation of the project.
  • Exploring and finding resources for Marwari.

Work Period (May 29 - 28 Aug):

Week 1 (29/05-04/06):

  • Adding nouns and adjectives to bilingual and MWR monolingual dictionary.
  • Learning about paradigms and how to implementing it for marwari.

Week 2(05/06-11/06):

  • Implementation of paradigms in marwari dictionary.
  • Getting familiar to the syntax for writing transfer rules.
  • Learning about currently used transfer rules implemented for other similar language pairs.

Week 3(12/06-18-06):

  • Implementing transfer rules for nouns and adjectives, for the chosen language pair.

Week 4(19/06-25/06):

  • Adding verbs and other parts of speech to the dictionaries.

Week 5(26/06-02/07):

  • Writing transfer rules for verbs and other parts of speech added to the dictionaries in previous week.

Week 6(03/07-09/07):

  • Run tests.
  • Update documentation.
  • Preparing for the midterm evaluation.

Deliverable 1: Monolingual and Bilingual dictionary, basic transfer rules

Week 7(14/07-23/07):

  • Translating essays/paragraphs and aim to achieve WER < 50%.
  • Working on lexical selection rules.

Week 8(24/07-30/07):

  • Using testvoc clean for adjectives.
  • Aim to achieve WER < 25%.

Week 9(31/07-6/08):

  • Expanding dictionaries further.
  • Working on disambiguation rules for HIN-MWR.

Week 10(07/08-13/08):

  • Expanding bilingual dictionary.
  • Lexical selection rules.
  • Disambiguation rules.
  • Transfer rules.

Week 11&12(14/08-28/08):

  • Testvoc HIN-MWR
  • Discussing documentation details with mentors and organization.
  • Completing any pending tasks.
  • Final discussion and release of the project and documentation.

Project completed

Skills

I am a senior Computer Science undergraduate at the prestigious Birla Institute of Technology and Science Pilani (BITS Pilani) in India. I interned at Ericsson, where I built an NLP based ticket-classifier using python. As part of my Natural Language Processing coursework at my university, I created a POS tagger for Hin-Eng code mixed datasets using the Hidden Markov Model. I also interned at the Artificial Intelligence Institute of South Carolina, where I worked on transformer architecture. Through these projects and my university coursework, I have gained proficiency in programming languages and tools such as Python, C++, XML, Git, bash scripting, and HTML/CSS. In general, I enjoy problem-solving with various programming tools. I am a native Hindi speaker with the ability to read and write Marwari. I believe I am a good fit for this project because I have previously worked in Natural Language Processing for my projects and understand two languages, HIN and MWR.

Coding Challenge/Contributions

Some outputs of the translation from HIN to MWR:

Hitenproposal1.png


Hitenproposal2.png


Hitenproposal3.png


Hitenproposal4.png

Test Corpus

Resources

Non summer of code plans

I have no plans other than GSoC for the summer of 2023. I can devote 30-40 hours per week to this project. As my university curriculum begins in August, I would like to work ~40 hours per week in the months of June and July and ~20 hours per week in August.