Difference between revisions of "User:Hiten"

From Apertium
Jump to navigation Jump to search
 
(59 intermediate revisions by the same user not shown)
Line 1: Line 1:
#Contact Information
== Contact Information ==
'''Name:''' Hiten Vidhani
GSoC 2023 Application
Name: Hiten Vidhani
E-mail address: vidhani.hiten2001@gmail.com
IRC: @hi101:matrix.org
GitHub: hitenvidhani


'''Location:''' India
Why is it that you are interested in Apertium?


'''University:''' Birla Institute of Technology and Science Pilani
Which of the published tasks are you interested in? What do you plan to do?


'''Email address:''' vidhani.hiten2001@gmail.com
Include a proposal, including
* a title,
* reasons why Google and Apertium should sponsor it,
* a description of how and who it will benefit in society,
* and a detailed work plan (including, if possible, a schedule with milestones and deliverables).


'''IRC:''' @hi101:matrix.org
=== Work plan ===


'''Timezone:''' GMT+5:30
* Week 1:
* Week 2:
* Week 3:
* Week 4:


'''Github:''' hitenvidhani
* '''Deliverable #1'''


* Week 5:
* Week 6:
* Week 7:
* Week 8:


== Why is it that you are interested in Apertium? ==
* '''Deliverable #2'''
* Machine Translation is an exciting field that affects many people directly or indirectly. As I enjoy developing programs/software that positively impact people's lives, Apertium provides me with the opportunity to do so.
* By being open source apertium also provides all the dictionaries and their systems to everyone for free.
* Apertium's rule-based translation system particularly appeals to low-resource languages that have limited data availability. Due to this limited availability, the rule based approach is better than the Neural Network approach.
* The best part of apertium is the community, which is always willing to assist you when needed. I would be thrilled to work with this incredible community of developers.


== Which of the published tasks are you interested in? What do you plan to do? ==
* Week 9:
I am interested in the task "Bring an unreleased translation pair to releasable quality." I plan to develop the Hindi-Marwari(HIN-MWR) pair.
* Week 10:
* Week 11:
* Week 12:


== Proposal ==
* '''Project completed'''
Deliverables:
* Creating the HIN-MWR bilingual dictionary.
* Creating the MWR monolingual dictionary
* Updating the HIN monolingual dictionary, if required.
* Building the transfer rules for the HIN-MWR pair.
* Creating a HIN-MWR translator.


== Reasons why Google and Apertium should sponsor it: ==
Include time needed to think, to program, to document and to disseminate.
* Marwari is spoken by approximately 22 million people in India and its neighbouring countries. Despite its widespread use, major translation tools such as Google Translate do not include it.
* The project adds diversity to Apertium by incorporating Marwari.
* This project will make a significant contribution to the community, with the potential to be useful for building projects or conducting research in the growing field of low-resource languages.
* The release of the first open-source HIN-MWR translator will aid developers in creating additional language pairs related to Marwari.


== How and who it will benefit in society ==
If you are intending to disseminate to a conference, which conference are you intending to submit to. Make sure
* The project will benefit the native Marwari speakers as well as those traveling to the Indian state of Rajasthan, where Marwari is the most widely spoken language. It will also help tourists visiting Rajasthan, a popular tourist destination around the world, communicate with the locals.
to factor in time taken to run any experiments/evaluations and write them up in your work plan.
* It will assist Natural Language Processing researchers in conducting research in Marwari.
* This project can be used by developers to create other language pairs that are closely related to Marwari.
* In the long run, this project aims to reduce the language barrier between people from different regions.


== Work plan ==
List your skills and give evidence of your qualifications. Tell us what is your current field of study,
=== Community bonding period (May 4 - May 28): ===
major, etc. Convince us that you can do the work.
* Getting introduced to the organization and community of Apertium.
* Understanding the code/projects which would be needed as a reference for my project.
* Discussing the project ideas and taking suggestions from the community regarding the implementation of the project.
* Exploring and finding resources for Marwari.


=== Work Period (May 29 - 28 Aug): ===
List any non-Summer-of-Code plans you have for the Summer, especially employment, if you are applying for
Week 1 (29/05-04/06):
internships, and class-taking. Be specific about schedules and time commitments. we would like to be sure you have

at least 30 free hours a week to develop for our project.
* Adding nouns and adjectives to bilingual and MWR monolingual dictionary.
* Learning about paradigms and how to implementing it for marwari.

Week 2(05/06-11/06):
* Implementation of paradigms in marwari dictionary.
* Getting familiar to the syntax for writing transfer rules.
* Learning about currently used transfer rules implemented for other similar language pairs.

Week 3(12/06-18-06):

* Implementing transfer rules for nouns and adjectives, for the chosen language pair.

Week 4(19/06-25/06):

* Adding verbs and other parts of speech to the dictionaries.

Week 5(26/06-02/07):

* Writing transfer rules for verbs and other parts of speech added to the dictionaries in previous week.

Week 6(03/07-09/07):

* Run tests.
* Update documentation.
* Preparing for the midterm evaluation.

'''Deliverable 1: Monolingual and Bilingual dictionary, basic transfer rules '''

Week 7(14/07-23/07):

* Translating essays/paragraphs and aim to achieve WER < 50%.
* Working on lexical selection rules.

Week 8(24/07-30/07):

* Using testvoc clean for adjectives.
* Aim to achieve WER < 25%.

Week 9(31/07-6/08):

* Expanding dictionaries further.
* Working on disambiguation rules for HIN-MWR.

Week 10(07/08-13/08):

* Expanding bilingual dictionary.
* Lexical selection rules.
* Disambiguation rules.
* Transfer rules.

Week 11&12(14/08-28/08):

* Testvoc HIN-MWR
* Discussing documentation details with mentors and organization.
* Completing any pending tasks.
* Final discussion and release of the project and documentation.

'''Project completed'''

== Skills ==
I am a senior Computer Science undergraduate at the prestigious Birla Institute of Technology and Science Pilani (BITS Pilani) in India. I interned at Ericsson, where I built an NLP based ticket-classifier using python. As part of my Natural Language Processing coursework at my university, I created a POS tagger for Hin-Eng code mixed datasets using the Hidden Markov Model. I also interned at the Artificial Intelligence Institute of South Carolina, where I worked on transformer architecture.
Through these projects and my university coursework, I have gained proficiency in programming languages and tools such as Python, C++, XML, Git, bash scripting, and HTML/CSS. In general, I enjoy problem-solving with various programming tools.
I am a native Hindi speaker with the ability to read and write Marwari.
I believe I am a good fit for this project because I have previously worked in Natural Language Processing for my projects and understand two languages, HIN and MWR.

== Coding Challenge/Contributions ==
* Successfully setup apertium environment.
* Created a Pull Request fixing minor compilation errors for apertium-mar-eng : https://github.com/apertium/apertium-mar-eng/pull/1 .
* Working on creating a HIN-MWR translator : [https://github.com/hitenvidhani/apertium-hin-mwr apertium-hin-mwr], [https://github.com/hitenvidhani/apertium-mwr apertium-mwr] and [https://github.com/hitenvidhani/apertium-hin apertium-hin].
* Worked on adding words to monodix and bidix, adding transfer rule in t1x, adding paradigms to MWR monodix.
Some outputs of the translation from HIN to MWR:
<center>[[File:hitenproposal1.png]]</center>
<br>
<center>[[File:hitenproposal2.png]]</center>
<br>
<center>[[File:hitenproposal3.png]]</center>
<br>
<center>[[File:hitenproposal4.png]]</center>

== Test Corpus ==
* HIN corpus : https://github.com/hitenvidhani/apertium-hin-mwr/blob/master/test_corpus/hin_corpus.txt
* MWR corpus : https://github.com/hitenvidhani/apertium-hin-mwr/blob/master/test_corpus/mwr_corpus.txt

== Resources ==
* https://wikitravel.org/en/Rajasthani_phrasebook
* https://www.languageshome.com/English-Marwadi.htm
* https://hi.glosbe.com/mwr/hi
* https://hattai.page.tl/marwari-dictionary.htm
* https://www.marwaribaatein.com/marwari-language
* https://crazychhora.com/learn-marwadi/

== Non summer of code plans ==
I have no plans other than GSoC for the summer of 2023. I can devote 30-40 hours per week to this project. As my university curriculum begins in August, I would like to work ~40 hours per week in the months of June and July and ~20 hours per week in August.

[[Category:GSoC_2023_student_proposals]]

Latest revision as of 15:03, 19 April 2023

Contact Information[edit]

Name: Hiten Vidhani

Location: India

University: Birla Institute of Technology and Science Pilani

Email address: vidhani.hiten2001@gmail.com

IRC: @hi101:matrix.org

Timezone: GMT+5:30

Github: hitenvidhani


Why is it that you are interested in Apertium?[edit]

  • Machine Translation is an exciting field that affects many people directly or indirectly. As I enjoy developing programs/software that positively impact people's lives, Apertium provides me with the opportunity to do so.
  • By being open source apertium also provides all the dictionaries and their systems to everyone for free.
  • Apertium's rule-based translation system particularly appeals to low-resource languages that have limited data availability. Due to this limited availability, the rule based approach is better than the Neural Network approach.
  • The best part of apertium is the community, which is always willing to assist you when needed. I would be thrilled to work with this incredible community of developers.

Which of the published tasks are you interested in? What do you plan to do?[edit]

I am interested in the task "Bring an unreleased translation pair to releasable quality." I plan to develop the Hindi-Marwari(HIN-MWR) pair.

Proposal[edit]

Deliverables:

  • Creating the HIN-MWR bilingual dictionary.
  • Creating the MWR monolingual dictionary
  • Updating the HIN monolingual dictionary, if required.
  • Building the transfer rules for the HIN-MWR pair.
  • Creating a HIN-MWR translator.

Reasons why Google and Apertium should sponsor it:[edit]

  • Marwari is spoken by approximately 22 million people in India and its neighbouring countries. Despite its widespread use, major translation tools such as Google Translate do not include it.
  • The project adds diversity to Apertium by incorporating Marwari.
  • This project will make a significant contribution to the community, with the potential to be useful for building projects or conducting research in the growing field of low-resource languages.
  • The release of the first open-source HIN-MWR translator will aid developers in creating additional language pairs related to Marwari.

How and who it will benefit in society[edit]

  • The project will benefit the native Marwari speakers as well as those traveling to the Indian state of Rajasthan, where Marwari is the most widely spoken language. It will also help tourists visiting Rajasthan, a popular tourist destination around the world, communicate with the locals.
  • It will assist Natural Language Processing researchers in conducting research in Marwari.
  • This project can be used by developers to create other language pairs that are closely related to Marwari.
  • In the long run, this project aims to reduce the language barrier between people from different regions.

Work plan[edit]

Community bonding period (May 4 - May 28):[edit]

  • Getting introduced to the organization and community of Apertium.
  • Understanding the code/projects which would be needed as a reference for my project.
  • Discussing the project ideas and taking suggestions from the community regarding the implementation of the project.
  • Exploring and finding resources for Marwari.

Work Period (May 29 - 28 Aug):[edit]

Week 1 (29/05-04/06):

  • Adding nouns and adjectives to bilingual and MWR monolingual dictionary.
  • Learning about paradigms and how to implementing it for marwari.

Week 2(05/06-11/06):

  • Implementation of paradigms in marwari dictionary.
  • Getting familiar to the syntax for writing transfer rules.
  • Learning about currently used transfer rules implemented for other similar language pairs.

Week 3(12/06-18-06):

  • Implementing transfer rules for nouns and adjectives, for the chosen language pair.

Week 4(19/06-25/06):

  • Adding verbs and other parts of speech to the dictionaries.

Week 5(26/06-02/07):

  • Writing transfer rules for verbs and other parts of speech added to the dictionaries in previous week.

Week 6(03/07-09/07):

  • Run tests.
  • Update documentation.
  • Preparing for the midterm evaluation.

Deliverable 1: Monolingual and Bilingual dictionary, basic transfer rules

Week 7(14/07-23/07):

  • Translating essays/paragraphs and aim to achieve WER < 50%.
  • Working on lexical selection rules.

Week 8(24/07-30/07):

  • Using testvoc clean for adjectives.
  • Aim to achieve WER < 25%.

Week 9(31/07-6/08):

  • Expanding dictionaries further.
  • Working on disambiguation rules for HIN-MWR.

Week 10(07/08-13/08):

  • Expanding bilingual dictionary.
  • Lexical selection rules.
  • Disambiguation rules.
  • Transfer rules.

Week 11&12(14/08-28/08):

  • Testvoc HIN-MWR
  • Discussing documentation details with mentors and organization.
  • Completing any pending tasks.
  • Final discussion and release of the project and documentation.

Project completed

Skills[edit]

I am a senior Computer Science undergraduate at the prestigious Birla Institute of Technology and Science Pilani (BITS Pilani) in India. I interned at Ericsson, where I built an NLP based ticket-classifier using python. As part of my Natural Language Processing coursework at my university, I created a POS tagger for Hin-Eng code mixed datasets using the Hidden Markov Model. I also interned at the Artificial Intelligence Institute of South Carolina, where I worked on transformer architecture. Through these projects and my university coursework, I have gained proficiency in programming languages and tools such as Python, C++, XML, Git, bash scripting, and HTML/CSS. In general, I enjoy problem-solving with various programming tools. I am a native Hindi speaker with the ability to read and write Marwari. I believe I am a good fit for this project because I have previously worked in Natural Language Processing for my projects and understand two languages, HIN and MWR.

Coding Challenge/Contributions[edit]

Some outputs of the translation from HIN to MWR:

Hitenproposal1.png


Hitenproposal2.png


Hitenproposal3.png


Hitenproposal4.png

Test Corpus[edit]

Resources[edit]

Non summer of code plans[edit]

I have no plans other than GSoC for the summer of 2023. I can devote 30-40 hours per week to this project. As my university curriculum begins in August, I would like to work ~40 hours per week in the months of June and July and ~20 hours per week in August.