Difference between revisions of "User:Hiten"

From Apertium
Jump to navigation Jump to search
 
(58 intermediate revisions by the same user not shown)
Line 7: Line 7:
   
 
'''Email address:''' vidhani.hiten2001@gmail.com
 
'''Email address:''' vidhani.hiten2001@gmail.com
  +
 
'''IRC:''' @hi101:matrix.org
 
'''IRC:''' @hi101:matrix.org
   
Line 14: Line 15:
   
   
Why is it that you are interested in Apertium?
+
== Why is it that you are interested in Apertium? ==
  +
* Machine Translation is an exciting field that affects many people directly or indirectly. As I enjoy developing programs/software that positively impact people's lives, Apertium provides me with the opportunity to do so.
  +
* By being open source apertium also provides all the dictionaries and their systems to everyone for free.
  +
* Apertium's rule-based translation system particularly appeals to low-resource languages that have limited data availability. Due to this limited availability, the rule based approach is better than the Neural Network approach.
  +
* The best part of apertium is the community, which is always willing to assist you when needed. I would be thrilled to work with this incredible community of developers.
  +
  +
== Which of the published tasks are you interested in? What do you plan to do? ==
  +
I am interested in the task "Bring an unreleased translation pair to releasable quality." I plan to develop the Hindi-Marwari(HIN-MWR) pair.
  +
  +
== Proposal ==
  +
Deliverables:
  +
* Creating the HIN-MWR bilingual dictionary.
  +
* Creating the MWR monolingual dictionary
  +
* Updating the HIN monolingual dictionary, if required.
  +
* Building the transfer rules for the HIN-MWR pair.
  +
* Creating a HIN-MWR translator.
  +
  +
== Reasons why Google and Apertium should sponsor it: ==
  +
* Marwari is spoken by approximately 22 million people in India and its neighbouring countries. Despite its widespread use, major translation tools such as Google Translate do not include it.
  +
* The project adds diversity to Apertium by incorporating Marwari.
  +
* This project will make a significant contribution to the community, with the potential to be useful for building projects or conducting research in the growing field of low-resource languages.
  +
* The release of the first open-source HIN-MWR translator will aid developers in creating additional language pairs related to Marwari.
  +
  +
== How and who it will benefit in society ==
  +
* The project will benefit the native Marwari speakers as well as those traveling to the Indian state of Rajasthan, where Marwari is the most widely spoken language. It will also help tourists visiting Rajasthan, a popular tourist destination around the world, communicate with the locals.
  +
* It will assist Natural Language Processing researchers in conducting research in Marwari.
  +
* This project can be used by developers to create other language pairs that are closely related to Marwari.
  +
* In the long run, this project aims to reduce the language barrier between people from different regions.
  +
  +
== Work plan ==
  +
=== Community bonding period (May 4 - May 28): ===
  +
* Getting introduced to the organization and community of Apertium.
  +
* Understanding the code/projects which would be needed as a reference for my project.
  +
* Discussing the project ideas and taking suggestions from the community regarding the implementation of the project.
  +
* Exploring and finding resources for Marwari.
  +
  +
=== Work Period (May 29 - 28 Aug): ===
  +
Week 1 (29/05-04/06):
  +
  +
* Adding nouns and adjectives to bilingual and MWR monolingual dictionary.
  +
* Learning about paradigms and how to implementing it for marwari.
  +
  +
Week 2(05/06-11/06):
  +
* Implementation of paradigms in marwari dictionary.
  +
* Getting familiar to the syntax for writing transfer rules.
  +
* Learning about currently used transfer rules implemented for other similar language pairs.
  +
  +
Week 3(12/06-18-06):
  +
  +
* Implementing transfer rules for nouns and adjectives, for the chosen language pair.
  +
  +
Week 4(19/06-25/06):
  +
  +
* Adding verbs and other parts of speech to the dictionaries.
  +
  +
Week 5(26/06-02/07):
  +
  +
* Writing transfer rules for verbs and other parts of speech added to the dictionaries in previous week.
  +
  +
Week 6(03/07-09/07):
  +
  +
* Run tests.
  +
* Update documentation.
  +
* Preparing for the midterm evaluation.
  +
  +
'''Deliverable 1: Monolingual and Bilingual dictionary, basic transfer rules '''
  +
  +
Week 7(14/07-23/07):
  +
  +
* Translating essays/paragraphs and aim to achieve WER < 50%.
  +
* Working on lexical selection rules.
  +
  +
Week 8(24/07-30/07):
  +
  +
* Using testvoc clean for adjectives.
  +
* Aim to achieve WER < 25%.
   
  +
Week 9(31/07-6/08):
Which of the published tasks are you interested in? What do you plan to do?
 
   
  +
* Expanding dictionaries further.
Include a proposal, including
 
  +
* Working on disambiguation rules for HIN-MWR.
* a title,
 
* reasons why Google and Apertium should sponsor it,
 
* a description of how and who it will benefit in society,
 
* and a detailed work plan (including, if possible, a schedule with milestones and deliverables).
 
   
  +
Week 10(07/08-13/08):
=== Work plan ===
 
   
  +
* Expanding bilingual dictionary.
* Week 1:
 
  +
* Lexical selection rules.
* Week 2:
 
  +
* Disambiguation rules.
* Week 3:
 
  +
* Transfer rules.
* Week 4:
 
   
  +
Week 11&12(14/08-28/08):
* '''Deliverable #1'''
 
   
  +
* Testvoc HIN-MWR
* Week 5:
 
  +
* Discussing documentation details with mentors and organization.
* Week 6:
 
  +
* Completing any pending tasks.
* Week 7:
 
  +
* Final discussion and release of the project and documentation.
* Week 8:
 
   
* '''Deliverable #2'''
+
'''Project completed'''
   
  +
== Skills ==
* Week 9:
 
  +
I am a senior Computer Science undergraduate at the prestigious Birla Institute of Technology and Science Pilani (BITS Pilani) in India. I interned at Ericsson, where I built an NLP based ticket-classifier using python. As part of my Natural Language Processing coursework at my university, I created a POS tagger for Hin-Eng code mixed datasets using the Hidden Markov Model. I also interned at the Artificial Intelligence Institute of South Carolina, where I worked on transformer architecture.
* Week 10:
 
  +
Through these projects and my university coursework, I have gained proficiency in programming languages and tools such as Python, C++, XML, Git, bash scripting, and HTML/CSS. In general, I enjoy problem-solving with various programming tools.
* Week 11:
 
  +
I am a native Hindi speaker with the ability to read and write Marwari.
* Week 12:
 
  +
I believe I am a good fit for this project because I have previously worked in Natural Language Processing for my projects and understand two languages, HIN and MWR.
   
  +
== Coding Challenge/Contributions ==
* '''Project completed'''
 
  +
* Successfully setup apertium environment.
  +
* Created a Pull Request fixing minor compilation errors for apertium-mar-eng : https://github.com/apertium/apertium-mar-eng/pull/1 .
  +
* Working on creating a HIN-MWR translator : [https://github.com/hitenvidhani/apertium-hin-mwr apertium-hin-mwr], [https://github.com/hitenvidhani/apertium-mwr apertium-mwr] and [https://github.com/hitenvidhani/apertium-hin apertium-hin].
  +
* Worked on adding words to monodix and bidix, adding transfer rule in t1x, adding paradigms to MWR monodix.
  +
Some outputs of the translation from HIN to MWR:
  +
<center>[[File:hitenproposal1.png]]</center>
  +
<br>
  +
<center>[[File:hitenproposal2.png]]</center>
  +
<br>
  +
<center>[[File:hitenproposal3.png]]</center>
  +
<br>
  +
<center>[[File:hitenproposal4.png]]</center>
   
  +
== Test Corpus ==
Include time needed to think, to program, to document and to disseminate.
 
  +
* HIN corpus : https://github.com/hitenvidhani/apertium-hin-mwr/blob/master/test_corpus/hin_corpus.txt
  +
* MWR corpus : https://github.com/hitenvidhani/apertium-hin-mwr/blob/master/test_corpus/mwr_corpus.txt
   
  +
== Resources ==
If you are intending to disseminate to a conference, which conference are you intending to submit to. Make sure
 
  +
* https://wikitravel.org/en/Rajasthani_phrasebook
to factor in time taken to run any experiments/evaluations and write them up in your work plan.
 
  +
* https://www.languageshome.com/English-Marwadi.htm
  +
* https://hi.glosbe.com/mwr/hi
  +
* https://hattai.page.tl/marwari-dictionary.htm
  +
* https://www.marwaribaatein.com/marwari-language
  +
* https://crazychhora.com/learn-marwadi/
   
  +
== Non summer of code plans ==
List your skills and give evidence of your qualifications. Tell us what is your current field of study,
 
  +
I have no plans other than GSoC for the summer of 2023. I can devote 30-40 hours per week to this project. As my university curriculum begins in August, I would like to work ~40 hours per week in the months of June and July and ~20 hours per week in August.
major, etc. Convince us that you can do the work.
 
   
  +
[[Category:GSoC_2023_student_proposals]]
List any non-Summer-of-Code plans you have for the Summer, especially employment, if you are applying for
 
internships, and class-taking. Be specific about schedules and time commitments. we would like to be sure you have
 
at least 30 free hours a week to develop for our project.
 

Latest revision as of 15:03, 19 April 2023

Contact Information[edit]

Name: Hiten Vidhani

Location: India

University: Birla Institute of Technology and Science Pilani

Email address: vidhani.hiten2001@gmail.com

IRC: @hi101:matrix.org

Timezone: GMT+5:30

Github: hitenvidhani


Why is it that you are interested in Apertium?[edit]

  • Machine Translation is an exciting field that affects many people directly or indirectly. As I enjoy developing programs/software that positively impact people's lives, Apertium provides me with the opportunity to do so.
  • By being open source apertium also provides all the dictionaries and their systems to everyone for free.
  • Apertium's rule-based translation system particularly appeals to low-resource languages that have limited data availability. Due to this limited availability, the rule based approach is better than the Neural Network approach.
  • The best part of apertium is the community, which is always willing to assist you when needed. I would be thrilled to work with this incredible community of developers.

Which of the published tasks are you interested in? What do you plan to do?[edit]

I am interested in the task "Bring an unreleased translation pair to releasable quality." I plan to develop the Hindi-Marwari(HIN-MWR) pair.

Proposal[edit]

Deliverables:

  • Creating the HIN-MWR bilingual dictionary.
  • Creating the MWR monolingual dictionary
  • Updating the HIN monolingual dictionary, if required.
  • Building the transfer rules for the HIN-MWR pair.
  • Creating a HIN-MWR translator.

Reasons why Google and Apertium should sponsor it:[edit]

  • Marwari is spoken by approximately 22 million people in India and its neighbouring countries. Despite its widespread use, major translation tools such as Google Translate do not include it.
  • The project adds diversity to Apertium by incorporating Marwari.
  • This project will make a significant contribution to the community, with the potential to be useful for building projects or conducting research in the growing field of low-resource languages.
  • The release of the first open-source HIN-MWR translator will aid developers in creating additional language pairs related to Marwari.

How and who it will benefit in society[edit]

  • The project will benefit the native Marwari speakers as well as those traveling to the Indian state of Rajasthan, where Marwari is the most widely spoken language. It will also help tourists visiting Rajasthan, a popular tourist destination around the world, communicate with the locals.
  • It will assist Natural Language Processing researchers in conducting research in Marwari.
  • This project can be used by developers to create other language pairs that are closely related to Marwari.
  • In the long run, this project aims to reduce the language barrier between people from different regions.

Work plan[edit]

Community bonding period (May 4 - May 28):[edit]

  • Getting introduced to the organization and community of Apertium.
  • Understanding the code/projects which would be needed as a reference for my project.
  • Discussing the project ideas and taking suggestions from the community regarding the implementation of the project.
  • Exploring and finding resources for Marwari.

Work Period (May 29 - 28 Aug):[edit]

Week 1 (29/05-04/06):

  • Adding nouns and adjectives to bilingual and MWR monolingual dictionary.
  • Learning about paradigms and how to implementing it for marwari.

Week 2(05/06-11/06):

  • Implementation of paradigms in marwari dictionary.
  • Getting familiar to the syntax for writing transfer rules.
  • Learning about currently used transfer rules implemented for other similar language pairs.

Week 3(12/06-18-06):

  • Implementing transfer rules for nouns and adjectives, for the chosen language pair.

Week 4(19/06-25/06):

  • Adding verbs and other parts of speech to the dictionaries.

Week 5(26/06-02/07):

  • Writing transfer rules for verbs and other parts of speech added to the dictionaries in previous week.

Week 6(03/07-09/07):

  • Run tests.
  • Update documentation.
  • Preparing for the midterm evaluation.

Deliverable 1: Monolingual and Bilingual dictionary, basic transfer rules

Week 7(14/07-23/07):

  • Translating essays/paragraphs and aim to achieve WER < 50%.
  • Working on lexical selection rules.

Week 8(24/07-30/07):

  • Using testvoc clean for adjectives.
  • Aim to achieve WER < 25%.

Week 9(31/07-6/08):

  • Expanding dictionaries further.
  • Working on disambiguation rules for HIN-MWR.

Week 10(07/08-13/08):

  • Expanding bilingual dictionary.
  • Lexical selection rules.
  • Disambiguation rules.
  • Transfer rules.

Week 11&12(14/08-28/08):

  • Testvoc HIN-MWR
  • Discussing documentation details with mentors and organization.
  • Completing any pending tasks.
  • Final discussion and release of the project and documentation.

Project completed

Skills[edit]

I am a senior Computer Science undergraduate at the prestigious Birla Institute of Technology and Science Pilani (BITS Pilani) in India. I interned at Ericsson, where I built an NLP based ticket-classifier using python. As part of my Natural Language Processing coursework at my university, I created a POS tagger for Hin-Eng code mixed datasets using the Hidden Markov Model. I also interned at the Artificial Intelligence Institute of South Carolina, where I worked on transformer architecture. Through these projects and my university coursework, I have gained proficiency in programming languages and tools such as Python, C++, XML, Git, bash scripting, and HTML/CSS. In general, I enjoy problem-solving with various programming tools. I am a native Hindi speaker with the ability to read and write Marwari. I believe I am a good fit for this project because I have previously worked in Natural Language Processing for my projects and understand two languages, HIN and MWR.

Coding Challenge/Contributions[edit]

Some outputs of the translation from HIN to MWR:

Hitenproposal1.png


Hitenproposal2.png


Hitenproposal3.png


Hitenproposal4.png

Test Corpus[edit]

Resources[edit]

Non summer of code plans[edit]

I have no plans other than GSoC for the summer of 2023. I can devote 30-40 hours per week to this project. As my university curriculum begins in August, I would like to work ~40 hours per week in the months of June and July and ~20 hours per week in August.