Difference between revisions of "User:Hiten"
(51 intermediate revisions by the same user not shown) | |||
Line 16: | Line 16: | ||
== Why is it that you are interested in Apertium? == |
== Why is it that you are interested in Apertium? == |
||
* Machine Translation is an exciting field that affects many people directly or indirectly. As I enjoy developing programs/software that positively impact people's lives, Apertium provides me with the opportunity to do so. |
|||
* By being open source apertium also provides all the dictionaries and their systems to everyone for free. |
|||
* Apertium's rule-based translation system particularly appeals to low-resource languages that have limited data availability. Due to this limited availability, the rule based approach is better than the Neural Network approach. |
|||
* The best part of apertium is the community, which is always willing to assist you when needed. I would be thrilled to work with this incredible community of developers. |
|||
== Which of the published tasks are you interested in? What do you plan to do? == |
== Which of the published tasks are you interested in? What do you plan to do? == |
||
I am interested in the task "Bring an unreleased translation pair to releasable quality |
I am interested in the task "Bring an unreleased translation pair to releasable quality." I plan to develop the Hindi-Marwari(HIN-MWR) pair. |
||
== Proposal == |
== Proposal == |
||
Deliverables: |
Deliverables: |
||
* Creating the |
* Creating the HIN-MWR bilingual dictionary. |
||
* Creating the MWR monolingual dictionary |
* Creating the MWR monolingual dictionary |
||
* Updating the HIN monolingual dictionary, if required. |
* Updating the HIN monolingual dictionary, if required. |
||
* Building the transfer rules for the |
* Building the transfer rules for the HIN-MWR pair. |
||
* Creating a |
* Creating a HIN-MWR translator. |
||
== Reasons why Google and Apertium should sponsor it: == |
== Reasons why Google and Apertium should sponsor it: == |
||
* Marwari |
* Marwari is spoken by approximately 22 million people in India and its neighbouring countries. Despite its widespread use, major translation tools such as Google Translate do not include it. |
||
* The project adds diversity to Apertium by |
* The project adds diversity to Apertium by incorporating Marwari. |
||
* This project will |
* This project will make a significant contribution to the community, with the potential to be useful for building projects or conducting research in the growing field of low-resource languages. |
||
* |
* The release of the first open-source HIN-MWR translator will aid developers in creating additional language pairs related to Marwari. |
||
== How and who it will benefit in society == |
== How and who it will benefit in society == |
||
* The project will benefit the native speakers |
* The project will benefit the native Marwari speakers as well as those traveling to the Indian state of Rajasthan, where Marwari is the most widely spoken language. It will also help tourists visiting Rajasthan, a popular tourist destination around the world, communicate with the locals. |
||
* It will |
* It will assist Natural Language Processing researchers in conducting research in Marwari. |
||
* |
* This project can be used by developers to create other language pairs that are closely related to Marwari. |
||
* In the long run, this project aims to reduce the language barrier |
* In the long run, this project aims to reduce the language barrier between people from different regions. |
||
== Work plan == |
== Work plan == |
||
Line 48: | Line 52: | ||
=== Work Period (May 29 - 28 Aug): === |
=== Work Period (May 29 - 28 Aug): === |
||
Week 1: |
Week 1 (29/05-04/06): |
||
* Adding nouns and adjectives to bilingual and MWR monolingual dictionary. |
* Adding nouns and adjectives to bilingual and MWR monolingual dictionary. |
||
* Learning about paradigms and how to implementing it for marwari. |
|||
Week 2: |
Week 2(05/06-11/06): |
||
* Implementation of paradigms in marwari dictionary. |
|||
* Getting familiar to the syntax for writing transfer rules. |
* Getting familiar to the syntax for writing transfer rules. |
||
* |
* Learning about currently used transfer rules implemented for other similar language pairs. |
||
Week 3(12/06-18-06): |
|||
* Implementing transfer rules for nouns and adjectives, for the chosen language pair. |
|||
Week |
Week 4(19/06-25/06): |
||
* Adding verbs and other parts of speech to the dictionaries. |
* Adding verbs and other parts of speech to the dictionaries. |
||
* Writing transfer rules for the same. |
|||
Week |
Week 5(26/06-02/07): |
||
* Writing transfer rules for verbs and other parts of speech added to the dictionaries in previous week. |
|||
⚫ | |||
⚫ | |||
Week 6(03/07-09/07): |
|||
⚫ | |||
⚫ | |||
⚫ | |||
⚫ | |||
'''Deliverable 1: Monolingual and Bilingual dictionary, basic transfer rules ''' |
'''Deliverable 1: Monolingual and Bilingual dictionary, basic transfer rules ''' |
||
Week |
Week 7(14/07-23/07): |
||
* Translating essays/paragraphs and aim to achieve WER < 50%. |
* Translating essays/paragraphs and aim to achieve WER < 50%. |
||
* Working on lexical selection rules. |
* Working on lexical selection rules. |
||
Week |
Week 8(24/07-30/07): |
||
* Using testvoc clean for adjectives. |
* Using testvoc clean for adjectives. |
||
* Aim to achieve WER < |
* Aim to achieve WER < 25%. |
||
Week |
Week 9(31/07-6/08): |
||
* Expanding dictionaries further. |
* Expanding dictionaries further. |
||
* Working on disambiguation rules for |
* Working on disambiguation rules for HIN-MWR. |
||
Week |
Week 10(07/08-13/08): |
||
* Expanding bilingual dictionary |
* Expanding bilingual dictionary. |
||
* Lexical selection rules |
* Lexical selection rules. |
||
* Disambiguation rules |
* Disambiguation rules. |
||
* Transfer rules |
* Transfer rules. |
||
⚫ | |||
*Prepare for the second evaluation |
|||
⚫ | |||
'''Deliverable 2: Improved Bilingual dictionary and updated rules''' |
|||
Week 9&10: |
|||
⚫ | |||
* Discussing documentation details with mentors and organization. |
* Discussing documentation details with mentors and organization. |
||
⚫ | |||
* Completing any pending tasks. |
* Completing any pending tasks. |
||
* Final discussion and release of the project and documentation. |
* Final discussion and release of the project and documentation. |
||
'''Project completed''' |
|||
== Skills == |
== Skills == |
||
I am a senior Computer Science undergraduate at Birla Institute of Technology and Science Pilani(BITS Pilani) |
I am a senior Computer Science undergraduate at the prestigious Birla Institute of Technology and Science Pilani (BITS Pilani) in India. I interned at Ericsson, where I built an NLP based ticket-classifier using python. As part of my Natural Language Processing coursework at my university, I created a POS tagger for Hin-Eng code mixed datasets using the Hidden Markov Model. I also interned at the Artificial Intelligence Institute of South Carolina, where I worked on transformer architecture. |
||
Through these projects and my university coursework I have gained proficiency in programming languages |
Through these projects and my university coursework, I have gained proficiency in programming languages and tools such as Python, C++, XML, Git, bash scripting, and HTML/CSS. In general, I enjoy problem-solving with various programming tools. |
||
I am a native |
I am a native Hindi speaker with the ability to read and write Marwari. |
||
I believe I am a good fit for this project because I have previously worked in Natural Language Processing for my projects and understand two languages, HIN and MWR. |
|||
As I have previously worked in Natural Language Processing for my projects, and that I understand two languages HIN and MWR, I believe that I am a good fit for this project. I'd also be glad to be a part of this wonderful community at apertium and learn from them. |
|||
== Coding Challenge/Contributions == |
|||
* Successfully setup apertium environment. |
|||
* Created a Pull Request fixing minor compilation errors for apertium-mar-eng : https://github.com/apertium/apertium-mar-eng/pull/1 . |
|||
* Working on creating a HIN-MWR translator : [https://github.com/hitenvidhani/apertium-hin-mwr apertium-hin-mwr], [https://github.com/hitenvidhani/apertium-mwr apertium-mwr] and [https://github.com/hitenvidhani/apertium-hin apertium-hin]. |
|||
* Worked on adding words to monodix and bidix, adding transfer rule in t1x, adding paradigms to MWR monodix. |
|||
Some outputs of the translation from HIN to MWR: |
|||
<center>[[File:hitenproposal1.png]]</center> |
|||
<br> |
|||
<center>[[File:hitenproposal2.png]]</center> |
|||
<br> |
|||
<center>[[File:hitenproposal3.png]]</center> |
|||
<br> |
|||
<center>[[File:hitenproposal4.png]]</center> |
|||
== Test Corpus == |
|||
* HIN corpus : https://github.com/hitenvidhani/apertium-hin-mwr/blob/master/test_corpus/hin_corpus.txt |
|||
* MWR corpus : https://github.com/hitenvidhani/apertium-hin-mwr/blob/master/test_corpus/mwr_corpus.txt |
|||
== Resources == |
|||
* https://wikitravel.org/en/Rajasthani_phrasebook |
|||
* https://www.languageshome.com/English-Marwadi.htm |
|||
* https://hi.glosbe.com/mwr/hi |
|||
* https://hattai.page.tl/marwari-dictionary.htm |
|||
* https://www.marwaribaatein.com/marwari-language |
|||
* https://crazychhora.com/learn-marwadi/ |
|||
== Non summer of code plans == |
== Non summer of code plans == |
||
I |
I have no plans other than GSoC for the summer of 2023. I can devote 30-40 hours per week to this project. As my university curriculum begins in August, I would like to work ~40 hours per week in the months of June and July and ~20 hours per week in August. |
||
[[Category:GSoC_2023_student_proposals]] |
Latest revision as of 15:03, 19 April 2023
Contents
- 1 Contact Information
- 2 Why is it that you are interested in Apertium?
- 3 Which of the published tasks are you interested in? What do you plan to do?
- 4 Proposal
- 5 Reasons why Google and Apertium should sponsor it:
- 6 How and who it will benefit in society
- 7 Work plan
- 8 Skills
- 9 Coding Challenge/Contributions
- 10 Test Corpus
- 11 Resources
- 12 Non summer of code plans
Contact Information[edit]
Name: Hiten Vidhani
Location: India
University: Birla Institute of Technology and Science Pilani
Email address: vidhani.hiten2001@gmail.com
IRC: @hi101:matrix.org
Timezone: GMT+5:30
Github: hitenvidhani
Why is it that you are interested in Apertium?[edit]
- Machine Translation is an exciting field that affects many people directly or indirectly. As I enjoy developing programs/software that positively impact people's lives, Apertium provides me with the opportunity to do so.
- By being open source apertium also provides all the dictionaries and their systems to everyone for free.
- Apertium's rule-based translation system particularly appeals to low-resource languages that have limited data availability. Due to this limited availability, the rule based approach is better than the Neural Network approach.
- The best part of apertium is the community, which is always willing to assist you when needed. I would be thrilled to work with this incredible community of developers.
Which of the published tasks are you interested in? What do you plan to do?[edit]
I am interested in the task "Bring an unreleased translation pair to releasable quality." I plan to develop the Hindi-Marwari(HIN-MWR) pair.
Proposal[edit]
Deliverables:
- Creating the HIN-MWR bilingual dictionary.
- Creating the MWR monolingual dictionary
- Updating the HIN monolingual dictionary, if required.
- Building the transfer rules for the HIN-MWR pair.
- Creating a HIN-MWR translator.
Reasons why Google and Apertium should sponsor it:[edit]
- Marwari is spoken by approximately 22 million people in India and its neighbouring countries. Despite its widespread use, major translation tools such as Google Translate do not include it.
- The project adds diversity to Apertium by incorporating Marwari.
- This project will make a significant contribution to the community, with the potential to be useful for building projects or conducting research in the growing field of low-resource languages.
- The release of the first open-source HIN-MWR translator will aid developers in creating additional language pairs related to Marwari.
How and who it will benefit in society[edit]
- The project will benefit the native Marwari speakers as well as those traveling to the Indian state of Rajasthan, where Marwari is the most widely spoken language. It will also help tourists visiting Rajasthan, a popular tourist destination around the world, communicate with the locals.
- It will assist Natural Language Processing researchers in conducting research in Marwari.
- This project can be used by developers to create other language pairs that are closely related to Marwari.
- In the long run, this project aims to reduce the language barrier between people from different regions.
Work plan[edit]
Community bonding period (May 4 - May 28):[edit]
- Getting introduced to the organization and community of Apertium.
- Understanding the code/projects which would be needed as a reference for my project.
- Discussing the project ideas and taking suggestions from the community regarding the implementation of the project.
- Exploring and finding resources for Marwari.
Work Period (May 29 - 28 Aug):[edit]
Week 1 (29/05-04/06):
- Adding nouns and adjectives to bilingual and MWR monolingual dictionary.
- Learning about paradigms and how to implementing it for marwari.
Week 2(05/06-11/06):
- Implementation of paradigms in marwari dictionary.
- Getting familiar to the syntax for writing transfer rules.
- Learning about currently used transfer rules implemented for other similar language pairs.
Week 3(12/06-18-06):
- Implementing transfer rules for nouns and adjectives, for the chosen language pair.
Week 4(19/06-25/06):
- Adding verbs and other parts of speech to the dictionaries.
Week 5(26/06-02/07):
- Writing transfer rules for verbs and other parts of speech added to the dictionaries in previous week.
Week 6(03/07-09/07):
- Run tests.
- Update documentation.
- Preparing for the midterm evaluation.
Deliverable 1: Monolingual and Bilingual dictionary, basic transfer rules
Week 7(14/07-23/07):
- Translating essays/paragraphs and aim to achieve WER < 50%.
- Working on lexical selection rules.
Week 8(24/07-30/07):
- Using testvoc clean for adjectives.
- Aim to achieve WER < 25%.
Week 9(31/07-6/08):
- Expanding dictionaries further.
- Working on disambiguation rules for HIN-MWR.
Week 10(07/08-13/08):
- Expanding bilingual dictionary.
- Lexical selection rules.
- Disambiguation rules.
- Transfer rules.
Week 11&12(14/08-28/08):
- Testvoc HIN-MWR
- Discussing documentation details with mentors and organization.
- Completing any pending tasks.
- Final discussion and release of the project and documentation.
Project completed
Skills[edit]
I am a senior Computer Science undergraduate at the prestigious Birla Institute of Technology and Science Pilani (BITS Pilani) in India. I interned at Ericsson, where I built an NLP based ticket-classifier using python. As part of my Natural Language Processing coursework at my university, I created a POS tagger for Hin-Eng code mixed datasets using the Hidden Markov Model. I also interned at the Artificial Intelligence Institute of South Carolina, where I worked on transformer architecture. Through these projects and my university coursework, I have gained proficiency in programming languages and tools such as Python, C++, XML, Git, bash scripting, and HTML/CSS. In general, I enjoy problem-solving with various programming tools. I am a native Hindi speaker with the ability to read and write Marwari. I believe I am a good fit for this project because I have previously worked in Natural Language Processing for my projects and understand two languages, HIN and MWR.
Coding Challenge/Contributions[edit]
- Successfully setup apertium environment.
- Created a Pull Request fixing minor compilation errors for apertium-mar-eng : https://github.com/apertium/apertium-mar-eng/pull/1 .
- Working on creating a HIN-MWR translator : apertium-hin-mwr, apertium-mwr and apertium-hin.
- Worked on adding words to monodix and bidix, adding transfer rule in t1x, adding paradigms to MWR monodix.
Some outputs of the translation from HIN to MWR:
Test Corpus[edit]
- HIN corpus : https://github.com/hitenvidhani/apertium-hin-mwr/blob/master/test_corpus/hin_corpus.txt
- MWR corpus : https://github.com/hitenvidhani/apertium-hin-mwr/blob/master/test_corpus/mwr_corpus.txt
Resources[edit]
- https://wikitravel.org/en/Rajasthani_phrasebook
- https://www.languageshome.com/English-Marwadi.htm
- https://hi.glosbe.com/mwr/hi
- https://hattai.page.tl/marwari-dictionary.htm
- https://www.marwaribaatein.com/marwari-language
- https://crazychhora.com/learn-marwadi/
Non summer of code plans[edit]
I have no plans other than GSoC for the summer of 2023. I can devote 30-40 hours per week to this project. As my university curriculum begins in August, I would like to work ~40 hours per week in the months of June and July and ~20 hours per week in August.