Difference between revisions of "User:Maharaj/GSoC2024Proposal"

Latest revision as of 04:13, 2 April 2024

Apertium, as a rule-based Machine Translation system, is one core contributor to open-source. I have been interested in the problem of Machine Translation since my Bachelor's degree, and having found out that there is an open-source system meant for doing machine translation using a rule-based approach made me revisit the history of machine translation and learn more about it. With over 7000+ languages across the globe, the majority of the languages are under-represented in terms of data creation and research activities being carried out. Rule-based translation system provides opportunities to write data and linguistic rules of languages.

Which of the published tasks are you interested in? What do you plan to do?[edit]

Implement a transducer-based morphological analyser/generator for the Bodo language.

About Bodo language: Bodo (ISO-639: brx) is one of the Scheduled Languages of India and belongs to the Sino-Tibetan language family, one out of four language families widely spoken in India. According to the 2011 Census, it has over a million speakers. There are 1,454,547 native speakers and a total of 1,482,929 Bodo speakers. It accounts for 0.12% of the total population of India. Historically, Bodo has a rich oral tradition but with no standard written script (some scholars suggest for lost script). With passage of time, various scripts such as Assamese, Roman and Bengali were used. Bodo was recognized as one of the 22nd Scheduled Languages of India in 2003, with Devanagari as its writing standard script [1]. It is one of the under-represented languages, primarily due to various issues such as (i) lack of available datasets, (ii) lack of NLP researchers for Bodo, (iii) technical constraints for data building, (iv) lack of participatory research, (v) lack of funding, (vi) youngness of language, (vii) unavailability of dictionaries and glossaries. As a language, Bodo is a morphologically rich language.

My Proposal[edit]

Brief:[edit]

My primary objective is to build a morphology analyzer for the Bodo language (currently not available at Apertium). These include the creation of a bilingual dictionary (Bodo-English), handling disambiguation, and the release of raw corpus text.

Why Google and Apertium should sponsor it?[edit]

Building linguistics resources for a language largely impacts the entire research ecosystem, providing opportunities for new problems to be tackled. Apart from this, it acts as a support system for the preservation of languages. In the application aspect, this work can act as a foundation for building spelling checkers (Not available publicly for Bodo), the bilingual dictionary can improve other MT systems, and it will also contribute towards efforts such as GATITOS (https://github.com/google-research/url-nlp/tree/main/gatitos).

How and who it will benefit in society?[edit]

The proposed work will directly benefit the Bodo community and contribute to the availability of language technologies of the language. It will improve the language diversity and inclusivity of Apertium. It will also provide the research community with the open-source morphology tool and corpus.

Work plan[edit]

Community Bonding Period (May 1 - 26):[edit]

List and discuss the possible approaches for corpus collection, linguistics features of Bodo, and its support in Apertium like spelling change rules and creation of new consistent symbols.
Reading about symbols and getting familiar with the Apertium documentation on morphology analyser.
Finding language resources - dictionary books, corpus selection, or curation.

Week 1 (May 27 - June 03):[edit]

Write Noun and Pronoun rules
Write Test cases for Nouns and Pronoun

Week 2 (June 03 - June 10):[edit]

Handle Number, Gender and case inflection

Week 3 (June 10 - June 17):[edit]

Write Verb and Tense rules, Adding conjunctions.

Week 4 (June 17 - June 24):[edit]

Write Verb, Tense rules. Writing more test cases. Documentation.

Deliverable #1[edit]

Affixes and rules for handling nominal inflections
Affixes and rules for verb inflections
Conjunctions

Week 5 (June 24 - July 01):[edit]

Adjectives, Adverbs

Week 6 (July 01 - July 08):[edit]

Interjections and Classifiers, Documentation.

July 8 - July 12 - Mid Term evaluations

Week 7 (July 08 - July 15):[edit]

Particles, and Intersections

Week 8 (July 15 - July 22):[edit]

Compounding and Reduplication
Creation of bilingual dictionaries.
Documentation.

Deliverable #2[edit]

Inflection rules for adjectives, adverbs.
Inflection rules for interjections and classifiers.
Initial bilingual dictionaries (~5k)

Week 9 (July 29 - August 05):[edit]

Expanding bilingual dictionaries

Week 10 (August 05 - August 12):[edit]

Handling disambiguation

Week 11 (August 12 - August 19):[edit]

Handling disambiguation

Week 12 (August 19 - August 26):[edit]

Testing and documentation

Final Deliverable[edit]

Morphology analyser of Bodo language
Bilingual dictionary (brx-eng) ~ 10k lexemes
Publicly released Bodo Corpus

Skills and Qualifications[edit]

I’m currently a Ph.D. student at the Indian Institute of Technology Hyderabad India, working in the field of Natural Language Processing. During the first year of my Ph.D. I worked on enabling zero-shot machine translation for extremely low-resource language by exploiting surface-level similarity (EMNLP Finding 2023 [3]). The availability of quality corpus often limits language technologies. The technology awareness, preliminary tools, and participation for language inclusivity in online platforms by native speakers limit corpus availability in a step towards addressing this issue during my Master studies, I explored building a monolingual corpus (LREC 2022 [4]) for Bodo (One of the 2010 UNESCO endangered languages), using easy-to-use tools like Google Keep. Bodo as a language is relatively new in terms of standard writing script and has two tones with no indication of tones in written form. To address this issue of tonal markers in written form, I was part of the first project that tried to solve this problem by framing it as word sense disambiguation (NERC [5]). During my Bachelors, I was introduced to research where I had first-hand experience in data cleaning and pre-processing and explored neural machine translation from English (IEEE ICCMC 2019 [6]) to the Bodo language.

Coding Challenge[edit]

I have completed the coding challenge, which is available at https://github.com/maharajbrahma/apertium-brx. It covers basic Nouns, Pronouns, Verbs, Conjunctions, Adverbs, Adjectives and Post Positions. Also, the script for testing different types of parts of scripts has been created.

Non-Summer-of-Code plans[edit]

Apart from my research activities at my institute, I don’t have any other commitments and satisfies the time requirements.

References[edit]

[1] https://www.mha.gov.in/sites/default/files/EighthSchedule_19052017.pdf

[2] https://github.com/google-research/url-nlp/tree/main/gatitos

[3] Maharaj Brahma, Kaushal Maurya, Maunendra Sankar Desarkar: SelectNoise: Unsupervised Noise Injection to Enable Zero-Shot Machine Translation for Extremely Low-resource Languages. EMNLP (Findings) 2023: 1615-1629. https://aclanthology.org/2023.findings-emnlp.109.pdf

[4] Sanjib Narzary, Maharaj Brahma, Mwnthai Narzary, Gwmsrang Muchahary, Pranav Kumar Singh, Apurbalal Senapati, Sukumar Nandi, and Bidisha Som. "Generating Monolingual Dataset for Low Resource Language Bodo from old books using Google Keep." In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pp. 6563-6570. 2022. https://aclanthology.org/2022.lrec-1.705.pdf

[5] Mwnthai Narzary, Maharaj Brahma, Sanjib Narzary, Apurbalal Senapati, Singh, Pranav Kumar Singh (2023). A Computational Approach for the Tonal Identification in Bodo Language. In NERC 2022. Springer, Singapore. https://doi.org/10.1007/978-981-99-2609-1_3.

[6] Sanjib Narzary, Maharaj Brahma, Bobita Singha, Rangjali Brahma, Bonali Dibragede, Sunita Barman, Sukumar Nandi, and Bidisha Som. "Attention based English-Bodo neural machine translation system for tourism domain." In 2019 3rd International Conference on Computing Methodologies and Communication (ICCMC), pp. 335-343. IEEE, 2019. https://ieeexplore.ieee.org/abstract/document/8819699

@@ Line 1: / Line 1: @@
+[[Category:GSoC_2024_student_proposals]]
-<b>Contact Information</b>
+=== Contact Information ===
-<hr>
 <b>Name:</b> Maharaj Brahma
@@ Line 14: / Line 15: @@
 <b>Website:</b> https://maharajbrahma.github.io/
+<b>Native Language:</b> Bodo
+=== Why is it that you are interested in Apertium? ===
+Apertium, as a rule-based Machine Translation system, is one core contributor to open-source. I have been interested in the problem of Machine Translation since my Bachelor's degree, and having found out that there is an open-source system meant for doing machine translation using a rule-based approach made me revisit the history of machine translation and learn more about it. With over 7000+ languages across the globe, the majority of the languages are under-represented in terms of data creation and research activities being carried out. Rule-based translation system provides opportunities to write data and linguistic rules of languages.
+=== Which of the published tasks are you interested in? What do you plan to do? ===
+Implement a transducer-based morphological analyser/generator for the Bodo language.
+<b>About Bodo language:</b> Bodo (ISO-639: brx) is one of the Scheduled Languages of India and belongs to the Sino-Tibetan language family, one out of four language families widely spoken in India. According to the 2011 Census, it has over a million speakers. There are 1,454,547 native speakers and a total of 1,482,929 Bodo speakers. It accounts for 0.12% of the total population of India. Historically, Bodo has a rich oral tradition but with no standard written script (some scholars suggest for lost script). With passage of time, various scripts such as Assamese, Roman and Bengali were used. Bodo was recognized as one of  the 22nd Scheduled Languages of India in 2003, with Devanagari as its writing standard script [1]. It is one of the under-represented languages, primarily due to various issues such as (i) lack of available datasets, (ii) lack of NLP researchers for Bodo, (iii) technical constraints for data building, (iv) lack of participatory research, (v) lack of funding, (vi) youngness of language, (vii) unavailability of dictionaries and glossaries. As a language, Bodo is a morphologically rich language.
+=== My Proposal ===
+==== Brief: ====
+My primary objective is to build a morphology analyzer for the Bodo language (currently not available at Apertium). These include the creation of a bilingual dictionary (Bodo-English), handling disambiguation, and the release of raw corpus text.
+==== Why Google and Apertium should sponsor it? ====
+Building linguistics resources for a language largely impacts the entire research ecosystem, providing opportunities for new problems to be tackled. Apart from this, it acts as a support system for the preservation of languages. In the application aspect, this work can act as a foundation for building spelling checkers (Not available publicly for Bodo), the bilingual dictionary can improve other MT systems, and it will also contribute towards efforts such as GATITOS (https://github.com/google-research/url-nlp/tree/main/gatitos).
+==== How and who it will benefit in society? ====
+The proposed work will directly benefit the Bodo community and contribute to the availability of language technologies of the language. It will improve the language diversity and inclusivity of Apertium. It will also provide the research community with the open-source morphology tool and corpus.
+=== Work plan ===
+==== Community Bonding Period (May 1 - 26): ====
+* List and discuss the possible approaches for corpus collection, linguistics features of Bodo, and its support in Apertium like spelling change rules and creation of new consistent symbols.
+* Reading about symbols and getting familiar with the Apertium documentation on morphology analyser.
+* Finding language resources - dictionary books, corpus selection, or curation.
+==== Week 1 (May 27 - June 03): ====
+* Write Noun and Pronoun rules
+* Write Test cases for Nouns and Pronoun
+==== Week 2 (June 03 - June 10): ====
+* Handle Number, Gender and case inflection
+==== Week 3 (June 10 - June 17): ====
+* Write Verb and Tense rules, Adding conjunctions.
+==== Week 4 (June 17 - June 24): ====
+* Write Verb, Tense rules. Writing more test cases. Documentation.
+==== '''Deliverable #1''' ====
+* Affixes and rules for handling nominal inflections
+* Affixes and rules for verb inflections
+* Conjunctions
+==== Week 5 (June 24 - July 01): ====
+* Adjectives, Adverbs
+==== Week 6 (July 01 - July 08): ====
+* Interjections and Classifiers, Documentation.
+<b><i>July 8 - July 12  - Mid Term evaluations</i></b>
+==== Week 7 (July 08 - July 15):====
+* Particles, and Intersections
+==== Week 8 (July 15 - July 22):====
+* Compounding and Reduplication
+* Creation of bilingual dictionaries.
+* Documentation.
+==== '''Deliverable #2''' ====
+* Inflection rules for adjectives, adverbs.
+* Inflection rules for interjections and classifiers.
+* Initial bilingual dictionaries (~5k)
+==== Week 9 (July 29 - August 05):====
+Expanding bilingual dictionaries
+==== Week 10 (August 05 - August 12):====
+Handling disambiguation
+==== Week 11 (August 12 - August 19):====
+Handling disambiguation
+==== Week 12 (August 19 - August 26):====
+Testing and documentation
+==== '''Final Deliverable''' ====
+* Morphology analyser of Bodo language
+* Bilingual dictionary (brx-eng) ~ 10k lexemes
+* Publicly released Bodo Corpus
+=== Skills and Qualifications ===
+I’m currently a Ph.D. student at the Indian Institute of Technology Hyderabad India, working in the field of Natural Language Processing. During the first year of my Ph.D. I worked on enabling zero-shot machine translation for extremely low-resource language by exploiting surface-level similarity (EMNLP Finding 2023 [3]). The availability of quality corpus often limits language technologies. The technology awareness,  preliminary tools, and participation for language inclusivity in online platforms by native speakers limit corpus availability in a step towards addressing this issue during my Master studies, I explored building a monolingual corpus (LREC 2022 [4]) for Bodo (One of the 2010 UNESCO endangered languages), using easy-to-use tools like Google Keep. Bodo as a language is relatively new in terms of standard writing script and has two tones with no indication of tones in written form. To address this issue of tonal markers in written form, I was part of the first project that tried to solve this problem by framing it as word sense disambiguation (NERC [5]). During my Bachelors, I was introduced to research where I had first-hand experience in data cleaning and pre-processing and explored neural machine translation from English (IEEE ICCMC 2019 [6]) to the Bodo language.
+=== Coding Challenge ===
+I have completed the coding challenge, which is available at https://github.com/maharajbrahma/apertium-brx. It covers basic Nouns, Pronouns, Verbs, Conjunctions, Adverbs, Adjectives and Post Positions. Also, the script for testing different types of parts of scripts has been created.
+=== Non-Summer-of-Code plans ===
+Apart from my research activities at my institute, I don’t have any other commitments and satisfies the time requirements.
+=== References ===
+[1] https://www.mha.gov.in/sites/default/files/EighthSchedule_19052017.pdf
+[2] https://github.com/google-research/url-nlp/tree/main/gatitos
+[3] Maharaj Brahma, Kaushal Maurya, Maunendra Sankar Desarkar: SelectNoise: Unsupervised Noise Injection to Enable Zero-Shot Machine Translation for Extremely Low-resource Languages. EMNLP (Findings) 2023: 1615-1629. https://aclanthology.org/2023.findings-emnlp.109.pdf
+[4] Sanjib Narzary, Maharaj Brahma, Mwnthai Narzary, Gwmsrang Muchahary, Pranav Kumar Singh, Apurbalal Senapati, Sukumar Nandi, and Bidisha Som. "Generating Monolingual Dataset for Low Resource Language Bodo from old books using Google Keep." In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pp. 6563-6570. 2022. https://aclanthology.org/2022.lrec-1.705.pdf
+[5] Mwnthai Narzary, Maharaj Brahma, Sanjib Narzary, Apurbalal Senapati, Singh, Pranav Kumar Singh (2023). A Computational Approach for the Tonal Identification in Bodo Language. In NERC 2022. Springer, Singapore. https://doi.org/10.1007/978-981-99-2609-1_3.
+[6] Sanjib Narzary, Maharaj Brahma, Bobita Singha, Rangjali Brahma, Bonali Dibragede, Sunita Barman, Sukumar Nandi, and Bidisha Som. "Attention based English-Bodo neural machine translation system for tourism domain." In 2019 3rd International Conference on Computing Methodologies and Communication (ICCMC), pp. 335-343. IEEE, 2019. https://ieeexplore.ieee.org/abstract/document/8819699