Difference between revisions of "User:Natasha singh/GSoC2023Proposal"

From Apertium
Jump to navigation Jump to search
 
Line 56: Line 56:
 
* For 3rd person (SG, Male): जामो
 
* For 3rd person (SG, Male): जामो
 
* For 3rd person (PL, any gender): जामईं
 
* For 3rd person (PL, any gender): जामईं
  +
  +
Similar behavior is shown by other intransitive verbs like: गा(to sing), रु(to cry), खा(to eat), पि(to drink)

Latest revision as of 03:05, 4 April 2023

Contact Details[edit]

Name: Natasha Singh

E-mail address: natashasi475@gmail.com

IRC: natasha_singh

University: Indiana University - Bloomington, USA

Timezone: EST (GMT-4)

Github: https://github.com/nsingh475


Why is it that you are interested in Apertium?[edit]

I am a first year MS Computational Linguistics student at Indiana University - Bloomington. As a trilingual who can speak English, Hindi and Kumaoni/ Kumauni(an indo-aryan language written in Devanagari script), I am interested in contributing to the development of language resources and NLP. A lot of resources are available online for English and Hindi languages but for a language like Kumaoni not much content is published. Since Apertium is a rule based machine translation platform, it is excellent for developing language resources and translation systems for less-resourced languages, which do not have sufficient data to train a good ML or DL based NLP model.


Which of the published tasks are you interested in? What do you plan to do?[edit]

I am interested in working on the Morphological analyzer task. Morphological Analysis is an important step for developing any NLP project. The results obtained from this task can be leveraged by many downstream tasks such as POS tagging, Spell checking, Information Retrieval, Named Entity Recognition, Machine Translation, etc.

Recently, UNESCO has designated Kumaoni language as a language in the unsafe category. Most native people are choosing Hindi or English over Kumaoni because these languages offer more resources and opportunities. This calls for consistent efforts to safeguard the language. I believe this project will provide me an opportunity to contribute to the preservation and promotion of language and culture of the Kumaoni community which has less than 0.2% of native speakers in India. This project can serve as the stepping stone in extending various NLP applications in Kumaoni language which will in turn help facilitate communication and access to information for the native speakers.


Work plan[edit]

  • Week 1: Writing morphology for each POS
  • Week 2: 10% token coverage (with 100 lexicon)
  • Week 3: 30% token coverage (with 500 lexicon)
  • Week 4: 50% token coverage
  • Week 5: 70% token coverage (Finish Close cases)
  • Week 6: 80% token coverage (with 2000 lexicon)
  • Deliverable: Midterm Evaluation (Jul 10-14)
  • Week 7: 85% token coverage (with 5000 lexicon)
  • Week 8: 90% token coverage
  • Week 9: 92.5% token coverage (with 8000 lexicon)
  • Week 10: 95% token coverage
  • Week 11: 95% token coverage (with 10000 lexicon)
  • Week 12: Documentation: Paper
  • Project completed


Coding Challenge[edit]

I have implemented a dummy morphological analyzer for Kumaoni language here - https://github.com/nsingh475/Kumaoni_MorphologicalAnalyzer

In this language, the intransitive verbs agree for person, number and gender of the subject.

Eg: Verb जा means “to go".

  • For 1st person (SG/ PL, any gender): जामोए
  • For 2nd person (SG/ PL, any gender): जामछा
  • For 3rd person (SG, Female): जामैं
  • For 3rd person (SG, Male): जामो
  • For 3rd person (PL, any gender): जामईं

Similar behavior is shown by other intransitive verbs like: गा(to sing), रु(to cry), खा(to eat), पि(to drink)