User:Natasha singh/GSoC2023Proposal

From Apertium
Revision as of 22:35, 2 April 2023 by Natasha singh (talk | contribs) (Created page with "==Contact Details== Name: Natasha Singh E-mail address: natashasi475@gmail.com IRC: natasha_singh University: Indiana University - Bloomington, USA Timezone: EST (GMT-4)...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Contact Details

Name: Natasha Singh

E-mail address: natashasi475@gmail.com

IRC: natasha_singh

University: Indiana University - Bloomington, USA

Timezone: EST (GMT-4)

Github: https://github.com/nsingh475


Why is it that you are interested in Apertium?

I am a first year MS Computational Linguistics student at Indiana University - Bloomington. As a trilingual who can speak English, Hindi and Kumaoni/ Kumauni(an indo-aryan language written in Devanagari script), I am interested in contributing to the development of language resources and NLP. A lot of resources are available online for English and Hindi languages but for a language like Kumaoni not much content is published. Since Apertium is a rule based machine translation platform, it is excellent for developing language resources and translation systems for less-resourced languages, which do not have sufficient data to train a good ML or DL based NLP model.


Which of the published tasks are you interested in? What do you plan to do?

I am interested in working on the Morphological analyzer task. Morphological Analysis is an important step for developing any NLP project. The results obtained from this task can be leveraged by many downstream tasks such as POS tagging, Spell checking, Information Retrieval, Named Entity Recognition, Machine Translation, etc.

Recently, UNESCO has designated Kumaoni language as a language in the unsafe category. Most native people are choosing Hindi or English over Kumaoni because these languages offer more resources and opportunities. This calls for consistent efforts to safeguard the language. I believe this project will provide me an opportunity to contribute to the preservation and promotion of language and culture of the Kumaoni community which has less than 0.2% of native speakers in India. This project can serve as the stepping stone in extending various NLP applications in Kumaoni language which will in turn help facilitate communication and access to information for the native speakers.


Work plan

  • Week 1: 10% token coverage (with 100 lexicon)
  • Week 2: 30% token coverage (with 500 lexicon)
  • Week 3: 50% token coverage
  • Week 4: 70% token coverage (with 2000 lexicon)
  • Deliverable #1: Close cases completed
  • Week 5: 75% token coverage
  • Week 6: 80% token coverage
  • Week 7: 85% token coverage (with 5000 lexicon)
  • Week 8: 90% token coverage
  • Deliverable #2: Evaluation
  • Week 9: 92.5% token coverage (with 8000 lexicon)
  • Week 10: 95% token coverage
  • Week 11: 95% token coverage (with 10000 lexicon)
  • Week 12: Documentation: Paper
  • Project completed