Difference between revisions of "User:Daedalus/GSoC2024Proposal"
(Created page with "== Contact Information == '''Name:''' Chaitanya Gambalii '''Location:''' India '''University:''' Indian Institute of Technology (BHU) Varanasi '''Email address:''' gschait...") |
|||
Line 1: | Line 1: | ||
=This page is incomplete, please ignore= |
|||
== Contact Information == |
== Contact Information == |
||
'''Name:''' Chaitanya Gambalii |
'''Name:''' Chaitanya Gambalii |
Revision as of 08:31, 30 March 2024
Contents
This page is incomplete, please ignore
Contact Information
Name: Chaitanya Gambalii
Location: India
University: Indian Institute of Technology (BHU) Varanasi
Email address: gschaitanya2003@gmail.com
IRC: daedealus03 (OFTC)
(previously daedalus in Google Code-In 2019)
@daedalus2003@matrix.org
Timezone: GMT+5:30
Github: gs-chaitanya
LinkedIn: gs-chaitanya
Timezone: UTC+5:30
Background
I am currently pursuing a Bachelor of Technology Degree in Electronics and Communication Engineering at the highly prestigious Indian Institute of Technology (BHU) Varanasi. I love learning new languages, and I have a keen interest in Natural Language Processing and Linguistics. I have also contributed to Apertium previously in 2019 under Google Code-In 2019 and I was one of the six Finalists (Daedalus) selected for Google Code-In 2019 from Apertium.
Skills
- Languages: Python, C/C++, JavaScript, bash
- Frameworks: Numpy, Pandas, PyTorch, TensorFlow, etc.
Why Am I Interested in Apertium
In the world of Machine-Learning/Natural Language processing-based translation approaches, Apertium is a breath of fresh air and ingenuity. A rule-based approach to translation is very interesting and more interpretable compared to the data-hungry, uninterpretable black boxes that modern-day machine learning-based systems are. Apertium's open-source nature, combined with its support for low-resource languages, makes it a promising alternative to NLP-based translation methods that still fail to support a lot of languages.
Which of the published tasks am I interested in? What do I plan to do?
I am interested in the Task : Dictionary Induction from Parallel Corpora. The aim is to construct bidirectional dictionaries for a language pair, given a pair of parallel corpora - i.e., the same content in two different languages using a single program.
Coding Challenge
The coding challenge outlined here required me to develop a script that "reads two parallel corpora, applies the appropriate monolingual taggers and some word-aligner, and then prints a list of paired words." I successfully completed that task, and the source code can be accessed here. I built a Python script that used the Apertium monolingual aligners for the source and target aligner and then used elfomal for word alignment. I also experimented with other text aligners like efmaral and fast_align. The script was reviewed and approved by Daniel Swanson (@dangswan)
Previous Contributions
- Worked on the Apertium-English Dictionary [1]
- Disambiguated 500 tokens of text in Apertium-Eng [2]
- Worked on logo Design for UD-Annotatrix [3]
- Documented usage of the Apertium-separable module (Wiki Username: Daedalus) [4]
- Minor contributions to various wiki pages [5]
Proposal
Brief of Deliverables
The aim is to create a general program that can construct bidirectional dictionaries for a language pair, given a pair of parallel corpora - i.e., the same content in two different languages. ReTraTos was developed for this very purpose, but it is not very user-friendly at the moment and works in multiple steps. The objective is to build a modern alternative that does this job for the user in a single step and is much easier to use.
Mentors/Experienced members in Contact
Daniel Swanson Previously worked with Jonathan Washington and Tino Didriksen in Google Code-In 2019
Why should Google and Apertium sponsor it?
The internet can serve as a huge repository of translated corpora. A tool that can build and update dictionaries using parallel corpora can greatly accelerate the development of language pairs and improve the capabilities of existing language pairs. It eliminates the need for manually developing bidirectional dictionaries to a great extent.
Work Plan
Workplan
Phase | Dates | Description of Work |
Deliverable |
---|---|---|---|
Community bonding period: Familiarizing myself with the Apertium Environment |
May 1st -May 26th | - Read carefully through all Apertium docs related to language pair development |
|
Yet to Be Completed | Lorem Ipsum |
Non-Summer of Code Plans
My final exams for the current semester will be completed by the first week of May 2024, and I have no plans other than GSoC for the summer of 2024. I can devote 20-30 hours per week, or more, as required to this project.