User:Daedalus/GSoC2024Proposal

From Apertium
Jump to navigation Jump to search

Contact Information[edit]

Name: Chaitanya Gambali

Location: India

University: Indian Institute of Technology (BHU) Varanasi

Email address: gschaitanya2003@gmail.com

IRC: daedealus03 (OFTC)
(previously daedalus in Google Code-In 2019)
@daedalus2003@matrix.org

Github: gs-chaitanya

LinkedIn: gs-chaitanya

Timezone: UTC+5:30

Background[edit]

I am currently pursuing a Bachelor of Technology Degree in Electronics and Communication Engineering at the highly prestigious Indian Institute of Technology (BHU) Varanasi. I love learning new languages, and I have a keen interest in Natural Language Processing and Linguistics. I have also contributed to Apertium previously in 2019 under Google Code-In 2019 and I was one of the six Finalists (Daedalus) selected for Google Code-In 2019 from Apertium.

Skills[edit]

  • Languages: Python, C/C++, JavaScript, bash
  • Frameworks: Numpy, Pandas, PyTorch, TensorFlow, etc.

Why Am I Interested in Apertium[edit]

In the world of Machine-Learning/Natural Language processing-based translation approaches, Apertium is a breath of fresh air and ingenuity. A rule-based approach to translation is very interesting and more interpretable compared to the data-hungry, uninterpretable black boxes that modern-day machine learning-based systems are. Apertium's open-source nature, combined with its support for low-resource languages, makes it a promising alternative to NLP-based translation methods that still fail to support a lot of languages.

Which of the published tasks am I interested in? What do I plan to do?[edit]

I am interested in the Task : Dictionary Induction from Parallel Corpora. The aim is to construct bidirectional dictionaries for a language pair, given a pair of parallel corpora - i.e., the same content in two different languages using a single program.

Coding Challenge[edit]

The coding challenge outlined here required me to develop a script that "reads two parallel corpora, applies the appropriate monolingual taggers and some word-aligner, and then prints a list of paired words." I successfully completed that task, and the source code can be accessed here. I built a Python script that used the Apertium monolingual aligners for the source and target aligner and then used elfomal for word alignment. I also experimented with other text aligners like efmaral and fast_align. The script was reviewed and approved by Daniel Swanson (@dangswan)

Previous Contributions[edit]

  • Worked on the Apertium-English Dictionary [1]
  • Disambiguated 500 tokens of text in Apertium-Eng [2]
  • Worked on logo Design for UD-Annotatrix [3]
  • Documented usage of the Apertium-separable module (Wiki Username: Daedalus) [4]
  • Minor contributions to various wiki pages [5]


Proposal[edit]

Brief of Deliverables[edit]

The aim is to create a general program that can construct bidirectional dictionaries for a language pair, given a pair of parallel corpora - i.e., the same content in two different languages. ReTraTos was developed for this very purpose, but it is not very user-friendly at the moment and works in multiple steps. The objective is to build a modern alternative that does this job for the user in a single step and is much easier to use.

Mentors/Experienced members in Contact[edit]

Daniel Swanson
Previously worked with Jonathan Washington and Tino Didriksen in Google Code-In 2019

Why should Google and Apertium sponsor it?[edit]

The internet can serve as a huge repository of translated corpora. A tool that can build and update dictionaries using parallel corpora can greatly accelerate the development of language pairs and improve the capabilities of existing language pairs. It eliminates the need for manually developing bidirectional dictionaries to a great extent.

Work Plan[edit]

Timeline[edit]

Phase Description of Work Deliverable
Community bonding period:
May 1st -May 26th
Read carefully through all Apertium docs related to language pair development
Familiarizing myself with the Apertium Environment
Week 1 to Week 3
May 27th - June 16th
Add the POS Tagger to the script and the text aligner.
Experiment with various text aligners to see which gives the best results, starting with GIZA++. Other candidates include efmeral, elfomal, fast_align, etc.
The script in its current state should be able to produce an aligned and tagged version of the parallel corpora.
Week 4 to Week 6

June 17th to July 7th

Extend functionality to perform Dictionary Induction. Essentially, rewrite the Retratos_lex.pl script and integrate it with the previous script. A complete script that performs dictionary induction in one go.
Week 7 to Week 9

July 8th to July 28th

Perform comprehensive testing of the script.
Convert it into an executable binary if required.
The final script and the binary if required.
Week 10 to Week 12

July 29th to August 18th

Buffer Period - Incorporate any further changes if required Final script

Non-Summer of Code Plans[edit]

My final exams for the current semester will be completed by the first week of May 2024, and I have no plans other than GSoC for the summer of 2024. I can devote 20-30 hours per week, or more, as required to this project.