User:Chebrolutejasvi/GSoC2020Proposal
Contents
Contact Information[edit]
Name: Tejasvi Chebrolu (chebrolutejasvi on WIki)
Location: Hyderabad, India
University: International Institute of Information Technology
E-Mail: tejasvi.chebrolu@research.iiit.ac.in
IRC: chebrolutejasvi
Timezone: UTC +5:30 or IST
Github: https://github.com/tejasvicsr1
Why is it that I am interested in Apertium?[edit]
Apertium is an open-source organisation dedicated to machine translation. As a child, I grew up in different places and was exposed to different languages. This led to me being fascinated with language translation and I wanted to contribute to help in making communication easier for everyone using machine translation.
Apertium focuses on low- resource languages. Growing up in India, a country with 22 official recognised languages and many more unrecognised ones, there was a lack of a good quality machine translation service. There are hardly any resources for most Indian languages and the work Apertium does manages to counter this.
Apertium is a rule-based system. As a student of Computational Linguistics, we have multiple linguistics courses. As a student and an undergraduate researcher, I am interested in rule-based systems and Apertium provides an excellent platform to further my interests.
Which of the published tasks am I interested in? What do I plan to do?[edit]
I am going to work on “ Adopt an unreleased language pair: Hindi - Telugu”. I want to get the pair released in both the directions. I expect the WER to be around 25%. This would mean updating both the monolingual dictionaries along with the bilingual dictionary. At the same time, I would be writing transfer rules to ensure the release of the pair.
Why should Google and Apertium sponsor it?[edit]
As of 2019, Hindi has 341 million speakers while Telugu has 82 million speakers. In spite of these huge numbers, there are very few resources which can effectively translate between these languages. Creating some basic rules for the transfer between Hindi (an Indo-Aryan language) and Telugu (a Dravidian language) would further the development of translation systems between these two sets of languages.
Places like Telangana, which speak the language Dakhini (a language which is considered to be a mixture of Hindi and Telugu), are extremely populated areas. Creating a good quality translator would help in furthering the research done in languages like Dakhini(with very few speakers) due to easy conversion between Hindi and Telugu due to Apertium.
Apertium has very few Indian language pairs(both Indian languages). It has only one Indian language pair in the trunk; no language pairs in staging; no language pair in the nursery; and six language pairs in the incubator. Creating a language pair consisting of a Dravidian language and an Indo-Aryan language will help even the other languages due to the rules that would be created.
Who will benefit from this?[edit]
Creating a good translator for Hindi - Telugu would have a huge impact on society. It would help in better documentation of official documents (Telugu is not an official language but Hindi is). India has a huge population and this would help in easier communication. It would help in creating a good, online bilingual dictionary. It would, again, help in the translation between Dravidian and Indo-Aryan languages which, as of right now, is very infrequent and inaccurate.
Work Plan[edit]
Current Status of the Pair
There is no pre-existing Hindi-Telugu pair in Apertium right now.
Hindi Monolingual Dictionary:
1) There exists a decent amount of words in the monolingual dictionary along with paradigms.
2) Constraint grammar exists.
Telugu Monolingual Dictionary:
1) There are hardly any words in the monolingual dictionary. Only the alphabets have been added.
2) There is no Constraint grammar.
Resources to enhance dictionaries
Hindi - Telugu Dictionary (~30,000 words)[1]
Hindi Monolingual Corpus (~36,000 sentences)[2]
Telugu Monolingual Corpus (~32,000 sentences)[3]
Detailed Plan
PHASE | DURATION | TASKS |
---|---|---|
Post-Application Period | April 1st - May 3rd |
|
Community Bonding Week | May 4th - May 31st |
|
Week One | June 1st - June 7th |
|
Week Two | June 8th - June 14th |
|
Week Three | June 15th - June 21st |
|
Week Four | June 22nd - June 28th |
|
DELIVERABLE 1: |
| |
Week Five | June 29th - July 5th |
|
Week Six | July 6th - July 12th |
|
Week Seven | July 13th - July 19th |
|
Week Eight | July 20th - July 26th |
|
DELIVERABLE 2: |
| |
Week Nine | July 27th - August 2nd |
|
Week Ten | August 3rd - August 9th |
|
Week Eleven | August 10th - August 16th |
|
Week Twelve | August 17th - August 23rd |
|
Week Thirteen | August 24th - August 30th |
|
FINAL EVALUATION OBJECTIVES: |
|
Coding Challenge[edit]
Install Apertium: Link to screenshot.[4]
Completed the HOWTO.
Completed the MT course.
Since there were no words at all in the Telugu monolingual dictionary no work could be done on the story. (As of right now.). Will be completed in the post-application period.
Skills[edit]
I am a first-year undergraduate student at International Institute of Information Technology, Hyderabad where I am studying Computational Linguistics. The course requires a strong understanding of Computer Science along with Linguistics. I have done courses in Linguistics, Semantics, Data Structures and Algorithms, and Software Systems.
I am proficient in a multitude of programming languages like C++, Python, XML, Bash Scripting, HTML. I have created websites and web apps apart from simple games for my courses. As part of the Linguistics courses, I had to create transfer rules for an English-Hindi pair. I have also built Brill’s POS tagger for languages like Hindi and Telugu. Currently, I am working on a system to help solve Arithmetic Word Problems in Hindi.
As mentioned before, I am fluent in multiple languages (English, Hindi, Telugu, Odiya, Gujarati). I also have a decent understanding of French.
Since most of my projects were part of course curriculum they are not available on my GitHub profile but I can send the files if needed.
Non-Summer-Of-Code plans for the Summer[edit]
I will be having my college summer vacations during the GSoC period and hence I do not have any other commitments and can spend around 40 hours a week. Since we are on lockdown because of COVID-19, I have a reduced workload in the first two weeks as our end-semester examinations could be postponed. At this point, however, it seems unlikely due to online classes. However, as a precaution, I have kept the workload heavy before and after the period to ensure no hiccups in the project.