Chebrolutejasvi/GSOC 2020 proposal: Hindi-Telugu

From Apertium
Jump to navigation Jump to search

Contact Information[edit]

Name: Tejasvi Chebrolu (chebrolutejasvi on WIki)

Location: Hyderabad, India

University: International Institute of Information Technology

E-Mail: tejasvi.chebrolu@research.iiit.ac.in

IRC: chebrolutejasvi

Timezone: UTC +5:30 or IST

Github: https://github.com/tejasvicsr1


Why is it that I am interested in Apertium?[edit]

Apertium is an open-source organisation dedicated to machine translation. As a child, I grew up in different places and was exposed to different languages. This led to me being fascinated with language translation and I wanted to contribute to help in making communication easier for everyone using machine translation.

Apertium focuses on low- resource languages. Growing up in India, a country with 22 official recognised languages and many more unrecognised ones, there was a lack of a good quality machine translation service. There are hardly any resources for most Indian languages and the work Apertium does manages to counter this.

Apertium is a rule-based system. As a student of Computational Linguistics, we have multiple linguistics courses. As a student and an undergraduate researcher, I am interested in rule-based systems and Apertium provides an excellent platform to further my interests.


Which of the published tasks am I interested in? What do I plan to do?[edit]

I am going to work on “ Adopt an unreleased language pair: Hindi - Telugu”. I want to get the pair released in both the directions. I expect the WER to be around 25%. This would mean updating both the monolingual dictionaries along with the bilingual dictionary. At the same time, I would be writing transfer rules to ensure the release of the pair.


Why should Google and Apertium sponsor it?[edit]

As of 2019, Hindi has 341 million speakers while Telugu has 82 million speakers. In spite of these huge numbers, there are very few resources which can effectively translate between these languages. Creating some basic rules for the transfer between Hindi (an Indo-Aryan language) and Telugu (a Dravidian language) would further the development of translation systems between these two sets of languages.

Places like Telangana, which speak the language Dakhini (a language which is considered to be a mixture of Hindi and Telugu), are extremely populated areas. Creating a good quality translator would help in furthering the research done in languages like Dakhini(with very few speakers) due to easy conversion between Hindi and Telugu due to Apertium.

Apertium has very few Indian language pairs(both Indian languages). It has only one Indian language pair in the trunk; no language pairs in staging; no language pair in the nursery; and six language pairs in the incubator. Creating a language pair consisting of a Dravidian language and an Indo-Aryan language will help even the other languages due to the rules that would be created.


Who will benefit from this?[edit]

Creating a good translator for Hindi - Telugu would have a huge impact on society. It would help in better documentation of official documents (Telugu is not an official language but Hindi is). India has a huge population and this would help in easier communication. It would help in creating a good, online bilingual dictionary. It would, again, help in the translation between Dravidian and Indo-Aryan languages which, as of right now, is very infrequent and inaccurate.


Work Plan[edit]

Current Status of the Pair

There is no pre-existing Hindi-Telugu pair in Apertium right now.


Hindi Monolingual Dictionary:

1) There exists a decent amount of words in the monolingual dictionary along with paradigms.

2) Constraint grammar exists.

Telugu Monolingual Dictionary:

1) There are hardly any words in the monolingual dictionary. Only the alphabets have been added.

2) There is no Constraint grammar.

Resources to enhance dictionaries

Hindi - Telugu Dictionary (~30,000 words)[1]

Hindi Monolingual Corpus (~36,000 sentences)[2]

Telugu Monolingual Corpus (~32,000 sentences)[3]

Detailed Plan

PHASE DURATION TASKS
Post-Application Period April 1st - May 3rd
  • Bootstrap the hin-tel pair.
  • Add basic words to the Telugu monolingual dictionary.
  • Complete the rest of the coding challenge.
  • Getting familiar with Apertium tools.
  • Find more resources.
  • Read about HFST.
Community Bonding Week May 4th - May 31st
  • Read the Apertium Documentation entirely.
  • Discuss with mentors the broad plan and iron out exact details.
  • Start creating transfer rules.
  • Make frequency lists.
Week One June 1st - June 7th
  • Adding nouns and verbs to the Telugu monolingual dictionary.
  • Start working on constraint grammar.
  • Defining paradigms for Telugu.


Week Two June 8th - June 14th
  • Add pronouns and adjectives to the dictionary.
  • Add conjunctions, prepositions, adverbs etc.
  • Create transfer rules.
Week Three June 15th - June 21st
  • Add to the bilingual dictionary.
  • Start creating disambiguation rules.
  • Add to the Telugu monolingual dictionary.


Week Four June 22nd - June 28th
  • Fix the Hindi monolingual dictionary for any errors.
  • Add words to the Hindi dictionary.
  • Add to the bilingual dictionary.
DELIVERABLE 1:
  • Reach 3500 words in the bilingual dictionary.
  • Reach 4000 words in the Telugu monolingual dictionary.


Week Five June 29th - July 5th
  • Add to the bilingual dictionary.
  • Create transfer rules.
  • Add disambiguation rules.
Week Six July 6th - July 12th
  • Add compound words.
  • Add disambiguation rules.
  • Add to the constraint grammar.
Week Seven July 13th - July 19th
  • Add multi-words to the bilingual dictionary.
  • Add more transfer rules.
Week Eight July 20th - July 26th
  • Test on data present in books.
  • Add transfer rules.
  • Add disambiguation rules.
DELIVERABLE 2:
  • Reach 7500 words in the bilingual dictionary.
  • Complete 90% of transfer rules.
  • Reach a WER(~40%) so that there is an understandable translation between the languages.
Week Nine July 27th - August 2nd
  • Expand the bilingual dictionary.
  • Create more disambiguation rules.
Week Ten August 3rd - August 9th
  • Add more transfer rules.
  • Finish the constraint grammar.
Week Eleven August 10th - August 16th
  • Test the system with natural language examples.
  • Update the rules based on the results.
Week Twelve August 17th - August 23rd
  • Testvoc the hin-tel pair.
  • Add more rules, if needed.
Week Thirteen August 24th - August 30th
  • Add documentation.
  • Evaluation of results.
  • Fix any bugs, if found.
FINAL EVALUATION OBJECTIVES:
  • Achieve a WER rate of around 25%.
  • Reach at least 10,000 words in the bilingual dictionary.
  • If there is time left over, convert the bilingual dictionary into IPA notation for easy use in the future.


Coding Challenge[edit]

Install Apertium: Link to screenshot.[4]

Completed the HOWTO.

Completed the MT course.

Since there were no words at all in the Telugu monolingual dictionary no work could be done on the story. (As of right now.). Will be completed in the post-application period.


Skills[edit]

I am a first-year undergraduate student at International Institute of Information Technology, Hyderabad where I am studying Computational Linguistics. The course requires a strong understanding of Computer Science along with Linguistics. I have done courses in Linguistics, Semantics, Data Structures and Algorithms, and Software Systems.

I am proficient in a multitude of programming languages like C++, Python, XML, Bash Scripting, HTML. I have created websites and web apps apart from simple games for my courses. As part of the Linguistics courses, I had to create transfer rules for an English-Hindi pair. I have also built Brill’s POS tagger for languages like Hindi and Telugu. Currently, I am working on a system to help solve Arithmetic Word Problems in Hindi.

As mentioned before, I am fluent in multiple languages (English, Hindi, Telugu, Odiya, Gujarati). I also have a decent understanding of French.

Since most of my projects were part of course curriculum they are not available on my GitHub profile but I can send the files if needed.

Non-Summer-Of-Code plans for the Summer[edit]

I will be having my college summer vacations during the GSoC period and hence I do not have any other commitments and can spend around 40 hours a week. Since we are on lockdown because of COVID-19, I have a reduced workload in the first two weeks as our end-semester examinations could be postponed. At this point, however, it seems unlikely due to online classes. However, as a precaution, I have kept the workload heavy before and after the period to ensure no hiccups in the project.