User:Sambit/GSoC proposal 2017: Odia and English

From Apertium
Jump to navigation Jump to search

Name

Sambit Mallick

Contact information

IRC nick : sambit

E-mail : sambit95@gmail.com / sambit.mallick@iitg.ac.in

SourceForge : sambit95

Location : India

Time Zone : UTC/GMT +5:30

Why am I interested in machine translation?

I've always wondered how Google translator works. After finding about Apertium on GSoC, I've read many well written articles about working process on wiki. Since then I got to know about different uses of machine translation, types of rule-based MT and how they work. This eventually developed my interest in MT.

Why is it that you are interested in Apertium?

As I said earlier how I developed my interest in MT, there couldn't be any better platform than Apertium to work on. It's open source and a shallow-transfer type machine translation system. Apart from these mentors are very friendly and supportive in nature.

Which of the published tasks am I interested in? What do I plan to do?

I'm interested in adopting an unreleased language pair(Odia-English). As there is no Odia-English pair in Apertium, I have to work on monodix first.

Reasons why Google and Apertium should sponsor it?

There aren't reliable Odia-English translator available on Internet.

How and who will it benefit in society?

Odia is the predominant language of the Indian state of Odisha and one of many official languages of India. It the sixth Indian language to be designated a Classical Language in India on the basis of having a long literary history and not having borrowed extensively from other languages. Out of 40 million native speakers, many doesn't know English. As there are is no reliable translation available even in Google translator, it gets difficult for those people who don't understand or struggle to learn English. As English is the International language, it's widely used everywhere i.e Social Media, Internet. Through Apertium these problems an be solved using machine translation. This'll benefit a large community and for non-native speakers, they will get to know about Odia language literature.

Work Plan

Coding challenge

  • I've already installed the prerequisites for Ubuntu.
  • Bootstrapped new language pair(odi.eng) with existing eng monodix. // Have to make again as the language code of Oriya/Odia is "ori" [1]
  • Added some words in odi monodix to work on story.
  • Currently working on transfer rules and reading the pdfs to get more familiar with Apertium.
  • Getting some errors but trying resolve these by asking mentors on IRC.

Thanks to Unhammer, TinoDidriksen and spectie, I've reached this far!

[1] It's either Odia or Oriya, but the official language code for this is 'ori'. Hence the language pair will be ori-eng.

Post Application

I'll try to work on story. But can't give much time because of end semester exams on April end.

Community Bonding Period

  • Get to know the mentors and discuss the plans properly.
  • To know more about the process how to implement very large scale monodix easily and effectively.

Notes

Odia and Hindi are similar but also have their differences. Most similarities lies in the grammar and vocabulary and there exists apertium-eng-hin in nursery. So, I guess I could use some rules from there. To implement Odia/Oriya monodix I'll use http://wortschatz.uni-leipzig.de/en/download/ as resource. As it's difficult to find parallel corpus for Odia-English language, I'll try to find small stories and articles. So I can extract bidix for the pair. For the time being I'm using http://learn101.org/oriya.php to know about grammar and stuff but I have to plans to read some books regarding this. Apart from these I'm using Apertium wiki articles for the project.

Week Plan

Week Task Comment
1 Implementation of Odia monodix. As it's a new language pair so, it'll take much time to implement Odia monodix.
2 Continue to work on monodix.
3 Continue to work on monodix.
4 Continue to work on monodix and start Working on a Odia-English bilingual dictionary.
Deliverable #1 A monolingual dictionary containing at least 3000 words. [1] As it's difficult to add proper words in Odia monodix as I've faced while working on post application, though I'll try my best to add more.
5 Continue adding more words to monolingual dictionary. Continue adding words to bilingual dictionary. NA
6 Continue adding more words to the bilingual dictionary. NA
7 Implementation of disambiguation rules for Odia. NA
8 Implementation of transfer rules for Odia->English. NA
Deliverable #2 A monolingual dictionary containing at least 7000 words and around 10000 words in bilingual dictionary. [1] NA
9 Complete the disambiguation, transfer rules implementation and design of constraint grammar. NA
10 testvoc NA
11 testvoc NA
12 Wrap-up testvoc, cleaning up, result evaluation and completion of documentation. NA
Deliverable #3 Completion of the project. NA

[1] As I've to implement the pair from scratch with less available resource I can't able to estimate the total word count for Oriya monodix and bidix for weekly work estimation. Also, it'll get difficult to reach 10000 mark if I can't find parallel corpus for Odia-English language.

Updates

There are articles on Wikipedia which are available in i.e English, Hindi, Oriya! This will help to implement both monolingual and bilingual dictionary.

Skills & Qualifications

Currently, I'm a pre-final year student of Electronics and Communicating Engineering at IIT Guwahati. Though I'm new to MT, I do coding and have interest in Machine Learning. I am comfortable with C and Python. Besides, for this project I'm familiar with Linux, XML, Odia(my mother-tongue) and English. I've not any previous experience with open source, although I've spent quite a bit of time to understand about Apertium.

My non-Summer-of-Code plans for the Summer

I've no other plans beside preparing for placement other than GSoC. Hence I'll able to give 30+ hours a week to develop for the project.