User:Agneet42/proposal

From Apertium
Revision as of 19:22, 2 April 2017 by Agneet42 (talk | contribs)
Jump to navigation Jump to search


Contact Info

Name: Agneet Chatterjee

E-mail: agneet257@gmail.com

IRC: agneet42

Location: India

Timezone: UTC+05:30

Why is it you are interested in machine translation?

"Because language plays such a fundamental part in connecting each of us as thinking creatures with the world around us, the subtle nuances of language (which are different even in similar tongues, say the Latin-derived Spanish and Portuguese) actually shape how we think about the world. Learning something of how somebody else speaks from a foreign country actually helps you to understand their mindset a little." I am interested in Machine Translation primarily for two reasons; Firstly, I believe that in this generation of information exchange, one of the biggest challenges is sharing and understanding knowledge in different languages. This is where machine translation comes into picture and interests me for it works for a unified purpose. Secondly, I have deep-rooted interests coupled with experience in the field of Natural Language processing. And I hope to make a difference in the field of machine translation.

Why is it that you are interested in the Apertium project?

Apertium is free/open-source machine translation platform, which means that developers from all over the world can join and work upon new language pair/s to facilitate better translation. Apertium uses Unix “pipelines” which is very useful for quick diagnosis and debugging, enabling me to use additional modules between existing modules, like using the HFST(Helsinki finite-state transducer) for morphological analysis. Furthermore, Apertium uses the novel approach of Rule Based Machine Translation where no bilingual texts are required which makes it possible to create translation systems for languages that have no texts in common, or even no digitized data whatsoever and also RBMT is domain independent which means that rules are usually written in a domain independent manner, so the vast majority of rules will "just work" in every domain, and only a few specific cases per domain may need rules written for them.

Which of the published tasks are you interested in?

Adopting the Hindi<->Bengali language pair.

Why should Google and Apertium sponsor it?

Firstly, Hindi and Bengali are respectively the 4th and 7th most spoken languages in the world with ~295 and ~200 million speakers each. And more so, the speakers of these languages are spread all across the globe. A hindi-bengali translation will not only aid speakers but also facilitate business transactions happening in these bustling business havens.

Currently, there is no single go-to platform for Machine Translation between these two languages, the only one being Google Translate but it has it's own limitations:

  1. They are not available offline, therefore less accessible.
  2. They are not open source. Not everybody can contribute.

Apertium makes sure that the above issues do not come in it's path, and that is what makes it a suitable developmental ground for this (or any other) language pair. Furthermore, a hindi-bengali translation will make it easier for translation of similar languages like bengali such as hindi-assamese and hindi-oriya.

How and who will benefit in society

The monolingual dictionaries can be used as a stemmer for any search engine for Hindi/Bengali. It could also used as a spell checker. The effect of these in other applications like, anaphora resolution, question answering can also be explored. The hindi-bengali translation will also help in accurate translation of manuscripts that are widely present in both the languages and make available the culture of both the forums to each other.

Literature Review

Hindi and Bengali both originated from Old Indo-Aryan family of languages and are similar in structure. They have lot of similarities even though there are differences in the form of uses and positions of the words in corresponding sentences. Hindi pronouns can be broadly categorized into seven types namely, Personal,Demonstrative, Indefinite, Relative-Correlative, Possessive, Interrogative and Reflexive. Among these Hindi pronouns some are used both as Personal, Demonstrative, and Relative-Correlative pronouns. In Bengali, there are different pronouns for each of these uses. As the list of Hindi such pronouns is small and their uses are limited, it is possible to differentiate each use and find their Bengali translations using a set of linguistic rules.

Current scenario

Presently,there exists a bengali and a hindi monolingual dictionary and a bengali-hindi bidix. In the bengali dictionary, the coverage needs to be expanded. The verb section faces difficulty in treating multi-word verbs and the negative form are not well recognised. This is because several forms of the verb like infinitives and participles demand a negative particle before the verb while fnite forms require the particle to follow the verb and in some cases as enclitic. The bilingual dictionary deals with very less coverage and misses out on very common and important nouns.

The goal of this project is to expand both the monolingual dictionaries and the bilingual dictionaries along with adding selection(structural and lexical) rules and constraint grammar rules, as they appear.

Solution/s

  1. Hindi nominal suffixes को(ko),का(kA),से(se),पे(pe), etc. are also used with Hindi pronouns. Different uses of these Hindi pronominal suffixes have different Bengali translations. The most frequent corresponding Bengali pronominal suffixes are (ke), (ra),(theke),(e), etc. The suffixes which have different translations can be disambiguated with the help of rules.
  2. Add a constraint grammar which deals effectively with tenses/prepositions/verbs. CG should also handle other-POS ambiguities.
  3. Tagger training to solve inflection issues.