Apertium-kan-mar

From Apertium
Revision as of 18:06, 12 August 2018 by Invo98 (talk | contribs)
Jump to navigation Jump to search

Description

The goal of this project was to develop a rule-based translation system for Kannada-Marathi pair for Apertium.

Kannada

The Kannada monolingual dictionary was developed from scratch as there was no pre-existing work based on Kannada from Apertium. The Kannada-Marathi bilingual dictionary was also developed from scratch with the help of Marathi monoligual dictionary.



My commits can be accessed at the following link: commits These are the dependent GitHub repositories of my GSoC 2018 project. My GitHub account name is MissingBytes

Github repositories
Kannada monolingual dictionary
Kannada-Marathi bilingual dictionary
Marathi monolingual dictionary

I worked on the Kannada monolingual dictionary and Kannada-Marathi bilingual dictionary from scratch. I haven't made any changes in the Marathi monolingual dictionary.

The work I did can be downloaded here in tar.gz format.

The work I did can be downloaded here in .zip format.

Summary

A finite state transducer(FST) for Kannada and a bilingual dictionary for Kannada-Marathi was developed in this project. Morphological analyzer is a tool used for decomposition of inflected words into its base form and to obtain its grammatical information. Generation is the exact reverse process of analysis i.e. obtaining the inflected word from its base form and grammatical information. Morphology and generation is an essential part of rule-based machine translation, an application of Natural Language Processing(NLP).

Coverage

Coverage is the percentage of words the translation system could analyse(or assign parts of speech tag-MonoDix/map words-BiDix) in a given text. For a translation system, it is necessary to do the morphological analysis using the dictionaries. The morphological analysis of Kannada was difficult due to high agglutinativity and morphological constraints. With the help of wikimedia dumps, we were able to sort down the words in it by frequency and also helped in calculation of coverage.

Corpus Coverage
WikiMedia Corpus 85.70%
cuni 78.94%