Apertium-kan-mar
Description
The goal of this project was to develop a rule-based translation system for Kannada-Marathi pair for Apertium.
Work
The Kannada monolingual dictionary was developed from scratch as there was no pre-existing work based on Kannada from Apertium. The Kannada-Marathi bilingual dictionary was also developed from scratch with the help of Marathi monolingual dictionary.
My commits can be accessed at the following link: commits or directly in the Apertium repository, here.
This table shows the dependent GitHub repositories of my GSoC 2018 project.
Apertium Github repositories |
---|
Kannada monolingual dictionary |
Kannada-Marathi bilingual dictionary |
Marathi monolingual dictionary |
The link also contains about the installation procedure.
My GitHub account name is MissingBytes.
I worked on the Kannada monolingual dictionary and Kannada-Marathi bilingual dictionary from scratch. I haven't made any changes in the Marathi monolingual dictionary.
The work I did can be downloaded here in tar.gz format.
The work I did can be downloaded here in .zip format.
Summary
A finite state transducer(FST) for Kannada and a bilingual dictionary for Kannada-Marathi was developed in this project. Morphological analyzer is a tool used for decomposition of inflected words into its base form and to obtain its grammatical information. Generation is the exact reverse process of analysis i.e. obtaining the inflected word from its base form and grammatical information. Morphology and generation is an essential part of rule-based machine translation, an application of Natural Language Processing(NLP).
Coverage
Coverage is the percentage of words the translation system could analyse(or assign parts of speech tag-MonoDix/map words-BiDix) in a given text. For a translation system, it is necessary to do the morphological analysis using the dictionaries. The morphological analysis of Kannada was difficult due to high agglutinativity and morphological constraints. With the help of wikimedia dumps, we were able to sort down the words in it by frequency and also helped in the calculation of coverage.
The coverage of Kannada analyser:
Number of stems:22408
Corpus | Coverage |
---|---|
WikiMedia Corpus | 85.70% |
cuni | 78.94% |
A draft of paper based on FST for Kannada can be viewed here
The coverage of Kan-Mar bidix:
Number of stems: 4411
Corpus | Coverage |
---|---|
WikiMedia Corpus | 76.36% |
cuni | 70.62% |
Word error was calculated using the instructions given in here, using a perl script. The text used was from "Universal Declaration of Human Rights(UDHR)" which are both available in several languages which translated by hand.
The links to UDHR-Kan and UDHR-Mar.
Kannada-Marathi | Coverage |
---|---|
Word error rate(PER) | 96.96% |
Position independent word error(PER) | 88.21% |
This error rate is huge. The translation by this system is literal and the translation of UDHR done by hand need not be literal, maybe because of this reason, the error rate is massive.
Future work
There is a long way ahead for a complete Kannada Marathi translation system.
1. Getting the coverage for both mono-dix and bi-dix above 90%.
2. The current .twol file is empty. Need to add all the morphographemic rules to it.
3. The word order in Kannada and Marathi(Subject-Object-Verb) is almost the same, with some exception when relative clause appear.There are few transfer rules in the bidix. More transfer rules need to be added to all .t1x .t2x and .t3x files.