Difference between revisions of "Apertium-kan-mar"

From Apertium
Jump to navigation Jump to search
 
(14 intermediate revisions by the same user not shown)
Line 2: Line 2:
 
The goal of this project was to develop a rule-based translation system for Kannada-Marathi pair for Apertium.
 
The goal of this project was to develop a rule-based translation system for Kannada-Marathi pair for Apertium.
   
==Kannada==
+
==Work==
 
The Kannada monolingual dictionary was developed from scratch as there was no pre-existing work based on Kannada from Apertium.
 
The Kannada monolingual dictionary was developed from scratch as there was no pre-existing work based on Kannada from Apertium.
The Kannada-Marathi bilingual dictionary was also developed from scratch with the help of Marathi monoligual dictionary.
+
The Kannada-Marathi bilingual dictionary was also developed from scratch with the help of Marathi monolingual dictionary.
   
 
My commits can be accessed at the following link: [https://apertium.projectjj.com/gsoc2018/invo.html commits] or directly in the Apertium repository, [https://github.com/apertium/apertium-kan/commits/master here].<br/>
<br><br>
 
  +
My commits can be accessed at the following link: [https://apertium.projectjj.com/gsoc2018/invo.html commits] These are the dependent GitHub repositories of my GSoC 2018 project. My GitHub account name is [https://github.com/MissingBytes/ MissingBytes]
 
  +
This table shows the dependent GitHub repositories of my GSoC 2018 project.
 
{| class="wikitable"
 
{| class="wikitable"
 
|-
 
|-
! Github repositories
+
!Apertium Github repositories
 
|-
 
|-
 
| [https://github.com/apertium/apertium-kan Kannada monolingual dictionary ]
 
| [https://github.com/apertium/apertium-kan Kannada monolingual dictionary ]
Line 18: Line 19:
 
| [https://github.com/apertium/apertium-mar Marathi monolingual dictionary ]
 
| [https://github.com/apertium/apertium-mar Marathi monolingual dictionary ]
 
|}
 
|}
  +
The link also contains about the installation procedure.<br/>
   
  +
My GitHub account name is [https://github.com/MissingBytes/ MissingBytes].<br/>
 
I worked on the Kannada monolingual dictionary and Kannada-Marathi bilingual dictionary from scratch. I haven't made any changes in the Marathi monolingual dictionary.
 
I worked on the Kannada monolingual dictionary and Kannada-Marathi bilingual dictionary from scratch. I haven't made any changes in the Marathi monolingual dictionary.
   
  +
The work I did can be downloaded [https://apertium.projectjj.com/gsoc2018/invo.tar.gz here] in tar.gz format.
==Summary==
 
   
  +
The work I did can be downloaded [https://apertium.projectjj.com/gsoc2018/invo.zip here] in .zip format.
Apertium particularly shines when used for languages with similar grammatical structures, and Romance and Turkic languages have been a very active area for language pair developers. Turkish and Crimean Tatar, though from different branches of the Turkic family (Oghuz and Kipchak respectively), have many similarities in phonetics, morphology and even syntax mostly due to Ottoman influence on the Crimean Tatar language.
 
  +
 
==Summary==
  +
A finite state transducer(FST) for Kannada and a bilingual dictionary for Kannada-Marathi was developed in this project. Morphological analyzer is a tool used for decomposition of inflected words into its base form and to obtain its grammatical information. Generation is the exact reverse process of analysis i.e. obtaining the inflected word from its base form and grammatical information. Morphology and generation is an essential part of rule-based machine translation, an application of Natural Language Processing(NLP).
   
 
==Coverage==
 
==Coverage==
Coverage is the percentage of words the translation system could analyse(or assign parts of speech tag-MonoDix/map words-BiDix) in a given text. For a translation system, it is necessary to do the morphological analysis using the dictionaries. The morphological analysis of Kannada was difficult due to high agglutinativity and morphological constraints. For
+
Coverage is the percentage of words the translation system could analyse(or assign parts of speech tag-MonoDix/map words-BiDix) in a given text. For a translation system, it is necessary to do the morphological analysis using the dictionaries. The morphological analysis of Kannada was difficult due to high agglutinativity and morphological constraints. With the help of wikimedia dumps, we were able to sort down the words in it by frequency and also helped in the calculation of coverage.
By coverage we mean the amount of the input text that the system understands and attempts to analyze and translate into the target language. This is an important metric and is related to the presence of the necessary words and morphology in the dictionaries. The system required the development of a Crimean Tatar-Turkish lexicon. The lack of a Turkish-Crimean Tatar dictionary was one obstacle in the path of the project. We used resources such as Wiktionary and crossed a Russian-Qırımtatar dictionary [http://medeniye.org/lugat] with a Russian-Turkish one to create an initial bilingual dictionary. After that more words were added through cognates through Turkish, corpora were examined to determine and ascertain unknown words' meanings and Persian, Arabic and Russian vocabulary were used to good effect to reach a high coverage on all the corpora.
 
   
  +
The coverage of Kannada analyser:<br/>
  +
Number of stems:22408
 
{| class="wikitable"
 
{| class="wikitable"
 
|-
 
|-
Line 41: Line 48:
 
|}
 
|}
   
  +
A draft of paper based on FST for Kannada can be viewed [https://docs.google.com/document/d/137f3SQPVImFlWH15bFU9YPd1NEOU9mvVx_SV-nrlPOo/edit?usp=sharing here]
Analyzers for both Crimean Tatar and Turkish were available in Apertium. Any entry in the bilingual dictionary (bidix) missing from either analyzer was added to the analyzers as well.
 
  +
  +
The coverage of Kan-Mar bidix:<br/>
  +
Number of stems: 4411
  +
{| class="wikitable"
  +
|-
  +
! Corpus
  +
! Coverage
  +
|-
  +
| [https://dumps.wikimedia.org/knwiki/latest/ WikiMedia Corpus]
  +
| 76.36%
  +
|-
  +
| [http://ufal.mff.cuni.cz/~majlis/w2c/download.html cuni]
  +
| 70.62%
  +
|}
  +
  +
  +
Word error was calculated using the instructions given in [http://wiki.apertium.org/wiki/Evaluation here], using a perl script. The text used was from "Universal Declaration of Human Rights(UDHR)" which are both available in several languages which translated by hand.<br/>The links to [https://www.unicode.org/udhr/d/udhr_kan.html UDHR-Kan] and [https://www.unicode.org/udhr/d/udhr_mar.html UDHR-Mar].<br/>
  +
  +
{| class="wikitable"
  +
|-
  +
! Kannada-Marathi
  +
! Coverage
  +
|-
  +
| Word error rate(PER)
  +
| 96.96%
  +
|-
  +
| Position independent word error(PER)
  +
| 88.21%
  +
|}
  +
  +
This error rate is huge. The translation by this system is literal and the translation of UDHR done by hand need not be literal, maybe because of this reason, the error rate is massive.
  +
  +
==Future work==
  +
There is a long way ahead for a complete Kannada Marathi translation system.
   
  +
1. Getting the coverage for both mono-dix and bi-dix above 90%.
==Transfer==
 
There are 51 structural transfer rules that take Crimean Tatar constructions and turn them into their Turkish equivalents. Many of these rules cover constructions that are analytic in Qırımtatar and synthetic in Turkish, the most simple examples being things like ''yapa bile'', "he/she/it can do it" which would translate to ''yapabilir''. Copulae like ''edi'', ''eken'', ''ekende'' are also often written together with the verb in Turkish as opposed to their Crimean Tatar counterparts.
 
   
  +
2. The current .twol file is empty. Need to add all the morphographemic rules to it.
==Disambiguation and Lexical Selection==
 
===Disambiguation===
 
Many different analyses are generated for many word forms. To correctly discern the lemma and the morphology so as to be translated correctly into the target language, MT systems have '''disambiguation''' components. The disambiguation in this system is currently carried out using Constraint Grammar (CG). 68 rules remove the wrong analyses and select the the correct ones with the use of contextual morphological information. Ideally this would be either in conjunction with or replaced by a machine-learned POS tagger, which requires a tagged corpus. The tagged corpus will be developed in the near future.
 
   
  +
3. The word order in Kannada and Marathi(Subject-Object-Verb) is almost the same, with some exception when relative clause appear.There are few transfer rules in the bidix. More transfer rules need to be added to all .t1x .t2x and .t3x files.
===Lexical Selection===
 
Lexical selection is used when the system needs to choose among multiple possible translations. The lexical selection component uses rules to choose which translation to prefer based on contextual information.
 

Latest revision as of 14:42, 13 August 2018

Description[edit]

The goal of this project was to develop a rule-based translation system for Kannada-Marathi pair for Apertium.

Work[edit]

The Kannada monolingual dictionary was developed from scratch as there was no pre-existing work based on Kannada from Apertium. The Kannada-Marathi bilingual dictionary was also developed from scratch with the help of Marathi monolingual dictionary.

My commits can be accessed at the following link: commits or directly in the Apertium repository, here.

This table shows the dependent GitHub repositories of my GSoC 2018 project.

Apertium Github repositories
Kannada monolingual dictionary
Kannada-Marathi bilingual dictionary
Marathi monolingual dictionary

The link also contains about the installation procedure.

My GitHub account name is MissingBytes.
I worked on the Kannada monolingual dictionary and Kannada-Marathi bilingual dictionary from scratch. I haven't made any changes in the Marathi monolingual dictionary.

The work I did can be downloaded here in tar.gz format.

The work I did can be downloaded here in .zip format.

Summary[edit]

A finite state transducer(FST) for Kannada and a bilingual dictionary for Kannada-Marathi was developed in this project. Morphological analyzer is a tool used for decomposition of inflected words into its base form and to obtain its grammatical information. Generation is the exact reverse process of analysis i.e. obtaining the inflected word from its base form and grammatical information. Morphology and generation is an essential part of rule-based machine translation, an application of Natural Language Processing(NLP).

Coverage[edit]

Coverage is the percentage of words the translation system could analyse(or assign parts of speech tag-MonoDix/map words-BiDix) in a given text. For a translation system, it is necessary to do the morphological analysis using the dictionaries. The morphological analysis of Kannada was difficult due to high agglutinativity and morphological constraints. With the help of wikimedia dumps, we were able to sort down the words in it by frequency and also helped in the calculation of coverage.

The coverage of Kannada analyser:
Number of stems:22408

Corpus Coverage
WikiMedia Corpus 85.70%
cuni 78.94%

A draft of paper based on FST for Kannada can be viewed here

The coverage of Kan-Mar bidix:
Number of stems: 4411

Corpus Coverage
WikiMedia Corpus 76.36%
cuni 70.62%


Word error was calculated using the instructions given in here, using a perl script. The text used was from "Universal Declaration of Human Rights(UDHR)" which are both available in several languages which translated by hand.
The links to UDHR-Kan and UDHR-Mar.

Kannada-Marathi Coverage
Word error rate(PER) 96.96%
Position independent word error(PER) 88.21%

This error rate is huge. The translation by this system is literal and the translation of UDHR done by hand need not be literal, maybe because of this reason, the error rate is massive.

Future work[edit]

There is a long way ahead for a complete Kannada Marathi translation system.

1. Getting the coverage for both mono-dix and bi-dix above 90%.

2. The current .twol file is empty. Need to add all the morphographemic rules to it.

3. The word order in Kannada and Marathi(Subject-Object-Verb) is almost the same, with some exception when relative clause appear.There are few transfer rules in the bidix. More transfer rules need to be added to all .t1x .t2x and .t3x files.