Difference between revisions of "Turkic MT Improvements GSoC2019 report"

From Apertium
Jump to navigation Jump to search
Line 46: Line 46:
 
|}
 
|}
   
 
 
==Disambiguation==
 
To correctly discern the lemma and the morphology so as to be translated correctly into the target language, Apertium uses Constraint Grammar (CG).
 
 
==Lexical Selection==
 
To determine in which context which translation of a given lemma would be selected, lexical selection is employed.
 
   
 
==Future Plans==
 
==Future Plans==

Revision as of 15:27, 25 August 2019

This aim of this project was improving the following language pairs of Apertium: tur->uig, uzb->tur, kir->tur, tat->tur.

Commits

My commits can be found below, on each depository:

Tur-Uzb Tur Uzb Uig-Tur Uig Tur-Tat Tat Tur-Kir Kir

Transfer

Transfer rules were written for tur->uig and uzb->tur, using Regression Tests. They can be found here: Uighur and Uzbek.


Corpora and Coverage

L Wiki Bible
Tur-Uig 53505239 words, 82.3% cov 178233 words, 93.0% cov
Uzb-Tur 12730161 words, 80.8% cov 184447 words, 83.5% cov
Kir-Tur 11435418 words, 82.5% cov 184808 words, 92.0% cov
Tat-Tur 5792382 words, 86.4% cov 178220 words, 91.4% cov


Future Plans

Uzbek lexicon still needs to be improved. Analysis of Uzbek can be problematic because of the unusual alphabet of the language along with non-standard forms, which also needs further improvement. More lexical selection, disambiguation and transfer rules are needed to achieve a greater translation quality on all pairs.