Turkic MT Improvements GSoC2019 report
This aim of this project was improving the following language pairs of Apertium: tur->uig, uzb->tur, kir->tur, tat->tur.
My commits can be found below, on each depository:
Corpora and Coverage
|Tur-Uig||53505239 words, 82.3% cov||178233 words, 93.0% cov|
|Uzb-Tur||12730161 words, 80.8% cov||184447 words, 83.5% cov|
|Kir-Tur||11435418 words, 82.5% cov||184808 words, 92.0% cov|
|Tat-Tur||5792382 words, 86.4% cov||178220 words, 91.4% cov|
All dictionaries were improved in the first stage of the project, with the help of mentors on Kypchak languages.
To correctly discern the lemma and the morphology so as to be translated correctly into the target language, Apertium uses Constraint Grammar (CG). As part of the project, CG rules were added where necessary. Uzbek and Turkish in particular received extensive attention in this regard.
Uzbek lexicon still needs to be improved. Analysis of Uzbek can be problematic because of the unusual alphabet of the language along with non-standard forms, which also needs further improvement. More lexical selection, disambiguation and transfer rules are needed to achieve a greater translation quality on all pairs.