Turkic MT Improvements GSoC2019 report
Jump to navigation
Jump to search
This aim of this project was improving the following language pairs of Apertium: tur->uig, uzb->tur, kir->tur, tat->tur.
Commits
My commits can be found below, on each depository:
Tur-Uzb Tur Uzb Uig-Tur Uig Tur-Tat Tat Tur-Kir Kir
Transfer
Transfer rules were written for tur->uig and uzb->tur, using Regression Tests. They can be found here: Uighur and Uzbek.
Corpora and Coverage
L | Wiki | Bible |
---|---|---|
Tur-Uig | 53505239 words, 82.3% cov | 178233 words, 93.0% cov |
Uzb-Tur | 12730161 words, 80.8% cov | 184447 words, 83.5% cov |
Kir-Tur | 11435418 words, 82.5% cov | 184808 words, 92.0% cov |
Tat-Tur | 5792382 words, 86.4% cov | 178220 words, 91.4% cov |
Future Plans
Uzbek lexicon still needs to be improved. Analysis of Uzbek can be problematic because of the unusual alphabet of the language along with non-standard forms, which also needs further improvement. More lexical selection, disambiguation and transfer rules are needed to achieve a greater translation quality on all pairs.