Turkic MT Improvements GSoC2019 report

This aim of this project was improving the following language pairs of Apertium: tur->uig, uzb->tur, kir->tur, tat->tur.

Commits

My commits can be found below, on each depository:

Tur-Uzb Tur Uzb Uig-Tur Uig Tur-Tat Tat Tur-Kir Kir

Transfer

Transfer rules were written for tur->uig and uzb->tur, using Regression Tests. They can be found here: Uighur and Uzbek.

Corpora and Coverage

L	Wiki	Bible
Tur-Uig	53505239 words, 82.3% cov	178233 words, 93.0% cov
Uzb-Tur	12730161 words, 80.8% cov	184447 words, 83.5% cov
Kir-Tur	11435418 words, 82.5% cov	184808 words, 92.0% cov
Tat-Tur	5792382 words, 86.4% cov	178220 words, 91.4% cov

Future Plans

Uzbek lexicon still needs to be improved. Analysis of Uzbek can be problematic because of the unusual alphabet of the language along with non-standard forms, which also needs further improvement. More lexical selection, disambiguation and transfer rules are needed to achieve a greater translation quality on all pairs.

Turkic MT Improvements GSoC2019 report

Contents

Commits

Transfer

Corpora and Coverage

Future Plans

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools