Difference between revisions of "Turkic MT Improvements GSoC2019 report"
Jump to navigation
Jump to search
Line 46: | Line 46: | ||
|} |
|} |
||
==Disambiguation== |
|||
To correctly discern the lemma and the morphology so as to be translated correctly into the target language, Apertium uses Constraint Grammar (CG). |
|||
==Lexical Selection== |
|||
To determine in which context which translation of a given lemma would be selected, lexical selection is employed. |
|||
==Future Plans== |
==Future Plans== |
Revision as of 15:27, 25 August 2019
This aim of this project was improving the following language pairs of Apertium: tur->uig, uzb->tur, kir->tur, tat->tur.
Commits
My commits can be found below, on each depository:
Tur-Uzb Tur Uzb Uig-Tur Uig Tur-Tat Tat Tur-Kir Kir
Transfer
Transfer rules were written for tur->uig and uzb->tur, using Regression Tests. They can be found here: Uighur and Uzbek.
Corpora and Coverage
L | Wiki | Bible |
---|---|---|
Tur-Uig | 53505239 words, 82.3% cov | 178233 words, 93.0% cov |
Uzb-Tur | 12730161 words, 80.8% cov | 184447 words, 83.5% cov |
Kir-Tur | 11435418 words, 82.5% cov | 184808 words, 92.0% cov |
Tat-Tur | 5792382 words, 86.4% cov | 178220 words, 91.4% cov |
Future Plans
Uzbek lexicon still needs to be improved. Analysis of Uzbek can be problematic because of the unusual alphabet of the language along with non-standard forms, which also needs further improvement. More lexical selection, disambiguation and transfer rules are needed to achieve a greater translation quality on all pairs.