Turkic MT Improvements GSoC2019 report
The aim of this project was improving the following language pairs of Apertium: tur->uig, uzb->tur, kir->tur, tat->tur. The first phase was mostly improving the bidix and lexc items of each language pair. After that the focus was on tur-uig and uzb-tur. CG and transfer rules were written to improve the translation quality for these languages.
Contents
Commits
My commits on each pair can be found here.
Transfer
Transfer rules were written for tur->uig and uzb->tur, using Regression Tests. They can be found here: Uighur and Uzbek. These rules focused on what the machine could not already translate. To this end missing suffixation patterns, lexical items and disambiguation rules etc. were also added to relative dictionaries, along with the transfer rules to enable the translation.
Corpora and Coverage
L | Wiki | Bible |
---|---|---|
Tur-Uig | 53505239 words, 82.3% cov | 178233 words, 93.0% cov |
Uzb-Tur | 12730161 words, 80.8% cov | 184447 words, 83.5% cov |
Kir-Tur | 11435418 words, 82.5% cov | 184808 words, 92.0% cov |
Tat-Tur | 5792382 words, 86.4% cov | 178220 words, 91.4% cov |
Dictionaries
All dictionaries were improved in the first stage of the project, with the help of mentors on Kipchak languages. Most frequent unknown tokens from corpora of each language (mostly consisting of Wikipedia entries, Bible and Quran) were added. Around 800-1000 entries were added to each language pair's bidix.
Disambiguation
To correctly discern the lemma and the morphology so as to be translated correctly into the target language, Apertium uses Constraint Grammar (CG). As part of the project, CG rules were added where necessary. Uzbek and Turkish in particular received extensive attention in this regard. For a better translation same/similar CG rules were written for these pairs, if the rule didn't clash with the intrinsic patterns of a language.
WER
---Uzbek---
Test file: 'mattauzbtr.txt' Reference file 'mattaturk.txt'
Statistics about input files
Number of words in reference: 565 Number of words in test: 579 Number of unknown words (marked with a star) in test: 124 Percentage of unknown words: 21.42 %
Results when removing unknown-word marks (stars)
Edit distance: 177 Word error rate (WER): 31.33 % Number of position-independent correct words: 408 Position-independent word error rate (PER): 30.27 %
Results when unknown-word marks (stars) are not removed
Edit distance: 188 Word Error Rate (WER): 33.27 % Number of position-independent correct words: 397 Position-independent word error rate (PER): 32.21 %
Statistics about the translation of unknown words
Number of unknown words which were free rides: 11 Percentage of unknown words that were free rides: 8.87 %,
---Uighur---
Test file: 'matta1turuig.txt' Reference file 'matta1uygur.txt'
Statistics about input files
Number of words in reference: 565 Number of words in test: 572 Number of unknown words (marked with a star) in test: 22 Percentage of unknown words: 3.85 %
Results when removing unknown-word marks (stars)
Edit distance: 270 Word error rate (WER): 47.79 % Number of position-independent correct words: 308 Position-independent word error rate (PER): 46.73 %
Results when unknown-word marks (stars) are not removed
Edit distance: 270 Word Error Rate (WER): 47.79 % Number of position-independent correct words: 308 Position-independent word error rate (PER): 46.73 %
Statistics about the translation of unknown words
Number of unknown words which were free rides: 0 Percentage of unknown words that were free rides: 0.00 %
---Kirgiz---
Test file: 'mattakirtr.txt' Reference file 'mattaturkkir.txt'
Statistics about input files
Number of words in reference: 569 Number of words in test: 669 Number of unknown words (marked with a star) in test: 63 Percentage of unknown words: 9.42 %
Results when removing unknown-word marks (stars)
Edit distance: 389 Word error rate (WER): 68.37 % Number of position-independent correct words: 286 Position-independent word error rate (PER): 67.31 %
Results when unknown-word marks (stars) are not removed
Edit distance: 389 Word Error Rate (WER): 68.37 % Number of position-independent correct words: 286 Position-independent word error rate (PER): 67.31 %
Statistics about the translation of unknown words
Number of unknown words which were free rides: 0 Percentage of unknown words that were free rides: 0.00 %
---Tatar---
Test file: 'mattatattr.txt' Reference file 'mattaturktatar.txt'
Statistics about input files
Number of words in reference: 573 Number of words in test: 587 Number of unknown words (marked with a star) in test: 66 Percentage of unknown words: 11.24 %
Results when removing unknown-word marks (stars)
Edit distance: 218 Word error rate (WER): 38.05 % Number of position-independent correct words: 375 Position-independent word error rate (PER): 37.00 %
Results when unknown-word marks (stars) are not removed
Edit distance: 218 Word Error Rate (WER): 38.05 % Number of position-independent correct words: 375 Position-independent word error rate (PER): 37.00 %
Statistics about the translation of unknown words
Number of unknown words which were free rides: 0 Percentage of unknown words that were free rides: 0.00 %
Future Plans
Uzbek lexicon still needs to be improved. Analysis of Uzbek can be problematic because of the unusual alphabet of the language along with non-standard forms, which also needs further improvement. More lexical selection, disambiguation and transfer rules are needed to achieve a greater translation quality on all pairs.