User:Oğuz/GSoC 2019 progress

From Apertium
Jump to navigation Jump to search

Progress on 2019 GSoC Project Turkic MT Improvements.


Week uig Cov. uig WER uig BLEU uzb Cov. uzb WER uzb BLEU tat Cov. tat WER tat BLEU kir Cov. kir WER kir BLEU On Track?
July 8th-14th


First Evaluation

Coverages

L Wiki Bible
Tur-Uig 53505239 words, 82.3% cov 178233 words, 93.0% cov
Uzb-Tur 12730161 words, 80.8% cov 184447 words, 81.1% cov
Kir-Tur 11435418 words, 82.8% cov 184808 words, 93.4% cov
Tat-Tur -- 178220 words, 91.4% cov

WER results

1st evaluation WER results:


Uzbek


Test file: 'istanbultr.txt' Reference file 'turistanbul.txt'

Statistics about input files


Number of words in reference: 206 Number of words in test: 208 Number of unknown words (marked with a star) in test: 28 Percentage of unknown words: 13.46 %

Results when removing unknown-word marks (stars)


Edit distance: 78 Word error rate (WER): 37.86 % Number of position-independent correct words: 132 Position-independent word error rate (PER): 36.89 %

Results when unknown-word marks (stars) are not removed


Edit distance: 76 Word Error Rate (WER): 36.89 % Number of position-independent correct words: 134 Position-independent word error rate (PER): 35.92 %

Statistics about the translation of unknown words


Number of unknown words which were free rides: -2 Percentage of unknown words that were free rides: -7.14 %


Kyrgyz


Test file: 'kazantr.txt' Reference file 'kazanturkce.txt'

Statistics about input files


Number of words in reference: 223 Number of words in test: 227 Number of unknown words (marked with a star) in test: 55 Percentage of unknown words: 24.23 %

Results when removing unknown-word marks (stars)


Edit distance: 113 Word error rate (WER): 50.67 % Number of position-independent correct words: 119 Position-independent word error rate (PER): 48.43 %

Results when unknown-word marks (stars) are not removed


Edit distance: 108 Word Error Rate (WER): 48.43 % Number of position-independent correct words: 124 Position-independent word error rate (PER): 46.19 %

Statistics about the translation of unknown words


Number of unknown words which were free rides: -5 Percentage of unknown words that were free rides: -9.09 %


Tatar


Test file: 'kazantr.txt' Reference file 'kazantur.txt'

Statistics about input files


Number of words in reference: 195 Number of words in test: 210 Number of unknown words (marked with a star) in test: 36 Percentage of unknown words: 17.14 %

Results when removing unknown-word marks (stars)


Edit distance: 103 Word error rate (WER): 52.82 % Number of position-independent correct words: 112 Position-independent word error rate (PER): 50.26 %

Results when unknown-word marks (stars) are not removed


Edit distance: 102 Word Error Rate (WER): 52.31 % Number of position-independent correct words: 113 Position-independent word error rate (PER): 49.74 %

Statistics about the translation of unknown words


Number of unknown words which were free rides: -1 Percentage of unknown words that were free rides: -2.78 %


Uyghur


Test file: 'cumhuriyet-1.txt' Reference file 'cumhuriyetturkce.txt'

Statistics about input files


Number of words in reference: 354 Number of words in test: 359 Number of unknown words (marked with a star) in test: 20 Percentage of unknown words: 5.57 %

Results when removing unknown-word marks (stars)


Edit distance: 61 Word error rate (WER): 17.23 % Number of position-independent correct words: 299 Position-independent word error rate (PER): 16.95 %

Results when unknown-word marks (stars) are not removed


Edit distance: 61 Word Error Rate (WER): 17.23 % Number of position-independent correct words: 299 Position-independent word error rate (PER): 16.95 %

Statistics about the translation of unknown words


Number of unknown words which were free rides: 0 Percentage of unknown words that were free rides: 0.00 %