Difference between revisions of "User:Oğuz/GSoC 2019 progress"

From Apertium
Jump to navigation Jump to search
(Created page with "Progress on 2019 GSoC Project "Turkic MT improvement". ---- == WER results == ''1st evaluation WER resulsts: '' '''Uzbek''' '''Kyrgyz''' '''Tatar''' '''Uyghur'''")
 
 
(9 intermediate revisions by 2 users not shown)
Line 1: Line 1:
Progress on 2019 GSoC Project "Turkic MT improvement".
Progress on 2019 GSoC Project [http://wiki.apertium.org/wiki/User:Oğuz/GSoC_2019 Turkic MT Improvements].




{| class="wikitable"
----
|-
! Week
! uig Cov.
! uig WER
! uig BLEU
! uzb Cov.
! uzb WER
! uzb BLEU
! tat Cov.
! tat WER
! tat BLEU
! kir Cov.
! kir WER
! kir BLEU
! On Track?
|-
| July 8th-14th
|
|
|
|
|
|
|
|
|
|
|
|
|}




== WER results ==
== First Evaluation ==
=== Coverages ===


{| class="wikitable"
''1st evaluation WER resulsts:
|-
! L
! Wiki
! Bible
|-
| Tur-Uig
| 53505239 words, 82.3% cov
| 178233 words, 93.0% cov
|-
| Uzb-Tur
| 12730161 words, 80.8% cov
| 184447 words, 81.1% cov
|-
| Kir-Tur
| 11435418 words, 82.8% cov
| 184808 words, 93.4% cov
|-
| Tat-Tur
| --
| 178220 words, 91.4% cov
|}

=== WER results ===


''1st evaluation WER results:
''
''



'''Uzbek'''
'''Uzbek'''


Test file: 'istanbultr.txt'
Reference file 'turistanbul.txt'

Statistics about input files
-------------------------------------------------------
Number of words in reference: 206
Number of words in test: 208
Number of unknown words (marked with a star) in test: 28
Percentage of unknown words: 13.46 %

Results when removing unknown-word marks (stars)
-------------------------------------------------------
Edit distance: 78
Word error rate (WER): 37.86 %
Number of position-independent correct words: 132
Position-independent word error rate (PER): 36.89 %

Results when unknown-word marks (stars) are not removed
-------------------------------------------------------
Edit distance: 76
Word Error Rate (WER): 36.89 %
Number of position-independent correct words: 134
Position-independent word error rate (PER): 35.92 %

Statistics about the translation of unknown words
-------------------------------------------------------
Number of unknown words which were free rides: -2
Percentage of unknown words that were free rides: -7.14 %





'''Kyrgyz'''
'''Kyrgyz'''


Test file: 'kazantr.txt'
Reference file 'kazanturkce.txt'

Statistics about input files
-------------------------------------------------------
Number of words in reference: 223
Number of words in test: 227
Number of unknown words (marked with a star) in test: 55
Percentage of unknown words: 24.23 %

Results when removing unknown-word marks (stars)
-------------------------------------------------------
Edit distance: 113
Word error rate (WER): 50.67 %
Number of position-independent correct words: 119
Position-independent word error rate (PER): 48.43 %

Results when unknown-word marks (stars) are not removed
-------------------------------------------------------
Edit distance: 108
Word Error Rate (WER): 48.43 %
Number of position-independent correct words: 124
Position-independent word error rate (PER): 46.19 %

Statistics about the translation of unknown words
-------------------------------------------------------
Number of unknown words which were free rides: -5
Percentage of unknown words that were free rides: -9.09 %




'''Tatar'''
'''Tatar'''


Test file: 'kazantr.txt'
Reference file 'kazantur.txt'

Statistics about input files
-------------------------------------------------------
Number of words in reference: 195
Number of words in test: 210
Number of unknown words (marked with a star) in test: 36
Percentage of unknown words: 17.14 %

Results when removing unknown-word marks (stars)
-------------------------------------------------------
Edit distance: 103
Word error rate (WER): 52.82 %
Number of position-independent correct words: 112
Position-independent word error rate (PER): 50.26 %

Results when unknown-word marks (stars) are not removed
-------------------------------------------------------
Edit distance: 102
Word Error Rate (WER): 52.31 %
Number of position-independent correct words: 113
Position-independent word error rate (PER): 49.74 %

Statistics about the translation of unknown words
-------------------------------------------------------
Number of unknown words which were free rides: -1
Percentage of unknown words that were free rides: -2.78 %




'''Uyghur'''
'''Uyghur'''


Test file: 'cumhuriyet-1.txt'
Reference file 'cumhuriyetturkce.txt'

Statistics about input files
-------------------------------------------------------
Number of words in reference: 354
Number of words in test: 359
Number of unknown words (marked with a star) in test: 20
Percentage of unknown words: 5.57 %

Results when removing unknown-word marks (stars)
-------------------------------------------------------
Edit distance: 61
Word error rate (WER): 17.23 %
Number of position-independent correct words: 299
Position-independent word error rate (PER): 16.95 %

Results when unknown-word marks (stars) are not removed
-------------------------------------------------------
Edit distance: 61
Word Error Rate (WER): 17.23 %
Number of position-independent correct words: 299
Position-independent word error rate (PER): 16.95 %

Statistics about the translation of unknown words
-------------------------------------------------------
Number of unknown words which were free rides: 0
Percentage of unknown words that were free rides: 0.00 %

Latest revision as of 16:05, 12 July 2019

Progress on 2019 GSoC Project Turkic MT Improvements.


Week uig Cov. uig WER uig BLEU uzb Cov. uzb WER uzb BLEU tat Cov. tat WER tat BLEU kir Cov. kir WER kir BLEU On Track?
July 8th-14th


First Evaluation[edit]

Coverages[edit]

L Wiki Bible
Tur-Uig 53505239 words, 82.3% cov 178233 words, 93.0% cov
Uzb-Tur 12730161 words, 80.8% cov 184447 words, 81.1% cov
Kir-Tur 11435418 words, 82.8% cov 184808 words, 93.4% cov
Tat-Tur -- 178220 words, 91.4% cov

WER results[edit]

1st evaluation WER results:


Uzbek


Test file: 'istanbultr.txt' Reference file 'turistanbul.txt'

Statistics about input files


Number of words in reference: 206 Number of words in test: 208 Number of unknown words (marked with a star) in test: 28 Percentage of unknown words: 13.46 %

Results when removing unknown-word marks (stars)


Edit distance: 78 Word error rate (WER): 37.86 % Number of position-independent correct words: 132 Position-independent word error rate (PER): 36.89 %

Results when unknown-word marks (stars) are not removed


Edit distance: 76 Word Error Rate (WER): 36.89 % Number of position-independent correct words: 134 Position-independent word error rate (PER): 35.92 %

Statistics about the translation of unknown words


Number of unknown words which were free rides: -2 Percentage of unknown words that were free rides: -7.14 %


Kyrgyz


Test file: 'kazantr.txt' Reference file 'kazanturkce.txt'

Statistics about input files


Number of words in reference: 223 Number of words in test: 227 Number of unknown words (marked with a star) in test: 55 Percentage of unknown words: 24.23 %

Results when removing unknown-word marks (stars)


Edit distance: 113 Word error rate (WER): 50.67 % Number of position-independent correct words: 119 Position-independent word error rate (PER): 48.43 %

Results when unknown-word marks (stars) are not removed


Edit distance: 108 Word Error Rate (WER): 48.43 % Number of position-independent correct words: 124 Position-independent word error rate (PER): 46.19 %

Statistics about the translation of unknown words


Number of unknown words which were free rides: -5 Percentage of unknown words that were free rides: -9.09 %


Tatar


Test file: 'kazantr.txt' Reference file 'kazantur.txt'

Statistics about input files


Number of words in reference: 195 Number of words in test: 210 Number of unknown words (marked with a star) in test: 36 Percentage of unknown words: 17.14 %

Results when removing unknown-word marks (stars)


Edit distance: 103 Word error rate (WER): 52.82 % Number of position-independent correct words: 112 Position-independent word error rate (PER): 50.26 %

Results when unknown-word marks (stars) are not removed


Edit distance: 102 Word Error Rate (WER): 52.31 % Number of position-independent correct words: 113 Position-independent word error rate (PER): 49.74 %

Statistics about the translation of unknown words


Number of unknown words which were free rides: -1 Percentage of unknown words that were free rides: -2.78 %


Uyghur


Test file: 'cumhuriyet-1.txt' Reference file 'cumhuriyetturkce.txt'

Statistics about input files


Number of words in reference: 354 Number of words in test: 359 Number of unknown words (marked with a star) in test: 20 Percentage of unknown words: 5.57 %

Results when removing unknown-word marks (stars)


Edit distance: 61 Word error rate (WER): 17.23 % Number of position-independent correct words: 299 Position-independent word error rate (PER): 16.95 %

Results when unknown-word marks (stars) are not removed


Edit distance: 61 Word Error Rate (WER): 17.23 % Number of position-independent correct words: 299 Position-independent word error rate (PER): 16.95 %

Statistics about the translation of unknown words


Number of unknown words which were free rides: 0 Percentage of unknown words that were free rides: 0.00 %