Difference between revisions of "User:Oğuz/GSoC 2019 progress"

From Apertium
Jump to navigation Jump to search
(Created page with "Progress on 2019 GSoC Project "Turkic MT improvement". ---- == WER results == ''1st evaluation WER resulsts: '' '''Uzbek''' '''Kyrgyz''' '''Tatar''' '''Uyghur'''")
 
Line 7: Line 7:
 
== WER results ==
 
== WER results ==
   
''1st evaluation WER resulsts:
+
''1st evaluation WER results:
 
''
 
''
   
 
'''Uzbek'''
 
'''Uzbek'''
  +
  +
  +
Test file: 'istanbultr.txt'
  +
Reference file 'turistanbul.txt'
  +
  +
Statistics about input files
  +
-------------------------------------------------------
  +
Number of words in reference: 206
  +
Number of words in test: 208
  +
Number of unknown words (marked with a star) in test: 28
  +
Percentage of unknown words: 13.46 %
  +
  +
Results when removing unknown-word marks (stars)
  +
-------------------------------------------------------
  +
Edit distance: 78
  +
Word error rate (WER): 37.86 %
  +
Number of position-independent correct words: 132
  +
Position-independent word error rate (PER): 36.89 %
  +
  +
Results when unknown-word marks (stars) are not removed
  +
-------------------------------------------------------
  +
Edit distance: 76
  +
Word Error Rate (WER): 36.89 %
  +
Number of position-independent correct words: 134
  +
Position-independent word error rate (PER): 35.92 %
  +
  +
Statistics about the translation of unknown words
  +
-------------------------------------------------------
  +
Number of unknown words which were free rides: -2
  +
Percentage of unknown words that were free rides: -7.14 %
  +
   
   
 
'''Kyrgyz'''
 
'''Kyrgyz'''
  +
  +
  +
Test file: 'kazantr.txt'
  +
Reference file 'kazanturkce.txt'
  +
  +
Statistics about input files
  +
-------------------------------------------------------
  +
Number of words in reference: 223
  +
Number of words in test: 227
  +
Number of unknown words (marked with a star) in test: 55
  +
Percentage of unknown words: 24.23 %
  +
  +
Results when removing unknown-word marks (stars)
  +
-------------------------------------------------------
  +
Edit distance: 113
  +
Word error rate (WER): 50.67 %
  +
Number of position-independent correct words: 119
  +
Position-independent word error rate (PER): 48.43 %
  +
  +
Results when unknown-word marks (stars) are not removed
  +
-------------------------------------------------------
  +
Edit distance: 108
  +
Word Error Rate (WER): 48.43 %
  +
Number of position-independent correct words: 124
  +
Position-independent word error rate (PER): 46.19 %
  +
  +
Statistics about the translation of unknown words
  +
-------------------------------------------------------
  +
Number of unknown words which were free rides: -5
  +
Percentage of unknown words that were free rides: -9.09 %
   
   
 
'''Tatar'''
 
'''Tatar'''
  +
  +
  +
Test file: 'kazantr.txt'
  +
Reference file 'kazantur.txt'
  +
  +
Statistics about input files
  +
-------------------------------------------------------
  +
Number of words in reference: 195
  +
Number of words in test: 210
  +
Number of unknown words (marked with a star) in test: 36
  +
Percentage of unknown words: 17.14 %
  +
  +
Results when removing unknown-word marks (stars)
  +
-------------------------------------------------------
  +
Edit distance: 103
  +
Word error rate (WER): 52.82 %
  +
Number of position-independent correct words: 112
  +
Position-independent word error rate (PER): 50.26 %
  +
  +
Results when unknown-word marks (stars) are not removed
  +
-------------------------------------------------------
  +
Edit distance: 102
  +
Word Error Rate (WER): 52.31 %
  +
Number of position-independent correct words: 113
  +
Position-independent word error rate (PER): 49.74 %
  +
  +
Statistics about the translation of unknown words
  +
-------------------------------------------------------
  +
Number of unknown words which were free rides: -1
  +
Percentage of unknown words that were free rides: -2.78 %
   
   
 
'''Uyghur'''
 
'''Uyghur'''
  +
  +
  +
Test file: 'cumhuriyet-1.txt'
  +
Reference file 'cumhuriyetturkce.txt'
  +
  +
Statistics about input files
  +
-------------------------------------------------------
  +
Number of words in reference: 354
  +
Number of words in test: 359
  +
Number of unknown words (marked with a star) in test: 20
  +
Percentage of unknown words: 5.57 %
  +
  +
Results when removing unknown-word marks (stars)
  +
-------------------------------------------------------
  +
Edit distance: 61
  +
Word error rate (WER): 17.23 %
  +
Number of position-independent correct words: 299
  +
Position-independent word error rate (PER): 16.95 %
  +
  +
Results when unknown-word marks (stars) are not removed
  +
-------------------------------------------------------
  +
Edit distance: 61
  +
Word Error Rate (WER): 17.23 %
  +
Number of position-independent correct words: 299
  +
Position-independent word error rate (PER): 16.95 %
  +
  +
Statistics about the translation of unknown words
  +
-------------------------------------------------------
  +
Number of unknown words which were free rides: 0
  +
Percentage of unknown words that were free rides: 0.00 %

Revision as of 11:32, 28 June 2019

Progress on 2019 GSoC Project "Turkic MT improvement".




WER results

1st evaluation WER results:

Uzbek


Test file: 'istanbultr.txt' Reference file 'turistanbul.txt'

Statistics about input files


Number of words in reference: 206 Number of words in test: 208 Number of unknown words (marked with a star) in test: 28 Percentage of unknown words: 13.46 %

Results when removing unknown-word marks (stars)


Edit distance: 78 Word error rate (WER): 37.86 % Number of position-independent correct words: 132 Position-independent word error rate (PER): 36.89 %

Results when unknown-word marks (stars) are not removed


Edit distance: 76 Word Error Rate (WER): 36.89 % Number of position-independent correct words: 134 Position-independent word error rate (PER): 35.92 %

Statistics about the translation of unknown words


Number of unknown words which were free rides: -2 Percentage of unknown words that were free rides: -7.14 %


Kyrgyz


Test file: 'kazantr.txt' Reference file 'kazanturkce.txt'

Statistics about input files


Number of words in reference: 223 Number of words in test: 227 Number of unknown words (marked with a star) in test: 55 Percentage of unknown words: 24.23 %

Results when removing unknown-word marks (stars)


Edit distance: 113 Word error rate (WER): 50.67 % Number of position-independent correct words: 119 Position-independent word error rate (PER): 48.43 %

Results when unknown-word marks (stars) are not removed


Edit distance: 108 Word Error Rate (WER): 48.43 % Number of position-independent correct words: 124 Position-independent word error rate (PER): 46.19 %

Statistics about the translation of unknown words


Number of unknown words which were free rides: -5 Percentage of unknown words that were free rides: -9.09 %


Tatar


Test file: 'kazantr.txt' Reference file 'kazantur.txt'

Statistics about input files


Number of words in reference: 195 Number of words in test: 210 Number of unknown words (marked with a star) in test: 36 Percentage of unknown words: 17.14 %

Results when removing unknown-word marks (stars)


Edit distance: 103 Word error rate (WER): 52.82 % Number of position-independent correct words: 112 Position-independent word error rate (PER): 50.26 %

Results when unknown-word marks (stars) are not removed


Edit distance: 102 Word Error Rate (WER): 52.31 % Number of position-independent correct words: 113 Position-independent word error rate (PER): 49.74 %

Statistics about the translation of unknown words


Number of unknown words which were free rides: -1 Percentage of unknown words that were free rides: -2.78 %


Uyghur


Test file: 'cumhuriyet-1.txt' Reference file 'cumhuriyetturkce.txt'

Statistics about input files


Number of words in reference: 354 Number of words in test: 359 Number of unknown words (marked with a star) in test: 20 Percentage of unknown words: 5.57 %

Results when removing unknown-word marks (stars)


Edit distance: 61 Word error rate (WER): 17.23 % Number of position-independent correct words: 299 Position-independent word error rate (PER): 16.95 %

Results when unknown-word marks (stars) are not removed


Edit distance: 61 Word Error Rate (WER): 17.23 % Number of position-independent correct words: 299 Position-independent word error rate (PER): 16.95 %

Statistics about the translation of unknown words


Number of unknown words which were free rides: 0 Percentage of unknown words that were free rides: 0.00 %