Difference between revisions of "User:Oğuz/GSoC 2019 progress"
(Created page with "Progress on 2019 GSoC Project "Turkic MT improvement". ---- == WER results == ''1st evaluation WER resulsts: '' '''Uzbek''' '''Kyrgyz''' '''Tatar''' '''Uyghur'''") |
|||
(9 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
− | Progress on 2019 GSoC Project |
+ | Progress on 2019 GSoC Project [http://wiki.apertium.org/wiki/User:Oğuz/GSoC_2019 Turkic MT Improvements]. |
+ | {| class="wikitable" |
||
− | ---- |
||
+ | |- |
||
+ | ! Week |
||
+ | ! uig Cov. |
||
+ | ! uig WER |
||
+ | ! uig BLEU |
||
+ | ! uzb Cov. |
||
+ | ! uzb WER |
||
+ | ! uzb BLEU |
||
+ | ! tat Cov. |
||
+ | ! tat WER |
||
+ | ! tat BLEU |
||
+ | ! kir Cov. |
||
+ | ! kir WER |
||
+ | ! kir BLEU |
||
+ | ! On Track? |
||
+ | |- |
||
+ | | July 8th-14th |
||
+ | | |
||
+ | | |
||
+ | | |
||
+ | | |
||
+ | | |
||
+ | | |
||
+ | | |
||
+ | | |
||
+ | | |
||
+ | | |
||
+ | | |
||
+ | | |
||
+ | |} |
||
− | == |
+ | == First Evaluation == |
+ | === Coverages === |
||
+ | {| class="wikitable" |
||
− | ''1st evaluation WER resulsts: |
||
+ | |- |
||
+ | ! L |
||
+ | ! Wiki |
||
+ | ! Bible |
||
+ | |- |
||
+ | | Tur-Uig |
||
+ | | 53505239 words, 82.3% cov |
||
+ | | 178233 words, 93.0% cov |
||
+ | |- |
||
+ | | Uzb-Tur |
||
+ | | 12730161 words, 80.8% cov |
||
+ | | 184447 words, 81.1% cov |
||
+ | |- |
||
+ | | Kir-Tur |
||
+ | | 11435418 words, 82.8% cov |
||
+ | | 184808 words, 93.4% cov |
||
+ | |- |
||
+ | | Tat-Tur |
||
+ | | -- |
||
+ | | 178220 words, 91.4% cov |
||
+ | |} |
||
+ | |||
+ | === WER results === |
||
+ | |||
+ | |||
+ | ''1st evaluation WER results: |
||
'' |
'' |
||
+ | |||
'''Uzbek''' |
'''Uzbek''' |
||
+ | |||
+ | |||
+ | Test file: 'istanbultr.txt' |
||
+ | Reference file 'turistanbul.txt' |
||
+ | |||
+ | Statistics about input files |
||
+ | ------------------------------------------------------- |
||
+ | Number of words in reference: 206 |
||
+ | Number of words in test: 208 |
||
+ | Number of unknown words (marked with a star) in test: 28 |
||
+ | Percentage of unknown words: 13.46 % |
||
+ | |||
+ | Results when removing unknown-word marks (stars) |
||
+ | ------------------------------------------------------- |
||
+ | Edit distance: 78 |
||
+ | Word error rate (WER): 37.86 % |
||
+ | Number of position-independent correct words: 132 |
||
+ | Position-independent word error rate (PER): 36.89 % |
||
+ | |||
+ | Results when unknown-word marks (stars) are not removed |
||
+ | ------------------------------------------------------- |
||
+ | Edit distance: 76 |
||
+ | Word Error Rate (WER): 36.89 % |
||
+ | Number of position-independent correct words: 134 |
||
+ | Position-independent word error rate (PER): 35.92 % |
||
+ | |||
+ | Statistics about the translation of unknown words |
||
+ | ------------------------------------------------------- |
||
+ | Number of unknown words which were free rides: -2 |
||
+ | Percentage of unknown words that were free rides: -7.14 % |
||
+ | |||
'''Kyrgyz''' |
'''Kyrgyz''' |
||
+ | |||
+ | |||
+ | Test file: 'kazantr.txt' |
||
+ | Reference file 'kazanturkce.txt' |
||
+ | |||
+ | Statistics about input files |
||
+ | ------------------------------------------------------- |
||
+ | Number of words in reference: 223 |
||
+ | Number of words in test: 227 |
||
+ | Number of unknown words (marked with a star) in test: 55 |
||
+ | Percentage of unknown words: 24.23 % |
||
+ | |||
+ | Results when removing unknown-word marks (stars) |
||
+ | ------------------------------------------------------- |
||
+ | Edit distance: 113 |
||
+ | Word error rate (WER): 50.67 % |
||
+ | Number of position-independent correct words: 119 |
||
+ | Position-independent word error rate (PER): 48.43 % |
||
+ | |||
+ | Results when unknown-word marks (stars) are not removed |
||
+ | ------------------------------------------------------- |
||
+ | Edit distance: 108 |
||
+ | Word Error Rate (WER): 48.43 % |
||
+ | Number of position-independent correct words: 124 |
||
+ | Position-independent word error rate (PER): 46.19 % |
||
+ | |||
+ | Statistics about the translation of unknown words |
||
+ | ------------------------------------------------------- |
||
+ | Number of unknown words which were free rides: -5 |
||
+ | Percentage of unknown words that were free rides: -9.09 % |
||
'''Tatar''' |
'''Tatar''' |
||
+ | |||
+ | |||
+ | Test file: 'kazantr.txt' |
||
+ | Reference file 'kazantur.txt' |
||
+ | |||
+ | Statistics about input files |
||
+ | ------------------------------------------------------- |
||
+ | Number of words in reference: 195 |
||
+ | Number of words in test: 210 |
||
+ | Number of unknown words (marked with a star) in test: 36 |
||
+ | Percentage of unknown words: 17.14 % |
||
+ | |||
+ | Results when removing unknown-word marks (stars) |
||
+ | ------------------------------------------------------- |
||
+ | Edit distance: 103 |
||
+ | Word error rate (WER): 52.82 % |
||
+ | Number of position-independent correct words: 112 |
||
+ | Position-independent word error rate (PER): 50.26 % |
||
+ | |||
+ | Results when unknown-word marks (stars) are not removed |
||
+ | ------------------------------------------------------- |
||
+ | Edit distance: 102 |
||
+ | Word Error Rate (WER): 52.31 % |
||
+ | Number of position-independent correct words: 113 |
||
+ | Position-independent word error rate (PER): 49.74 % |
||
+ | |||
+ | Statistics about the translation of unknown words |
||
+ | ------------------------------------------------------- |
||
+ | Number of unknown words which were free rides: -1 |
||
+ | Percentage of unknown words that were free rides: -2.78 % |
||
'''Uyghur''' |
'''Uyghur''' |
||
+ | |||
+ | |||
+ | Test file: 'cumhuriyet-1.txt' |
||
+ | Reference file 'cumhuriyetturkce.txt' |
||
+ | |||
+ | Statistics about input files |
||
+ | ------------------------------------------------------- |
||
+ | Number of words in reference: 354 |
||
+ | Number of words in test: 359 |
||
+ | Number of unknown words (marked with a star) in test: 20 |
||
+ | Percentage of unknown words: 5.57 % |
||
+ | |||
+ | Results when removing unknown-word marks (stars) |
||
+ | ------------------------------------------------------- |
||
+ | Edit distance: 61 |
||
+ | Word error rate (WER): 17.23 % |
||
+ | Number of position-independent correct words: 299 |
||
+ | Position-independent word error rate (PER): 16.95 % |
||
+ | |||
+ | Results when unknown-word marks (stars) are not removed |
||
+ | ------------------------------------------------------- |
||
+ | Edit distance: 61 |
||
+ | Word Error Rate (WER): 17.23 % |
||
+ | Number of position-independent correct words: 299 |
||
+ | Position-independent word error rate (PER): 16.95 % |
||
+ | |||
+ | Statistics about the translation of unknown words |
||
+ | ------------------------------------------------------- |
||
+ | Number of unknown words which were free rides: 0 |
||
+ | Percentage of unknown words that were free rides: 0.00 % |
Latest revision as of 16:05, 12 July 2019
Progress on 2019 GSoC Project Turkic MT Improvements.
Week | uig Cov. | uig WER | uig BLEU | uzb Cov. | uzb WER | uzb BLEU | tat Cov. | tat WER | tat BLEU | kir Cov. | kir WER | kir BLEU | On Track? |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
July 8th-14th |
First Evaluation[edit]
Coverages[edit]
L | Wiki | Bible |
---|---|---|
Tur-Uig | 53505239 words, 82.3% cov | 178233 words, 93.0% cov |
Uzb-Tur | 12730161 words, 80.8% cov | 184447 words, 81.1% cov |
Kir-Tur | 11435418 words, 82.8% cov | 184808 words, 93.4% cov |
Tat-Tur | -- | 178220 words, 91.4% cov |
WER results[edit]
1st evaluation WER results:
Uzbek
Test file: 'istanbultr.txt'
Reference file 'turistanbul.txt'
Statistics about input files
Number of words in reference: 206 Number of words in test: 208 Number of unknown words (marked with a star) in test: 28 Percentage of unknown words: 13.46 %
Results when removing unknown-word marks (stars)
Edit distance: 78 Word error rate (WER): 37.86 % Number of position-independent correct words: 132 Position-independent word error rate (PER): 36.89 %
Results when unknown-word marks (stars) are not removed
Edit distance: 76 Word Error Rate (WER): 36.89 % Number of position-independent correct words: 134 Position-independent word error rate (PER): 35.92 %
Statistics about the translation of unknown words
Number of unknown words which were free rides: -2 Percentage of unknown words that were free rides: -7.14 %
Kyrgyz
Test file: 'kazantr.txt'
Reference file 'kazanturkce.txt'
Statistics about input files
Number of words in reference: 223 Number of words in test: 227 Number of unknown words (marked with a star) in test: 55 Percentage of unknown words: 24.23 %
Results when removing unknown-word marks (stars)
Edit distance: 113 Word error rate (WER): 50.67 % Number of position-independent correct words: 119 Position-independent word error rate (PER): 48.43 %
Results when unknown-word marks (stars) are not removed
Edit distance: 108 Word Error Rate (WER): 48.43 % Number of position-independent correct words: 124 Position-independent word error rate (PER): 46.19 %
Statistics about the translation of unknown words
Number of unknown words which were free rides: -5 Percentage of unknown words that were free rides: -9.09 %
Tatar
Test file: 'kazantr.txt'
Reference file 'kazantur.txt'
Statistics about input files
Number of words in reference: 195 Number of words in test: 210 Number of unknown words (marked with a star) in test: 36 Percentage of unknown words: 17.14 %
Results when removing unknown-word marks (stars)
Edit distance: 103 Word error rate (WER): 52.82 % Number of position-independent correct words: 112 Position-independent word error rate (PER): 50.26 %
Results when unknown-word marks (stars) are not removed
Edit distance: 102 Word Error Rate (WER): 52.31 % Number of position-independent correct words: 113 Position-independent word error rate (PER): 49.74 %
Statistics about the translation of unknown words
Number of unknown words which were free rides: -1 Percentage of unknown words that were free rides: -2.78 %
Uyghur
Test file: 'cumhuriyet-1.txt'
Reference file 'cumhuriyetturkce.txt'
Statistics about input files
Number of words in reference: 354 Number of words in test: 359 Number of unknown words (marked with a star) in test: 20 Percentage of unknown words: 5.57 %
Results when removing unknown-word marks (stars)
Edit distance: 61 Word error rate (WER): 17.23 % Number of position-independent correct words: 299 Position-independent word error rate (PER): 16.95 %
Results when unknown-word marks (stars) are not removed
Edit distance: 61 Word Error Rate (WER): 17.23 % Number of position-independent correct words: 299 Position-independent word error rate (PER): 16.95 %
Statistics about the translation of unknown words
Number of unknown words which were free rides: 0 Percentage of unknown words that were free rides: 0.00 %