Difference between revisions of "Turkic MT Improvements GSoC2019 report"

Revision as of 14:25, 26 August 2019

This aim of this project was improving the following language pairs of Apertium: tur->uig, uzb->tur, kir->tur, tat->tur.

Commits

My commits can be found here.

Transfer

Transfer rules were written for tur->uig and uzb->tur, using Regression Tests. They can be found here: Uighur and Uzbek.

Corpora and Coverage

L	Wiki	Bible
Tur-Uig	53505239 words, 82.3% cov	178233 words, 93.0% cov
Uzb-Tur	12730161 words, 80.8% cov	184447 words, 83.5% cov
Kir-Tur	11435418 words, 82.5% cov	184808 words, 92.0% cov
Tat-Tur	5792382 words, 86.4% cov	178220 words, 91.4% cov

Dictionaries

All dictionaries were improved in the first stage of the project, with the help of mentors on Kipchak languages.

Disambiguation

To correctly discern the lemma and the morphology so as to be translated correctly into the target language, Apertium uses Constraint Grammar (CG). As part of the project, CG rules were added where necessary. Uzbek and Turkish in particular received extensive attention in this regard.

WER

---Uzbek--- Test file: 'mattauzbtr.txt' Reference file 'mattaturk.txt'

Statistics about input files

Number of words in reference: 565 Number of words in test: 579 Number of unknown words (marked with a star) in test: 124 Percentage of unknown words: 21.42 %

Results when removing unknown-word marks (stars)

Edit distance: 177 Word error rate (WER): 31.33 % Number of position-independent correct words: 408 Position-independent word error rate (PER): 30.27 %

Results when unknown-word marks (stars) are not removed

Edit distance: 188 Word Error Rate (WER): 33.27 % Number of position-independent correct words: 397 Position-independent word error rate (PER): 32.21 %

Statistics about the translation of unknown words

Number of unknown words which were free rides: 11 Percentage of unknown words that were free rides: 8.87 %

---Kirgiz---

Test file: 'mattakirtr.txt' Reference file 'mattaturkkir.txt'

Statistics about input files

Number of words in reference: 569 Number of words in test: 669 Number of unknown words (marked with a star) in test: 63 Percentage of unknown words: 9.42 %

Results when removing unknown-word marks (stars)

Edit distance: 389 Word error rate (WER): 68.37 % Number of position-independent correct words: 286 Position-independent word error rate (PER): 67.31 %

Results when unknown-word marks (stars) are not removed

Edit distance: 389 Word Error Rate (WER): 68.37 % Number of position-independent correct words: 286 Position-independent word error rate (PER): 67.31 %

Statistics about the translation of unknown words

Number of unknown words which were free rides: 0 Percentage of unknown words that were free rides: 0.00 %

---Tatar--- Test file: 'mattatattr.txt' Reference file 'mattaturktatar.txt'

Statistics about input files

Number of words in reference: 573 Number of words in test: 587 Number of unknown words (marked with a star) in test: 66 Percentage of unknown words: 11.24 %

Results when removing unknown-word marks (stars)

Edit distance: 218 Word error rate (WER): 38.05 % Number of position-independent correct words: 375 Position-independent word error rate (PER): 37.00 %

Results when unknown-word marks (stars) are not removed

Edit distance: 218 Word Error Rate (WER): 38.05 % Number of position-independent correct words: 375 Position-independent word error rate (PER): 37.00 %

Statistics about the translation of unknown words

Number of unknown words which were free rides: 0 Percentage of unknown words that were free rides: 0.00 %

Future Plans

Uzbek lexicon still needs to be improved. Analysis of Uzbek can be problematic because of the unusual alphabet of the language along with non-standard forms, which also needs further improvement. More lexical selection, disambiguation and transfer rules are needed to achieve a greater translation quality on all pairs.

Difference between revisions of "Turkic MT Improvements GSoC2019 report"

Revision as of 14:25, 26 August 2019

Contents

Commits

Transfer

Corpora and Coverage

Dictionaries

Disambiguation

WER

Future Plans

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools

@@ Line 102: / Line 102: @@
 Number of position-independent correct words: 286
 Position-independent word error rate (PER): 67.31 %
+Statistics about the translation of unknown words
+-------------------------------------------------------
+Number of unknown words which were free rides: 0
+Percentage of unknown words that were free rides: 0.00 %
+---Tatar---
+Test file: 'mattatattr.txt'
+Reference file 'mattaturktatar.txt'
+Statistics about input files
+-------------------------------------------------------
+Number of words in reference: 573
+Number of words in test: 587
+Number of unknown words (marked with a star) in test: 66
+Percentage of unknown words: 11.24 %
+Results when removing unknown-word marks (stars)
+-------------------------------------------------------
+Edit distance: 218
+Word error rate (WER): 38.05 %
+Number of position-independent correct words: 375
+Position-independent word error rate (PER): 37.00 %
+Results when unknown-word marks (stars) are not removed
+-------------------------------------------------------
+Edit distance: 218
+Word Error Rate (WER): 38.05 %
+Number of position-independent correct words: 375
+Position-independent word error rate (PER): 37.00 %
 Statistics about the translation of unknown words