Uighur and Turkish/GSoC2018 report

From Apertium
< Uighur and Turkish
Revision as of 18:10, 11 August 2018 by Oğuz (talk | contribs) (Created page with " This project was an application of Apertium to develop an MT between Uyghur and Turkish, two Turkic languages. The project consisted mainly of building a bilingual bidix, wri...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

This project was an application of Apertium to develop an MT between Uyghur and Turkish, two Turkic languages. The project consisted mainly of building a bilingual bidix, writing transfer and disambiguation rules and enriching the Uyghur morphological analyzer.


Commits

My commits can be found here.


Corpora and Coverage

Our main corpora consisted of RFA, Tanritor, and TRT Uyghurche and Uyghur Wikipedia, but we also worked on some Uyghur blogs, an collection of Uyghur stories and the Uyghur translation of the Bible to be able to cover different domains. Wikipedia and blog coverages were relatively lower due to nonstandard forms, Arabic and Farsi texts and alphabets.


Corpus Coverage
News xx.x%
Bible 94.1%
Wikipedia 88.2%
Blogs xx.x%


Transfer

There are about 50 transfer rules, mostly needed to cover Uyghur's relatively richer tense inventory. We also needed transfer rules for expression that are -optionally- expressed synthetically in Uyghur but was analytic in Turkish. To give some examples: bolidighan and bolmaqchi in Uyghur are both equivalents of Turkish olacak. Uyghur can use the -rAK for comparisons, which is expressed by the preposition daha in Turkish.

Disambiguation

Many different analyses are generated for many word forms. To correctly discern the lemma and the morphology so as to be translated correctly into the target language, MT systems have disambiguation components. The disambiguation in this system is currently carried out using Constraint Grammar (CG). 68 rules remove the wrong analyses and select the the correct ones with the use of contextual morphological information. Ideally this would be either in conjunction with or replaced by a machine-learned POS tagger, which requires a tagged corpus. The tagged corpus will be developed in the near future.

Lexical Selection

Lexical selection is used when the system needs to choose among multiple possible translations. The lexical selection component uses rules to choose which translation to prefer based on contextual information.


Sources

I used the online [Yulghun] dictionary and Uyghurche-Türkche Lughet of E. N. Necip the vocabulary. For grammar reference, I used Rıdvan Öztürk's Yeni Uygur Türkçesi Grameri.


Future Plans

For a more satisfactory translation and analysis, more disambiguation and lexsel rules must be added. Morphological analysis can be extended to cover the vast non-standard forms of modern Uyghur. With some work on coverage and transfer rules, Tur->Uig translation can be made possible.