Difference between revisions of "Uighur and Turkish/GSoC2018 report"

Revision as of 21:30, 11 August 2018

This project was an application of Apertium to develop an MT between Uyghur and Turkish, two Turkic languages. The project consisted mainly of building a bilingual bidix, writing transfer and disambiguation rules and enriching the Uyghur morphological analyzer.

Commits

My commits can be found here.

Corpora and Coverage

Our main corpora consisted of RFA, Tanritor, TRT Uyghurche and Uyghur Wikipedia, but we also worked on some Uyghur blogs, an collection of Uyghur stories and the Uyghur translation of the Bible to be able to cover different domains. Wikipedia and blog coverages were relatively low due to nonstandard forms, Arabic and Farsi texts and alphabets.

Corpus	Words	Coverage
News	3447048	94.0%
Bible	1527061	94.1%
Wikipedia	1589113	88.2%
Blogs	4055981	87.0%

Transfer

There are about 50 transfer rules, mostly needed to cover Uyghur's relatively richer tense inventory. We also needed transfer rules for expression that are -optionally- expressed synthetically in Uyghur but was analytic in Turkish. To give some examples:

bolidighan and bolmaqchi in Uyghur are both equivalents of Turkish olacak.

Uyghur can use the -rAK for comparisons, which is expressed by the preposition daha in Turkish.

Disambiguation

To correctly discern the lemma and the morphology so as to be translated correctly into the target language, Apertium uses Constraint Grammar (CG). Currently Uyghur has about 45 CG rules for disambiguation.

Lexical Selection

To determine in which context which translation of a given lemma would be selected, lexical selection is employed. Currently uig-tur has 35 lexsel rules.

Sources

I used the online Yulghun dictionary and Uyghurche-Türkche Lughet of E. N. Necip for the vocabulary. For grammar reference, I used Rıdvan Öztürk's Yeni Uygur Türkçesi Grameri.

Future Plans

For a more satisfactory translation and analysis, more disambiguation and lexsel rules must be added. Morphological analysis can be extended to cover the vast non-standard forms of modern Uyghur. With some work on coverage and transfer rules, Tur->Uig translation can be made possible.

@@ Line 15: / Line 15: @@
 |-
 ! Corpus
+! Words
 ! Coverage
 |-
 | News
-| xx.x%
+| 3447048
+| 94.0%
 |-
 | Bible
+| 1527061
 | 94.1%
 |-
 | Wikipedia
+| 1589113
 | 88.2%
 |-
 | Blogs
-| xx.x%
+| 4055981
+| 87.0%
 |}

Difference between revisions of "Uighur and Turkish/GSoC2018 report"

Revision as of 21:30, 11 August 2018

Contents

Commits

Corpora and Coverage

Transfer

Disambiguation

Lexical Selection

Sources

Future Plans

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools