Crimean Tatar and Turkish/GSoC Report

The following is the submission report for the Google Summer of Code 2017 project, RBMT for Crimean Tatar and Turkish.

My commits can be accessed at the following link: [1] The pair is in the trunk folder in SVN[2], and can be checked out with its prerequisites using the following commands.

svn co https://svn.code.sf.net/p/apertium/svn/languages/apertium-crh/
svn co https://svn.code.sf.net/p/apertium/svn/languages/apertium-tur/
svn co https://svn.code.sf.net/p/apertium/svn/trunk/apertium-crh-tur/

Summary[edit]

Apertium particularly shines when used for languages with similar grammatical structures, and Romance and Turkic languages have been a very active area for language pair developers. Turkish and Crimean Tatar, though from different branches of the Turkic family (Oghuz and Kipchak respectively), have many similarities in phonetics, morphology and even syntax mostly due to Ottoman influence on the Crimean Tatar language.

Coverage[edit]

By coverage we mean the amount of the input text that the system understands and attempts to analyze and translate into the target language. This is an important metric and is related to the presence of the necessary words and morphology in the dictionaries. The system required the development of a Crimean Tatar-Turkish lexicon. The lack of a Turkish-Crimean Tatar dictionary was one obstacle in the path of the project. We used resources such as Wiktionary and crossed a Russian-Qırımtatar dictionary [3] with a Russian-Turkish one to create an initial bilingual dictionary. After that more words were added through cognates through Turkish, corpora were examined to determine and ascertain unknown words' meanings and Persian, Arabic and Russian vocabulary were used to good effect to reach a high coverage on all the corpora.

Corpus	Coverage
Krymr 2014	92.3%
Krymr 2015	93.3%
Wikipedia	90.1%

Analyzers for both Crimean Tatar and Turkish were available in Apertium. Any entry in the bilingual dictionary (bidix) missing from either analyzer was added to the analyzers as well.

Transfer[edit]

There are 51 structural transfer rules that take Crimean Tatar constructions and turn them into their Turkish equivalents. Many of these rules cover constructions that are analytic in Qırımtatar and synthetic in Turkish, the most simple examples being things like yapa bile, "he/she/it can do it" which would translate to yapabilir. Copulae like edi, eken, ekende are also often written together with the verb in Turkish as opposed to their Crimean Tatar counterparts.

Disambiguation and Lexical Selection[edit]

Disambiguation[edit]

Many different analyses are generated for many word forms. To correctly discern the lemma and the morphology so as to be translated correctly into the target language, MT systems have disambiguation components. The disambiguation in this system is currently carried out using Constraint Grammar (CG). 68 rules remove the wrong analyses and select the the correct ones with the use of contextual morphological information. Ideally this would be either in conjunction with or replaced by a machine-learned POS tagger, which requires a tagged corpus. The tagged corpus will be developed in the near future.

Lexical Selection[edit]

Lexical selection is used when the system needs to choose among multiple possible translations. The lexical selection component uses rules to choose which translation to prefer based on contextual information.

Crimean Tatar and Turkish/GSoC Report

Contents

Summary[edit]

Coverage[edit]

Transfer[edit]

Disambiguation and Lexical Selection[edit]

Disambiguation[edit]

Lexical Selection[edit]

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools