Crimean Tatar and Turkish/GSoC Report
The following is the submission report for the Google Summer of Code 2017 project, RBMT for Crimean Tatar and Turkish.
My commits can be accessed at the following link: [1] The pair is in the trunk folder in SVN[2], and can be checked out using the following commands.
svn co https://svn.code.sf.net/p/apertium/svn/languages/apertium-crh/ svn co https://svn.code.sf.net/p/apertium/svn/languages/apertium-tur/ svn co https://svn.code.sf.net/p/apertium/svn/trunk/apertium-crh-tur/
Summary
Apertium particularly shines when used for languages with similar grammatical structures, and Romance and Turkic languages have been a very active area for language pair developers. Turkish and Crimean Tatar, though from different branches of the Turkic family (Oghuz and Kipchak respectively), have many similarities in phonetics, morphology and even syntax mostly due to Ottoman influence on the Crimean Tatar language.
Coverage
The lack of a Turkish-Crimean Tatar dictionary was one obstacle in the path of the project. We used resources such as Wiktionary and crossed a Russian-Qırımtatar dictionary [3] with a Russian-Turkish one to create an initial bilingual dictionary. After that more words were added through cognates through Turkish, corpora were examined to determine and ascertain unknown words' meanings and Persian, Arabic and Russian vocabulary were used to good effect to reach a high coverage on all the corpora.
Corpus | Coverage |
---|---|
Krymr 2014 | 92.3% |
Krymr 2015 | 93.3% |
Wikipedia | 90.1% |
Analyzers for both Crimean Tatar and Turkish were available in Apertium. Any entry in the bilingual dictionary (bidix) missing from either analyzer was added to the analyzers as well.