Talk:Turkic languages

From Apertium
Jump to navigation Jump to search


  • attributive attr = things that act like adjectives
  • predicative pred
  • substantive subst = things that act like nouns
  • adverbial advl = things that act like adverbs (??)


  • noun (default 'subst') + DECL-NOUN
    noun->adj = n.attr + DECL-ADJ !! No <comp>arison levels though
  • adj (default 'attr') + DECL-ADJ
    adj->noun = adj.subst + DECL-NOUN
  • num (default 'attr') + DECL-NUM
    num->noun = num.subst + DECL-NOUN
  • prn (default 'subst') + DECL-NOUN
  • det (default 'attr') + NO-DECL
  • v
    v->noun = v.ger + DECL-NOUN
    v->adj = v.glp + DECL-ADJ
    v->adv = v.prc + DECL-ADV

Types of non-finite verbal forms:

  • Adverbial participle: <gnL> (e.g. <gnc> "Conditional adverbial participle")
  • Verbal adjectives: <gpL> (e.g. <gpi> "Imperfect verbal adjective")
  • Gerunds: <gerN> (e.g. <ger1> "Past/present gerund")
  • Participles: <prcN> (e.g. <prc1> "Realis participle")

What about 'cop' and 'pred'[edit]

  • The copula is i- (p.79)
    • -(y) (pres)
    • -(y)DI (past)
    • -(y)mIş (evid)
    • -(y)sA (cond)

Заметки разрешении морфологической неоднозначности[edit]

Arguments against just having a different tag:

 e.g. güzel<n>/güzel<adj>
  • We lose the tag denoting the principle function of the stem
  • We can't tell the CG to choose the principal function
  • We can't tell the difference between `real' N/A ambiguity and "derivation" ambiguity

Arguments against just piling one tag ontop of another:

 e.g. güzel<adj>/güzel<adj><n>
  • Having two POS in a word makes things confusing
  • Having two POS tags in a word makes it difficult to write CG rules

Arguments against having a "zero derivation":

 e.g. güzel<adj>/güzel<adj><D_n><n>
  • It's ugly and stupid
  • Having two POS tags in a word makes it difficult to write CG rules


güzel            'beautiful'          güzel<adj>/güzel<adj><subst>/güzel<adj><advl>
güzelim          'my beauty'          güzel<adj><subst><px1sg>
güzel konuştu    'she spoke well'     güzel<adj>/güzel<adj><subst>/güzel<adj><advl>
güzel bir köpek  'a beautiful dog'    güzel<adj>/güzel<adj><subst>/güzel<adj><advl>
küçük            'small'              küçük<adj>/küçük<adj><subst>/küçük<adj><advl>
küçük kızlar     'little girls'       küçük<adj>/küçük<adj><subst>/küçük<adj><advl>
küçükler         'little one(s)'      küçük<adj><subst><pl>/küçük<n>+i<cop><pres><p3><pl>
kötü             'bad'                kötü<adj>/kötü<adj><subst>/kötü<adj><advl>
kötü araba       '(a) bad car'        kötü<adj>/kötü<adj><subst>/kötü<adj><advl>
kötü yüzmek      'to swim badly'      kötü<adj>/kötü<adj><subst>/kötü<adj><advl>


şimde            'now'                şimde<adv>
şimdelerde       'nowadays'           şimdelerde<adv>
I think you want both of these as <adv>. Historically it's something like "şu emdi<n??>" and "şu emdilerde/emdi<n??><pl><loc>", but for our purposes this is irrelevant. —Firespeaker 19:40, 26 February 2012 (UTC)
More to the point, this isn't any sort of productive process we're seeing here; my point is that it's an isolated productive-looking form because of its unique history. —Firespeaker 19:41, 26 February 2012 (UTC)

Имема существительные[edit]

evdeki          'the one in the house'     ev<n><locattr>/ev<n><locsub>
evdekinde       'in the one in the house'  ev<n><locsub><loc>


$ echo Evlerimizdeymişler | hfst-proc tr-cv.automorf.hfst 

Compound tenses[edit]

Things to think about:

  • analysis length:
    • ^келген эмеспи/кел<v><iv><neg><past><p3><pl>+бы<qst>/кел<v><iv><neg><past><p3><sg>+бы<qst>/кел<vaux><neg><past><p3><pl>+бы<qst>/кел<vaux><neg><past><p3><sg>+бы<qst>$, vs.
    • ^келген/кел<v><iv><past>/кел<vaux><past>$ ^эмеспи/эмес<neg><p3><sg>+бы<qst>/эмес<neg><p3><pl>+бы<qst>$
  • tag/morpheme reordering should be done by transfer, such as Turkish->Chuvash negative imperative, Chuvash->Turkish possessives.
  • what about different spacing, do you ever get >1 space, or nbsp or formatting between e.g. келген and эмеспи ? -- or anything that isn't a single ascii space ?


Following is a mail to the Corpora list. Might be a good idea to have a 'Resources' page/section for Turkic languages, as it is done on language pages.

Message: 6
Date: Wed, 25 Jun 2014 19:54:04 +0200
From: "Christian Chiarcos" <>
Subject: Re: [Corpora-List] Turkic dictionaries
To: "" <>

Dear all,

I would like to thank everyone who responded to my request and who helped
me in personal conversation, in particular, Emily Bender, Jost Gippert,
Max Ionov, Irina Nevskaya, Monika Rind-Pawlowski, Vit Suchomel, Francis
Tyers, and Mardan Wushouer. Please find a summary, with URLs, brief
description and licensing information below (no particular order):

(A) Dictionaries/Wordlists in machine-readable formats

(A.1) Gilles Sérasset's DBnary
machine-readable (RDF) dictionaries generated from Wiktionary, incl.

(A.2) Mardan Wushouer's wordlists
plain word lists for Chinese-Uyghur, Chinese-Kazakh, Uyghur-Kazakh

(A.3) Altaic etymological dictionary\data\alt\turcet
includes 26 Turkic languages, available online and as DBase dump
copyright restricted

(A.4) Freelang
English (and partially, French) word lists for 28 Turkic languages (mostly
small), proprietary list format
freeware (i.e., no modification)

(A.5) Apertium Turkic
word lists for Turkic-Azeri, Kazakh-Tatar, 12 more pairs of Turkic
languages under development
open source (hosted at Sourceforge)

lexicons for Chalkan and Tuva, provided by the RELISH project
available online, XML
licensing to be clarified

(A.7) PanLex
huge collection of word lists in a unified representation (SQL, RDF)
incl. Azeri, Gagauz, Kazakh, Kirgiz, Turkish, Turkmen, Uzbek, etc.
different (mostly open) licenses depending on the original source

(A.8) Intercontinental Dictionary Series,
word lists of minimal core vocabulary
Azeri, Kumyk, Nogai, Terekeme (Azerbaijan dialect)
plain text or RDF

(B) Human-readable dictionaries/wordlists that can be easily converted
into machine-readable formats

(B.1) Wiktionary, various languages (see A.1)
incl. Azeri, Kazakh, Kirgiz, Tatar, Turkish, Turkmen

(B.2) Chalkan dictionary
available for academic use, with attribution, non-commercial

(B.3) Shorica
Shor dictionary and corpus
copyright to be clarified, currently offline (last accessed mid-May 2014)

(B.4) Karachay-Balkar dictionary
Karachay-Balkar - Russian dictionary
copyright restricted

(B.5) Tatar dictionary
Tatar-Russian dictionary
copyright restricted

(B.6) Khakassian dictionary
Khakas - English and Khakas - Russian
copyright restricted

(C) other resources

(C.1) Altaica
link and resource collection, includes machine-readable and human-readable
dictionaries for 17 Turkic languages (not replicated above)

(C.2) Pre-Islamic Old Turkic Texts (VATEC)
glossed corpus (XML) from which a German-Old Turkic word list can be
copyright restricted

(C.3) Glosbe
online access to word lists and translation memories
Azeri, Karachay-Balkar, Kazakh, Tatar, Turkish, Turkmen, Uzbek, etc.
free online API (with severe capacity limits)

Certainly, this list is not exhaustive, so if you feel something important
is missing or incorrect, please let me know ;)

All the best,

Turkic-Turkish texts.