Talk:Turkic languages
Jump to navigation
Jump to search
Classification
- attributive attr = things that act like adjectives
- predicative pred
- substantive subst = things that act like nouns
- adverbial advl = things that act like adverbs (??)
Hierarchy
- noun (default 'subst') + DECL-NOUN
- noun->adj = n.attr + DECL-ADJ !! No <comp>arison levels though
- adj (default 'attr') + DECL-ADJ
- adj->noun = adj.subst + DECL-NOUN
- num (default 'attr') + DECL-NUM
- num->noun = num.subst + DECL-NOUN
- prn (default 'subst') + DECL-NOUN
- det (default 'attr') + NO-DECL
- v
- v->noun = v.ger + DECL-NOUN
- v->adj = v.glp + DECL-ADJ
- v->adv = v.prc + DECL-ADV
Types of non-finite verbal forms:
- Adverbial participle:
<gnL>
(e.g.<gnc>
"Conditional adverbial participle") - Verbal adjectives:
<gpL>
(e.g.<gpi>
"Imperfect verbal adjective") - Gerunds:
<gerN>
(e.g.<ger1>
"Past/present gerund") - Participles:
<prcN>
(e.g.<prc1>
"Realis participle")
What about 'cop' and 'pred'
- The copula is i- (p.79)
- -(y) (pres)
- -(y)DI (past)
- -(y)mIş (evid)
- -(y)sA (cond)
Заметки разрешении морфологической неоднозначности
Arguments against just having a different tag:
e.g. güzel<n>/güzel<adj>
- We lose the tag denoting the principle function of the stem
- We can't tell the CG to choose the principal function
- We can't tell the difference between `real' N/A ambiguity and "derivation" ambiguity
Arguments against just piling one tag ontop of another:
e.g. güzel<adj>/güzel<adj><n>
- Having two POS in a word makes things confusing
- Having two POS tags in a word makes it difficult to write CG rules
Arguments against having a "zero derivation":
e.g. güzel<adj>/güzel<adj><D_n><n>
- It's ugly and stupid
- Having two POS tags in a word makes it difficult to write CG rules
Прилагательное
güzel 'beautiful' güzel<adj>/güzel<adj><subst>/güzel<adj><advl> güzelim 'my beauty' güzel<adj><subst><px1sg> güzel konuştu 'she spoke well' güzel<adj>/güzel<adj><subst>/güzel<adj><advl> güzel bir köpek 'a beautiful dog' güzel<adj>/güzel<adj><subst>/güzel<adj><advl>
küçük 'small' küçük<adj>/küçük<adj><subst>/küçük<adj><advl> küçük kızlar 'little girls' küçük<adj>/küçük<adj><subst>/küçük<adj><advl> küçükler 'little one(s)' küçük<adj><subst><pl>/küçük<n>+i<cop><pres><p3><pl>
kötü 'bad' kötü<adj>/kötü<adj><subst>/kötü<adj><advl> kötü araba '(a) bad car' kötü<adj>/kötü<adj><subst>/kötü<adj><advl> kötü yüzmek 'to swim badly' kötü<adj>/kötü<adj><subst>/kötü<adj><advl>
Наречии
şimde 'now' şimde<adv> şimdelerde 'nowadays' şimdelerde<adv>
- I think you want both of these as <adv>. Historically it's something like "şu emdi<n??>" and "şu emdilerde/emdi<n??><pl><loc>", but for our purposes this is irrelevant. —Firespeaker 19:40, 26 February 2012 (UTC)
- More to the point, this isn't any sort of productive process we're seeing here; my point is that it's an isolated productive-looking form because of its unique history. —Firespeaker 19:41, 26 February 2012 (UTC)
Имема существительные
evdeki 'the one in the house' ev<n><locattr>/ev<n><locsub> evdekinde 'in the one in the house' ev<n><locsub><loc>
Разное
$ echo Evlerimizdeymişler | hfst-proc tr-cv.automorf.hfst ^Evlerimizdeymişler/Ev<n><pl><px1pl><loc>+i<cop><evid><p3><pl>$
Compound tenses
Things to think about:
- analysis length:
^келген эмеспи/кел<v><iv><neg><past><p3><pl>+бы<qst>/кел<v><iv><neg><past><p3><sg>+бы<qst>/кел<vaux><neg><past><p3><pl>+бы<qst>/кел<vaux><neg><past><p3><sg>+бы<qst>$
, vs.^келген/кел<v><iv><past>/кел<vaux><past>$ ^эмеспи/эмес<neg><p3><sg>+бы<qst>/эмес<neg><p3><pl>+бы<qst>$
- tag/morpheme reordering should be done by transfer, such as Turkish->Chuvash negative imperative, Chuvash->Turkish possessives.
- what about different spacing, do you ever get >1 space, or nbsp or formatting between e.g. келген and эмеспи ? -- or anything that isn't a single ascii space ?
Resources
Following is a mail to the Corpora list. Might be a good idea to have a 'Resources' page/section for Turkic languages, as it is done on language pages.
Message: 6 Date: Wed, 25 Jun 2014 19:54:04 +0200 From: "Christian Chiarcos" <christian.chiarcos@web.de> Subject: Re: [Corpora-List] Turkic dictionaries To: "corpora@uib.no" <corpora@uib.no> Dear all, I would like to thank everyone who responded to my request and who helped me in personal conversation, in particular, Emily Bender, Jost Gippert, Max Ionov, Irina Nevskaya, Monika Rind-Pawlowski, Vit Suchomel, Francis Tyers, and Mardan Wushouer. Please find a summary, with URLs, brief description and licensing information below (no particular order): (A) Dictionaries/Wordlists in machine-readable formats (A.1) Gilles Sérasset's DBnary http://kaiko.getalp.org/about-dbnary/ machine-readable (RDF) dictionaries generated from Wiktionary, incl. Turkish CC-BY-SA (A.2) Mardan Wushouer's wordlists http://www.ai.soc.i.kyoto-u.ac.jp/~mardan/resource/Chinese_Uyghur_Bilingual_Dictionary_v1.zip http://www.ai.soc.i.kyoto-u.ac.jp/~mardan/resource/Chinese_Kazakh_Bilingual_Dictionary_v1.zip http://www.ai.soc.i.kyoto-u.ac.jp/~mardan/resource/Uyghur_Kazakh_Bilingual_Dictionary_v1.zip plain word lists for Chinese-Uyghur, Chinese-Kazakh, Uyghur-Kazakh CC-BY-NC (A.3) Altaic etymological dictionary http://starling.rinet.ru/cgi-bin/bdescr.cgi?root=config&morpho=0&basename=\data\alt\turcet includes 26 Turkic languages, available online and as DBase dump copyright restricted (A.4) Freelang http://freelang.net English (and partially, French) word lists for 28 Turkic languages (mostly small), proprietary list format freeware (i.e., no modification) (A.5) Apertium Turkic http://wiki.apertium.org/wiki/Turkic_languages#Pairs word lists for Turkic-Azeri, Kazakh-Tatar, 12 more pairs of Turkic languages under development open source (hosted at Sourceforge) (A.6) RELISH http://tla.mpi.nl/relish/ lexicons for Chalkan and Tuva, provided by the RELISH project available online, XML licensing to be clarified (A.7) PanLex http://panlex.org huge collection of word lists in a unified representation (SQL, RDF) incl. Azeri, Gagauz, Kazakh, Kirgiz, Turkish, Turkmen, Uzbek, etc. different (mostly open) licenses depending on the original source (A.8) Intercontinental Dictionary Series http://lingweb.eva.mpg.de/ids/, http://datahub.io/de/dataset/ids word lists of minimal core vocabulary Azeri, Kumyk, Nogai, Terekeme (Azerbaijan dialect) plain text or RDF CC-BY-NC-ND (B) Human-readable dictionaries/wordlists that can be easily converted into machine-readable formats (B.1) Wiktionary, various languages (see A.1) http://wiktionary.org incl. Azeri, Kazakh, Kirgiz, Tatar, Turkish, Turkmen CC-BY-SA (B.2) Chalkan dictionary http://sprachen.sprachsignale.de/tschalkanisch/tschalkanisch.html German available for academic use, with attribution, non-commercial (B.3) Shorica http://shoriya.ngpi.rdtc.ru/ Shor dictionary and corpus copyright to be clarified, currently offline (last accessed mid-May 2014) (B.4) Karachay-Balkar dictionary http://www.elbrusoid.org/dictionary/ Karachay-Balkar - Russian dictionary copyright restricted (B.5) Tatar dictionary http://tatar.com.ru/dict/dict.php Tatar-Russian dictionary copyright restricted (B.6) Khakassian dictionary http://khakas.altaica.ru/dictionary/ Khakas - English and Khakas - Russian copyright restricted (C) other resources (C.1) Altaica http://altaica.narod.ru/e_v-turks.htm link and resource collection, includes machine-readable and human-readable dictionaries for 17 Turkic languages (not replicated above) (C.2) Pre-Islamic Old Turkic Texts (VATEC) http://vatec2.fkidg1.uni-frankfurt.de/ glossed corpus (XML) from which a German-Old Turkic word list can be compiled copyright restricted (C.3) Glosbe http://glosbe.com online access to word lists and translation memories Azeri, Karachay-Balkar, Kazakh, Tatar, Turkish, Turkmen, Uzbek, etc. free online API (with severe capacity limits) Certainly, this list is not exhaustive, so if you feel something important is missing or incorrect, please let me know ;) All the best, Christian