Difference between revisions of "Talk:Turkic languages"
Line 109: | Line 109: | ||
* tag/morpheme reordering should be done by transfer, such as Turkish->Chuvash negative imperative, Chuvash->Turkish possessives. |
* tag/morpheme reordering should be done by transfer, such as Turkish->Chuvash negative imperative, Chuvash->Turkish possessives. |
||
* what about different spacing, do you ever get >1 space, or nbsp or formatting between e.g. келген and эмеспи ? -- or anything that isn't a single ascii space ? |
* what about different spacing, do you ever get >1 space, or nbsp or formatting between e.g. келген and эмеспи ? -- or anything that isn't a single ascii space ? |
||
== Resources == |
|||
Following is a mail to the Corpora list. Might be a good idea to have a 'Resources' page/section for Turkic languages, as it is done on language pages. |
|||
Message: 6 |
|||
Date: Wed, 25 Jun 2014 19:54:04 +0200 |
|||
From: "Christian Chiarcos" <christian.chiarcos@web.de> |
|||
Subject: Re: [Corpora-List] Turkic dictionaries |
|||
To: "corpora@uib.no" <corpora@uib.no> |
|||
Dear all, |
|||
I would like to thank everyone who responded to my request and who helped |
|||
me in personal conversation, in particular, Emily Bender, Jost Gippert, |
|||
Max Ionov, Irina Nevskaya, Monika Rind-Pawlowski, Vit Suchomel, Francis |
|||
Tyers, and Mardan Wushouer. Please find a summary, with URLs, brief |
|||
description and licensing information below (no particular order): |
|||
(A) Dictionaries/Wordlists in machine-readable formats |
|||
(A.1) Gilles Sérasset's DBnary |
|||
http://kaiko.getalp.org/about-dbnary/ |
|||
machine-readable (RDF) dictionaries generated from Wiktionary, incl. |
|||
Turkish |
|||
CC-BY-SA |
|||
(A.2) Mardan Wushouer's wordlists |
|||
http://www.ai.soc.i.kyoto-u.ac.jp/~mardan/resource/Chinese_Uyghur_Bilingual_Dictionary_v1.zip |
|||
http://www.ai.soc.i.kyoto-u.ac.jp/~mardan/resource/Chinese_Kazakh_Bilingual_Dictionary_v1.zip |
|||
http://www.ai.soc.i.kyoto-u.ac.jp/~mardan/resource/Uyghur_Kazakh_Bilingual_Dictionary_v1.zip |
|||
plain word lists for Chinese-Uyghur, Chinese-Kazakh, Uyghur-Kazakh |
|||
CC-BY-NC |
|||
(A.3) Altaic etymological dictionary |
|||
http://starling.rinet.ru/cgi-bin/bdescr.cgi?root=config&morpho=0&basename=\data\alt\turcet |
|||
includes 26 Turkic languages, available online and as DBase dump |
|||
copyright restricted |
|||
(A.4) Freelang |
|||
http://freelang.net |
|||
English (and partially, French) word lists for 28 Turkic languages (mostly |
|||
small), proprietary list format |
|||
freeware (i.e., no modification) |
|||
(A.5) Apertium Turkic |
|||
http://wiki.apertium.org/wiki/Turkic_languages#Pairs |
|||
word lists for Turkic-Azeri, Kazakh-Tatar, 12 more pairs of Turkic |
|||
languages under development |
|||
open source (hosted at Sourceforge) |
|||
(A.6) RELISH |
|||
http://tla.mpi.nl/relish/ |
|||
lexicons for Chalkan and Tuva, provided by the RELISH project |
|||
available online, XML |
|||
licensing to be clarified |
|||
(A.7) PanLex |
|||
http://panlex.org |
|||
huge collection of word lists in a unified representation (SQL, RDF) |
|||
incl. Azeri, Gagauz, Kazakh, Kirgiz, Turkish, Turkmen, Uzbek, etc. |
|||
different (mostly open) licenses depending on the original source |
|||
(A.8) Intercontinental Dictionary Series |
|||
http://lingweb.eva.mpg.de/ids/, http://datahub.io/de/dataset/ids |
|||
word lists of minimal core vocabulary |
|||
Azeri, Kumyk, Nogai, Terekeme (Azerbaijan dialect) |
|||
plain text or RDF |
|||
CC-BY-NC-ND |
|||
(B) Human-readable dictionaries/wordlists that can be easily converted |
|||
into machine-readable formats |
|||
(B.1) Wiktionary, various languages (see A.1) |
|||
http://wiktionary.org |
|||
incl. Azeri, Kazakh, Kirgiz, Tatar, Turkish, Turkmen |
|||
CC-BY-SA |
|||
(B.2) Chalkan dictionary |
|||
http://sprachen.sprachsignale.de/tschalkanisch/tschalkanisch.html |
|||
German |
|||
available for academic use, with attribution, non-commercial |
|||
(B.3) Shorica |
|||
http://shoriya.ngpi.rdtc.ru/ |
|||
Shor dictionary and corpus |
|||
copyright to be clarified, currently offline (last accessed mid-May 2014) |
|||
(B.4) Karachay-Balkar dictionary |
|||
http://www.elbrusoid.org/dictionary/ |
|||
Karachay-Balkar - Russian dictionary |
|||
copyright restricted |
|||
(B.5) Tatar dictionary |
|||
http://tatar.com.ru/dict/dict.php |
|||
Tatar-Russian dictionary |
|||
copyright restricted |
|||
(B.6) Khakassian dictionary |
|||
http://khakas.altaica.ru/dictionary/ |
|||
Khakas - English and Khakas - Russian |
|||
copyright restricted |
|||
(C) other resources |
|||
(C.1) Altaica |
|||
http://altaica.narod.ru/e_v-turks.htm |
|||
link and resource collection, includes machine-readable and human-readable |
|||
dictionaries for 17 Turkic languages (not replicated above) |
|||
(C.2) Pre-Islamic Old Turkic Texts (VATEC) |
|||
http://vatec2.fkidg1.uni-frankfurt.de/ |
|||
glossed corpus (XML) from which a German-Old Turkic word list can be |
|||
compiled |
|||
copyright restricted |
|||
(C.3) Glosbe |
|||
http://glosbe.com |
|||
online access to word lists and translation memories |
|||
Azeri, Karachay-Balkar, Kazakh, Tatar, Turkish, Turkmen, Uzbek, etc. |
|||
free online API (with severe capacity limits) |
|||
Certainly, this list is not exhaustive, so if you feel something important |
|||
is missing or incorrect, please let me know ;) |
|||
All the best, |
|||
Christian |
Revision as of 10:32, 26 June 2014
Classification
- attributive attr = things that act like adjectives
- predicative pred
- substantive subst = things that act like nouns
- adverbial advl = things that act like adverbs (??)
Hierarchy
- noun (default 'subst') + DECL-NOUN
- noun->adj = n.attr + DECL-ADJ !! No <comp>arison levels though
- adj (default 'attr') + DECL-ADJ
- adj->noun = adj.subst + DECL-NOUN
- num (default 'attr') + DECL-NUM
- num->noun = num.subst + DECL-NOUN
- prn (default 'subst') + DECL-NOUN
- det (default 'attr') + NO-DECL
- v
- v->noun = v.ger + DECL-NOUN
- v->adj = v.glp + DECL-ADJ
- v->adv = v.prc + DECL-ADV
Types of non-finite verbal forms:
- Adverbial participle:
<gnL>
(e.g.<gnc>
"Conditional adverbial participle") - Verbal adjectives:
<gpL>
(e.g.<gpi>
"Imperfect verbal adjective") - Gerunds:
<gerN>
(e.g.<ger1>
"Past/present gerund") - Participles:
<prcN>
(e.g.<prc1>
"Realis participle")
What about 'cop' and 'pred'
- The copula is i- (p.79)
- -(y) (pres)
- -(y)DI (past)
- -(y)mIş (evid)
- -(y)sA (cond)
Заметки разрешении морфологической неоднозначности
Arguments against just having a different tag:
e.g. güzel<n>/güzel<adj>
- We lose the tag denoting the principle function of the stem
- We can't tell the CG to choose the principal function
- We can't tell the difference between `real' N/A ambiguity and "derivation" ambiguity
Arguments against just piling one tag ontop of another:
e.g. güzel<adj>/güzel<adj><n>
- Having two POS in a word makes things confusing
- Having two POS tags in a word makes it difficult to write CG rules
Arguments against having a "zero derivation":
e.g. güzel<adj>/güzel<adj><D_n><n>
- It's ugly and stupid
- Having two POS tags in a word makes it difficult to write CG rules
Прилагательное
güzel 'beautiful' güzel<adj>/güzel<adj><subst>/güzel<adj><advl> güzelim 'my beauty' güzel<adj><subst><px1sg> güzel konuştu 'she spoke well' güzel<adj>/güzel<adj><subst>/güzel<adj><advl> güzel bir köpek 'a beautiful dog' güzel<adj>/güzel<adj><subst>/güzel<adj><advl>
küçük 'small' küçük<adj>/küçük<adj><subst>/küçük<adj><advl> küçük kızlar 'little girls' küçük<adj>/küçük<adj><subst>/küçük<adj><advl> küçükler 'little one(s)' küçük<adj><subst><pl>/küçük<n>+i<cop><pres><p3><pl>
kötü 'bad' kötü<adj>/kötü<adj><subst>/kötü<adj><advl> kötü araba '(a) bad car' kötü<adj>/kötü<adj><subst>/kötü<adj><advl> kötü yüzmek 'to swim badly' kötü<adj>/kötü<adj><subst>/kötü<adj><advl>
Наречии
şimde 'now' şimde<adv> şimdelerde 'nowadays' şimdelerde<adv>
- I think you want both of these as <adv>. Historically it's something like "şu emdi<n??>" and "şu emdilerde/emdi<n??><pl><loc>", but for our purposes this is irrelevant. —Firespeaker 19:40, 26 February 2012 (UTC)
- More to the point, this isn't any sort of productive process we're seeing here; my point is that it's an isolated productive-looking form because of its unique history. —Firespeaker 19:41, 26 February 2012 (UTC)
Имема существительные
evdeki 'the one in the house' ev<n><locattr>/ev<n><locsub> evdekinde 'in the one in the house' ev<n><locsub><loc>
Разное
$ echo Evlerimizdeymişler | hfst-proc tr-cv.automorf.hfst ^Evlerimizdeymişler/Ev<n><pl><px1pl><loc>+i<cop><evid><p3><pl>$
Compound tenses
Things to think about:
- analysis length:
^келген эмеспи/кел<v><iv><neg><past><p3><pl>+бы<qst>/кел<v><iv><neg><past><p3><sg>+бы<qst>/кел<vaux><neg><past><p3><pl>+бы<qst>/кел<vaux><neg><past><p3><sg>+бы<qst>$
, vs.^келген/кел<v><iv><past>/кел<vaux><past>$ ^эмеспи/эмес<neg><p3><sg>+бы<qst>/эмес<neg><p3><pl>+бы<qst>$
- tag/morpheme reordering should be done by transfer, such as Turkish->Chuvash negative imperative, Chuvash->Turkish possessives.
- what about different spacing, do you ever get >1 space, or nbsp or formatting between e.g. келген and эмеспи ? -- or anything that isn't a single ascii space ?
Resources
Following is a mail to the Corpora list. Might be a good idea to have a 'Resources' page/section for Turkic languages, as it is done on language pages.
Message: 6 Date: Wed, 25 Jun 2014 19:54:04 +0200 From: "Christian Chiarcos" <christian.chiarcos@web.de> Subject: Re: [Corpora-List] Turkic dictionaries To: "corpora@uib.no" <corpora@uib.no>
Dear all,
I would like to thank everyone who responded to my request and who helped me in personal conversation, in particular, Emily Bender, Jost Gippert, Max Ionov, Irina Nevskaya, Monika Rind-Pawlowski, Vit Suchomel, Francis Tyers, and Mardan Wushouer. Please find a summary, with URLs, brief description and licensing information below (no particular order):
(A) Dictionaries/Wordlists in machine-readable formats
(A.1) Gilles Sérasset's DBnary http://kaiko.getalp.org/about-dbnary/ machine-readable (RDF) dictionaries generated from Wiktionary, incl. Turkish CC-BY-SA
(A.2) Mardan Wushouer's wordlists http://www.ai.soc.i.kyoto-u.ac.jp/~mardan/resource/Chinese_Uyghur_Bilingual_Dictionary_v1.zip http://www.ai.soc.i.kyoto-u.ac.jp/~mardan/resource/Chinese_Kazakh_Bilingual_Dictionary_v1.zip http://www.ai.soc.i.kyoto-u.ac.jp/~mardan/resource/Uyghur_Kazakh_Bilingual_Dictionary_v1.zip plain word lists for Chinese-Uyghur, Chinese-Kazakh, Uyghur-Kazakh CC-BY-NC
(A.3) Altaic etymological dictionary http://starling.rinet.ru/cgi-bin/bdescr.cgi?root=config&morpho=0&basename=\data\alt\turcet includes 26 Turkic languages, available online and as DBase dump copyright restricted
(A.4) Freelang http://freelang.net English (and partially, French) word lists for 28 Turkic languages (mostly small), proprietary list format freeware (i.e., no modification)
(A.5) Apertium Turkic http://wiki.apertium.org/wiki/Turkic_languages#Pairs word lists for Turkic-Azeri, Kazakh-Tatar, 12 more pairs of Turkic languages under development open source (hosted at Sourceforge)
(A.6) RELISH http://tla.mpi.nl/relish/ lexicons for Chalkan and Tuva, provided by the RELISH project available online, XML licensing to be clarified
(A.7) PanLex http://panlex.org huge collection of word lists in a unified representation (SQL, RDF) incl. Azeri, Gagauz, Kazakh, Kirgiz, Turkish, Turkmen, Uzbek, etc. different (mostly open) licenses depending on the original source
(A.8) Intercontinental Dictionary Series http://lingweb.eva.mpg.de/ids/, http://datahub.io/de/dataset/ids word lists of minimal core vocabulary Azeri, Kumyk, Nogai, Terekeme (Azerbaijan dialect) plain text or RDF CC-BY-NC-ND
(B) Human-readable dictionaries/wordlists that can be easily converted
into machine-readable formats
(B.1) Wiktionary, various languages (see A.1) http://wiktionary.org incl. Azeri, Kazakh, Kirgiz, Tatar, Turkish, Turkmen CC-BY-SA
(B.2) Chalkan dictionary http://sprachen.sprachsignale.de/tschalkanisch/tschalkanisch.html German available for academic use, with attribution, non-commercial
(B.3) Shorica http://shoriya.ngpi.rdtc.ru/ Shor dictionary and corpus copyright to be clarified, currently offline (last accessed mid-May 2014)
(B.4) Karachay-Balkar dictionary http://www.elbrusoid.org/dictionary/ Karachay-Balkar - Russian dictionary copyright restricted
(B.5) Tatar dictionary http://tatar.com.ru/dict/dict.php Tatar-Russian dictionary copyright restricted
(B.6) Khakassian dictionary http://khakas.altaica.ru/dictionary/ Khakas - English and Khakas - Russian copyright restricted
(C) other resources
(C.1) Altaica http://altaica.narod.ru/e_v-turks.htm link and resource collection, includes machine-readable and human-readable dictionaries for 17 Turkic languages (not replicated above)
(C.2) Pre-Islamic Old Turkic Texts (VATEC) http://vatec2.fkidg1.uni-frankfurt.de/ glossed corpus (XML) from which a German-Old Turkic word list can be compiled copyright restricted
(C.3) Glosbe http://glosbe.com online access to word lists and translation memories Azeri, Karachay-Balkar, Kazakh, Tatar, Turkish, Turkmen, Uzbek, etc. free online API (with severe capacity limits)
Certainly, this list is not exhaustive, so if you feel something important
is missing or incorrect, please let me know ;)
All the best, Christian