Turkic lexicon

From Apertium
Jump to navigation Jump to search

Some notes on how to go about making a Turkic lexicon for use in Apertium.

Layout[edit]

General points:

  • The lexicon will be made in one file, it will have the suffix .lexc
  • The file will be laid out in the following order:
    1. The multicharacter symbols
    2. The Root lexicon, pointing to the stem lexicons
    3. The morphotactics (continuation lexica)
    4. The stem lexicons

Multicharacter symbols[edit]

Morphological categories must be encased in < and > tags. They may contain the letters a-z and numbers 0-9. In extreme cases they may include the letters A-Z They must begin with a letter, they may not begin with a number.

Examples:

  • %<n%> Noun
  • %<p3%> Third person
  • %<evid%> Evidential

For information on archiphonemes, see the corresponding page.

The list of symbols should be laid out in the following order:

  • The major parts of speech
  • The morphological categories
  • Archiphonemes
  • Other symbols, e.g. Morpheme boundary, ' ', '-' etc.

Every symbol should have a comment. The comments should line up. The alignment should be done with spaces, not tabs.

Morphotactics[edit]

There are two types of lexicons to distinguish here:

  1. lexicons which are continuations for other lexicons, and
  2. lexicons which are continuations for stems.


Root lexicon lists all the major part of speech categories of the language, but usually there are some "subclasses" to each of these categories (like different subclasses of adjectives, different types of nouns etc.).


These subclasses are represented by the lexicons of the second type.


Therefore it is a good idea to keep a list of them with a short description of each and examples in the beginning of stem lexicons section. That way, if there is someone willing to contribute stems to the lexicon, all he needs to know are these specifications provided for him.

Below we describe some of the classes which are common in Turkic languages.

Naming continuation lexica[edit]

  • Continuation lexica will be named in upper case, and may contain letters, numbers and the symbol -.
    • Examples: LEXICON N1, LEXICON DET-DEM, LEXICON ADV

What sorts of distinctions to make[edit]

TODO: TV vs. IV, Russian vs. non-Russian in Chuvash

Numerals[edit]

The system of treating numerals should be more or less the same for all of the Turkic languages. Because numerals make the lexicon cyclic, they need to be treated with care.

In order to allow the morpholophonological processes to be treated where we don't have any letters on the lexical side, we introduce special archiphonemes which are deleted, but can be used in rules.

TODO: describe system we have in kazakh

Stem lexicons[edit]

TODO: Why stems go in lexicon and not infinitives

Lines in the stem lexicons should follow the following pattern:

  • Left side (lexical form)
  • Colon :
  • Right side (surface form)
  • Space
  • Continuation lexicon
  • Space
  • Semicolon ;
  • Space
  • Exclamation mark
  • Open quote "
  • Gloss (optional)
  • Close quote "

Example:

кӗнеке:кӗнек N2 ; ! "llibre, книга"

Morphophonology[edit]

TODO: px3 is sIn (and why)

Categorisation[edit]

General comments.

  • <subst> "Substantive" (like a noun)
  • <attr> "Attributive" (like an adjective)
  • <advl>? "Adverbial" (like an adverb)

Nominals[edit]

Compound Nouns[edit]

TODO: N-N compounds with <px3>


Cases[edit]

Comitative / Instrumental / Sociative[edit]

This can be either a postposition, if the language writes it with a space, or it can be attached as a case. If it is attached to the previous word then it should receive the <ins> tag.

Language Suffix Function Example(s)
Chuvash -{п}{A} Instrumental
Sociative
Conjunctive
Ручкапа ҫыратӑп.
Юлташсемпе калатӑп.
Машапа Ваня пахчара.
Turkish -{y}l{A} Instrumental
Sociative
Conjunctive
Kalemle yaziyorum.
Arkadaşlarla konuşuyorum.
Maşayla Vanya bahçede.
Kazakh -{M}ен/-{M}енен
-{M}ен/-{M}енен
{M}ен/{M}енен
Instrumental
Sociative
Conjunctive
Қалеммен жазып жатыр.
Досымен сөйлесіп жатыр.
Маша мен Ваня бақшада.
Kyrgyz менен/мен Instrumental
Sociative
Conjunctive
Калам менен жазып жатат.
Досу менен сүйлөшүп жатат.
Маша менен Ваня бакчада.
Tatar белән Instrumental
Sociative
Conjunctive
Каләм белән яза.
Дусты белән сөйләшә.
Маша белән Ваня бакчада.
Tuvan биле/-биле Instrumental
Sociative
Conjunctive


Азамат биле Айгуль садта
Teleut -мынаң Instrumental
Sociative
Conjunctive
Ручкамынаң чиип јам/јадым.
Ол найымынаң эрмектеш јат.
Маша Панямынаң садта.
Altay -л{A} Instrumental
Sociative
Conjunctive
Ручкала бичип јадым.
Ол нӧкӧрлӧ куучындап јат.
Маша Паняла садта.
Shor -{B}{A} Instrumental
Sociative
Conjunctive
Пасчақпа пасчам.
Ол нанчыба чооқтапча.
Маша Паняба садта.

Adjectives[edit]

Turkic languages usually have several different classes of adjective. It is best to examine the morphological specifics of adjectives to determine what the different categories are in a given language. However, a commonly found set of categories, with arbitrarily numbered names, is as follows:

  • A1: adjectives that can be both substantivised and adverbialised
    • three readings: <adj>, <adj.subst> and <adj.advl>,
    • all readings have comparison levels;
  • A2: adjectives without adverbial reading
    • <adj> and <adj.subst> readings,
    • both readings have comparison levels;
  • A3: adjectives without adverbial reading
    • <adj> and <adj.subst> readings,
    • no comparison levels at all;
  • A4: "pure" adjectives -- no adverbial and substantive readings
    • <adj> reading,
    • no comparison levels.

Examples by language[edit]

Setting up charts for various languages of groups of adjectives by their behaviour is one way to both determine what the correct categorisation for the language is, and present the categories. (Remember that * before a form indicates unattested/impossible forms.)

Chuvash[edit]
Type Example Reading Phrase
A1 лайӑх "good" <adj> Ку лайӑх кĕнеке.
лайӑхтарах <adj><comp> Ку лайӑхтарахчĕ.
лайӑх <adj><advl> Вӑл лайӑх ишет.
лайӑххисене <adj><subst><pl><dat> Лаиӑххисене куратӑп.
A2 кӑвак "blue" <adj> Эпĕ çак кавак кĕнекене куратӑп.
кӑвактарах <adj><comp> Ку кӗнеке кавактарахче.
*кӑвак <adj><advl>
кӑваккисем <adj><subst><pl> Каваккисем куратӑп.
A3 вилĕ "dead" <adj> Эпĕ çак вилӗ сынна куратӑп.
*вилĕрех, *вилĕтерех <adj><comp>
*вилĕ <adj><advl>
виллисем <adj><subst><pl> Эпĕ виллисем куратӑп.
A4 тĕп "main" <adj> Ку тӗп кӗнеке.
*тĕпрех, *тĕптерех <adj><comp>
*тĕп <adj><advl>
*тĕп <adj><subst>
Subtypes[edit]
  • A1/A2
    Comparatives can be -рах, тарах or both, depending on the stem ending sound
  • A1/A2/A3
    Substantivation can be done by the means of a suffix, e.g. лайӑхх·и, (in fact, the 3rd person possessive) or without it
Kazakh[edit]
Kyrgyz[edit]
Type Example <adj> <adj><comp> <adj><advl> <adj><subst>
A1 жакшы Бул китеп жакшы. Бул китеп жакшыраак. Ал сууда жакшы сүзөт. Жакшыларды тааныбайм.
A2 акылдуу Акылдуу технологиялар өнүгүп жатат Ал досуна караганда акылдуураак Акылдуулар менен кеңеш
A3 алгачкы Эң алгачкы мектепке келгеним эсимден кетпейт Биздин агайыбыз алгачкылардан болуп ошол методиканы колдоно баштаган
A4 парламенттик Парламенттик башкаруу системасы кандай болот
A5 кеңири Бактериялар — табиятта кеңири таралган эң жөнөкөй бир клеткалуу организмдер тобу продукция үчүн кеңирирээк рыноктор ачылат
Tatar[edit]
  • A1 - яхшы
  • A2 - иске, (күк)
  • A3 - үле
  • A4 - төп
Turkish[edit]
Type Example Reading Phrase
A1 iyi "good" <adj> Bu iyi kitap.
iyi <adj><advl> O iyi yüzüyor.
iyileri <adj><subst><pl><abl> Iyilerden geldim.
A3 ölü "dead" <adj> Bu ölü adami görüyorum.
*ölü <adj><advl>
ölüleri <adj><subst><pl> Ölüleri görüyorum.
A4 ana "main" <adj> Bu programin ana fonksiyonunda sorun var.
*ana <adj><advl>
*ana <adj><subst>

Adverbs[edit]

Postpositions[edit]

TODO: "postpositions" which take poss./case are nouns

Pronouns[edit]

Reflexive pronoun[edit]

Basic verb morphology[edit]

This includes stuff that comes before any TAME morphology.

Mood-like stuff[edit]

Passive[edit]
Causative[edit]
Reflexive[edit]
Cooperative[edit]

The cooperative (<coop>) adds the meaning to an event that multiple people did it together. In most Turkic languages it has the form -Iş or similar. For example, in Kazakh:

  • Without cooperative: Біз ұнды көтердік. "We picked up the flour."
  • With cooperative: Біз ұнды көтерістік. "We [worked together to] pick up the flour."

In a number of languages it can also add the meaning of one person helping another do something, especially when used with a singular subject.

  • With cooperative: Мен сен көтерген ұнды көтерістім. "I [helped] pick up the flour that you were lifting."

Sometimes the cooperative can get lexicalised in a verb form. E.g., "арала-" means "wander around", but "аралас-" means "be mixed together". In this case there is a historical derivational relation between the two forms, but it's probably not safe to say that there is a productive cooperative relationship between them.

Also, in some languages, like Kyrgyz and some dialects of Uzbek, the cooperative has been reappropriated to add plural meaning as well. So көтөрүштү can be parsed with almost any combination of plurality and cooperativeness, i.e., "s/he helped lift", "they lifted", "they lifted together", "they helped someone lift".

Negative[edit]

Most Turkic languages have a negative form that comes after derivational suffixes and before TAME morphology. The form is usually something like -mA or -BA. But see also information on Compound Negative Forms. Either way, the tag for negative is <neg>.

Finite verb forms[edit]

Defective verbs[edit]

Many Turkic languages have a defective copula verb in a form like e-/i-. We call it defective because it does not have a full paradigm, and can be limited in the contexts in which it's used. It often alternates with null forms (e.g., in present tenses) or with another "be"/"become" verb (like ol-/bol-, e.g., in future tenses).

This table presents common forms of the copula.

language UR uses / analyses notes
Kazakh е<cop>
  • е<cop>: (+agreement)
  • е<cop><ifi>:еді (+agreement)
  • е<cop><neg>:емес (+agreement)
  • е<cop><ger>:екен (+noun stuff)
екен is also a modal particle
Tatar и<cop>
  • и<cop>: (+agreement)
  • и<cop><ifi>:иде (+agreement)
  • и<cop><neg>:түгел (+agreement)
  • и<cop><ger>:икән (+noun stuff)
икән is also a modal particle
Kyrgyz э<cop>
  • э<cop>: (+agreement)
  • э<cop><ifi>:элэ (+agreement)
  • э<cop><neg>:эмес (+agreement)
  • э<cop><ger>:экен (+noun stuff)
экен is also a modal particle
Turkish i<cop>
  • i<cop>:+{y} (+agreement)
  • i<cop><ifi>:+{y}d{I} (+agreement)
  • i<cop><neg>:değil (+agreement)
Uzbek e<cop>
  • e<cop>: (+agreement)
  • e<cop><ifi>:edi (+agreement)
  • e<cop><neg>:emas (+agreement)
  • e<cop><ger>:ekan (+noun stuff) (?)
ekan is also a modal particle

Compound Negative Forms[edit]

A number of Turkic languages have compound negative verb forms in addition to regular ones. These forms normally consist of what appears to be either a verbal noun or verbal adjective, followed by a negative adjective or adverb.

For example, in Kazakh, the negative of the general/future is formed with a regular negative, e.g. "бармаймын" "I won't go", but the negative of one of the past-tense forms is a compound negative, e.g. "барған жоқпын" "I didn't go".

Non-finite verb forms[edit]

This section outlines what categories of non-finite verb forms exist in Turkic, and how to identify the type of category created by a given affix.

overview of non-finite verb forms by type and language
language verbal nouns verbal adjectives participles verbal adverbs infinitives
Kazakh -GAн
-EтIн

-(A)р
-GAн
-EтIн
-ушI
-(A)р
-GIс
-Iп
-E
-сA (+<p*>)
-GI (+<p*>)
-EтIн
-GAлI
-Iп
-E
-сA (+<p*>)
-GAншA
Turkish -DIk
-(y)AcAk
-mIş
-(V)r
-m(A)
-m(A)k
-(y)Iş
-DIk (+<p*>)
-(y)AcAk (+<p*>)
-mIş
-(V)r
-(y)An
-(y)IncA
-(y)Ip
-(V)rken
-(y)ArAk
Tatar -GAн
-W
-(V)р
-GAн
-(V)р
-Iп
-E
-GI (+<p*>)
-Iп
-E
-(V)ргA

In general: subordinate forms are ger_, gpr_, gna_ and main clause forms are prc_.

Verbal nouns / gerunds[edit]

Verbal nouns are forms of verbs that allow one to use a verb phrase as a noun phrase. An example in English might be "running" in the sentence "I like running", or "eating beshbarmaq with my hands" in "I believe in eating beshbarmaq with my hands". The former sentence in Kazakh would be:

Мен
мен<prn><nom>
I
жүгіруді
жүгір<v><iv><ger><acc>
running
жақсы
жақсы<adv>
well
көремін
көр<v><tv><aor><p1><sg>
I see
"I like running."

You can also embed subjects, kind of like the English "I saw him/his running home."

Мен
мен<prn><nom>
I
оның
ол<prn><gen>
his
үйге
үй<n><dat>
to home
қарай
қарай<pst>
towards
жүгіретінін
жүгір<v><iv><ger_impf><px3sp><acc>
his running
көрдім
көр<v><tv><ifi><p1><sg>
I saw
"I saw him running home."

This same sentence could also be translated as follows, depending on whether you're focusing on the fact that he was running (previous) or that you saw him run home (following):

Мен
мен<prn><nom>
I
оның
ол<prn><gen>
his
үйге
үй<n><dat>
to home
қарай
қарай<pst>
towards
жүгіргенін
жүгір<v><iv><ger_prf><px3sp><acc>
his running
көрдім
көр<v><tv><ifi><p1><sg>
I saw
"I saw him run home."

As implied by this example, while the tense of gerunds is limited in English, gerunds in most Turkic languages can have a wide range of tense/mood/aspect/evidentiality (TMAE) combinations. Many of these are translated to languages like English as subordinate clauses, e.g. "I believe that he eats beshbarmaq with his hands.":

Беспармақ
беспармақ<n><acc>
beshbarmaq
қолымен
қол<n><px3sp><inst>
with his hands
жейтініне
же<v><ger_impf><px3sg><dat>
to his eating
сенемін
сен<v><tv><aor><p1><sg>
I believe
I believe that he eats beshbarmaq with his hands."

Notice that in these examples, the verb phrase is being used as a subject, object, adjunct, etc. That is, in Turkic languages, gerunds can take any grammatical role (and morphology) that a noun phrase can take. In Kazakh, for example, the verbal nouns can take any combination of possession and/or case suffixes. They may sometimes even take plural suffixes, though often forms that appear to be a gerund followed by a plural suffix are actually plural substantivised verbal adjectives (<gpr><subst><pl>).

For Turkic languages in apertium, the most fundamental gerund of a language (often used as an infinitive form) takes the <ger> tag and other gerunds take tags based on <ger> and something about their TMAE specification, such as <ger_impf> (for "imperfective gerund") or <ger_fut> (for "future gerund").

Verbal adjectives[edit]

Verbal adjectives are forms of verbs that allow one to use a verb phrase as an adjectival phrase. An example in English might be "running" in the sentence "The running man startled me" (as opposed to "the sitting man"), or "running home" in "The man running home startled me" (as opposed to "the man eating beshbarmaq"). These sentences in Kazakh would be:

Жүгіретін
жүгір<v><vi><gpr_impf>
running
адам
адам<n><nom>
man
мені
мен<prn><acc>
me
шошытты.
шошы<v><iv><caus><ifi><p3><sg>
he startled.
"That running man startled me."
Үйге
үй<n><dat>
to home
жүгіретін
жүгір<v><vi><gpr_impf>
running
адам
адам<n><nom>
man
мені
мен<prn><acc>
me
шошытты.
шошы<v><iv><caus><ifi><p3><sg>
he startled.
"The man running home startled me."

Notice that while in English verbal adjective phrases that are longer than just a verb must be placed after the noun they modify, in Kazakh verbal adjective phrases of any length are only ever placed before the noun.

Phrases formed using verbal adjectives in Turkic languages are often translated using relative clauses in languages like English (e.g., "The man who was running [home] startled me."). Note that in Turkic languages, usually any part of the verb phrase can be relativised (i.e., "extracted" from the embedded verb phrase and made into a nominal argument which the verbal adjective phrase then "modifies"). For example, "The man who I gave a fork to yesterday was eating beshbarmaq" and "The man that gave me a fork yesterday was eating beshbarmaq" can both be translated into Kazakh using verbal adjectives.

Кеше
кеше<adv>
yesterday
мен
мен<prn><nom>
I
шанышқы
шанышқы<n><acc>
fork
берген
бер<v><tv><gpr_past>
having given
адам
адам<n><nom>
man
беспармақ
беспармақ<n><nom>
beshbarmaq
жеп
же<v><tv><prc>
eating
жатқан.
жат<vaux><past><p3><sg>
was
"The man who I gave the fork to yesterday was eating beshbarmaq."
Кеше
кеше<adv>
yesterday
маған
мен<prn><dat>
to me
шанышқы
шанышқы<n><acc>
fork
берген
бер<v><tv><gpr_past>
having given
адам
адам<n><nom>
man
беспармақ
беспармақ<n><nom>
beshbarmaq
жеп
же<v><tv><prc>
eating
жатқан.
жат<vaux><past><p3><sg>
was
"The man who gave me the fork yesterday was eating beshbarmaq."

In English, there is a difference between relative clauses that limit/restrict (what's the right term?) the noun and those that don't. E.g., "The man that I saw yesterday startled me" (specifically that mean startled me) versus "The man, who I saw yesterday, startled me" (a man startled me; it happens that I saw him yesterday). In Turkic languages the default meaning of a verbal adjective is usually the restricted meaning, e.g. in Kazakh:

Мен
мен<prn><nom>
I
кеше
кеше<adv>
yesterday
көрген
көр<v><vt><gpr_past>
seen
адам
адам<n><nom>
man
мені
мен<prn><acc>
me
шошытты.
шошы<v><iv><caus><ifi><p3><sg>
he startled.
"The man I saw yesterday startled me."

To get the non-restricted meaning in a Turkic language, two finite verb forms would normally be used (e.g., "The man scared me; I saw him yesterday").

Verbal adjectives can also be substantivised to mean "the ones who ...". For example, in Kazakh, in a translation of the sentence "the ones/people I saw yesterday were eating beshbarmaq", "the ones/people I saw yesterday" would be formed like the first part of "the man I saw yesterday", but without the word "man", and a plural marker instead:

Мен
мен<prn><nom>
I
кеше
кеше<adv>
yesterday
көргендер
көр<v><vt><gpr_past><subst><pl>
ones seen
беспармақ
беспармақ<n><nom>
beshbarmaq
жеп
же<v><tv><prc>
eating
жатқан.
жат<vaux><past><p3><pl>
were
"The ones/people I saw were eating beshbarmaq."

For Turkic languages in apertium, the tags for verbal adjectives are based on <gpr>, with a brief TMAE specification following, such as <gpr_past> (for "past-tense verbal adjective") or <gpr_impf> (for "imperfect verbal adjective"). The abbreviation "gpr" comes from the Russian phrase "глагольное прилагательное" [glaˈgolʲnəjɪ prʲilaˈgatʲɪlʲnəjɪ], which means "verbal adjective".

Participles[edit]

Participles are verb forms that allow a verb phrase to be combined with other verbs for the purpose of adding information about tense/mood/aspect/voice/evidentiality (TMAVE) to the utterance. That is, it's usually used in the creation of "compound verb tenses", so the following word is almost always a verbal auxiliary, and the participial verb phrase and auxiliary together form only one predicate.

New definition: participle as a term means a non-finite form that can be the root of a clause (governs arg. struct.) that provides some part of TAM and requires a finite-auxiliary for agreement and other parts of TAM. This means that prc can be used with copulas too.

Examples of relevant compound verb phrases in Kazakh include the following:

gloss gloss participle
тамақ жеп біттің ‘you finished eating’ ^жеп/же<v><tv><prc_prf>$
тамақ жей бердің ‘you kept eating’ ^жей/же<v><tv><prc_impf>$
тамақ жейтін шығармын ‘I seem to be eating’ ^жейтін/же<v><tv><prc_irre>$
тамақ жесең болады ‘you may/can eat’ ^жесең/же<v><tv><prc_cnd><p2><sg>$
тамақ жегің келеді ‘you want to eat’ ^жегің/же<v><tv><prc_vol><p2><sg>$
тамақ жегелі жатырсың ‘you're about to eat’ ^жегелі/же<v><tv><prc_purp>$
etc.

Some Turkic participles must agree with their subjects (in person, number, and formality), as with Kazakh's <prc_cnd> and <prc_vol> above, and most (but not all!) of them also have negative forms. Examples of negative forms of participles include Kazakh ^ жемей/же<v><tv><neg><prc_prf>$ and ^жемейтін/же<v><tv><neg><prc_irre>$.

For Turkic languages in apertium, the tags for participles are based on <prc>, with a brief TMAE specification following, such as <prc_impf> (for "imperfective participle") or <prc_irre> (for "irrealis participle"). The abbreviation "prc" comes from the English term "participle" or the Russian equivalent "причастие" [prʲiˈtɕastʲijɪ].

Most participles, along with most verbal adverbs, are often referred to as "converbs" in the Turkology literature.

Verbal adverbs[edit]

Verbal adverbs are forms of verbs that allow one to use a verb phrase as an adjunct to another verb phrase. Each verb phrase (the main one, and the one subordinated by the verbal adverb) is an independent predicate, resulting in two separate predicates. Use of these forms often relates two events to one another in some way temporally. An example of a verbal adverb in English is "running" in "Running home, I saw that man". In Kazakh, a similar sentence could be rendered as:

Үйге
үй<n><dat>
to home
жүгіріп
жүгір<v><iv><gna>
having run
мен
мен<prn><nom>
I
сол
сол<det>
that
адамды
адам<n><acc>
man
көрдім.
көр<v><tv><ifi><p1><sg>
I saw.
"Having run home, I saw that man."

Another example can be found in the following Kazakh sentence:

Мен
мен<prn><nom>
I
сол
сол<det>
that
адамды
адам<n><acc>
man
көрсем
көр<v><tv><gna_cond><p1><sg>
if I see
үйге
үй<n><dat>
to home
жүгіремін
жүгір<v><iv><aor><p1><sg>
I will run
"If I see that man, I will run home."

Some Turkic verbal adverbs must agree with their subjects (in person, number, and formality), as with Kazakh's <gna_cnd> above, and almost all of them also have negative forms. Examples of negative forms of participles include Kazakh ^жемей/же<v><tv><neg><gna>$ and ^жемесең/же<v><tv><neg><gna_cond><p2><sg>$. Verbal adverbs can have the same or a different subject as the main verb phrase.

Some forms that are not morphologically verbal adverbs are used in very much the same way as verbal adverbs in Turkic languages. For example, in Kazakh, a form like ^жүгіргенде/жүгір<v><iv><ger_past><loc>$ allows a verb phrase with the verb "жүгір" to be used as an adjunct to another verb phrase, for example in the following sentences in Kazakh:

Үйге
үй<n><dat>
to home
жүгіргенде
жүгір<v><iv><ger_past><loc>
running
мен
мен<prn><nom>
I
сол
сол<det>
that
адамды
адам<n><acc>
man
көрдім.
көр<v><tv><ifi><p1><sg>
I saw.
"Running home, I saw that man."
Үйге
үй<n><dat>
to home
жүгіргеннен
жүгір<v><iv><ger_past><abl>
running
бері
бері<post>
since
мен
мен<prn><nom>
I
сол
сол<det>
that
адамды
адам<n><acc>
man
көрген жоқпын.
көр<v><tv><neg><past><p1><sg>
I have not seen.
"Since running home, I have not seen that man."

This is much like English expressions like "While running home, I saw that man" or "Since running home I have not seen that man", where a subordinating "conjunction" or a preposition is used.

For Turkic languages in apertium, the most fundamental verbal adverb of a language takes the <gna> tag and other verbal adverbs take tags based on <gna>, with a brief TMAE specification following, such as <gna_cnd> (for "conditional verbal adverb"). The abbreviation "gna" comes from the Russian word "глагольное наречие" [glaˈgolʲnəjɪ naˈrʲetɕijɪ], meaning literally "verbal adverb" (even though these might more commonly be called "деепричастие").

Most verbal adverbs, along with most participles, are often referred to as "converbs" in the Turkology literature.

Question word[edit]

Copula[edit]

Special cases[edit]

Nationalities and ethnicities[edit]

Should they be nouns or adjectives ?

Colour terms[edit]

Cardinal points[edit]

The words бар and жок[edit]

In most (if not all) Turkic languages, there will be a word for "existing" and a word for "not existing". These are adjectives.

Prototypical parts of speech:

  • adjective:
  • noun:
  • verb:
  • determiner:
  • pronoun:
  • adverb:

Positive tests:

  • Predicative use:
    • бу китап бар.
  • Attributive use
    • бу бар китап.
  • Place in the TAMVE system
    • бу китап бар болду
    • бу китап бар болып жатыр
  • Place of predication
  • Can be substantivised?
  • What is its "subject"/"argument"/whatever? (nominative case nouns and gerunds)
  • What relation does it have with the subject of the sentence ?
  • What relation does it have with the predicate of the sentence ?
  • Can bar/yok be used attributively without an argument ? e.g. in the case of "men yok kezde" the argument is "men", but most adjectives don't have arguments.
  • Are there other adjectives which take arguments like that ? (e.g. could you use kerek like that too ?) how about other adjectives taking nominal arguments ? something like "sky high prices", "compound rich language" ?

Negative tests:

  • Doesn't have verb morphology?
  • Doesn't have reference (e.g. not a pro- form)
  • But also not negatable with emes/emas/tügel/deyil (are there other non-negatable adjectives?)
Language Word pred attr subst
Kazakh бар
жоқ
Kyrgyz бар
жок
Tatar бар
юк
Bashkir бар
юҡ
Kumyk бар
ёкъ
Nogay бар
йок
Turkish var
yok
Chuvash пур
ҫук
Tuvan бар
чок
Yakut баар
суох
Dolgan баар
һуох
Uzbek bor
yo'q
Examples
  • Бірақ әкесінің бар байлығы сақталып тұрған банк банкротқа ұшырап, аяқасты кедейленіп қалған Теккерейге ақша табу үшін журналист, суретші-карикатурист болуға тура келді.
  • "Ұлы жоқ үйде қыз туғанда ат қою" - giving a name when a girl is born in a house with no sons
  • "ешқандай иесі жоқ жер" land without any kind of owner
  • (?)"biz mutlaka yok şeyler yapmayacağım"
  • "Аллаһның болу ғалам үшін өте керек нәрсе , оның жоқтығы мүмкін емес"
  • өндіріске керек су және ауыз су жетістіру
  • атын дискі , дыбысалғыш , оған граммофон табағының керек тұсына қоюға және

« Derivation » and friends[edit]

The principle behind dealing with derivation is, as far as possible, that we don't want to deal with it. Derivation causes problems for translation because the translation of derivation is even more unpredictable than inflection. So, in general, just say no to derivation. However, if we want to get any kind of reasonable coverage, for each language, we're going to need to deal with about 5—10 suffixes which are very productive, and quite predictable in meaning.

"Abessive" case: -sIz[edit]

The abessive or privative translates as "without" or "-less" in English. It can be attributive, adverbial, or substantival. The base reading is <advl>, as with other postpositions. If we want <attr> and <subst> readings, then the word should be lexicalised as an adjective.

Chuvash -с{Ӑ}р
Tatar, Kazakh, Kyrgyz -с{I}з
Turkish, Turkmen -s{I}z
Uzbek -siz

The suffix -LI[edit]

Tatar, Kazakh -л{I}
Kyrgyz -луу
Turkish, Turkmen -l{I}

The suffix -LIK[edit]

The suffix -KI[edit]

See Turkic languages/Ki.

The suffix -DAKI[edit]

See Turkic languages/Ki.

So far we are calling this a form of the locative <loc> which can take <attr> and <subst> tags. With the <subst> tag it allows nominal inflection, but without the possibility of recursively having -DAKI.

The suffix -NIKI[edit]

See Turkic languages/Ki.

So far we are calling this a form of the genitive <gen> followed by the <subst> tag. It allows nominal inflection, but without the possibility of recursively having -NIKI.

Language specific issues[edit]

Turkmen: stem-final voiced and voiceless stops[edit]

In Turkmen, there are three types of stem-final stops:

  • voiced stops
  • voiceless stops
  • stops that are voiceless syllable finally and voiced intervocalically

TODO: finish description of this and explain how it can be / is dealt with

Chuvash: Russian loans ending in -a with non-final stress[edit]

Orthographic issues[edit]

Spelling errors[edit]

Sometimes people writing Turkic languages make typos and spelling errors. Most of the languages don't have spell checkers, so it's easier to make mistakes.

We would like to be able to analyse forms where there are some systematic errors, e.g. harmony errors, but we want to be able to tag those forms.

Ideas:

  • Take the lexc lexicon, and a plain twol file (just with the alphabet and removal of morpheme boundary), full.hfst
  • Then take the transducer with a full twol file (all the phonology), and subtract the good analyses (lang.hfst) from full.hfst leaving full-error.hfst
  • Intersect full-error.hfst with forms that we actually find in the corpus full-error-corpus.hfst (the reason for this is that we want to avoid very spurious analyses)
  • Then concatenate a tag, like <err_orth> to the end of full-error-corpus.hfst and give it high weight
  • Then union the full-error-corpus-orth.hfst with lang.hfst to give a transducer with the correct forms + high weighted/tagged incorrect forms
  • Note: What we currently do with spellrelax could be done this way also.

Multiple writing systems[edit]