Difference between revisions of "Turkic lexicon"

Revision as of 13:01, 14 July 2012

Some notes on how to go about making a Turkic lexicon for use in Apertium.

Layout

General points:

The lexicon will be made in one file, it will have the suffix .lexc
The file will be laid out in the following order:
1. The multicharacter symbols
2. The Root lexicon, pointing to the stem lexicons
3. The morphotactics (continuation lexica)
4. The stem lexicons

Multicharacter symbols

Morphological categories must be encased in < and > tags. They may contain the letters a-z and numbers 0-9. In extreme cases they may include the letters A-Z They must begin with a letter, they may not begin with a number.

Examples:

%<n%> Noun
%<p3%> Third person
%<evid%> Evidential

For information on archiphonemes, see the corresponding page.

The list of symbols should be laid out in the following order:

The major parts of speech
The morphological categories
Archiphonemes
Other symbols, e.g. Morpheme boundary, ' ', '-' etc.

Every symbol should have a comment. The comments should line up.

Morphotactics

Naming continuation lexica

Continuation lexica will be named in upper case, and may contain letters, numbers and the symbol -.
- Examples: LEXICON N1, LEXICON DET-DEM, LEXICON ADV

What sorts of distinctions to make

TODO: TV vs. IV, Russian vs. non-Russian in Chuvash

Stem lexicons

TODO: Why stems go in lexicon and not infinitives

Lines in the stem lexicons should follow the following pattern:

Left side (lexical form)
Colon :
Right side (surface form)
Space
Continuation lexicon
Space
Semicolon ;
Space
Exclamation mark
Open quote "
Gloss (optional)
Close quote "

Example:

кӗнеке:кӗнек N2 ; ! "llibre, книга"

Morphophonology

TODO: px3 is sIn (and why)

Categorisation

Nominals

Compound Nouns

TODO: N-N compounds with <px3>

Adjectives

A1: adjectives that can be both substantivised and adverbialised;
- All three readings (<adj>, <adj.subst> and <adj.advl>)
- have comparison levels.
A2: derived/not fully lexicalised adjectives without adverbial reading
- <adj> and <adj.subst> readings
- have comparison levels.
A3: derived/not fully lexicalised adjectives without adverbial reading
- so-called "predicatives" (бар, жоқ)
- no comparison levels at all.
A4: "pure" adjectives
- no adverbial and substantive readings,
- no comparison levels;

Examples by language

Chuvash

Type	Example	Reading	Phrase
A1	лайӑх "good"	`<adj>`	Ку лайӑх кĕнеке.
	лайӑхтарах	`<adj><comp>`	Ку лайӑхтарахчĕ.
	лайӑх	`<adj><advl>`	Вӑл лайӑх ишет.
	лайӑххисем	`<adj><subst><pl>`
A2	кӑвак "blue"	`<adj>`
	кӑвакрах	`<adj><comp>`
	*кӑвак	`<adj><advl>`
	кӑвак	`<adj><subst><pl>`
A3	вилĕ "dead"	`<adj>`
	вилĕрех, вилĕтерех	`<adj><comp>`
	*вилĕ	`<adj><advl>`
	вилĕ	`<adj><subst><pl>`
A4	тĕп "main"	`<adj>`
	тĕпрех, тĕптерех	`<adj><comp>`	—
	*тĕп	`<adj><advl>`	—
	*тĕп	`<adj><subst>`	—

Kazakh

Tatar

Turkish

Adverbs

Postpositions

TODO: "postpositions" which take poss./case are nouns

Finite verbs

Non-finite verbs

This section outlines what categories of non-finite verb forms exist in Turkic, and how to identify the type of category created by a given affix.

Language specific issues

Turkmen: stem-final voiced and voiceless stops

In Turkmen, there are three types of stem-final stops:

voiced stops
voiceless stops
stops that are voiceless syllable finally and voiced intervocalically

TODO: finish description of this and explain how it can be / is dealt with

Chuvash: Russian loans ending in -a with non-final stress

@@ Line 103: / Line 103: @@
 | A1   || лайӑх "good"           || {{tag|adj}}            || Ку лайӑх кĕнеке.
 |-
-|      || лайӑхтӑрӑх             ||  {{tag|adj><comp}}     || Ку лайӑхтӑрӑхче.
+|      || лайӑхтарах             ||  {{tag|adj><comp}}     || Ку лайӑхтарахчĕ.
 |-
-|      || лайӑх                  || {{tag|adj><advl}}      || Вӑл лайӑх иҫет.
+|      || лайӑх                  || {{tag|adj><advl}}      || Вӑл лайӑх ишет.
 |-
-|      || лайӑхисем              || {{tag|adj><subst><pl}} ||
+|      || лайӑххисем              || {{tag|adj><subst><pl}} ||
 |-
 |-
 | A2   || кӑвак "blue"           || {{tag|adj}}            ||
 |-
-|      || кӑвакрӑх               || {{tag|adj><comp}}      ||
+|      || кӑвакрах               || {{tag|adj><comp}}      ||
 |-
 |      || *кӑвак                 || {{tag|adj><advl}}      ||
@@ Line 121: / Line 121: @@
 | A3   || вилĕ "dead"            || {{tag|adj}}            ||
 |-
-|      || *вилĕрӑх, *вилĕтĕрĕх   || {{tag|adj><comp}}      ||
+|      || *вилĕрех, *вилĕтерех   || {{tag|adj><comp}}      ||
 |-
 |      || *вилĕ                  || {{tag|adj><advl}}      ||
@@ Line 130: / Line 130: @@
 | A4   || тĕп "main"             || {{tag|adj}}            ||
 |-
-|      || *тĕпрĕх, *тĕптĕрĕх     ||  {{tag|adj><comp}}     || &mdash;
+|      || *тĕпрех, *тĕптерех     ||  {{tag|adj><comp}}     || &mdash;
 |-
 |      || *тĕп                   ||  {{tag|adj><advl}}     || &mdash;
@@ Line 136: / Line 136: @@
 |      || *тĕп                   ||  {{tag|adj><subst}}     || &mdash;
 |}
 ===== Kazakh =====