Difference between revisions of "Indic"

Revision as of 00:11, 22 November 2013

Status

The ultimate goal is to have multi-purposable transducers for a variety of Indic languages. These can then be paired for X→Y translation with the addition of a CG for language X and transfer rules / dictionary for the pair X→Y. Below is listed development progress for each language's transducers and dictionary pairs.

Transducers

See also: Indic lexicon

Once a transducer has ~80% coverage on a range of medium-large corpora we can say it is "working". Over 90% and it can be considered to be "production".

name	Language	ISO 639		formalism	state	stems	coverage	location	primary authors
name	Language	-2	-3	formalism	state	stems	coverage	location	primary authors
`apertium-hin`	Hindi	`hi`	`hin`	HFST (lexc+twol)	production	37,833	-	apertium-hin (languages)	Nikant, Abu Zaher Md. Faridee, Fran
`apertium-urd`	Urdu	`ur`	`urd`	HFST (lexc+twol)	production	14,943	-	apertium-urd (languages)	-
`apertium-ben`	Bengali	`bn`	`ben`	HFST (lexc+twol)	production	{{#1st:apertium-ben/stats\|stems}}	-	apertium-ben (languages)	Abu Zaher Md. Faridee
`apertium-san`	Sanskrit	`sa`	`san`	HFST (lexc+twol)	production	123,373	-	apertium-san (languages)	Amba Kulkarni

Indic Language Classification

Dardic: Pahayi, Khowar, Kohistani, Shina language, Kashiri
Northern Zone:
- Cantral Pahari
  - Garhwali, Kumauni
- Eastern Pahari
  - Nepali
North-Western Zone:
- Dogri-Kangri
  - Dogri, Kangri, Mandeali, etc.
- Punjabi
- Lahnda
- Sindhi
Western Zone:
- Rajasthani
  - Marwari, Rajasthani
- Gujarati
- Bhil
- Khandeshi
- Domari-Romani
Hindi
Southern Zone:
- Marathi
- Konkani
- Insular Indic
  - Sinhalese, Maldivian
Eastern Zone:
- Bihari
  - Bhojpuri, Maithili, etc.
- Bengali
- Oriya
- Tharu
Sanskrit

Pairs

Some Turkic languages that are particularly similar to one another (and hence have high levels of mutual intelligibility) include those in the following list:

Turkish, Azerbaycani, (Turkmen)
Qazaq, Qaraqalpaq, Noğay, (Qırğız)
Kumyk, Karachay-Balkar
Tatar, Bashqort
Uzbek, Uyğur
Shor, Khakas(, Altay)
Sakha, Dolgan
Tuvan, Tofa

Chuvash is very distant from other Turkic languages and is not even partially mutually intelligible with any of them.

Table of dix progress

As counted with https://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/get_stems.py

	tur	aze	tuk	uzb	kir	kaz	tat	chv	bak	uig	khk	eng	rus
tur	—
aze		—
tuk			—
uzb				—
kir					—
kaz						—
tat							—
chv								—
bak									—
uig										—

khk											—
eng												—
rus													—

Turkic-Turkic pairs

See also: Turkic-Turkic translator

Text in italic denotes language pairs under development / in the incubator. Regular text denotes a functioning language pair in staging, while text in bold denotes a stable well-working language pair in trunk.

	tur	aze	tuk	uzb	kir	kaz	tat	chv	bak	uig
tur	—	tur-aze	tur-tuk	tur-uzb	tr-ky		tr-tt	tr-cv
aze	aze-tur	—
tuk	tuk-tur		—
uzb	tur-uzb			—
kir	ky-tr				—	ky-kk
kaz					kaz-kir	—	kaz-tat
tat	tt-tr					tat-kaz	—		tt-ba
chv	cv-tr							—
bak							ba-tt		—
uig										—

Pairs with non-Turkic languages

	tur	kir	kaz	chv
eng	tr-en	ky-en	kaz-eng
fr
es
it
ru				cv-ru
mng/khk			mn-kk

Roadmap

Stable release of apertium-kaz
Stable release of apertium-tat
Stable release of apertium-kaz-tat
Rework apertium-kir to match new standards
Bring apertium-bak up to date (based on apertium-tat)
Expand apertium-tat-bak
Beta release of apertium-kaz-kir
Expand apertium-tuk
Expand apertium-chv
Basic transducers for:
- Khakas
- Tuvan
- Sakha
- Shor
- Qaralpaq (based on ~~Tatar~~ Kazakh, probably, no?)
- ~~Uzbek~~
- Uyghur
- ~~Nogay~~
- ~~Kumyk~~

Getting involved

We have a work plan for developing Turkic-Turkic translators and are working on a how-to for building a Turkic lexicon. Please come talk to us on IRC or contact us on the apertium-turkic mailing list.

Tagset

Rough guide to tagsets in various Turkic language transducers, with an eye to keeping stuff that is basically the same tagged the same. In the following table, ^A stands for Apertium and ^T stands for TRmorph (See also the general tagset list).

Phenomenon	Morphology	Description	Tag(s)	Language(s)	Notes
Part of speech
Noun			`<n>`
Proper noun			`<np>`
Determiner			`<det>`
Numeral			`<num>`
Adjective			`<adj>`		incl. var/yok
Adverb			`<adv>`
Pronoun			`<prn>`
Verb			`<v>`
Auxiliary verb			`<vaux>`
Copula			`<cop>`
Adverb			`<adv>`
Postadverb			`<postadv>`
Postposition			`<post>`
Particle^[1]			`<part>`
Coordinating conjunction			`<cnjcoo>`
Subordinating conjunction			`<cnjsub>`
Adverbial conjunction			`<cnjadv>`
Abbreviation			`<abbr>`
Personal Title			`<title>`
Interjection			`<ij>`		Әлбетте^(kaz), жок^(kir) (cf. Adj)
Proper noun types
Toponym			`<top>`
Anthroponym			`<ant>`
Patronym			`<pat>`
Cognomen (Surname)			`<cog>`
Acronym			`<acr>`
Other			`<al>`
Pronoun types
Personal			`<pers>`
Ordinal			`<ord>`
Demonstrative			`<dem>`
Indefinite			`<ind>`
Interrogative			`<itg>`
Reflexive			`<ref>`
Quantifier			`<qnt>`
Positive			`<pst>`
Negative			`<neg>`
Comparative			`<comp>`
Reciprocal			`<recip>`
First person			`<p1>`
Second person			`<p2>`
Third person			`<p3>`
Numeral types
Substantive		Substantive form of numerals (when they are used as the head of the noun phrase)	`<subst>`
Ordinal			`<ord>`		Chuvash: -мĂш
Distributive			`<dist>`		Chuvash: -шĂр
Collective			`<coll>`		Chuvash: -Ăн
Case
Nominative case (unmarked)			`<nom>`
Genitive case			`<gen>`
Dative case			`<dat>`
Locative case			`<loc>`
Ablative case	-DAn	Case indicating movement away	`<abl>`	Pan-turkic
Comitative case			`<com>`
Terminative case			`<ter>`		Chuvash: -ччен
Benefactive (Purposive) case			`<ben>`		Chuvash: -шĂн
Allative (Directive) case		Case indicating motion towards something	`<all>`		Chuvash: -АллА
Posession
1st pers sg			`<px1sg>`
1st pers pl			`<px1pl>`
2nd pers sg			`<px2sg>`
2nd pers pl			`<px2pl>`
3rd pers sg			`<px3sg>`
3rd pers pl			`<px3pl>`
3rd pers sg or pl			`<px3sp>`
Gender
Masculine			`<m>`
Feminine			`<f>`
Masculine / feminine			`<mf>`
Number
Singular			`<sg>`
Plural			`<pl>`
Tense, aspect, mood
Present tense			`<pres>`
Present continuous tense			`<cont>`		Turkish: -{bI}yor
Evidential tenseless/past tense			`<evid>`		Turkish: -m{I}ş (<past><evid>)
Past tense			`<past>`		Kyrgyz: -{G}{A}н
Definite past tense			`<ifi>`		Turkish: -{D}{I}
Imperfect			`<pii>`		Turkish: Aorist + -m{A}kt{A}
Past habitual tense			`<pih>`		Turkish: Aorist + -{D}{I}
Future tense			`<fut>`		Turkish: -{bY}{A}c{A}{k}
Imperative	-ø	Mood for giving orders	`<imp>`^A, `<t_imp>`^T	Pan-turkic	Turkish: -ø
Conditional			`<cond>`		Turkish: -s{A}
Aorist			`<aor>`		Turkish: -{A}r or -{bI}r
Optative			`<opt>`		Turkish: -{bY}{A}, Kirghiz: -мAк>чI
Obligative			`<oblig>`		Turkish: -m{A}l{I}
Potential			`<pot>`		Kirghiz: -чUдAй
Not-yet tense			`<notyet>`		Kirghiz: -E элек
Non-finite verb forms
Gerund		makes verbs usable as nouns	`<ger>`, `<vn>`?
Verbal adjective		makes verbs usable as adjectives	`<vadj>`
Participle		makes verb a matrix verb usable auxiliaries and modals	`<part>`, `<vadv>`?
???		makes verb usable as first of a dual-predicate construction	??
Infinitive		citation form of verb and use in certain constructions	`<inf>`?
Gerund #1			`<ger1>`		Turkish: -m{A}
Gerund #2			`<ger2>`		Turkish: -m{A}{K}
Gerund #3			`<ger3>`		Turkish: -{D}{I}{k}
Gerund #4			`<ger4>`		Turkish: -{bY}{I}ş
Gerund #5			`<ger5>`		Turkish: -{bY}{A}n
Gerund #6			`<ger6>`		Turkish: -{bY}{A}r{A}k
Gerund #7			`<ger10>`		Turkish: -{bY}{I}p
Future gerund #1			`<fger>`		Turkish: -{bY}{A}c{A}{k}
Imperfect participle #1			`<fger>`		Turkish: -{bY}{A}r{A}{k}

Productive verbal derivation
Passive			`<pass>`
Causative			`<caus>`
Cooperative			`<coop>`		-{I}ш^(kir), -{I}с^(kaz)
Transitivity
Transitive, переходный			`<tv>`
Intransitive, непереходный			`<iv>`
Modal/question/etc. "particles"
Question		used with yes/no, focus, etc. question morphemes	`<qst>`	most-all	{М}{А}^(kaz), {B}{I}^(kir), m{I}^(tur); +ше^(kaz), ч{I}^(kir)
Emphatic		used with imperative/optative and other coercive verb forms	`<emph>`	most	+ш{I}^(kaz), +ч{I}^(kir),^(tat), s{A}n{A}^(tur)

Official poem

Kovayla bira içerim, ama sen bilmezsin. Yarın gelir misin?
Vedrəyle pivə içirəm, ama sen bilməzsən. Yarın gələrmisən?
Чиләк белән сыра эчәм, әмма син белмисең. Иртәгә киләсеңме?
Шелекпен сыра ішемін, бірақ сен білмейсің. Ертең келесің бе?
Чака менен сыра ичем, бирок сен билбейсиң. Эртең келесиңби?
Челек булан пиво ичемен амма сен билмейсен. Эртен гелемисен?

Footnotes

↑ Warning: The use of the particle tag is highly discouraged.

[1] Warning: The use of the particle tag is highly discouraged.

[1]

@@ Line 57: / Line 57: @@
 || HFST (lexc+twol)
 || production
-|align="right"|{{:bnmorph/stems}}
+|align="right"|{{#1st:apertium-ben/stats|stems}}
 |align="center"| -
 || [[apertium-ben]]&nbsp;([[languages]])

Difference between revisions of "Indic"

Revision as of 00:11, 22 November 2013

Contents

THIS PAGE IS UNFINISHED

Status

Transducers

Indic Language Classification

Pairs

Table of dix progress

Turkic-Turkic pairs

Pairs with non-Turkic languages

Roadmap

Getting involved

Tagset

Official poem

Footnotes

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools