Difference between revisions of "Indic languages"

Revision as of 18:15, 22 November 2013

Status

The ultimate goal is to have multi-purposable transducers for a variety of Indic languages. These can then be paired for X→Y translation with the addition of a CG for language X and transfer rules / dictionary for the pair X→Y. Below is listed development progress for each language's transducers and dictionary pairs.

Transducers

Once a transducer has ~80% coverage on a range of medium-large corpora we can say it is "working". Over 90% and it can be considered to be "production".

name	Language	ISO 639		formalism	state	stems	coverage	location	primary authors
name	Language	-2	-3	formalism	state	stems	coverage	location	primary authors
`apertium-hin`	Hindi	`hi`	`hin`	HFST (lexc+twol)	production	37,833	-	apertium-hin (languages)	Nikant, Abu Zaher Md. Faridee, Fran
`apertium-urd`	Urdu	`ur`	`urd`	HFST (lexc+twol)	production	14,943	-	apertium-urd (languages)	-
`apertium-ben`	Bengali	`bn`	`ben`	HFST (lexc+twol)	production	8,230	-	apertium-ben (languages)	Abu Zaher Md. Faridee
`apertium-san`	Sanskrit	`sa`	`san`	HFST (lexc+twol)	production	123,373	-	apertium-san (languages)	Amba Kulkarni

Indic Language Classification

Dardic: Pahayi, Khowar, Kohistani, Shina language, Kashiri
Northern Zone:
- Central Pahari
  - Garhwali, Kumauni
- Eastern Pahari
  - Nepali
North-Western Zone: Punjabi, Lahnda, Sindhi
- Dogri-Kangri
  - Dogri, Kangri, Mandeali, etc.
Western Zone: Gujarati, Bhil, Khandeshi, Domari-Romani
- Rajasthani
  - Marwari, Rajasthani
Hindi
Southern Zone: Marathi, Konkani
- Insular Indic: Sinhalese, Maldivian
Eastern Zone: Bengali, Oriya, Tharu
- Bihari
  - Bhojpuri, Maithili, etc.
Sanskrit

Indic-Indic pairs

	hin	ben	urd	san
hin	-
ben	bn-hi	-
urd	ur-hi		-
san				-

Pairs with non-Indic languages

	eng	as	mr	pa	fa
hin	eng-hin	as-hi	mr-hi	pa-hi
ben	bn-en
urd				ur-pa	ur-fa
san

Tagset

Rough guide to tagsets in various Indic language transducers, with an eye to keeping stuff that is basically the same tagged the same. In the following table, ^A stands for Apertium and ^T stands for TRmorph (See also the general tagset list).

Phenomenon	Morphology	Description	Tag(s)	Language(s)	Notes
Part of speech
Noun			`<n>`

@@ Line 75: / Line 75: @@
 * Dardic: [[Pahayi]], [[Khowar]], [[Kohistani]], [[Shina language]], [[Kashiri]]
 * Northern Zone:
-**Cantral Pahari
+**Central Pahari
 ***[[Garhwali]], [[Kumauni]]
 **Eastern Pahari
@@ Line 82: / Line 82: @@
 **Dogri-Kangri
 ***[[Dogri]], [[Kangri]], [[Mandeali]], etc.
-* Western Zone:
+* Western Zone: [[Gujarati]], [[Bhil]], [[Khandeshi]], [[Domari-Romani]]
 ** Rajasthani
 *** [[Marwari]], [[Rajasthani]]
-**[[Gujarati]]
-**[[Bhil]]
-**[[Khandeshi]]
-**[[Domari-Romani]]
 * [[Hindi]]
 * Southern Zone: [[Marathi]], [[Konkani]]
 ** Insular Indic: [[Sinhalese]], [[Maldivian]]
-* Eastern Zone:
+* Eastern Zone: [[Bengali]], [[Oriya]], [[Tharu]]
 ** Bihari
 *** [[Bhojpuri]], [[Maithili]], etc.
-** [[Bengali]]
-** [[Oriya]]
-** [[Tharu]]
 * [[Sanskrit]]

Difference between revisions of "Indic languages"

Revision as of 18:15, 22 November 2013

Contents

Status

Transducers

Indic Language Classification

Indic-Indic pairs

Pairs with non-Indic languages

Tagset

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools