Difference between revisions of "Indic languages"

Revision as of 18:32, 22 November 2013

Status

The ultimate goal is to have multi-purposable transducers for a variety of Indic languages. These can then be paired for X→Y translation with the addition of a CG for language X and transfer rules / dictionary for the pair X→Y. Below is listed development progress for each language's transducers and dictionary pairs.

Transducers

Once a transducer has ~80% coverage on a range of medium-large corpora we can say it is "working". Over 90% and it can be considered to be "production".

name	Language	ISO 639		formalism	state	stems	coverage	location	primary authors
name	Language	-2	-3	formalism	state	stems	coverage	location	primary authors
`apertium-hin`	Hindi	`hi`	`hin`	lttoolbox	production	37,833	-	apertium-hin (languages)	Nikant, Abu Zaher Md. Faridee, Fran
`apertium-urd`	Urdu	`ur`	`urd`	lttoolbox	nursery	14,943	-	apertium-urd (languages)	-
`apertium-ben`	Bengali	`bn`	`ben`	lttoolbox	nursery	8,230	-	apertium-ben (languages)	Abu Zaher Md. Faridee
`apertium-san`	Sanskrit	`sa`	`san`	lttoolbox	production	123,373	-	apertium-san (languages)	Amba Kulkarni

Indic Language Classification

Dardic: Pahayi, Khowar, Kohistani, Shina language, Kashiri
Northern Zone:
- Central Pahari
  - Garhwali, Kumauni
- Eastern Pahari
  - Nepali
North-Western Zone: Punjabi, Lahnda, Sindhi
- Dogri-Kangri
  - Dogri, Kangri, Mandeali, etc.
Western Zone: Gujarati, Bhil, Khandeshi, Domari-Romani
- Rajasthani
  - Marwari, Rajasthani
Hindi
Southern Zone: Marathi, Konkani
- Insular Indic: Sinhalese, Maldivian
Eastern Zone: Bengali, Oriya, Tharu
- Bihari
  - Bhojpuri, Maithili, etc.
Sanskrit

Indic-Indic pairs

	hin	ben	urd	san
hin	-
ben	bn-hi	-
urd	ur-hi		-
san				-

Pairs with non-Indic languages

	eng	as	mr	pa	fa
hin	eng-hin	as-hi	mr-hi	pa-hi
ben	bn-en
urd				ur-pa	ur-fa
san

Tagset

Rough guide to tagsets in various Indic language transducers, with an eye to keeping stuff that is basically the same tagged the same. In the following table, ^A stands for Apertium and ^T stands for TRmorph (See also the general tagset list).

Phenomenon	Morphology	Description	Tag(s)	Language(s)	Notes
Part of speech
Noun			`<n>`

@@ Line 41: / Line 41: @@
 || <code>urd</code>
 || [[lttoolbox]]
-|| production
+|| [[nursery]]
 |align="right"| {{#lst:apertium-urd/stats|stems}}
 |align="center"| -

Difference between revisions of "Indic languages"

Revision as of 18:32, 22 November 2013

Contents

Status

Transducers

Indic Language Classification

Indic-Indic pairs

Pairs with non-Indic languages

Tagset

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools