Difference between revisions of "Indic languages"

Revision as of 02:46, 22 November 2013

Status

The ultimate goal is to have multi-purposable transducers for a variety of Indic languages. These can then be paired for X→Y translation with the addition of a CG for language X and transfer rules / dictionary for the pair X→Y. Below is listed development progress for each language's transducers and dictionary pairs.

Transducers

Once a transducer has ~80% coverage on a range of medium-large corpora we can say it is "working". Over 90% and it can be considered to be "production".

name	Language	ISO 639		formalism	state	stems	coverage	location	primary authors
name	Language	-2	-3	formalism	state	stems	coverage	location	primary authors
`apertium-hin`	Hindi	`hi`	`hin`	HFST (lexc+twol)	production	37,833	-	apertium-hin (languages)	Nikant, Abu Zaher Md. Faridee, Fran
`apertium-urd`	Urdu	`ur`	`urd`	HFST (lexc+twol)	production	14,943	-	apertium-urd (languages)	-
`apertium-ben`	Bengali	`bn`	`ben`	HFST (lexc+twol)	production	8,230	-	apertium-ben (languages)	Abu Zaher Md. Faridee
`apertium-san`	Sanskrit	`sa`	`san`	HFST (lexc+twol)	production	123,373	-	apertium-san (languages)	Amba Kulkarni

Indic Language Classification

Dardic: Pahayi, Khowar, Kohistani, Shina language, Kashiri
Northern Zone:
- Cantral Pahari
  - Garhwali, Kumauni
- Eastern Pahari
  - Nepali
North-Western Zone:
- Dogri-Kangri
  - Dogri, Kangri, Mandeali, etc.
- Punjabi
- Lahnda
- Sindhi
Western Zone:
- Rajasthani
  - Marwari, Rajasthani
- Gujarati
- Bhil
- Khandeshi
- Domari-Romani
Hindi
Southern Zone:
- Marathi
- Konkani
- Insular Indic
  - Sinhalese, Maldivian
Eastern Zone:
- Bihari
  - Bhojpuri, Maithili, etc.
- Bengali
- Oriya
- Tharu
Sanskrit

Indic-Indic pairs

	hin	ben	urd	san
hin	-
ben	bn-hi	-
urd	ur-hi		-
san				-

Pairs with non-Indic languages

	eng	as	mr	pa	fa
hin	eng-hin	as-hi	mr-hi	pa-hi
ben	bn-en
urd				ur-pa	ur-fa
san

Tagset

Rough guide to tagsets in various Indic language transducers, with an eye to keeping stuff that is basically the same tagged the same. In the following table, ^A stands for Apertium and ^T stands for TRmorph (See also the general tagset list).

Phenomenon	Morphology	Description	Tag(s)	Language(s)	Notes
Part of speech
Noun			`<n>`

@@ Line 114: / Line 114: @@
 !           !!    hin    !! ben !! urd !! san
 |-
-| '''hin''' ||     -     ||     ||     ||     |
+| '''hin''' ||     -     ||     ||     ||
 |-
-| '''ben''' || [[bn-hi]] ||  -  ||     ||     |
+| '''ben''' || [[bn-hi]] ||  -  ||     ||
 |-
-| '''urd''' || [[ur-hi]] ||     ||  -  ||     |
+| '''urd''' || [[ur-hi]] ||     ||  -  ||
 |-
-| '''san''' ||           ||     ||     ||  -  |
+| '''san''' ||           ||     ||     ||  -
 |-

Difference between revisions of "Indic languages"

Revision as of 02:46, 22 November 2013

Contents

Status

Transducers

Indic Language Classification

Indic-Indic pairs

Pairs with non-Indic languages

Tagset

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools