Difference between revisions of "Indic"
Jump to navigation
Jump to search
(Blanked the page) |
|||
Line 1: | Line 1: | ||
{{TOCD}} |
|||
=== THIS PAGE IS UNFINISHED === |
|||
The '''Indic languages''' include [[Hindi]], [[Urdu]], [[Bengali]], [[Sanskrit]], and several other languages. These languages are the dominant language family of the Indian subcontinent. The number of people that speak an Indic language is upwards of 900,000,000. |
|||
The master plan involves generating independent finite-state transducers for each language, and then making individual dictionaries and transfer rules for every pair. The current status of these goals is listed below. |
|||
==Status== |
|||
The ultimate goal is to have multi-purposable transducers for a variety of Indic languages. These can then be paired for X→Y translation with the addition of a [[Constraint Grammar|CG]] for language X and transfer rules / dictionary for the pair X→Y. Below is listed development progress for each language's transducers and dictionary pairs. |
|||
=== Transducers === |
|||
Once a transducer has ~80% coverage on a range of medium-large corpora we can say it is "working". Over 90% and it can be considered to be "production". |
|||
{| class="wikitable sortable" |
|||
|- |
|||
!rowspan=2| name |
|||
!rowspan=2| Language |
|||
!colspan=2 class="unsortable"| ISO 639 |
|||
!rowspan=2| formalism |
|||
!rowspan=2| state |
|||
!rowspan=2| stems |
|||
!rowspan=2| coverage |
|||
!rowspan=2| location |
|||
!rowspan=2 class="unsortable"| primary authors |
|||
|-class="sortbottom" |
|||
! -2 |
|||
! -3 |
|||
|- |
|||
|| <code>[[apertium-hin]]</code> |
|||
|| [[Hindi]] |
|||
|| <code>hi</code> |
|||
|| <code>hin</code> |
|||
|| HFST (lexc+twol) |
|||
|| production |
|||
|align="right"| {{#lst:apertium-hin/stats|stems}} |
|||
|align="center"| - |
|||
|| [[apertium-hin]] ([[languages]]) |
|||
|| [[User:Nikant|Nikant]], [[User:darthxaher|Abu Zaher Md. Faridee]], [[User:Francis Tyers|Fran]] |
|||
|- |
|||
|| <code>[[apertium-urd]]</code> |
|||
|| [[Urdu]] |
|||
|| <code>ur</code> |
|||
|| <code>urd</code> |
|||
|| HFST (lexc+twol) |
|||
|| production |
|||
|align="right"| {{#lst:apertium-urd/stats|stems}} |
|||
|align="center"| - |
|||
|| [[apertium-urd]] ([[languages]]) |
|||
|| - |
|||
|- |
|||
|| <code>[[apertium-ben]]</code> |
|||
|| [[Bengali]] |
|||
|| <code>bn</code> |
|||
|| <code>ben</code> |
|||
|| HFST (lexc+twol) |
|||
|| production |
|||
|align="right"| {{#lst:apertium-ben/stats|stems}} |
|||
|align="center"| - |
|||
|| [[apertium-ben]] ([[languages]]) |
|||
|| [[User:darthxaher|Abu Zaher Md. Faridee]] |
|||
|- |
|||
|| <code>[[apertium-san]]</code> |
|||
|| [[Sanskrit]] |
|||
|| <code>sa</code> |
|||
|| <code>san</code> |
|||
|| HFST (lexc+twol) |
|||
|| production |
|||
|align="right"| {{#lst:Apertium-san/stats|stems}} |
|||
|align="center"| - |
|||
|| [[apertium-san]] ([[languages]]) |
|||
|| Amba Kulkarni |
|||
|- |
|||
|} |
|||
=== Indic Language Classification === |
|||
* Dardic: [[Pahayi]], [[Khowar]], [[Kohistani]], [[Shina language]], [[Kashiri]] |
|||
* Northern Zone: |
|||
**Cantral Pahari |
|||
***[[Garhwali]], [[Kumauni]] |
|||
**Eastern Pahari |
|||
***[[Nepali]] |
|||
* North-Western Zone: |
|||
**Dogri-Kangri |
|||
***[[Dogri]], [[Kangri]], [[Mandeali]], etc. |
|||
** [[Punjabi]] |
|||
** [[Lahnda]] |
|||
** [[Sindhi]] |
|||
* Western Zone: |
|||
** Rajasthani |
|||
*** [[Marwari]], [[Rajasthani]] |
|||
**[[Gujarati]] |
|||
**[[Bhil]] |
|||
**[[Khandeshi]] |
|||
**[[Domari-Romani]] |
|||
* [[Hindi]] |
|||
* Southern Zone: |
|||
** [[Marathi]] |
|||
** [[Konkani]] |
|||
** Insular Indic |
|||
*** [[Sinhalese]], [[Maldivian]] |
|||
* Eastern Zone: |
|||
** Bihari |
|||
*** [[Bhojpuri]], [[Maithili]], etc. |
|||
** [[Bengali]] |
|||
** [[Oriya]] |
|||
** [[Tharu]] |
|||
* [[Sanskrit]] |
|||
==== Indic-Indic pairs ==== |
|||
{| style="text-align: center;" class="wikitable" |
|||
|- style="background: #ececec" |
|||
! !! hin !! ben !! urd !! san |
|||
|- |
|||
| '''hin''' || - || || || | |
|||
|- |
|||
| '''ben''' || [[bn-hi]] || - || || | |
|||
|- |
|||
| '''urd''' || [[ur-hi]] || || - || | |
|||
|- |
|||
| '''san''' || || || || - | |
|||
|- |
|||
|} |
|||
==== Pairs with non-Indic languages ==== |
|||
{| style="text-align: center;" class="wikitable" |
|||
|- style="background: #ececec" |
|||
! !! eng !! as !! mr !! pa !! fa |
|||
|- |
|||
| '''hin''' || [[eng-hin]] || [[as-hi]] || [[mr-hi]] || [[pa-hi]] || |
|||
|- |
|||
| '''ben''' || [[bn-en]] || || || || |
|||
|- |
|||
| '''urd''' || || || || [[ur-pa]] || [[ur-fa]] |
|||
|- |
|||
| '''san''' || || || || || |
|||
|} |
|||
==Tagset== |
|||
Rough guide to tagsets in various Indic language transducers, with an eye to keeping stuff that is basically the same tagged the same. In the following table, <sup>A</sup> stands for Apertium and <sup>T</sup> stands for [[TRmorph]] (See also [[List_of_symbols|the general tagset list]]). |
|||
{|class="wikitable" |
|||
! Phenomenon !! Morphology !! Description !! Tag(s) !! Language(s) !! Notes |
|||
|- |
|||
|colspan=6 align="center"|'''Part of speech''' |
|||
|- |
|||
| Noun || || || {{tag|n}} || || |