Difference between revisions of "Indic"
Jump to navigation
Jump to search
(Blanked the page) |
|||
Line 1: | Line 1: | ||
− | {{TOCD}} |
||
− | === THIS PAGE IS UNFINISHED === |
||
− | |||
− | |||
− | The '''Indic languages''' include [[Hindi]], [[Urdu]], [[Bengali]], [[Sanskrit]], and several other languages. These languages are the dominant language family of the Indian subcontinent. The number of people that speak an Indic language is upwards of 900,000,000. |
||
− | |||
− | The master plan involves generating independent finite-state transducers for each language, and then making individual dictionaries and transfer rules for every pair. The current status of these goals is listed below. |
||
− | |||
− | ==Status== |
||
− | The ultimate goal is to have multi-purposable transducers for a variety of Indic languages. These can then be paired for X→Y translation with the addition of a [[Constraint Grammar|CG]] for language X and transfer rules / dictionary for the pair X→Y. Below is listed development progress for each language's transducers and dictionary pairs. |
||
− | |||
− | === Transducers === |
||
− | Once a transducer has ~80% coverage on a range of medium-large corpora we can say it is "working". Over 90% and it can be considered to be "production". |
||
− | |||
− | {| class="wikitable sortable" |
||
− | |- |
||
− | !rowspan=2| name |
||
− | !rowspan=2| Language |
||
− | !colspan=2 class="unsortable"| ISO 639 |
||
− | !rowspan=2| formalism |
||
− | !rowspan=2| state |
||
− | !rowspan=2| stems |
||
− | !rowspan=2| coverage |
||
− | !rowspan=2| location |
||
− | !rowspan=2 class="unsortable"| primary authors |
||
− | |-class="sortbottom" |
||
− | ! -2 |
||
− | ! -3 |
||
− | |- |
||
− | || <code>[[apertium-hin]]</code> |
||
− | || [[Hindi]] |
||
− | || <code>hi</code> |
||
− | || <code>hin</code> |
||
− | || HFST (lexc+twol) |
||
− | || production |
||
− | |align="right"| {{#lst:apertium-hin/stats|stems}} |
||
− | |align="center"| - |
||
− | || [[apertium-hin]] ([[languages]]) |
||
− | || [[User:Nikant|Nikant]], [[User:darthxaher|Abu Zaher Md. Faridee]], [[User:Francis Tyers|Fran]] |
||
− | |- |
||
− | || <code>[[apertium-urd]]</code> |
||
− | || [[Urdu]] |
||
− | || <code>ur</code> |
||
− | || <code>urd</code> |
||
− | || HFST (lexc+twol) |
||
− | || production |
||
− | |align="right"| {{#lst:apertium-urd/stats|stems}} |
||
− | |align="center"| - |
||
− | || [[apertium-urd]] ([[languages]]) |
||
− | || - |
||
− | |- |
||
− | || <code>[[apertium-ben]]</code> |
||
− | || [[Bengali]] |
||
− | || <code>bn</code> |
||
− | || <code>ben</code> |
||
− | || HFST (lexc+twol) |
||
− | || production |
||
− | |align="right"| {{#lst:apertium-ben/stats|stems}} |
||
− | |align="center"| - |
||
− | || [[apertium-ben]] ([[languages]]) |
||
− | || [[User:darthxaher|Abu Zaher Md. Faridee]] |
||
− | |- |
||
− | || <code>[[apertium-san]]</code> |
||
− | || [[Sanskrit]] |
||
− | || <code>sa</code> |
||
− | || <code>san</code> |
||
− | || HFST (lexc+twol) |
||
− | || production |
||
− | |align="right"| {{#lst:Apertium-san/stats|stems}} |
||
− | |align="center"| - |
||
− | || [[apertium-san]] ([[languages]]) |
||
− | || Amba Kulkarni |
||
− | |- |
||
− | |} |
||
− | |||
− | |||
− | === Indic Language Classification === |
||
− | * Dardic: [[Pahayi]], [[Khowar]], [[Kohistani]], [[Shina language]], [[Kashiri]] |
||
− | * Northern Zone: |
||
− | **Cantral Pahari |
||
− | ***[[Garhwali]], [[Kumauni]] |
||
− | **Eastern Pahari |
||
− | ***[[Nepali]] |
||
− | * North-Western Zone: |
||
− | **Dogri-Kangri |
||
− | ***[[Dogri]], [[Kangri]], [[Mandeali]], etc. |
||
− | ** [[Punjabi]] |
||
− | ** [[Lahnda]] |
||
− | ** [[Sindhi]] |
||
− | * Western Zone: |
||
− | ** Rajasthani |
||
− | *** [[Marwari]], [[Rajasthani]] |
||
− | **[[Gujarati]] |
||
− | **[[Bhil]] |
||
− | **[[Khandeshi]] |
||
− | **[[Domari-Romani]] |
||
− | * [[Hindi]] |
||
− | * Southern Zone: |
||
− | ** [[Marathi]] |
||
− | ** [[Konkani]] |
||
− | ** Insular Indic |
||
− | *** [[Sinhalese]], [[Maldivian]] |
||
− | * Eastern Zone: |
||
− | ** Bihari |
||
− | *** [[Bhojpuri]], [[Maithili]], etc. |
||
− | ** [[Bengali]] |
||
− | ** [[Oriya]] |
||
− | ** [[Tharu]] |
||
− | * [[Sanskrit]] |
||
− | |||
− | |||
− | ==== Indic-Indic pairs ==== |
||
− | |||
− | |||
− | {| style="text-align: center;" class="wikitable" |
||
− | |- style="background: #ececec" |
||
− | ! !! hin !! ben !! urd !! san |
||
− | |- |
||
− | | '''hin''' || - || || || | |
||
− | |- |
||
− | | '''ben''' || [[bn-hi]] || - || || | |
||
− | |- |
||
− | | '''urd''' || [[ur-hi]] || || - || | |
||
− | |- |
||
− | | '''san''' || || || || - | |
||
− | |- |
||
− | |||
− | |} |
||
− | |||
− | ==== Pairs with non-Indic languages ==== |
||
− | {| style="text-align: center;" class="wikitable" |
||
− | |- style="background: #ececec" |
||
− | ! !! eng !! as !! mr !! pa !! fa |
||
− | |- |
||
− | | '''hin''' || [[eng-hin]] || [[as-hi]] || [[mr-hi]] || [[pa-hi]] || |
||
− | |- |
||
− | | '''ben''' || [[bn-en]] || || || || |
||
− | |- |
||
− | | '''urd''' || || || || [[ur-pa]] || [[ur-fa]] |
||
− | |- |
||
− | | '''san''' || || || || || |
||
− | |||
− | |} |
||
− | |||
− | ==Tagset== |
||
− | |||
− | Rough guide to tagsets in various Indic language transducers, with an eye to keeping stuff that is basically the same tagged the same. In the following table, <sup>A</sup> stands for Apertium and <sup>T</sup> stands for [[TRmorph]] (See also [[List_of_symbols|the general tagset list]]). |
||
− | |||
− | {|class="wikitable" |
||
− | ! Phenomenon !! Morphology !! Description !! Tag(s) !! Language(s) !! Notes |
||
− | |- |
||
− | |colspan=6 align="center"|'''Part of speech''' |
||
− | |- |
||
− | | Noun || || || {{tag|n}} || || |