Difference between revisions of "User:Sushain/SemeticLanguages"

From Apertium
Jump to navigation Jump to search
Line 29: Line 29:
| <code>[[apertium-heb]]</code>
| <code>[[apertium-heb]]</code>
|| [[Hebrew]]
|| [[Hebrew]]
|| עִבְרִית
|align="right"| עִבְרִית
||<code>he</code>
||<code>he</code>
|| <code>heb</code>
|| <code>heb</code>
Line 55: Line 55:
| <code>[[apertium-ara]]</code>
| <code>[[apertium-ara]]</code>
|| [[Arabic]]
|| [[Arabic]]
|| العربية
|align="right"| العربية
||<code>ar</code>
||<code>ar</code>
|| <code>ara</code>
|| <code>ara</code>

Revision as of 07:47, 3 January 2014

The Semitic languages (sem) constitute a group of related languages and a branch of the Afro-Asiatic language family. Spoken by more than 470 million people throughout North Africa and Southwest Asia, the most widely spoken Semitic languages are Arabic, Maltese, Hebrew, Amharic, and Tigrigna.

The master plan involves generating independent finite-state transducers for each language, and then making individual dictionaries and transfer rules for every pair. The current status of these goals is listed below.

Status

The ultimate goal is to have multi-purposable transducers for a variety of Semitic languages. These can then be paired for X→Y translation with the addition of a CG for language X and transfer rules / dictionary for the pair X→Y. Below is listed development progress for each language's transducers and dictionary pairs.

Transducers

Once a transducer has ~80% coverage on a range of medium-large corpora we can say it is "working". Over 90% and it can be considered to be "production".

name language native name ISO 639 formalism state stems paradigms coverage location primary authors
-2 -3
apertium-heb Hebrew עִבְרִית he heb lttoolbox development apertium-ara-heb (incubator) missmaryx
apertium-mlt Maltese Malti mt mlt lttoolbox development 7,371 758 apertium-mlt (languages) Fran, Unhammer, Fronczak
apertium-ara Arabic العربية ar ara lttoolbox development apertium-ara-heb (incubator) missmaryx

Existing language pairs

Text in italic denotes language pairs in the incubator. Regular text denotes a developing language pair in staging, while text in bold denotes a stable well-working language pair in trunk and text in bold and italics denotes a pair in staging. Bidix stems as counted with dixcounter are displayed below.

heb mlt ara
heb - mt-he
3,634
ara-heb
131
mlt mt-he
3,634
- mt-ar
7,570
ara ara-heb
131
mt-ar
7,570
-
eng en-mt
814
epo eo-he
1,505

Semitic languages by subgroup

There are six fairly uncontroversial nodes within the Semitic languages:

  • East Semitic languages: Akkadian, Eblaite (extinct)
  • Central Semitic languages
  • South Semitic languages
    • Western: Ethiopic languages (Amharic, Tigrinya, etc.) and Old South Arabian languages (Sabaean, Minaean, Qatabānian, Ḥaḑramitic, etc.)
    • Eastern: Modern South Arabian languages (Bathari, Harsusi, Hobyót, Mehri, Shehri, Soqotri)

Samples

Article 1 of the Universal Declaration of Human Rights:

All human beings are born free and equal in dignity and rights. They are endowed with reason and conscience and should act towards one another in a spirit of brotherhood.

Language Text
Arabic يولد جميع الناس أحرارًا متساوين في الكرامة والحقوق. وقد وهبوا عقلاً وضميرًا وعليهم أن يعامل بعضهم بعضًا بروح الإخاء.
Maltese Il-bnedmin kollha jitwieldu ħielsa u ugwali fid-dinjità u d-drittijiet. Huma mogħnija bir-raġuni u bil-kuxjenza u għandhom iġibu ruħhom ma’ xulxin bi spirtu ta’ aħwa.
Hebrew כל בני אדם נולדו בני חורין ושווים בערכם ובזכויותיהם. כולם חוננו בתבונה ובמצפון, לפיכך חובה עליהם לנהוג איש ברעהו ברוח של אחוה.
Amharic የሰው፡ልጅ፡ሁሉ፡ሲወለድ፡ነጻና፡በክብርና፡በመብትም፡እኩልነት፡ያለው፡ነው።፡የተፈጥሮ፡ማስተዋልና፡ሕሊና፡ስላለው፡አንዱ፡ሌላውን፡በወንድማማችነት፡መንፈስ፡መመልከት፡ይገባዋል።
Tigrigna ብመንፅር ክብርን መሰልን ኩሎም ሰባት እንትውለዱ ነፃን ማዕሪን እዮም፡፡ ምስትውዓልን ሕልናን ዝተዓደሎም ብምዃኖም ንሕድሕዶም ብሕውነታዊ መንፈስ ክተሓላለዩ ኦለዎም፡፡

Vulnerability

This table summarizes the vulnerability of various Semitic languages. Data is derived from the ‘Atlas of the World’s Languages in Danger, © UNESCO, http://www.unesco.org/culture/languages-atlas’ and Ethnologue.

Language ISO639-3 Location Speakers Status
Ethnologue UNESCO
Jewish Babylonian Aramaic tmr Iraq 0 10 (Extinct) -
Mlahsö lhs Syrian Arab Republic 0 10 (Extinct) 5 (Extinct)
Mandaic, Classical myz Iran 0 10 (Extinct) -
Mesmes mys Ethiopia 0 10 (Extinct) -
Syriac syc Turkey 0 9 (Dormant) -
Hebrew, Ancient hbo Israel 0 9 (Dormant) -
Geez gez Ethiopia 0 9 (Second language only) 5 (Extinct)
Samaritan Aramaic sam Palestine 620 9 (Dormant) -
Samaritan smp Palestine 620 9 (Dormant) -
Barzani Jewish Neo-Aramaic bjf Israel & Iraq 20 8b (Nearly extinct) 5 (Extinct)
Bathari bhm Oman 200 8b (Nearly extinct) 4 (Critically endangered)
Senaya syn Iran 460 8b (Nearly extinct) 4 (Critically endangered)
Hobyót hoh Oman, Yemen 100 8a (Moribund) 3 (Severely endangered)
Arabic, Uzbeki Spoken auz Uzbekistan 700 8a (Moribund) -
Hulaulá huy Israel & Iran 10,350 8a (Moribund) 5 (Extinct)
Soqotri sqt Yemen 64,000 8a (Moribund) 3 (Severely endangered)
Harsusi hss Oman 600 7 (Shifting) 2 (Definitely endangered)
Bohtan Neo-Aramaic bhn Georgia, Russian Federation 1,000 7 (Shifting) 3 (Severely endangered)
Arabic, Cypriot Spoken acy Cyprus 1,300 7 (Shifting) 3 (Severely endangered)
Lishanid Noshan aij Israel & Iraq 2,200 7 (Shifting) 5 (Extinct)
Lishán Didán trg Israel & Iran 4,450 7 (Shifting) 5 (Extinct)
Mandaic mid Iran, Iraq 5,500 7 (Shifting) 4 (Critically endangered)
Lishana Deni lsd Israel & Iraq 7,500 7 (Shifting) 5 (Extinct)
Western Neo-Aramaic amw Syrian Arab Republic 15,000 7 (Shifting) 2 (Definitely endangered)
Arabic, Judeo-Tripolitanian yud Israel 35,000 7 (Shifting) -
Arabic, Judeo-Tunisian ajt Israel 45,500 7 (Shifting) 3 (Severely endangered)
Mehri gdq Oman, Yemen 115,200 7 (Shifting) 2 (Definitely endangered)
Arabic, Judeo-Iraqi yhd Israel 151,820 7 (Shifting) -
Chaldean Neo-Aramaic cld Iraq 206,000 7 (Shifting) -
Arabic, Judeo-Moroccan aju Israel 258,930 7 (Shifting) 2 (Definitely endangered)
Zay zwa Ethiopia 4,880 6b (Threatened) 3 (Severely endangered)
Arabic, Tajiki Spoken abh Tajikistan 6,000 6b (Threatened) -
Shehri shv Oman 25,000 6b (Threatened) 3 (Severely endangered)
Argobba agj Ethiopia 43,700 6b (Threatened) 4 (Critically endangered)
Turoyo tru Syrian Arab Republic, Turkey 62,000 6b (Threatened) 3 (Severely endangered)
Assyrian Neo-Aramaic aii Iraq 232,300 6b (Threatened) -
Koy Sanjaq Surat kqd Iraq 800 6a (Vigorous) -
Hértevin hrt Turkey 1,000 6a (Vigorous) 4 (Critically endangered)
Dahalik dlk Eritrea 2,500 6a (Vigorous) -
Harari har Ethiopia 25,800 6a (Vigorous) -
Arabic, Shihhi Spoken ssh United Arab Emirates 27,000 6a (Vigorous) -
Arabic, Judeo-Yemeni jye Israel 51,000 6a (Vigorous) -
Arabic, Dhofari Spoken adf Oman 70,000 6a (Vigorous) -
Arabic, Algerian Saharan Spoken aao Algeria 130,500 6a (Vigorous) -
Mesqan mvz Ethiopia 195,000 6a (Vigorous) -
Kistane gru Ethiopia 255,000 6a (Vigorous) -
Inor ior Ethiopia 280,000 6a (Vigorous) -
Arabic, Hadrami Spoken ayh Yemen 410,000 6a (Vigorous) -
Arabic, Eastern Egyptian Bedawi Spoken avl Egypt 1,690,000 6a (Vigorous) -
Arabic, Gulf Spoken afb Iraq 3,601,000 6a (Vigorous) -
Arabic, Hijazi Spoken acw Saudi Arabia 6,023,900 6a (Vigorous) -
Arabic, North Mesopotamian Spoken ayp Iraq 6,300,000 6a (Vigorous) -
Arabic, Ta’izzi-Adeni Spoken acq Yemen 7,078,500 6a (Vigorous) -
Arabic, Sanaani Spoken ayn Yemen 7,600,000 6a (Vigorous) -
Arabic, Sa’idi Spoken aec Egypt 19,000,000 6a (Vigorous) -
Wolane wle Ethiopia - 6a (Vigorous) -
Sebat Bet Gurage sgw Ethiopia 440000 5 (Developing) -
Arabic, Omani Spoken acx Oman 853,900 5 (Developing) -
Silt’e stv Ethiopia 935,000 4 (Educational) -
Tigré tig Eritrea 1,050,000 4 (Educational) -
Arabic, Baharna Spoken abv Bahrain 310,000 3 (Wider communication) -
Arabic, Chadian Spoken shu Chad 1,139,100 3 (Wider communication) -
Arabic, Sudanese Spoken apd Sudan 1,833,000 3 (Wider communication) -
Hassaniyya mey Mauritania 3,278,190 3 (Wider communication) -
Arabic, Libyan Spoken ayl Libya 4,320,500 3 (Wider communication) -
Arabic, South Levantine Spoken ajp Jordan 6,200,000 3 (Wider communication) -
Arabic, Tunisian Spoken aeb Tunisia 9,406,900 3 (Wider communication) -
Arabic, Najdi Spoken ars Saudi Arabia 9,670,000 3 (Wider communication) -
Arabic, North Levantine Spoken apc Syria 14,426,540 3 (Wider communication) -
Arabic, Mesopotamian Spoken acm Iraq 15,100,000 3 (Wider communication) -
Arabic, Moroccan Spoken ary Morocco 21,048,600 3 (Wider communication) -
Arabic, Algerian Spoken arq Algeria 27,997,000 3 (Wider communication) -
Arabic, Egyptian Spoken arz Egypt 53,990,000 3 (Wider communication) -
Tigrigna tir Ethiopia 6,915,000 2 (Provincial) -
Maltese mlt Malta 429,000 1 (National) -
Hebrew heb Israel 5,302,770 1 (National) -
Amharic amh Ethiopia 21,811,560 1 (National) -
Arabic, Standard arb Saudi Arabia 206,000,000 1 (National) -

This article uses material from the Wikipedia article "Semitic languages", which is released under the Creative Commons Attribution-Share-Alike License 3.0.