Difference between revisions of "Languages"

Latest revision as of 22:33, 1 February 2019

Monolingual language data lives in apertium-languages. Monolingual language data in Apertium is slowly being moved to this new repository scheme. (Originally, all monolingual language data was found in language pairs, meaning that there was a lot of duplication.) If you feel something is missing, please feel free to contact us.

New monolingual packages should be developed as incubator languages until they're minimally useful, at which point they can go in apertium-languages. There is no fixed criterion for what constitutes a minimally-useful language package; generally, however, a language package should have over 60% coverage on a variety of corpora and should probably have at least 2500 stems to be considered minimally useful.

Apertium-languages can be found in GitHub at apertium-languages. You may also want to browse all the languages and pairs using the Apertium source browser.

Contents[edit]

Languages by coverage[edit]

Module	Language	Entries	Coverage
apertium-afr	Afrikaans	7577	-
apertium-ara	Arabic	6,127	-
apertium-arg	Aragonese	26,068	-
apertium-ast	Asturian	498	-
apertium-ava	Avar	4,904	~86.5%
apertium-bak	Bashkir	46,501	~66%
apertium-ben	Bengali	8,230	~74%
apertium-bre	Breton	18,249	-
apertium-bul	Bulgarian	8,578	-
apertium-cat	Catalan	95604	-
apertium-ces	Czech	41,199	~90.5%
apertium-chv	Chuvash	10,267	~85%
apertium-crh	Crimean Tatar	11,757	~85.4%
apertium-cym	Welsh	11,015	-
apertium-dan	Danish	52,133	-
apertium-deu	German	74,339	-
apertium-ell	Greek	2,460	-
apertium-eng	English	62,609	-
apertium-eus	Basque	11,471	-
apertium-fao	Faroese	2,318	-
apertium-fin	Finnish	408,216	-
apertium-fra	French		-
apertium-gla	Scottish Gaelic	117	-
apertium-glg	Galician	31,916	-
apertium-glv	Manx	11,353	-
apertium-hbs	Serbo-Croatian	58,004	-
apertium-heb	Hebrew	20,932	-
apertium-hin	Hindi	37,833	~83.1%
apertium-hye	Armenian	8,247	-
apertium-ind	Indonesian	12,264	-
apertium-isl	Icelandic	8,770	-
apertium-ita	Italian	25,609	-
apertium-kaa	Karakalpak	25,545	~86.1%
apertium-kaz	Kazakh	36,595	~94.5%
apertium-kir	Kyrgyz	14,424	~90.4%
apertium-kmr	Kurmanji	17,771	-
apertium-kum	Kumyk	4,918	~90.2%
apertium-ltz	Luxembourgish	11,882	-
apertium-lvs	Latvian	6,756	-
apertium-mar	Marathi	14,886	-
apertium-mkd	Macedonian	30,686	~90.5%
apertium-mlt	Maltese	7,371	-
apertium-nld	Dutch	25,079	-
apertium-nno	Norwegian Nynorsk	182,497	-
apertium-nob	Norwegian Bokmål	246,281	-
apertium-nog	Nogay	1,385	~81.4%
apertium-pol	Polish	13,972	-
apertium-por	Portuguese	14,796	-
apertium-ron	Romanian	18,878	-
apertium-rus	Russian	126,833	~89.6%
apertium-sah	Sakha	11,531	~89.6%
apertium-san	Sanskrit	123,373	-
apertium-slv	Slovenian	20,596	-
apertium-spa	Spanish	46,003	-
apertium-sqi	Albanian	3,312	~80.2%
apertium-srd	Sardinian	46,642	-
apertium-swe	Swedish	138,490	-
apertium-tat	Tatar	55,702	~91%
apertium-tuk	Turkmen	2,988	~70.7%
apertium-tur	Turkish	17,221	~87.3%
apertium-tyv	Tuvan	11,695	~92.7%
apertium-ukr	Ukrainian	10,709	-
apertium-urd	Urdu	14,943	~64.6%
apertium-uzb	Uzbek	34,470	~82.9%
apertium-yid	Yiddish	378	~62.5%
apertium-zho	Chinese	8,521	-
apertium-zlm	Malay	11,894	-

Languages by family[edit]

Turkic:
- Oghuz: Turkmen, Turkish
- Kypchak: Kazakh, Kyrgyz, Tatar, Bashqort, Kumyk, Nogay, Karakalpak
- Karluk: Uzbek, Uyghur
- Other: Chuvash, Sakha, Tuvan
Indo-European
- Slavic: Russian, Serbo-Croatian, Macedonian, Czech, Bulgarian, Ukranian, Polish, Slovenian
- Celtic: Scottish Gaelic, Breton, Welsh, Manx
- Germanic
  - West Germanic: Dutch, Afrikaans, English, German, Luxembourgish, Yiddish
  - North Germanic: Danish, Icelandic, Norwegian (nno, nob), Swedish, Faroese
- Romance: Aragonese, Asturian, Catalan, Spanish, French, Galician, Italian, Portuguese, Sardinian, Romanian, Corsican
- Indic: Urdu, Bengali, Hindi, Sanskrit, Marathi
- Baltic: Latvian
- Other: Albanian, Armenian, Greek
Semitic: Maltese, Arabic, Hebrew
Uralic: Finnish
Daghestani languages: Avar
Vasconic languages: Basque
Sinitic languages: Chinese
Austronesian languages: Malay, Indonesian

Languages by region[edit]

Volga-Kama: Tatar, Bashqort, Chuvash
Balkans: Albanian, Bulgarian, Greek, Macedonian, Serbo-Croatian, Romanian
Caucasus: Kumyk, Nogay, Armenian, Avar
Central Asia: Kazakh, Kyrgyz, Turkmen, Uzbek, Karakalpak
former Soviet Union: Kyrgyz, Kazakh, Azeri, Turkmen, Tatar, Bashqort, Chuvash, Armenian, Tajik, Avar, Uyghur, Karakalpak, Uzbek, Kumyk, Sakha, Tuvan, Latvian, Gagauz
Languages of Spain: Spanish, Basque, Catalan, Asturian, Galician, Aragonese
Languages of the Baltics: Latvian, Estonian

Language family pages[edit]

Language family pages exist to show the overall progress of monolingual language modules for regions and language families of interest. Currently the following pages are (or should soon be) available on this wiki (family names in bold use the format described below):

The language family pages should represent the following data in a standardised format:

The languages of that group with apertium data (whether in languages, incubator, part of a pair in trunk, etc.)
2- and 3-letter ISO codes for each language
The formalism the module is written in
Links to the pages for each language on the apertium wiki
The location in the apertium repository (whether in languages, incubator, part of a pair in trunk, etc.)
Development status, which should be one of the following:
- production - for language modules used in a released pair, usually over 90% coverage and/or over 10,000 stems
- working - for language modules with near-production-quality performance, usually over 80% coverage and/or over 8'000 stems
- development - for language modules under development, usually over 60% coverage and/or over 1'000 stems
- prototype - for language modules that have not received heavy development, usually less than 60% coverage or under 1'000 stems

Here are status guidelines summarised in a table:

status	description	stems	coverage
prototype	language module that has not received heavy development	<1,000	<60%
development	language module under development	≥1,000	≥60%
working	language module with near-production-quality performance	≥8,000	≥80%
production	language module used in a released pair	≥10,000	≥90%

Additionally, the following data is put on apertium-xxx/stats pages, and is included on the language family page and other places as relevant:

The number of stems (and paradigms if relevant) in that language module
The coverage of the transducer on a variety of corpora

There should also be a table of language pairs available with these languages, with number of stems from apertium-xxx-yyy/stats pages on the wiki. Guidelines for font semantics for the pairs follow:

production / trunk = bold
working / staging = bold+italics
development / nursery = normal
prototype / incubator = italics

Optional information includes samples of a single text in these languages and UNESCO-provided vulnerability data.

@@ Line 166: / Line 166: @@
 *** [[North Germanic languages|North Germanic]]: Danish, Icelandic, Norwegian (nno, nob), Swedish, Faroese
 ** [[Romance languages|Romance]]: [[Aragonese]], [[Asturian]], [[Catalan]], [[Spanish]], [[French]], [[Galician]], [[Italian]], [[Portuguese]], [[Sardinian]], [[Romanian]], [[Corsican]]
-** [[Indic languages|Indic]]: [[Urdu]], [[Bengali]], [[Hindi]], [[Sanskrit]]
+** [[Indic languages|Indic]]: [[Urdu]], [[Bengali]], [[Hindi]], [[Sanskrit]], [[Marathi]]
 ** [[Baltic languages|Baltic]]: Latvian
 ** Other: [[Albanian]], Armenian, Greek

Difference between revisions of "Languages"

Latest revision as of 22:33, 1 February 2019

Contents

Contents[edit]

Languages by coverage[edit]

Languages by family[edit]

Languages by region[edit]

Language family pages[edit]

See also[edit]

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools