Difference between revisions of "Languages"
Firespeaker (talk | contribs) (→Contents: different "by"s) |
Firespeaker (talk | contribs) |
||
(122 intermediate revisions by 9 users not shown) | |||
Line 1: | Line 1: | ||
{{TOCD}} |
|||
:''If you are looking for the category, click [[:Category:Languages|here]]'' |
:''If you are looking for the category, click [[:Category:Languages|here]]'' |
||
Monolingual language data lives in [https://github.com/apertium/apertium-languages apertium-languages]. Monolingual language data in Apertium is slowly being moved to this new repository scheme. (Originally, all monolingual language data was found in language pairs, meaning that there was a lot of duplication.) If you feel something is missing, please feel free to [[contact|contact us]]. |
|||
New monolingual packages should be developed as [https://github.com/apertium/apertium-incubator incubator] languages until they're minimally useful, at which point they can go in apertium-languages. There is no fixed criterion for what constitutes a minimally-useful language package; generally, however, a language package should have over 60% coverage on a variety of corpora and should probably have at least 2500 stems to be considered minimally useful. |
|||
It can be found [https://svn.code.sf.net/p/apertium/svn/languages here] |
|||
Apertium-languages can be found in GitHub at [https://github.com/apertium/apertium-languages apertium-languages]. You may also want to browse all the languages and pairs using the [https://apertium.github.io/apertium-on-github/source-browser.html Apertium source browser]. |
|||
==Contents== |
==Contents== |
||
=== Languages by coverage === |
=== Languages by coverage === |
||
{{div col begin|colwidth=30em}} |
|||
{|class="wikitable sortable" |
{|class="wikitable sortable" |
||
! Module !! Language !! Entries !! Coverage |
! Module !! Language !! Entries !! Coverage |
||
|- |
|- |
||
| apertium- |
| apertium-afr|| [[Afrikaans]] ||align="right"| {{#lst:apertium-afr/stats|stems}} ||align="center"| - |
||
|- |
|- |
||
| apertium- |
| apertium-ara|| [[Arabic]] ||align="right"| {{#lst:apertium-ara/stats|stems}} ||align="center"| - |
||
|- |
|- |
||
| apertium- |
| apertium-arg|| [[Aragonese]] ||align="right"| {{#lst:apertium-arg/stats|stems}} ||align="center"| - |
||
|- |
|- |
||
| apertium- |
| apertium-ast|| [[Asturian]] ||align="right"| {{#lst:apertium-ast/stats|stems}} ||align="center"| - |
||
|- |
|- |
||
| apertium-ava|| [[Avar]] ||align="right"| {{#lst:apertium-ava/stats|stems}} ||align="center"| [[Apertium-ava#Current_State|~{{:Apertium-ava/stats/average}}%]] |
|||
| apertium-hbs|| Serbo-Croatian || - || - |
|||
|- |
|- |
||
| apertium-bak|| [[Bashkir]] ||align="right"| {{#lst:apertium-bak/stats|stems}} ||align="center"| [[Apertium-bak#Current_State|~{{:Apertium-bak/stats/average}}%]] |
|||
| apertium-hye|| Armenian || - || - |
|||
|- |
|- |
||
| apertium- |
| apertium-ben|| [[Bengali]] ||align="right"| {{#lst:apertium-ben/stats|stems}} ||align="center"| [[Apertium-ben#Current_State|~{{:Apertium-ben/stats/average}}%]] |
||
|- |
|- |
||
| apertium- |
| apertium-bre|| [[Breton]] ||align="right"| {{#lst:apertium-bre/stats|stems}} ||align="center"| - |
||
|- |
|- |
||
| apertium- |
| apertium-bul|| Bulgarian ||align="right"| {{#lst:apertium-bul/stats|stems}} ||align="center"| - |
||
|- |
|- |
||
| apertium- |
| apertium-cat|| [[Catalan]] ||align="right"| {{#lst:apertium-cat/stats|stems}} ||align="center"| - |
||
|- |
|- |
||
| apertium- |
| apertium-ces|| [[Czech]] ||align="right"| {{#lst:apertium-ces/stats|stems}} ||align="center"| [[Apertium-ces#Current_State|~{{:Apertium-ces/stats/average}}%]] |
||
|- |
|- |
||
| apertium-chv|| [[Chuvash]] ||align="right"| {{#lst:apertium-chv/stats|stems}} ||align="center"| [[Apertium-chv#Current_State|~{{:Apertium-chv/stats/average}}%]] |
|||
| apertium-rus|| Russian || - || - |
|||
|- |
|- |
||
| apertium-crh|| [[Crimean Tatar]] ||align="right"| {{#lst:apertium-crh/stats|stems}} ||align="center"| [[Apertium-crh#Current_State|~{{:Apertium-crh/stats/average}}%]] |
|||
| apertium-sqi|| Albanian || - || - |
|||
|- |
|- |
||
| apertium- |
| apertium-cym|| [[Welsh]] ||align="right"| {{#lst:apertium-cym/stats|stems}} ||align="center"| - |
||
|- |
|- |
||
| apertium- |
| apertium-dan|| Danish ||align="right"| {{#lst:apertium-dan/stats|stems}} ||align="center"| - |
||
|- |
|- |
||
| apertium- |
| apertium-deu|| German ||align="right"| {{#lst:apertium-deu/stats|stems}} ||align="center"| - |
||
|- |
|- |
||
| apertium-ell|| Greek ||align="right"| {{#lst:apertium-ell/stats|stems}} ||align="center"| - |
|||
|- |
|||
| apertium-eng|| English ||align="right"| {{#lst:apertium-eng/stats|stems}} ||align="center"| - |
|||
|- |
|||
| apertium-eus|| Basque ||align="right"| {{#lst:apertium-eus/stats|stems}} ||align="center"| - |
|||
|- |
|||
| apertium-fao|| Faroese ||align="right"| {{#lst:apertium-fao/stats|stems}} ||align="center"| - |
|||
|- |
|||
| apertium-fin|| Finnish ||align="right"| {{#lst:apertium-fin/stats|stems}} ||align="center"| - |
|||
|- |
|||
| apertium-fra|| French ||align="right"| {{#lst:apertium-fra/stats|stems}} ||align="center"| - |
|||
|- |
|||
| apertium-gla|| Scottish Gaelic ||align="right"| {{#lst:apertium-gla/stats|stems}} ||align="center"| - |
|||
|- |
|||
| apertium-glg|| Galician ||align="right"| {{#lst:apertium-glg/stats|stems}} ||align="center"| - |
|||
|- |
|||
| apertium-glv|| Manx ||align="right"| {{#lst:apertium-glv/stats|stems}} ||align="center"| - |
|||
|- |
|||
| apertium-hbs|| Serbo-Croatian ||align="right"| {{#lst:apertium-hbs/stats|stems}} ||align="center"| - |
|||
|- |
|||
| apertium-heb|| Hebrew ||align="right"| {{#lst:apertium-heb/stats|stems}} ||align="center"| - |
|||
|- |
|||
| apertium-hin|| [[Hindi]] ||align="right"| {{#lst:apertium-hin/stats|stems}} ||align="center"| [[Apertium-hin#Current_State|~{{:Apertium-hin/stats/average}}%]] |
|||
|- |
|||
| apertium-hye|| Armenian ||align="right"| {{#lst:apertium-hye/stats|stems}} ||align="center"| - |
|||
|- |
|||
| apertium-ind|| Indonesian ||align="right"| {{#lst:apertium-ind/stats|stems}} ||align="center"| - |
|||
|- |
|||
| apertium-isl|| Icelandic ||align="right"| {{#lst:apertium-isl/stats|stems}} ||align="center"| - |
|||
|- |
|||
| apertium-ita|| Italian ||align="right"| {{#lst:apertium-ita/stats|stems}} ||align="center"| - |
|||
|- |
|||
| apertium-kaa|| [[Karakalpak]] ||align="right"| {{#lst:apertium-kaa/stats|stems}} ||align="center"| [[Apertium-kaa#Current_State|~{{:Apertium-kaa/stats/average}}%]] |
|||
|- |
|||
| apertium-kaz|| [[Kazakh]] ||align="right"| {{#lst:apertium-kaz/stats|stems}} ||align="center"| [[Apertium-kaz#Current_State|~{{:Apertium-kaz/stats/average}}%]] |
|||
|- |
|||
| apertium-kir|| [[Kyrgyz]] ||align="right"| {{#lst:apertium-kir/stats|stems}} ||align="center"| [[Apertium-kir#Current_State|~{{:Apertium-kir/stats/average}}%]] |
|||
|- |
|||
| apertium-kmr|| [[Kurmanji]] ||align="right"| {{#lst:apertium-kmr/stats|stems}} ||align="center"| - |
|||
|- |
|||
| apertium-kum|| [[Kumyk]] ||align="right"| {{#lst:apertium-kum/stats|stems}} ||align="center"| [[Apertium-kum#Current_State|~{{:Apertium-kum/stats/average}}%]] |
|||
|- |
|||
| apertium-ltz|| [[Luxembourgish]] ||align="right"| {{#lst:apertium-ltz/stats|stems}} ||align="center"| - |
|||
|- |
|||
| apertium-lvs|| [[Latvian]] ||align="right"| {{#lst:apertium-lvs/stats|stems}} ||align="center"| - |
|||
|- |
|||
| apertium-mar|| [[Marathi]] ||align="right"| {{#lst:apertium-mar/stats|stems}} ||align="center"| - |
|||
|- |
|||
| apertium-mkd|| [[Macedonian]] ||align="right"| {{#lst:apertium-mkd/stats|stems}} ||align="center"| [[Apertium-kum#Current_State|~{{:Apertium-mkd/stats/average}}%]] |
|||
|- |
|||
| apertium-mlt|| [[Maltese]] ||align="right"| {{#lst:apertium-mlt/stats|stems}} ||align="center"| - |
|||
|- |
|||
| apertium-nld|| [[Dutch]] ||align="right"| {{#lst:apertium-nld/stats|stems}} ||align="center"| - |
|||
|- |
|||
| apertium-nno|| [[Norwegian Nynorsk]] ||align="right" | {{#lst:apertium-nno/stats|stems}} ||align="center"| - |
|||
|- |
|||
| apertium-nob|| [[Norwegian Bokmål]] ||align="right" | {{#lst:apertium-nob/stats|stems}} ||align="center"| - |
|||
|- |
|||
| apertium-nog|| [[Nogay]] ||align="right"| {{#lst:apertium-nog/stats|stems}} ||align="center"| [[Apertium-nog#Current_State|~{{:Apertium-nog/stats/average}}%]] |
|||
|- |
|||
| apertium-pol|| [[Polish]] ||align="right"| {{#lst:apertium-pol/stats|stems}} ||align="center"| - |
|||
|- |
|||
| apertium-por|| [[Portuguese]] ||align="right"| {{#lst:apertium-por/stats|stems}} ||align="center"| - |
|||
|- |
|||
| apertium-ron|| [[Romanian]] ||align="right"| {{#lst:apertium-ron/stats|stems}} ||align="center"| - |
|||
|- |
|||
| apertium-rus|| [[Russian]] ||align="right"| {{#lst:apertium-rus/stats|stems}} ||align="center"| [[Apertium-rus#Current_State|~{{:Apertium-rus/stats/average}}%]] |
|||
|- |
|||
| apertium-sah|| [[Sakha]] ||align="right"| {{#lst:apertium-sah/stats|stems}} ||align="center"| [[Apertium-sah#Current_State|~{{:Apertium-sah/stats/average}}%]] |
|||
|- |
|||
| apertium-san|| [[Sanskrit]] ||align="right"| {{#lst:apertium-san/stats|stems}} ||align="center"| - |
|||
|- |
|||
| apertium-slv|| Slovenian ||align="right"| {{#lst:apertium-slv/stats|stems}} ||align="center"| - |
|||
|- |
|||
| apertium-spa|| Spanish ||align="right"| {{#lst:apertium-spa/stats|stems}} ||align="center"| - |
|||
|- |
|||
| apertium-sqi|| [[Albanian]] ||align="right"| {{#lst:apertium-sqi/stats|stems}} ||align="center"| [[Apertium-sqi#Current_State|~{{:Apertium-sqi/stats/average}}%]] |
|||
|- |
|||
| apertium-srd|| Sardinian ||align="right"| {{#lst:apertium-srd/stats|stems}} ||align="center"| - |
|||
|- |
|||
| apertium-swe|| Swedish ||align="right"| {{#lst:apertium-swe/stats|stems}} ||align="center"| - |
|||
|- |
|||
| apertium-tat|| [[Tatar]] ||align="right"| {{#lst:apertium-tat/stats|stems}} ||align="center"| [[Apertium-tat#Current_State|~{{:Apertium-tat/stats/average}}%]] |
|||
|- |
|||
| apertium-tuk|| [[Turkmen]] ||align="right"| {{#lst:apertium-tuk/stats|stems}} ||align="center"| [[Apertium-tuk#Current_State|~{{:Apertium-tuk/stats/average}}%]] |
|||
|- |
|||
| apertium-tur|| [[Turkish]] ||align="right"| {{#lst:apertium-tur/stats|stems}} ||align="center"| [[Apertium-tur#Current_State|~{{:Apertium-tur/stats/average}}%]] |
|||
|- |
|||
| apertium-tyv|| [[Tuvan]] ||align="right"| {{#lst:apertium-tyv/stats|stems}} ||align="center"| [[Apertium-tyv#Current_State|~{{:Apertium-tyv/stats/average}}%]] |
|||
|- |
|||
| apertium-ukr|| [[Ukrainian]] ||align="right"| {{#lst:apertium-ukr/stats|stems}} ||align="center"| - |
|||
|- |
|||
| apertium-urd|| [[Urdu]] ||align="right"| {{#lst:apertium-urd/stats|stems}} ||align="center"| [[Apertium-urd#Current_State|~{{:Apertium-urd/stats/average}}%]] |
|||
|- |
|||
| apertium-uzb|| [[Uzbek]] ||align="right"| {{#lst:apertium-uzb/stats|stems}} ||align="center"| [[Apertium-uzb#Current_State|~{{:Apertium-uzb/stats/average}}%]] |
|||
|- |
|||
| apertium-yid|| [[Yiddish]] ||align="right"| {{#lst:apertium-yid/stats|stems}} ||align="center"| [[Apertium-yid#Current_State|~{{:Apertium-yid/stats/average}}%]] |
|||
|- |
|||
| apertium-zho|| [[Chinese]] ||align="right"| {{#lst:apertium-zho/stats|stems}} ||align="center"| - |
|||
|- |
|||
| apertium-zlm|| [[Malay]] ||align="right"| {{#lst:apertium-zlm/stats|stems}} ||align="center"| - |
|||
|- |
|- |
||
|} |
|} |
||
{{div col end}} |
|||
=== Languages by family === |
=== Languages by family === |
||
Languages still in incubator are ''in italics''. |
|||
* [[Turkic languages|Turkic]]: |
* [[Turkic languages|Turkic]]: |
||
** |
** Oghuz: [[Turkmen]], [[Turkish]] |
||
** Kypchak: Kazakh, Kyrgyz, Tatar, Bashqort, Kumyk, Nogay, |
** Kypchak: [[Kazakh]], [[Kyrgyz]], [[Tatar]], [[Bashqort]], [[Kumyk]], [[Nogay]], [[Karakalpak]] |
||
** |
** Karluk: [[Uzbek]], [[Uyghur]] |
||
** Other: [[Chuvash]], [[Sakha]], [[Tuvan]] |
|||
* [[Indo-European languages|Indo-European]] |
* [[Indo-European languages|Indo-European]] |
||
** [[ |
** [[Slavic languages|Slavic]]: Russian, Serbo-Croatian, Macedonian, Czech, Bulgarian, Ukranian, Polish, [[Slovenian]] |
||
** [[Celtic languages|Celtic]]: Scottish Gaelic |
** [[Celtic languages|Celtic]]: Scottish Gaelic, Breton, Welsh, Manx |
||
** [[Germanic languages|Germanic]] |
** [[Germanic languages|Germanic]] |
||
** [[ |
*** [[West Germanic languages|West Germanic]]: Dutch, Afrikaans, English, German, Luxembourgish, [[Yiddish]] |
||
*** [[North Germanic languages|North Germanic]]: Danish, Icelandic, Norwegian (nno, nob), Swedish, Faroese |
|||
** [[Iranian languages|Iranian]]: |
|||
** [[Romance languages|Romance]]: [[Aragonese]], [[Asturian]], [[Catalan]], [[Spanish]], [[French]], [[Galician]], [[Italian]], [[Portuguese]], [[Sardinian]], [[Romanian]], [[Corsican]] |
|||
** [[Indic languages|Indic]]: [[Urdu]], [[Bengali]], [[Hindi]], [[Sanskrit]], [[Marathi]] |
|||
** [[Baltic languages|Baltic]]: Latvian |
** [[Baltic languages|Baltic]]: Latvian |
||
** Other |
** Other: [[Albanian]], Armenian, Greek |
||
* [[Semitic languages|Semitic]]: Maltese |
* [[Semitic languages|Semitic]]: Maltese, Arabic, Hebrew |
||
* [[Uralic languages|Uralic]]: |
* [[Uralic languages|Uralic]]: Finnish |
||
* Daghestani languages: [[Avar]] |
|||
* Vasconic languages: [[Basque]] |
|||
* Sinitic languages: Chinese |
|||
* Austronesian languages: [[Malay]], [[Indonesian]] |
|||
=== Languages by region === |
|||
==Requiring a monolingual package as a dependency of a pair== |
|||
* '''[[Languages of the Volga-Kama region|Volga-Kama]]''': [[Tatar]], [[Bashqort]], [[Chuvash]] |
|||
Say you want apertium-[http://www.ethnologue.com/language/fie fie]-[http://www.ethnologue.com/language/bar bar] to depend on some monolingual data from the apertium-bar package, e.g. <code>apertium-bar/bar.lrx</code> and maybe other such files. |
|||
* '''[[Balkan languages|Balkans]]''': [[Albanian]], [[Bulgarian]], [[Greek]], [[Macedonian]], [[Serbo-Croatian]], [[Romanian]] |
|||
* '''[[Languages of the Caucasus|Caucasus]]''': [[Kumyk]], [[Nogay]], [[Armenian]], [[Avar]] |
|||
* '''[[Languages of Central Asia|Central Asia]]''': [[Kazakh]], [[Kyrgyz]], [[Turkmen]], [[Uzbek]], [[Karakalpak]] |
|||
* [[Languages of the former Soviet Union|former Soviet Union]]: [[Kyrgyz]], [[Kazakh]], [[Azeri]], [[Turkmen]], [[Tatar]], [[Bashqort]], [[Chuvash]], [[Armenian]], [[Tajik]], [[Avar]], [[Uyghur]], [[Karakalpak]], [[Uzbek]], [[Kumyk]], [[Sakha]], [[Tuvan]], [[Latvian]], [[Gagauz]] |
|||
* Languages of Spain: [[Spanish]], [[Basque]], [[Catalan]], [[Asturian]], [[Galician]], [[Aragonese]] |
|||
* Languages of the Baltics: [[Latvian]], [[Estonian]] |
|||
== Language family pages == |
|||
Assuming apertium-bar is set up correctly, you can put the following lines into the <code>configure.ac</code> of apertium-fie-bar: |
|||
'''Language family pages''' exist to show the overall progress of monolingual language modules for regions and language families of interest. Currently the following pages are (or should soon be) available on this wiki (family names in '''bold''' use the format described below): |
|||
<pre> |
|||
* '''[[Turkic languages]]''' |
|||
AC_ARG_VAR(BARSRC, "Source directory for apertium-bar") |
|||
* '''[[Uralic languages]]''' |
|||
AS_IF([test -z "$BARSRC"], |
|||
* '''[[Indic languages]]''' |
|||
[ |
|||
* '''[[Dravidian languages]]''' |
|||
PKG_CHECK_MODULES([APERTIUM_BAR], [apertium-bar]) |
|||
* '''[[Balkan languages]]''' |
|||
BARSRC=`pkg-config --variable=srcdir apertium-bar` |
|||
* '''[[Celtic languages]]''' |
|||
], |
|||
* '''[[Languages of the Volga-Kama region]]''' |
|||
[echo "Using apertium-bar from $BARSRC"]) |
|||
* '''[[Iranian languages]]''' |
|||
</pre> |
|||
* '''[[Slavic languages]]''' |
|||
* '''[[Mongolic languages]]''' |
|||
* '''[[Semitic languages]]''' |
|||
* '''[[Germanic languages]]''' |
|||
* '''[[Romance languages]]''' |
|||
* '''[[Languages of the Caucasus]]''' |
|||
* '''[[Languages of Central Asia]]''' |
|||
* [[Languages of the former Soviet Union]] |
|||
The language family pages should represent the following data in a standardised format: |
|||
* The languages of that group with apertium data (whether in languages, incubator, part of a pair in trunk, etc.) |
|||
* 2- and 3-letter ISO codes for each language |
|||
* The formalism the module is written in |
|||
* Links to the pages for each language on the apertium wiki |
|||
* The location in the apertium repository (whether in languages, incubator, part of a pair in trunk, etc.) |
|||
* Development status, which should be one of the following: |
|||
** '''production''' - for language modules used in a released pair, usually over 90% coverage and/or over 10,000 stems |
|||
** '''working''' - for language modules with near-production-quality performance, usually over 80% coverage and/or over 8'000 stems |
|||
** '''development''' - for language modules under development, usually over 60% coverage and/or over 1'000 stems |
|||
** '''prototype''' - for language modules that have not received heavy development, usually less than 60% coverage or under 1'000 stems |
|||
Here are status guidelines summarised in a table: |
|||
{|class="wikitable" |
|||
|- |
|||
! status !! description !! stems !! coverage !! bidix table |
|||
|- |
|||
!style="background-color: red;"| '''prototype''' |
|||
| language module that has not received heavy development |
|||
| <1,000 |
|||
| <60% |
|||
|- |
|||
!style="background-color: orange;"| '''development''' |
|||
| language module under development |
|||
| ≥1,000 |
|||
| ≥60% |
|||
|- |
|||
!style="background-color: yellow;"| '''working''' |
|||
| language module with near-production-quality performance |
|||
| ≥8,000 |
|||
| ≥80% |
|||
|- |
|||
!style="background-color: green;"| '''production''' |
|||
| language module used in a released pair |
|||
| ≥10,000 |
|||
| ≥90% |
|||
|} |
|||
and in the <code>Makefile.am</code>, you can write rules like this: |
|||
<pre> |
|||
bar-fie.lrx.bin: $BARSRC/bar.lrx |
|||
lrx-comp $< $@ |
|||
</pre> |
|||
Additionally, the following data is put on <code>apertium-xxx/stats</code> pages, and is included on the language family page and other places as relevant: |
|||
Now if you've typed "make install" apertium-bar before running autogen.sh in apertium-fie-bar, apertium-fie-bar will use the bar.lrx which is installed by apertium-bar in its compilation. |
|||
* The number of stems (and paradigms if relevant) in that language module |
|||
* The coverage of the transducer on a variety of corpora |
|||
There should also be a table of language pairs available with these languages, with number of stems from <code>apertium-xxx-yyy/stats</code> pages on the wiki. Guidelines for font semantics for the pairs follow: |
|||
* production / trunk = '''bold''' |
|||
* working / staging = '''''bold+italics''''' |
|||
* development / nursery = normal |
|||
* prototype / incubator = ''italics'' |
|||
Optional information includes samples of a single text in these languages and UNESCO-provided vulnerability data. |
|||
If you make a lot of changes to apertium-bar and want to avoid having to "make install" for each and every change, you can do this in apertium-fie-bar: |
|||
<pre> |
|||
./autogen.sh BARSRC=/path/to/apertium-bar |
|||
</pre> |
|||
(<code>./configure BARSRC=/path/to/apertium-bar</code> should also work). Now each time you make, the "BARSRC" variable will point to /path/to/apertium-bar instead of the "make install"-ed file. You can set it back to default by just running plain <code>./configure</code> again. |
|||
====Making a monolingual package dependable for pairs==== |
|||
In apertium-bar, there should be a file <code>apertium-bar.pc.in</code>. This has to have the following lines: |
|||
<pre> |
|||
dir=@libdir@/apertium/apertium-bar |
|||
srcdir=@datarootdir@/apertium/apertium-bar |
|||
</pre> |
|||
These should correspond to where the binaries and source files respectively are installed by <code>Makefile.am</code> (typically named <code>apertium_bardir</code> and <code>apertium_bar_srcdir</code>). The <code>configure.ac</code> should have a line saying something like <code>AC_OUTPUT([Makefile apertium-bar.pc])</code>. See https://svn.code.sf.net/p/apertium/svn/languages/apertium-nob for a working example. |
|||
==See also== |
==See also== |
||
Line 112: | Line 264: | ||
[[Category:Terminology]] |
[[Category:Terminology]] |
||
[[Category:Documentation in English]] |
[[Category:Documentation in English]] |
||
[[Category:Languages|*]] |
Latest revision as of 22:33, 1 February 2019
- If you are looking for the category, click here
Monolingual language data lives in apertium-languages. Monolingual language data in Apertium is slowly being moved to this new repository scheme. (Originally, all monolingual language data was found in language pairs, meaning that there was a lot of duplication.) If you feel something is missing, please feel free to contact us.
New monolingual packages should be developed as incubator languages until they're minimally useful, at which point they can go in apertium-languages. There is no fixed criterion for what constitutes a minimally-useful language package; generally, however, a language package should have over 60% coverage on a variety of corpora and should probably have at least 2500 stems to be considered minimally useful.
Apertium-languages can be found in GitHub at apertium-languages. You may also want to browse all the languages and pairs using the Apertium source browser.
Contents[edit]
Languages by coverage[edit]
Module | Language | Entries | Coverage |
---|---|---|---|
apertium-afr | Afrikaans | 7577 | - |
apertium-ara | Arabic | 6,127 | - |
apertium-arg | Aragonese | 26,068 | - |
apertium-ast | Asturian | 498 | - |
apertium-ava | Avar | 4,904 | ~86.5% |
apertium-bak | Bashkir | 46,501 | ~66% |
apertium-ben | Bengali | 8,230 | ~74% |
apertium-bre | Breton | 18,249 | - |
apertium-bul | Bulgarian | 8,578 | - |
apertium-cat | Catalan | 95604 | - |
apertium-ces | Czech | 41,199 | ~90.5% |
apertium-chv | Chuvash | 10,267 | ~85% |
apertium-crh | Crimean Tatar | 11,757 | ~85.4% |
apertium-cym | Welsh | 11,015 | - |
apertium-dan | Danish | 52,133 | - |
apertium-deu | German | 74,339 | - |
apertium-ell | Greek | 2,460 | - |
apertium-eng | English | 62,609 | - |
apertium-eus | Basque | 11,471 | - |
apertium-fao | Faroese | 2,318 | - |
apertium-fin | Finnish | 408,216 | - |
apertium-fra | French | - | |
apertium-gla | Scottish Gaelic | 117 | - |
apertium-glg | Galician | 31,916 | - |
apertium-glv | Manx | 11,353 | - |
apertium-hbs | Serbo-Croatian | 58,004 | - |
apertium-heb | Hebrew | 20,932 | - |
apertium-hin | Hindi | 37,833 | ~83.1% |
apertium-hye | Armenian | 8,247 | - |
apertium-ind | Indonesian | 12,264 | - |
apertium-isl | Icelandic | 8,770 | - |
apertium-ita | Italian | 25,609 | - |
apertium-kaa | Karakalpak | 25,545 | ~86.1% |
apertium-kaz | Kazakh | 36,595 | ~94.5% |
apertium-kir | Kyrgyz | 14,424 | ~90.4% |
apertium-kmr | Kurmanji | 17,771 | - |
apertium-kum | Kumyk | 4,918 | ~90.2% |
apertium-ltz | Luxembourgish | 11,882 | - |
apertium-lvs | Latvian | 6,756 | - |
apertium-mar | Marathi | 14,886 | - |
apertium-mkd | Macedonian | 30,686 | ~90.5% |
apertium-mlt | Maltese | 7,371 | - |
apertium-nld | Dutch | 25,079 | - |
apertium-nno | Norwegian Nynorsk | 182,497 | - |
apertium-nob | Norwegian Bokmål | 246,281 | - |
apertium-nog | Nogay | 1,385 | ~81.4% |
apertium-pol | Polish | 13,972 | - |
apertium-por | Portuguese | 14,796 | - |
apertium-ron | Romanian | 18,878 | - |
apertium-rus | Russian | 126,833 | ~89.6% |
apertium-sah | Sakha | 11,531 | ~89.6% |
apertium-san | Sanskrit | 123,373 | - |
apertium-slv | Slovenian | 20,596 | - |
apertium-spa | Spanish | 46,003 | - |
apertium-sqi | Albanian | 3,312 | ~80.2% |
apertium-srd | Sardinian | 46,642 | - |
apertium-swe | Swedish | 138,490 | - |
apertium-tat | Tatar | 55,702 | ~91% |
apertium-tuk | Turkmen | 2,988 | ~70.7% |
apertium-tur | Turkish | 17,221 | ~87.3% |
apertium-tyv | Tuvan | 11,695 | ~92.7% |
apertium-ukr | Ukrainian | 10,709 | - |
apertium-urd | Urdu | 14,943 | ~64.6% |
apertium-uzb | Uzbek | 34,470 | ~82.9% |
apertium-yid | Yiddish | 378 | ~62.5% |
apertium-zho | Chinese | 8,521 | - |
apertium-zlm | Malay | 11,894 | - |
Languages by family[edit]
- Turkic:
- Indo-European
- Slavic: Russian, Serbo-Croatian, Macedonian, Czech, Bulgarian, Ukranian, Polish, Slovenian
- Celtic: Scottish Gaelic, Breton, Welsh, Manx
- Germanic
- West Germanic: Dutch, Afrikaans, English, German, Luxembourgish, Yiddish
- North Germanic: Danish, Icelandic, Norwegian (nno, nob), Swedish, Faroese
- Romance: Aragonese, Asturian, Catalan, Spanish, French, Galician, Italian, Portuguese, Sardinian, Romanian, Corsican
- Indic: Urdu, Bengali, Hindi, Sanskrit, Marathi
- Baltic: Latvian
- Other: Albanian, Armenian, Greek
- Semitic: Maltese, Arabic, Hebrew
- Uralic: Finnish
- Daghestani languages: Avar
- Vasconic languages: Basque
- Sinitic languages: Chinese
- Austronesian languages: Malay, Indonesian
Languages by region[edit]
- Volga-Kama: Tatar, Bashqort, Chuvash
- Balkans: Albanian, Bulgarian, Greek, Macedonian, Serbo-Croatian, Romanian
- Caucasus: Kumyk, Nogay, Armenian, Avar
- Central Asia: Kazakh, Kyrgyz, Turkmen, Uzbek, Karakalpak
- former Soviet Union: Kyrgyz, Kazakh, Azeri, Turkmen, Tatar, Bashqort, Chuvash, Armenian, Tajik, Avar, Uyghur, Karakalpak, Uzbek, Kumyk, Sakha, Tuvan, Latvian, Gagauz
- Languages of Spain: Spanish, Basque, Catalan, Asturian, Galician, Aragonese
- Languages of the Baltics: Latvian, Estonian
Language family pages[edit]
Language family pages exist to show the overall progress of monolingual language modules for regions and language families of interest. Currently the following pages are (or should soon be) available on this wiki (family names in bold use the format described below):
- Turkic languages
- Uralic languages
- Indic languages
- Dravidian languages
- Balkan languages
- Celtic languages
- Languages of the Volga-Kama region
- Iranian languages
- Slavic languages
- Mongolic languages
- Semitic languages
- Germanic languages
- Romance languages
- Languages of the Caucasus
- Languages of Central Asia
- Languages of the former Soviet Union
The language family pages should represent the following data in a standardised format:
- The languages of that group with apertium data (whether in languages, incubator, part of a pair in trunk, etc.)
- 2- and 3-letter ISO codes for each language
- The formalism the module is written in
- Links to the pages for each language on the apertium wiki
- The location in the apertium repository (whether in languages, incubator, part of a pair in trunk, etc.)
- Development status, which should be one of the following:
- production - for language modules used in a released pair, usually over 90% coverage and/or over 10,000 stems
- working - for language modules with near-production-quality performance, usually over 80% coverage and/or over 8'000 stems
- development - for language modules under development, usually over 60% coverage and/or over 1'000 stems
- prototype - for language modules that have not received heavy development, usually less than 60% coverage or under 1'000 stems
Here are status guidelines summarised in a table:
status | description | stems | coverage | bidix table |
---|---|---|---|---|
prototype | language module that has not received heavy development | <1,000 | <60% | |
development | language module under development | ≥1,000 | ≥60% | |
working | language module with near-production-quality performance | ≥8,000 | ≥80% | |
production | language module used in a released pair | ≥10,000 | ≥90% |
Additionally, the following data is put on apertium-xxx/stats
pages, and is included on the language family page and other places as relevant:
- The number of stems (and paradigms if relevant) in that language module
- The coverage of the transducer on a variety of corpora
There should also be a table of language pairs available with these languages, with number of stems from apertium-xxx-yyy/stats
pages on the wiki. Guidelines for font semantics for the pairs follow:
- production / trunk = bold
- working / staging = bold+italics
- development / nursery = normal
- prototype / incubator = italics
Optional information includes samples of a single text in these languages and UNESCO-provided vulnerability data.