Difference between revisions of "Languages"

From Apertium
Jump to navigation Jump to search
 
(14 intermediate revisions by 4 users not shown)
Line 2: Line 2:
:''If you are looking for the category, click [[:Category:Languages|here]]''
:''If you are looking for the category, click [[:Category:Languages|here]]''


'''/languages/''' is a module of the [[SVN]] where monolingual language data lives. Monolingual language data in Apertium is slowly being moved to this new repository scheme. (Originally, all monolingual language data was found in language pairs, meaning that there was a lot of duplication.) If you feel something is missing, please feel free to [[contact|contact us]].
Monolingual language data lives in [https://github.com/apertium/apertium-languages apertium-languages]. Monolingual language data in Apertium is slowly being moved to this new repository scheme. (Originally, all monolingual language data was found in language pairs, meaning that there was a lot of duplication.) If you feel something is missing, please feel free to [[contact|contact us]].


New monolingual packages should be developed in incubator until they're minimally useful, at which point they can go in /languages/. There is no fixed criterion for what constitutes a minimally-useful language package; generally, however, a language package should have over 60% coverage on a variety of corpora and should probably have at least 2500 stems to be considered minimally useful.
New monolingual packages should be developed as [https://github.com/apertium/apertium-incubator incubator] languages until they're minimally useful, at which point they can go in apertium-languages. There is no fixed criterion for what constitutes a minimally-useful language package; generally, however, a language package should have over 60% coverage on a variety of corpora and should probably have at least 2500 stems to be considered minimally useful.


The /languages/ module can be found in svn at [https://svn.code.sf.net/p/apertium/svn/languages https://svn.code.sf.net/p/apertium/svn/languages] .
Apertium-languages can be found in GitHub at [https://github.com/apertium/apertium-languages apertium-languages]. You may also want to browse all the languages and pairs using the [https://apertium.github.io/apertium-on-github/source-browser.html Apertium source browser].


==Contents==
==Contents==


=== Languages by coverage ===
=== Languages by coverage ===
{{div col begin|colwidth=30em}}

{|class="wikitable sortable"
{|class="wikitable sortable"
! Module !! Language !! Entries !! Coverage
! Module !! Language !! Entries !! Coverage
Line 38: Line 38:
|-
|-
| apertium-chv|| [[Chuvash]] ||align="right"| {{#lst:apertium-chv/stats|stems}} ||align="center"| [[Apertium-chv#Current_State|~{{:Apertium-chv/stats/average}}%]]
| apertium-chv|| [[Chuvash]] ||align="right"| {{#lst:apertium-chv/stats|stems}} ||align="center"| [[Apertium-chv#Current_State|~{{:Apertium-chv/stats/average}}%]]
|-
| apertium-crh|| [[Crimean Tatar]] ||align="right"| {{#lst:apertium-crh/stats|stems}} ||align="center"| [[Apertium-crh#Current_State|~{{:Apertium-crh/stats/average}}%]]
|-
|-
| apertium-cym|| [[Welsh]] ||align="right"| {{#lst:apertium-cym/stats|stems}} ||align="center"| -
| apertium-cym|| [[Welsh]] ||align="right"| {{#lst:apertium-cym/stats|stems}} ||align="center"| -
Line 140: Line 142:
|-
|-
| apertium-uzb|| [[Uzbek]] ||align="right"| {{#lst:apertium-uzb/stats|stems}} ||align="center"| [[Apertium-uzb#Current_State|~{{:Apertium-uzb/stats/average}}%]]
| apertium-uzb|| [[Uzbek]] ||align="right"| {{#lst:apertium-uzb/stats|stems}} ||align="center"| [[Apertium-uzb#Current_State|~{{:Apertium-uzb/stats/average}}%]]
|-
| apertium-yid|| [[Yiddish]] ||align="right"| {{#lst:apertium-yid/stats|stems}} ||align="center"| [[Apertium-yid#Current_State|~{{:Apertium-yid/stats/average}}%]]
|-
|-
| apertium-zho|| [[Chinese]] ||align="right"| {{#lst:apertium-zho/stats|stems}} ||align="center"| -
| apertium-zho|| [[Chinese]] ||align="right"| {{#lst:apertium-zho/stats|stems}} ||align="center"| -
Line 146: Line 150:
|-
|-
|}
|}
{{div col end}}


=== Languages by family ===
=== Languages by family ===
Line 152: Line 157:
** Oghuz: [[Turkmen]], [[Turkish]]
** Oghuz: [[Turkmen]], [[Turkish]]
** Kypchak: [[Kazakh]], [[Kyrgyz]], [[Tatar]], [[Bashqort]], [[Kumyk]], [[Nogay]], [[Karakalpak]]
** Kypchak: [[Kazakh]], [[Kyrgyz]], [[Tatar]], [[Bashqort]], [[Kumyk]], [[Nogay]], [[Karakalpak]]
** Other: [[Uzbek]], [[Chuvash]], [[Sakha]], [[Tuvan]]
** Karluk: [[Uzbek]], [[Uyghur]]
** Other: [[Chuvash]], [[Sakha]], [[Tuvan]]
* [[Indo-European languages|Indo-European]]
* [[Indo-European languages|Indo-European]]
** [[Slavic languages|Slavic]]: Russian, Serbo-Croatian, Macedonian, Czech, Bulgarian, Ukranian, Polish, [[Slovenian]]
** [[Slavic languages|Slavic]]: Russian, Serbo-Croatian, Macedonian, Czech, Bulgarian, Ukranian, Polish, [[Slovenian]]
** [[Celtic languages|Celtic]]: Scottish Gaelic, Breton, Welsh, Manx
** [[Celtic languages|Celtic]]: Scottish Gaelic, Breton, Welsh, Manx
** [[Germanic languages|Germanic]]
** [[Germanic languages|Germanic]]
*** [[West Germanic languages|West Germanic]]: Dutch, Afrikaans, English, German, Luxembourgish
*** [[West Germanic languages|West Germanic]]: Dutch, Afrikaans, English, German, Luxembourgish, [[Yiddish]]
*** [[North Germanic languages|North Germanic]]: Danish, Icelandic, Norwegian (nno, nob), Swedish, Faroese
*** [[North Germanic languages|North Germanic]]: Danish, Icelandic, Norwegian (nno, nob), Swedish, Faroese
** [[Romance languages|Romance]]: [[Aragonese]], [[Asturian]], [[Catalan]], [[Spanish]], [[French]], [[Galician]], [[Italian]], [[Portuguese]], [[Sardinian]], [[Romanian]], [[Corsican]]
** [[Romance languages|Romance]]: [[Aragonese]], [[Asturian]], [[Catalan]], [[Spanish]], [[French]], [[Galician]], [[Italian]], [[Portuguese]], [[Sardinian]], [[Romanian]], [[Corsican]]
** [[Indic languages|Indic]]: [[Urdu]], [[Bengali]], [[Hindi]], [[Sanskrit]]
** [[Indic languages|Indic]]: [[Urdu]], [[Bengali]], [[Hindi]], [[Sanskrit]], [[Marathi]]
** [[Baltic languages|Baltic]]: Latvian
** [[Baltic languages|Baltic]]: Latvian
** Other: [[Albanian]], Armenian, Greek
** Other: [[Albanian]], Armenian, Greek
Line 175: Line 181:
* '''[[Languages of the Caucasus|Caucasus]]''': [[Kumyk]], [[Nogay]], [[Armenian]], [[Avar]]
* '''[[Languages of the Caucasus|Caucasus]]''': [[Kumyk]], [[Nogay]], [[Armenian]], [[Avar]]
* '''[[Languages of Central Asia|Central Asia]]''': [[Kazakh]], [[Kyrgyz]], [[Turkmen]], [[Uzbek]], [[Karakalpak]]
* '''[[Languages of Central Asia|Central Asia]]''': [[Kazakh]], [[Kyrgyz]], [[Turkmen]], [[Uzbek]], [[Karakalpak]]
* [[Languages of the former Soviet Union|former Soviet Union]]: [[Kyrgyz]], [[Kazakh]], [[Azeri]], [[Turkmen]], [[Tatar]], [[Bashqort]], [[Chuvash]], [[Armenian]], [[Tajik]], [[Avar]], [[Uyghur]], [[Karakalpak]], [[Uzbek]], [[Kumyk]], [[Sakha]], [[Tuvan]]
* [[Languages of the former Soviet Union|former Soviet Union]]: [[Kyrgyz]], [[Kazakh]], [[Azeri]], [[Turkmen]], [[Tatar]], [[Bashqort]], [[Chuvash]], [[Armenian]], [[Tajik]], [[Avar]], [[Uyghur]], [[Karakalpak]], [[Uzbek]], [[Kumyk]], [[Sakha]], [[Tuvan]], [[Latvian]], [[Gagauz]]
* Languages of Spain: [[Spanish]], [[Basque]], [[Catalan]], [[Asturian]], [[Galician]], [[Aragonese]]
* Languages of Spain: [[Spanish]], [[Basque]], [[Catalan]], [[Asturian]], [[Galician]], [[Aragonese]]
* Languages of the Baltics: [[Latvian]], [[Estonian]]


== Language family pages ==
== Language family pages ==
Line 247: Line 254:
Optional information includes samples of a single text in these languages and UNESCO-provided vulnerability data.
Optional information includes samples of a single text in these languages and UNESCO-provided vulnerability data.


==Requiring a monolingual package as a dependency of a pair==
Say you want apertium-[http://www.ethnologue.com/language/fie fie]-[http://www.ethnologue.com/language/bar bar] to depend on some monolingual data from the apertium-bar package, e.g. <code>apertium-bar/bar.rlx</code> and maybe other such files.

This requires a recent version of apertium (-r51152 or later), and that you've exported <code>PKG_CONFIG_PATH</code> as described at [[Minimal_installation_from_SVN]].

Assuming apertium-bar is set up correctly (see next section), you can put the following line into the <code>configure.ac</code> of apertium-fie-bar:
<pre>
AP_CHECK_LING([2], [apertium-bar])
</pre>

and in the <code>Makefile.am</code>, you can write rules like this:
<pre>
bar-fie.rlx.bin: $(AP_SRC2)/bar.rlx
cg-comp $< $@

bar-tat.automorf.bin: $(AP_LIB2)/bar.automorf.bin
cp $< $@
</pre>

Similarly for apertium-fie (with <code>AP_CHECK_LING([1], [apertium-fie])</code>). By convention, a language pair called apertium-fie-bar should use the number 1 for fie and 2 for bar (though variants like 1b are possible too). Also by convention, AP_SRC should point to source files and AP_LIB to compiled binaries (this is the responsibility of the monolingual package, e.g. apertium-bar).

Now if you've typed "make install" in apertium-bar before running autogen.sh in apertium-fie-bar, apertium-fie-bar will use the bar.rlx and bar.automorf.bin which are installed by apertium-bar.


If you often make a lot of changes to apertium-bar and want to avoid having to "make install" for each and every change, you can do this in apertium-fie-bar:
<pre>
./autogen.sh --with-lang2=/path/to/apertium-bar
</pre>
Now each time you make, the "AP_SRC2" and "AP_LIB2" variables will both point to /path/to/apertium-bar instead of the "make install"-ed files. You can set it back to default by just running plain <code>autogen.sh</code> (or <code>./configure</code>) again.


See [[Installation_troubleshooting#AP_CHECK_LING_not_found_when_running_configure_or_autogen.sh]] if you run into errors about AP_CHECK_LING.

==== Build the monolingual dependencies when you type 'make' in the language pair ====
Add
<pre>
SUBDIRS=$(AP_SUBDIRS)
</pre>
to Makefile.am in the language pair, and specify --with-lang (<code>./autogen.sh --with-lang2=/path/to/apertium-bar</code> etc.).

Now when you type 'make' in the language pair, it'll first go into apertium-bar and do a make there, then continue with make in apertium-fie-bar.

====Making a monolingual package dependable for pairs====
In apertium-bar, there should be a file <code>apertium-bar.pc.in</code>. This has to have the following lines:
<pre>
dir=@libdir@/apertium/apertium-bar
srcdir=@datarootdir@/apertium/apertium-bar
</pre>
These should correspond to where the binaries and source files respectively are installed by the <code>Makefile.am</code> in the monolingual package (typically the makefile names these directories <code>apertium_bardir</code> and <code>apertium_bar_srcdir</code>).

The <code>configure.ac</code> should have a line saying something like <code>AC_OUTPUT([Makefile apertium-bar.pc])</code>. See https://svn.code.sf.net/p/apertium/svn/languages/apertium-nob for a working example.

Compiled / binary files should be listed in TARGETS_COMMON as usual, while any source files can be installed using install-data-local, e.g.:

<pre>
apertium_bar_srcdir=$(prefix)/share/apertium/$(BASENAME)/
install-data-local:
test -d $(DESTDIR)$(apertium_bar_srcdir) || mkdir -p $(DESTDIR)$(apertium_bar_srcdir)
$(INSTALL_DATA) $(BASENAME).$(LANG1).dix $(DESTDIR)$(apertium_bar_srcdir)
</pre>

Now if the apertium-fie-bar pair depends on apertium-bar as its lang2, it can refer to binaries (apertium-bar's TARGETS_COMMON) using $(AP_LIB2) and source files using $(AP_SRC2), e.g. <code>$(AP_SRC2)/apertium-$(LANG2).$(LANG2).dix</code> for the dix file in the install-data-local example above.


==See also==
==See also==
Line 319: Line 264:
[[Category:Terminology]]
[[Category:Terminology]]
[[Category:Documentation in English]]
[[Category:Documentation in English]]
[[Category:Languages|*]]

Latest revision as of 22:33, 1 February 2019

If you are looking for the category, click here

Monolingual language data lives in apertium-languages. Monolingual language data in Apertium is slowly being moved to this new repository scheme. (Originally, all monolingual language data was found in language pairs, meaning that there was a lot of duplication.) If you feel something is missing, please feel free to contact us.

New monolingual packages should be developed as incubator languages until they're minimally useful, at which point they can go in apertium-languages. There is no fixed criterion for what constitutes a minimally-useful language package; generally, however, a language package should have over 60% coverage on a variety of corpora and should probably have at least 2500 stems to be considered minimally useful.

Apertium-languages can be found in GitHub at apertium-languages. You may also want to browse all the languages and pairs using the Apertium source browser.

Contents[edit]

Languages by coverage[edit]

Module Language Entries Coverage
apertium-afr Afrikaans 7577 -
apertium-ara Arabic 6,127 -
apertium-arg Aragonese 26,068 -
apertium-ast Asturian 498 -
apertium-ava Avar 4,904 ~86.5%
apertium-bak Bashkir 46,501 ~66%
apertium-ben Bengali 8,230 ~74%
apertium-bre Breton 18,249 -
apertium-bul Bulgarian 8,578 -
apertium-cat Catalan 95604 -
apertium-ces Czech 41,199 ~90.5%
apertium-chv Chuvash 10,267 ~85%
apertium-crh Crimean Tatar 11,757 ~85.4%
apertium-cym Welsh 11,015 -
apertium-dan Danish 52,133 -
apertium-deu German 74,339 -
apertium-ell Greek 2,460 -
apertium-eng English 62,609 -
apertium-eus Basque 11,471 -
apertium-fao Faroese 2,318 -
apertium-fin Finnish 408,216 -
apertium-fra French -
apertium-gla Scottish Gaelic 117 -
apertium-glg Galician 31,916 -
apertium-glv Manx 11,353 -
apertium-hbs Serbo-Croatian 58,004 -
apertium-heb Hebrew 20,932 -
apertium-hin Hindi 37,833 ~83.1%
apertium-hye Armenian 8,247 -
apertium-ind Indonesian 12,264 -
apertium-isl Icelandic 8,770 -
apertium-ita Italian 25,609 -
apertium-kaa Karakalpak 25,545 ~86.1%
apertium-kaz Kazakh 36,595 ~94.5%
apertium-kir Kyrgyz 14,424 ~90.4%
apertium-kmr Kurmanji 17,771 -
apertium-kum Kumyk 4,918 ~90.2%
apertium-ltz Luxembourgish 11,882 -
apertium-lvs Latvian 6,756 -
apertium-mar Marathi 14,886 -
apertium-mkd Macedonian 30,686 ~90.5%
apertium-mlt Maltese 7,371 -
apertium-nld Dutch 25,079 -
apertium-nno Norwegian Nynorsk 182,497 -
apertium-nob Norwegian Bokmål 246,281 -
apertium-nog Nogay 1,385 ~81.4%
apertium-pol Polish 13,972 -
apertium-por Portuguese 14,796 -
apertium-ron Romanian 18,878 -
apertium-rus Russian 126,833 ~89.6%
apertium-sah Sakha 11,531 ~89.6%
apertium-san Sanskrit 123,373 -
apertium-slv Slovenian 20,596 -
apertium-spa Spanish 46,003 -
apertium-sqi Albanian 3,312 ~80.2%
apertium-srd Sardinian 46,642 -
apertium-swe Swedish 138,490 -
apertium-tat Tatar 55,702 ~91%
apertium-tuk Turkmen 2,988 ~70.7%
apertium-tur Turkish 17,221 ~87.3%
apertium-tyv Tuvan 11,695 ~92.7%
apertium-ukr Ukrainian 10,709 -
apertium-urd Urdu 14,943 ~64.6%
apertium-uzb Uzbek 34,470 ~82.9%
apertium-yid Yiddish 378 ~62.5%
apertium-zho Chinese 8,521 -
apertium-zlm Malay 11,894 -

Languages by family[edit]

Languages by region[edit]

Language family pages[edit]

Language family pages exist to show the overall progress of monolingual language modules for regions and language families of interest. Currently the following pages are (or should soon be) available on this wiki (family names in bold use the format described below):

The language family pages should represent the following data in a standardised format:

  • The languages of that group with apertium data (whether in languages, incubator, part of a pair in trunk, etc.)
  • 2- and 3-letter ISO codes for each language
  • The formalism the module is written in
  • Links to the pages for each language on the apertium wiki
  • The location in the apertium repository (whether in languages, incubator, part of a pair in trunk, etc.)
  • Development status, which should be one of the following:
    • production - for language modules used in a released pair, usually over 90% coverage and/or over 10,000 stems
    • working - for language modules with near-production-quality performance, usually over 80% coverage and/or over 8'000 stems
    • development - for language modules under development, usually over 60% coverage and/or over 1'000 stems
    • prototype - for language modules that have not received heavy development, usually less than 60% coverage or under 1'000 stems

Here are status guidelines summarised in a table:

status description stems coverage bidix table
prototype language module that has not received heavy development <1,000 <60%
development language module under development ≥1,000 ≥60%
working language module with near-production-quality performance ≥8,000 ≥80%
production language module used in a released pair ≥10,000 ≥90%


Additionally, the following data is put on apertium-xxx/stats pages, and is included on the language family page and other places as relevant:

  • The number of stems (and paradigms if relevant) in that language module
  • The coverage of the transducer on a variety of corpora

There should also be a table of language pairs available with these languages, with number of stems from apertium-xxx-yyy/stats pages on the wiki. Guidelines for font semantics for the pairs follow:

  • production / trunk = bold
  • working / staging = bold+italics
  • development / nursery = normal
  • prototype / incubator = italics

Optional information includes samples of a single text in these languages and UNESCO-provided vulnerability data.


See also[edit]