Difference between revisions of "Languages"
Firespeaker (talk | contribs) (→Language family pages: rotated version) |
Firespeaker (talk | contribs) |
||
Line 112: | Line 112: | ||
== Language family pages == |
== Language family pages == |
||
'''Language family pages''' exist to show the overall progress of monolingual language modules for regions and language families of interest. Currently the following are (or should soon be) available on this wiki: |
'''Language family pages''' exist to show the overall progress of monolingual language modules for regions and language families of interest. Currently the following pages are (or should soon be) available on this wiki (family names in '''bold''' use the format described below): |
||
* '''[[Turkic languages]]''' |
* '''[[Turkic languages]]''' |
||
* '''[[Uralic languages]]''' |
* '''[[Uralic languages]]''' |
||
Line 119: | Line 119: | ||
* '''[[Celtic languages]]''' |
* '''[[Celtic languages]]''' |
||
* '''[[Slavic languages]]''' |
* '''[[Slavic languages]]''' |
||
* |
* [[Semitic languages]] |
||
* |
* [[Germanic languages]] |
||
* '''[[Languages of the Volga-Kama region]]''' |
* '''[[Languages of the Volga-Kama region]]''' |
||
* '''[[Balkan languages]]''' |
* '''[[Balkan languages]]''' |
||
* |
* [[Languages of the Caucasus]] |
||
* |
* [[Languages of Central Asia]] |
||
The language family pages should represent the following data in a standardised format: |
The language family pages should represent the following data in a standardised format: |
Revision as of 20:58, 1 January 2014
- If you are looking for the category, click here
Languages is a module of the SVN where monolingual language data lives. Monolingual language data in Apertium is slowly being moved to this new repository scheme. (Originally, all monolingual language data was found in language pairs, meaning that there was a lot of duplication.) If you feel something is missing, please feel free to contact us.
New monolingual packages should be developed in incubator until they're minimally useful, at which point they can go in languages. There is no fixed criterion for what constitutes a minimally-useful language package; generally, however, a language package should have over 60% coverage on a variety of corpora and should probably have at least 2500 stems to be considered minimally useful.
The languages module can be found in svn at https://svn.code.sf.net/p/apertium/svn/languages .
Contents
Languages by coverage
Module | Language | Entries | Coverage |
---|---|---|---|
apertium-bak | Bashkir | 46,501 | ~66% |
apertium-ben | Bengali | 8,230 | ~74% |
apertium-bul | Bulgarian | 8,578 | - |
apertium-ces | Czech | 41,199 | ~90.5% |
apertium-chv | Chuvash | 10,267 | ~85% |
apertium-dan | Danish | 52,133 | - |
apertium-ell | Greek | 2,460 | - |
apertium-fao | Faroese | - | - |
apertium-fin | Finnish | 408,216 | - |
apertium-gla | Scottish Gaelic | - | - |
apertium-hbs | Serbo-Croatian | 58,004 | - |
apertium-hin | Hindi | 37,833 | ~83.1% |
apertium-hye | Armenian | 8,247 | - |
apertium-isl | Icelandic | - | - |
apertium-kaz | Kazakh | 36,595 | ~94.5% |
apertium-kir | Kyrgyz | 14,424 | ~90.4% |
apertium-kum | Kumyk | 4,918 | ~90.2% |
apertium-lvs | Latvian | - | - |
apertium-mkd | Macedonian | 30,686 | ~90.5% |
apertium-mlt | Maltese | 7,371 | - |
apertium-nld | Dutch | - | - |
apertium-nno | Norwegian Nynorsk | 182,497 | - |
apertium-nob | Norwegian Bokmål | 246,281 | - |
apertium-nog | Nogay | 1,385 | ~81.4% |
apertium-rus | Russian | 126,833 | ~89.6% |
apertium-san | Sanskrit | 123,373 | - |
apertium-slv | Slovenian | 20,596 | - |
apertium-sqi | Albanian | 3,312 | ~80.2% |
apertium-swe | Swedish | - | - |
apertium-tat | Tatar | 55,702 | ~91% |
apertium-tuk | Turkmen | 2,988 | ~70.7% |
apertium-tur | Turkish | 17,221 | ~87.3% |
apertium-ukr | Ukrainian | 10,709 | - |
apertium-urd | Urdu | 14,943 | ~64.6% |
apertium-uzb | Uzbek | 34,470 | ~82.9% |
Languages by family
- Turkic:
- Indo-European
- Semitic: Maltese
- Uralic: Finnish
Languages by region
- Volga-Kama: Tatar, Bashqort, Chuvash
- Balkans: Albanian, Bulgarian, Greek, Macedonian, Serbo-Croatian
- Caucasus: Kumyk, Nogay
- Central Asia: Kazakh, Kyrgyz, Turkmen, Uzbek
Language family pages
Language family pages exist to show the overall progress of monolingual language modules for regions and language families of interest. Currently the following pages are (or should soon be) available on this wiki (family names in bold use the format described below):
- Turkic languages
- Uralic languages
- Indic languages
- Iranian languages
- Celtic languages
- Slavic languages
- Semitic languages
- Germanic languages
- Languages of the Volga-Kama region
- Balkan languages
- Languages of the Caucasus
- Languages of Central Asia
The language family pages should represent the following data in a standardised format:
- The languages of that group with apertium data (whether in languages, incubator, part of a pair in trunk, etc.)
- 2- and 3-letter ISO codes for each language
- The formalism the module is written in
- Links to the pages for each language on the apertium wiki
- The location in the apertium repository (whether in languages, incubator, part of a pair in trunk, etc.)
- Development status, which should be one of the following:
- production - for language modules used in a released pair, usually over 90% coverage and/or over 10,000 stems
- working - for language modules with near-production-quality performance, usually over 80% coverage and/or over 8'000 stems
- development - for language modules under development, usually over 60% coverage and/or over 1'000 stems
- prototype - for language modules that have not received heavy development, usually less than 60% coverage or under 1'000 stems
Here are status guidelines summarised in a table:
status | description | stems | coverage |
---|---|---|---|
prototype | language module used in a released pair | <1,000 | <60% |
development | language module with near-production-quality performance | ≥1,000 | ≥60% |
working | language module under development | ≥8,000 | ≥80% |
production | language module that has not received heavy development | ≥10,000 | ≥90% |
status | prototype | development | working | production |
---|---|---|---|---|
description | language module used in a released pair | language module with near-production-quality performance | language module under development | language module that has not received heavy development |
stems | <1,000 | ≥1,000 | ≥8,000 | ≥10,000 |
coverage | <60% | ≥60% | ≥80% | ≥90% |
Additionally, the following data is put on apertium-xxx/stats
pages, and is included on the language family page and other places as relevant:
- The number of stems (and paradigms if relevant) in that language module
- The coverage of the transducer on a variety of corpora
There should also be a table of language pairs available with these languages, with number of stems from apertium-xxx-yyy/stats
pages on the wiki.
Optional information includes samples of a single text in these languages and UNESCO-provided vulnerability data.
Requiring a monolingual package as a dependency of a pair
Say you want apertium-fie-bar to depend on some monolingual data from the apertium-bar package, e.g. apertium-bar/bar.rlx
and maybe other such files.
This requires a recent version of apertium (-r48374 or later), and that you've exported PKG_CONFIG_PATH as described at Minimal_installation_from_SVN.
Assuming apertium-bar is set up correctly (see next section), you can put the following line into the configure.ac
of apertium-fie-bar:
AP_CHECK_LING([2], [apertium-bar])
and in the Makefile.am
, you can write rules like this:
bar-fie.rlx.bin: $(AP_SRC2)/bar.rlx cg-comp $< $@ bar-tat.automorf.bin: $(AP_LIB2)/bar.automorf.bin cp $< $@
Similarly for apertium-fie (with AP_CHECK_LING([1], [apertium-fie])
). By convention, a language pair called apertium-fie-bar should use the number 1 for fie and 2 for bar (though variants like 1b are possible too). Also by convention, AP_SRC should point to source files and AP_LIB to compiled binaries (this is the responsibility of the monolingual package, e.g. apertium-bar).
Now if you've typed "make install" in apertium-bar before running autogen.sh in apertium-fie-bar, apertium-fie-bar will use the bar.rlx and bar.automorf.bin which are installed by apertium-bar.
If you often make a lot of changes to apertium-bar and want to avoid having to "make install" for each and every change, you can do this in apertium-fie-bar:
./autogen.sh --with-lang2=/path/to/apertium-bar
Now each time you make, the "AP_SRC2" and "AP_LIB2" variables will both point to /path/to/apertium-bar instead of the "make install"-ed files. You can set it back to default by just running plain autogen.sh
(or ./configure
) again.
See Installation_troubleshooting#AP_CHECK_LING_not_found_when_running_configure_or_autogen.sh if you run into errors about AP_CHECK_LING.
Making a monolingual package dependable for pairs
In apertium-bar, there should be a file apertium-bar.pc.in
. This has to have the following lines:
dir=@libdir@/apertium/apertium-bar srcdir=@datarootdir@/apertium/apertium-bar
These should correspond to where the binaries and source files respectively are installed by the Makefile.am
in the monolingual package (typically the makefile names these directories apertium_bardir
and apertium_bar_srcdir
).
The configure.ac
should have a line saying something like AC_OUTPUT([Makefile apertium-bar.pc])
. See https://svn.code.sf.net/p/apertium/svn/languages/apertium-nob for a working example.
Compiled / binary files should be listed in TARGETS_COMMON as usual, while any source files can be installed using install-data-local, e.g.:
apertium_bar_srcdir=$(prefix)/share/apertium/$(BASENAME)/ install-data-local: test -d $(DESTDIR)$(apertium_bar_srcdir) || mkdir -p $(DESTDIR)$(apertium_bar_srcdir) $(INSTALL_DATA) $(BASENAME).$(LANG1).dix $(DESTDIR)$(apertium_bar_srcdir)
Now if the apertium-fie-bar pair depends on apertium-bar as its lang2, it can refer to binaries (apertium-bar's TARGETS_COMMON) using $(AP_LIB2) and source files using $(AP_SRC2), e.g. $(AP_SRC2)/apertium-$(LANG2).$(LANG2).dix
for the dix file in the install-data-local example above.