Languages of the Volga-Kama region

From Apertium
Revision as of 22:36, 27 December 2013 by Sushain (talk | contribs) (→‎Transducers)
Jump to navigation Jump to search

The languages of the Volga-Kama region include several Turkic and Uralic languages spoken in the Volga-Kama region (along the Volga and Kama rivers) in Russia. These include [varieties of] Tatar, Bashqort, Chuvash, Mari, Komi, Mordvin, and Udmurt (and linguistically, to some extent, Russian).

The master plan involves generating independent finite-state transducers for each language, and then making individual dictionaries and transfer rules for every pair. The current status of these goals is listed below.


The ultimate goal is to have multi-purposable transducers for a variety of Balkan languages. These can then be paired for X→Y translation with the addition of a CG for language X and transfer rules / dictionary for the pair X→Y. Below is listed development progress for each language's transducers and dictionary pairs.


Once a transducer has ~80% coverage on a range of medium-large corpora we can say it is "working". Over 90% and it can be considered to be "production".

name Language ISO 639 speakers UNESCO formalism state stems coverage location primary authors
-2 -3
apertium-tat Tatar tt tat 6500K 0 (none) HFST (lexc+twol) production 55,702 ~91% apertium-tat (languages) Ilnar, Fran, Jonathan, Milli
apertium-chv Chuvash cv chv 1325K 1 (vulnerable) HFST (lexc+twol) development 8,579 ~85% apertium-chv (languages) Hèctor
apertium-bak Bashkir ba bak 1379K 1 (vulnerable) HFST (lexc+twol) development 2,827 ~66% apertium-bak (languages) Fran, Jonathan, Ilnar, Milli
apertium-udm Udmurt udm 464K 2 (definitely endangered) HFST (lexc+twol) development ? ? apertium-fin-udm (incubator)
apertium-udm-rus (nursery)
Fran, Trond, Andrey, Лукерья, Алексей
apertium-myv Erzya myv 400K 2 (definitely endangered) HFST (lexc+twol) development ? apertium-myv-fin (incubator) Fran, Jack Rueter
apertium-mhr Eastern Mari mhr 414K 2 (definitely endangered) HFST (lexc+twol) development ? apertium-kpv-mhr (incubator) Fran, Fedina, chemyshev
apertium-kpv Komi-Zyrian kpv 217K 2 (definitely endangered) HFST (lexc+twol) development ? ? apertium-kpv-mhr (incubator)
apertium-kpv-fin (incubator)
Fran, Trond, Fedina, chemyshev
apertium-mdf Moksha mdf 200K 2 (definitely endangered)
apertium-koi Komi-Permyak koi 94K 2 (definitely endangered)
apertium-mrj Western Mari mrj 37K 3 (severely endangered)
apertium-kpvyaz Komi-Yazva 0K 3 (severely endangered)

Existing language pairs

Volga-Kama–Volga-Kama pairs

Text in italic denotes language pairs under development / in the incubator. Regular text denotes a functioning language pair in staging, while text in bold denotes a stable well-working language pair in trunk.

tat chv bak udm mhr kpv
tat - cv-tt tat-bak
chv cv-tt -
bak tat-bak -
udm -
mhr - kpv-mhr
kpv kpv-mhr -

Pairs with non–Volga-Kama languages

tat chv bak udm mhr kpv
ru tt-ru cv-ru udm-rus
ky tt-ky
tr cv-tr
fin fin-udm kpv-fin

Table of dix progress

tat chv bak udm mhr kpv
tat - ? ?
chv ? -
bak ? -
udm -
mhr - ?
kpv ? -
ru ? ? ?
ky ?
tr ?
fin ? ?

The languages

The following table shows information about Volga-Kama varieties and information about apertium projects related to the languages.

language iso num speakers UNESCO classification Apertium support
Tatar tat 6500K 0. none incubator/apertium-tr-tt/

incubator/apertium-tt-kk/ incubator/apertium-tt-ky/ incubator/apertium-tt-ru/ incubator/apertium-cv-tt/ nursery/apertium-tt-ba/

Bashqort bak 1379K 1. vulnerable nursery/apertium-tt-ba/
Chuvash chv 1325K 1. vulnerable incubator/apertium-cv-ru/

incubator/apertium-cv-tr/ incubator/apertium-cv-tt/

Udmurt udm 0464K 2. definitely endangered incubator/apertium-fin-udm/


Mari - Eastern mhr 0414K 2. definitely endangered incubator/apertium-kpv-mhr/
Mordvin - Erzya myv 0400K 2. definitely endangered
Komi - Zyryan kpv 0217K 2. definitely endangered incubator/apertium-kpv-mhr/
Mordvin - Moksha mdf 0200K 2. definitely endangered
Komi - Permyak koi 0094K 2. definitely endangered
Mari - Western mrj 0037K 3. severely endangered
Komi - Yazva koi 0000K 3. severely endangered

Existing general resources



Existing computational resources

Corpora and corpora projects


Text-to-speech and speech-to-text systems


  • Xkb includes keyboards for the following languages:
    • Tatar
    • Chuvash
    • ...?

Morphological Transducers