User:Sushain/BalkanLangsConvert

From Apertium
< User:Sushain
Revision as of 08:56, 24 December 2013 by Sushain (talk | contribs) (→‎Pairs with non-Balkan languages)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

The Balkan languages are those languages spoken in the Balkans, and possibly forming a part of the Balkan Sprachbund. They include Bulgarian, Macedonian, Romanian, Aromanian, Albanian, Greek, Serbo-Croatian, and a number of others.

The master plan involves generating independent finite-state transducers for each language, and then making individual dictionaries and transfer rules for every pair. The current status of these goals is listed below.

Status[edit]

The ultimate goal is to have multi-purposable transducers for a variety of Balkan languages. These can then be paired for X→Y translation with the addition of a CG for language X and transfer rules / dictionary for the pair X→Y. Below is listed development progress for each language's transducers and dictionary pairs.

Transducers[edit]

Once a transducer has ~80% coverage on a range of medium-large corpora we can say it is "working". Over 90% and it can be considered to be "production".

name Language ISO 639 formalism state stems paradigms coverage location primary authors
-2 -3
apertium-mkd Macedonian mk mkd lttoolbox production 30,686 260 ~90.5% apertium-mkd (languages) Fran, Tihomir, Petkovski
apertium-hbs Serbo-Croatian sh hbs lttoolbox working 58,004 1,092 ~90.5% apertium-hbs (languages) Fran
apertium-slv Slovenian sl slv lttoolbox production 20,596 1,435 ~90.5% apertium-hbs-slv (trunk)
apertium-slv-pol (incubator)
apertium-sl-mk (incubator)
Fran, Petkovski, Peradin, Horvat, Čabrilo, Dimitrijev
apertium-tur Turkish tr tur HFST working 17,221 1 ~87.3% apertium-tur (languages) Fran, Gianluca, Sezgi Aydın
apertium-bul Bulgarian bg bul lttoolbox production 8,578 317 ~80% apertium-bul (languages) Fran, Tihomir
apertium-sqi Albanian sq sqi lttoolbox development 3,312 138 ~80.2% apertium-sqi (languages) Fran
apertium-ell Greek el ell lttoolbox ? 2,460 951 - apertium-ell (languages) Fran
apertium-rup Aromanian - rup lttoolbox ? 312,005 26193 - apertium-rup (incubator) Fran, shopskasalata
apertium-ron Romanian ro ron lttoolbox ? ? ? - apertium-ron-rup (incubator) Fran

Balkan Language Classification [1][edit]

Existing language pairs[edit]

Balkan-Balkan pairs[edit]

Text in italic denotes language pairs under development / in the incubator. Regular text denotes a functioning language pair in staging, while text in bold denotes a stable well-working language pair in trunk.

bul mkd ron rup sqi ell hbs slv tur
bul - mk-bg bg-el
mkd - mk-sq
ron - ron-rup
rup -
sqi -
ell -
hbs sh-mk - hbs-slv
slv sl-mk -
tur -

Pairs with non-Balkan languages[edit]

bul mkd ron rup sqi ell hbs slv tur
el bg-el
ru bg-ru hbs-rus
en bg-en mk-en en-sq ell-eng sh-en tr-en
it ro-it sl-it
spa slv-spa
pol slv-pol
eo eo-bg eo-el
fr fr-ro
ca ca-ro
es es-ro
cs cs-sl
kir tur-kir
tat tur-tat
uzb tur-uzb
aze tur-aze
cv cv-tr

Existing[edit]

Monolingual[edit]

Language Module Paradigms Lemmata Coverage (SETimes) Coverage (Wikipedia)
Bulgarian Macedonian and Bulgarian 305 7873 88.1% 77.15%
Macedonian Macedonian and Bulgarian 225 8094 92.1%
Romanian Spanish and Romanian 997 18719 89.7% 83.62%
Aromanian Incubator 17 28 -
Albanian Incubator 127 3302 80.2% 65.62%
Greek Incubator 377 859 49.4% 49.75%
Serbo-Croatian Incubator 85 660 -
Slovenian Incubator 1128 20385 -
Turkish (external: TRMorph) - 37101

Languages missing: Roma

Bilingual[edit]

Language pairs[edit]

See also[edit]

External links[edit]