User:Sushain/BalkanLangsConvert
The Balkan languages are those languages spoken in the Balkans, and possibly forming a part of the Balkan Sprachbund. They include Bulgarian, Macedonian, Romanian, Aromanian, Albanian, Greek, Serbo-Croatian, and a number of others.
The master plan involves generating independent finite-state transducers for each language, and then making individual dictionaries and transfer rules for every pair. The current status of these goals is listed below.
Status[edit]
The ultimate goal is to have multi-purposable transducers for a variety of Balkan languages. These can then be paired for X→Y translation with the addition of a CG for language X and transfer rules / dictionary for the pair X→Y. Below is listed development progress for each language's transducers and dictionary pairs.
Transducers[edit]
Once a transducer has ~80% coverage on a range of medium-large corpora we can say it is "working". Over 90% and it can be considered to be "production".
Balkan Language Classification [1][edit]
Existing language pairs[edit]
Balkan-Balkan pairs[edit]
Text in italic denotes language pairs under development / in the incubator. Regular text denotes a functioning language pair in staging, while text in bold denotes a stable well-working language pair in trunk.
bul | mkd | ron | rup | sqi | ell | hbs | slv | tur | |
---|---|---|---|---|---|---|---|---|---|
bul | - | mk-bg | bg-el | ||||||
mkd | - | mk-sq | |||||||
ron | - | ron-rup | |||||||
rup | - | ||||||||
sqi | - | ||||||||
ell | - | ||||||||
hbs | sh-mk | - | hbs-slv | ||||||
slv | sl-mk | - | |||||||
tur | - |
Pairs with non-Balkan languages[edit]
bul | mkd | ron | rup | sqi | ell | hbs | slv | tur | |
---|---|---|---|---|---|---|---|---|---|
el | bg-el | ||||||||
ru | bg-ru | hbs-rus | |||||||
en | bg-en | mk-en | en-sq | ell-eng | sh-en | tr-en | |||
it | ro-it | sl-it | |||||||
spa | slv-spa | ||||||||
pol | slv-pol | ||||||||
eo | eo-bg | eo-el | |||||||
fr | fr-ro | ||||||||
ca | ca-ro | ||||||||
es | es-ro | ||||||||
cs | cs-sl | ||||||||
kir | tur-kir | ||||||||
tat | tur-tat | ||||||||
uzb | tur-uzb | ||||||||
aze | tur-aze | ||||||||
cv | cv-tr |
Existing[edit]
Monolingual[edit]
Language | Module | Paradigms | Lemmata | Coverage (SETimes) | Coverage (Wikipedia) |
---|---|---|---|---|---|
Bulgarian | Macedonian and Bulgarian | 305 | 7873 | 88.1% | 77.15% |
Macedonian | Macedonian and Bulgarian | 225 | 8094 | 92.1% | |
Romanian | Spanish and Romanian | 997 | 18719 | 89.7% | 83.62% |
Aromanian | Incubator | 17 | 28 | - | |
Albanian | Incubator | 127 | 3302 | 80.2% | 65.62% |
Greek | Incubator | 377 | 859 | 49.4% | 49.75% |
Serbo-Croatian | Incubator | 85 | 660 | - | |
Slovenian | Incubator | 1128 | 20385 | - | |
Turkish | (external: TRMorph) | - | 37101 |
Languages missing: Roma