Difference between revisions of "User:Sushain/BalkanLangsConvert"
(2 intermediate revisions by the same user not shown) | |||
Line 196: | Line 196: | ||
{| style="text-align: center;" class="wikitable" |
{| style="text-align: center;" class="wikitable" |
||
|- style="background: #ececec" |
|- style="background: #ececec" |
||
! !! bul !! mkd !! ron !! rup !! sqi !! ell !! hbs !! slv !! tur |
! !! bul !! mkd !! ron !! rup !! sqi !! ell !! hbs !! slv !! tur |
||
|- |
|- |
||
| '''el''' || ''[[bg-el]]'' || || || || || || || || |
| '''el''' || ''[[bg-el]]'' || || || || || || || || |
||
|- |
|- |
||
| '''ru''' || ''[[bg-ru]]'' || || || || || || ''[[hbs-rus]]'' || || |
| '''ru''' || ''[[bg-ru]]'' || || || || || || ''[[hbs-rus]]'' || || |
||
|- |
|- |
||
| '''en''' || ''[[bg-en]]'' || '''[[mk-en]]''' || || || |
| '''en''' || ''[[bg-en]]'' || '''[[mk-en]]''' || || || ''[[en-sq]]'' || ''[[ell-eng]]'' || ''[[sh-en]]'' || || ''[[tr-en]]'' |
||
|- |
|- |
||
| '''it''' || || || ''[[ro-it]]'' || || || || || ''[[sl-it]]'' || |
| '''it''' || || || ''[[ro-it]]'' || || || || || ''[[sl-it]]'' || |
||
|- |
|- |
||
| '''spa''' || || || || || || || || ''[[slv-spa]]'' || |
| '''spa''' || || || || || || || || ''[[slv-spa]]'' || |
||
|- |
|- |
||
| '''pol''' || || || || || || || || ''[[slv-pol]]'' || |
| '''pol''' || || || || || || || || ''[[slv-pol]]'' || |
||
|- |
|||
| '''eo''' || ''[[eo-bg]]'' || || || || || ''[[eo-el]]'' || || || |
|||
|- |
|||
| '''fr''' || || || ''[[fr-ro]]'' || || || || || || |
|||
|- |
|||
| '''ca''' || || || ''[[ca-ro]]'' || || || || || || |
|||
|- |
|||
| '''es''' || || || '''[[es-ro]]''' || || || || || || |
|||
|- |
|||
| '''cs''' || || || || || || || || ''[[cs-sl]]'' || |
|||
|- |
|||
| '''kir''' || || || || || || || || || ''[[tur-kir]]'' |
|||
|- |
|||
| '''tat''' || || || || || || || || || ''[[tur-tat]]'' |
|||
|- |
|||
| '''uzb''' || || || || || || || || || ''[[tur-uzb]]'' |
|||
|- |
|||
| '''aze''' || || || || || || || || || [[tur-aze]] |
|||
|- |
|||
| '''cv''' || || || || || || || || || ''[[cv-tr]]'' |
|||
|} |
|} |
||
Latest revision as of 08:56, 24 December 2013
The Balkan languages are those languages spoken in the Balkans, and possibly forming a part of the Balkan Sprachbund. They include Bulgarian, Macedonian, Romanian, Aromanian, Albanian, Greek, Serbo-Croatian, and a number of others.
The master plan involves generating independent finite-state transducers for each language, and then making individual dictionaries and transfer rules for every pair. The current status of these goals is listed below.
Status[edit]
The ultimate goal is to have multi-purposable transducers for a variety of Balkan languages. These can then be paired for X→Y translation with the addition of a CG for language X and transfer rules / dictionary for the pair X→Y. Below is listed development progress for each language's transducers and dictionary pairs.
Transducers[edit]
Once a transducer has ~80% coverage on a range of medium-large corpora we can say it is "working". Over 90% and it can be considered to be "production".
Balkan Language Classification [1][edit]
Existing language pairs[edit]
Balkan-Balkan pairs[edit]
Text in italic denotes language pairs under development / in the incubator. Regular text denotes a functioning language pair in staging, while text in bold denotes a stable well-working language pair in trunk.
bul | mkd | ron | rup | sqi | ell | hbs | slv | tur | |
---|---|---|---|---|---|---|---|---|---|
bul | - | mk-bg | bg-el | ||||||
mkd | - | mk-sq | |||||||
ron | - | ron-rup | |||||||
rup | - | ||||||||
sqi | - | ||||||||
ell | - | ||||||||
hbs | sh-mk | - | hbs-slv | ||||||
slv | sl-mk | - | |||||||
tur | - |
Pairs with non-Balkan languages[edit]
bul | mkd | ron | rup | sqi | ell | hbs | slv | tur | |
---|---|---|---|---|---|---|---|---|---|
el | bg-el | ||||||||
ru | bg-ru | hbs-rus | |||||||
en | bg-en | mk-en | en-sq | ell-eng | sh-en | tr-en | |||
it | ro-it | sl-it | |||||||
spa | slv-spa | ||||||||
pol | slv-pol | ||||||||
eo | eo-bg | eo-el | |||||||
fr | fr-ro | ||||||||
ca | ca-ro | ||||||||
es | es-ro | ||||||||
cs | cs-sl | ||||||||
kir | tur-kir | ||||||||
tat | tur-tat | ||||||||
uzb | tur-uzb | ||||||||
aze | tur-aze | ||||||||
cv | cv-tr |
Existing[edit]
Monolingual[edit]
Language | Module | Paradigms | Lemmata | Coverage (SETimes) | Coverage (Wikipedia) |
---|---|---|---|---|---|
Bulgarian | Macedonian and Bulgarian | 305 | 7873 | 88.1% | 77.15% |
Macedonian | Macedonian and Bulgarian | 225 | 8094 | 92.1% | |
Romanian | Spanish and Romanian | 997 | 18719 | 89.7% | 83.62% |
Aromanian | Incubator | 17 | 28 | - | |
Albanian | Incubator | 127 | 3302 | 80.2% | 65.62% |
Greek | Incubator | 377 | 859 | 49.4% | 49.75% |
Serbo-Croatian | Incubator | 85 | 660 | - | |
Slovenian | Incubator | 1128 | 20385 | - | |
Turkish | (external: TRMorph) | - | 37101 |
Languages missing: Roma