Difference between revisions of "User:Sushain/BalkanLangsConvert"
(12 intermediate revisions by the same user not shown) | |||
Line 18: | Line 18: | ||
!rowspan=2| state |
!rowspan=2| state |
||
!rowspan=2| stems |
!rowspan=2| stems |
||
!rowspan=2| paradigms |
|||
!rowspan=2| coverage |
!rowspan=2| coverage |
||
!rowspan=2| location |
!rowspan=2| location |
||
Line 24: | Line 25: | ||
! -2 |
! -2 |
||
! -3 |
! -3 |
||
|- |
|||
|| <code>[[apertium-bul]]</code> |
|||
|| [[Bulgarian]] |
|||
|| <code>bg</code> |
|||
|| <code>bul</code> |
|||
|| [[lttoolbox]] |
|||
|| production |
|||
|align="right"| {{#lst:Apertium-bul/stats|stems}} |
|||
|align="center"| [[Apertium-bul#Current_State|~{{:Apertium-bul/stats/average}}%]] |
|||
|| [[apertium-bul]] ([[languages]]) |
|||
|| [[User:Francis Tyers|Fran]], [[User:Tihomir|Tihomir]] |
|||
|- |
|- |
||
|| <code>[[apertium-mkd]]</code> |
|| <code>[[apertium-mkd]]</code> |
||
Line 43: | Line 33: | ||
|| production |
|| production |
||
|align="right"| {{#lst:Apertium-mkd/stats|stems}} |
|align="right"| {{#lst:Apertium-mkd/stats|stems}} |
||
|align="right"| {{#lst:Apertium-mkd/stats|paradigms}} |
|||
|align="center"| [[Apertium-mkd#Current_State|~{{:Apertium-mkd/stats/average}}%]] |
|align="center"| [[Apertium-mkd#Current_State|~{{:Apertium-mkd/stats/average}}%]] |
||
|| [[apertium-mkd]] ([[languages]]) |
|| [[apertium-mkd]] ([[languages]]) |
||
|| [[User:Francis Tyers|Fran]], [[User:Tihomir|Tihomir]], [[User:Fpetkovski|Petkovski]] |
|| [[User:Francis Tyers|Fran]], [[User:Tihomir|Tihomir]], [[User:Fpetkovski|Petkovski]] |
||
|- |
|- |
||
|| <code>[[apertium- |
|| <code>[[apertium-hbs]]</code> |
||
|| [[ |
|| [[Serbo-Croatian]] |
||
|| <code> |
|| <code>sh</code> |
||
|| <code> |
|| <code>hbs</code> |
||
|| [[lttoolbox]] |
|| [[lttoolbox]] |
||
|| |
|| working |
||
|align="right"| |
|align="right"| {{#lst:Apertium-hbs/stats|stems}} |
||
|align=" |
|align="right"| {{#lst:Apertium-hbs/stats|paradigms}} |
||
|align="center"| [[Apertium-hbs#Current_State|~{{:Apertium-hbs/stats/average}}%]] |
|||
|| [[apertium-ron-rup]] ([[incubator]]) |
|||
|| [[apertium-hbs]] ([[languages]]) |
|||
|| ? |
|||
|| [[User: Francis Tyers|Fran]] |
|||
|- |
|- |
||
|| <code>[[apertium- |
|| <code>[[apertium-slv]]</code> |
||
|| [[ |
|| [[Slovenian]] |
||
|| <code> |
|| <code>sl</code> |
||
|| <code> |
|| <code>slv</code> |
||
|| [[lttoolbox]] |
|||
|| production |
|||
|align="right"| {{#lst:Apertium-slv/stats|stems}} |
|||
|align="right"| {{#lst:Apertium-slv/stats|paradigms}} |
|||
|align="center"| [[Apertium-slv#Current_State|~{{:Apertium-slv/stats/average}}%]] |
|||
|| [[apertium-hbs-slv]] ([[trunk]])<br />[[apertium-slv-pol]] ([[incubator]])<br />[[apertium-sl-mk]] ([[incubator]]) |
|||
|| [[User:Francis Tyers|Fran]], [[User:Fpetkovski|Petkovski]], [[User:Krvoje|Peradin]], Horvat, Čabrilo, Dimitrijev |
|||
|- |
|||
|| <code>[[apertium-tur]]</code> |
|||
|| [[Turkish]] |
|||
|| <code>tr</code> |
|||
|| <code>tur</code> |
|||
|| [[HFST]] |
|||
|| working |
|||
|align="right"| {{#lst:Apertium-tur/stats|stems}} |
|||
|align="right"| {{#lst:Apertium-tur/stats|paradigms}} |
|||
|align="center"| [[Apertium-tur#Current_State|~{{:Apertium-tur/stats/average}}%]] |
|||
|| [[apertium-tur]] ([[languages]]) |
|||
|| [[User:Francis Tyers|Fran]], [[User:Zfe|Gianluca]], Sezgi Aydın |
|||
|- |
|||
|| <code>[[apertium-bul]]</code> |
|||
|| [[Bulgarian]] |
|||
|| <code>bg</code> |
|||
|| <code>bul</code> |
|||
|| [[lttoolbox]] |
|| [[lttoolbox]] |
||
|| |
|| production |
||
|align="right"| {{#lst:Apertium- |
|align="right"| {{#lst:Apertium-bul/stats|stems}} |
||
|align=" |
|align="right"| {{#lst:Apertium-bul/stats|paradigms}} |
||
|align="center"| [[Apertium-bul#Current_State|~{{:Apertium-bul/stats/average}}%]] |
|||
|| [[apertium-rup]] ([[incubator]]) |
|||
|| [[apertium-bul]] ([[languages]]) |
|||
|| ? |
|||
|| [[User:Francis Tyers|Fran]], [[User:Tihomir|Tihomir]] |
|||
|- |
|- |
||
|| <code>[[apertium-sqi]]</code> |
|| <code>[[apertium-sqi]]</code> |
||
Line 76: | Line 93: | ||
|| development |
|| development |
||
|align="right"| {{#lst:Apertium-sqi/stats|stems}} |
|align="right"| {{#lst:Apertium-sqi/stats|stems}} |
||
|align="right"| {{#lst:Apertium-sqi/stats|paradigms}} |
|||
|align="center"| [[Apertium-sqi#Current_State|~{{:Apertium-sqi/stats/average}}%]] |
|align="center"| [[Apertium-sqi#Current_State|~{{:Apertium-sqi/stats/average}}%]] |
||
|| [[apertium-sqi]] ([[languages]]) |
|| [[apertium-sqi]] ([[languages]]) |
||
|| [[User: Francis Tyers|Fran]] |
|||
|| ? |
|||
|- |
|- |
||
|| <code>[[apertium-ell]]</code> |
|| <code>[[apertium-ell]]</code> |
||
Line 87: | Line 105: | ||
|| ? |
|| ? |
||
|align="right"| {{#lst:Apertium-ell/stats|stems}} |
|align="right"| {{#lst:Apertium-ell/stats|stems}} |
||
|align="right"| {{#lst:Apertium-ell/stats|paradigms}} |
|||
|align="center"| - |
|align="center"| - |
||
|| [[apertium-ell]] ([[languages]]) |
|| [[apertium-ell]] ([[languages]]) |
||
|| [[User:Francis Tyers|Fran]] |
|||
|| ? |
|||
|- |
|- |
||
|| <code>[[apertium- |
|| <code>[[apertium-rup]]</code> |
||
|| [[ |
|| [[Aromanian]] |
||
|| <code> |
|| <code>-</code> |
||
|| <code> |
|| <code>rup</code> |
||
|| [[lttoolbox]] |
|| [[lttoolbox]] |
||
|| working |
|||
|align="right"| {{#lst:Apertium-hbs/stats|stems}} |
|||
|align="center"| [[Apertium-hbs#Current_State|~{{:Apertium-hbs/stats/average}}%]] |
|||
|| [[apertium-hbs]] ([[languages]]) |
|||
|| ? |
|| ? |
||
|align="right"| {{#lst:Apertium-rup/stats|stems}} |
|||
|align="right"| {{#lst:Apertium-rup/stats|paradigms}} |
|||
|align="center"| - |
|||
|| [[apertium-rup]] ([[incubator]]) |
|||
|| [[User: Francis Tyers|Fran]], shopskasalata |
|||
|- |
|- |
||
|| <code>[[apertium- |
|| <code>[[apertium-ron]]</code> |
||
|| [[ |
|| [[Romanian]] |
||
|| <code> |
|| <code>ro</code> |
||
|| <code> |
|| <code>ron</code> |
||
|| [[ |
|| [[lttoolbox]] |
||
|| |
|| ? |
||
|align="right"| |
|align="right"| ? |
||
|align="right"| ? |
|||
|align="center"| [[Apertium-slv#Current_State|~{{:Apertium-slv/stats/average}}%]] |
|||
|align="center"| - |
|||
|| [[apertium-hbs-slv]] ([[trunk]])<br />[[apertium-slv-pol]] ([[incubator]])<br />[[apertium-sl-mk]] ([[incubator]]) |
|||
|| [[apertium-ron-rup]] ([[incubator]]) |
|||
|| [[User:Francis Tyers|Fran]], [[User:Fpetkovski|Petkovski]], [[User:Krvoje|Peradin]], Aleš Horvat, Dejan Čabrilo, Ivica Dimitrijev |
|||
|| [[User: Francis Tyers|Fran]] |
|||
|- |
|||
|| <code>[[apertium-tur]]</code> |
|||
|| [[Turkish]] |
|||
|| <code>tr</code> |
|||
|| <code>tur</code> |
|||
|| [[?]] |
|||
|| working |
|||
|align="right"| {{#lst:Apertium-tur/stats|stems}} |
|||
|align="center"| [[Apertium-tur#Current_State|~{{:Apertium-tur/stats/average}}%]] |
|||
|| [[apertium-tur]] ([[languages]]) |
|||
|| [[User:Francis Tyers|Fran]], [[Users:Zfe|Grossi]], Sezgi Aydın |
|||
|} |
|} |
||
=== Balkan Language Classification === |
=== Balkan Language Classification <sup><small>[https://en.wikipedia.org/wiki/Languages_of_the_Balkans]</small></sup> === |
||
* [[Albanian]] |
* [[Albanian]] |
||
** [[Arvanitika]] |
** [[Arvanitika]] |
||
Line 137: | Line 147: | ||
* [[Romance languages]] |
* [[Romance languages]] |
||
** [[Aromanian]] |
** [[Aromanian]] |
||
** [[Istriot]] |
** [[Istriot]] |
||
** [[Istro-Romanian]] |
** [[Istro-Romanian]] |
||
** [[Italian]] |
** [[Italian]] |
||
** [[Ladino]] |
|||
** [[Ladino]] (in Greece,Turkey,Bosnia,Serbia,Macedonia,Bulgaria) |
|||
** [[Megleno-Romanian]] |
** [[Megleno-Romanian]] |
||
** [[Romanian]] |
** [[Romanian]] |
||
** [[Moldovan]] |
** [[Moldovan]] |
||
* [[Slavic languages]] |
|||
** [[Western South Slavic]] |
|||
*** [[Serbo-Croatian]] |
|||
*** [[Slovenian]] |
|||
** [[Eastern South Slavic]] |
|||
*** [[Bulgarian]] |
|||
*** [[Macedonian]] |
|||
* [[Indo-Aryan languages]] |
|||
** [[Romani]] |
|||
=== Existing language pairs === |
=== Existing language pairs === |
||
Line 177: | Line 196: | ||
{| style="text-align: center;" class="wikitable" |
{| style="text-align: center;" class="wikitable" |
||
|- style="background: #ececec" |
|- style="background: #ececec" |
||
! !! bul !! mkd !! ron !! rup !! sqi !! ell !! hbs !! slv !! tur |
! !! bul !! mkd !! ron !! rup !! sqi !! ell !! hbs !! slv !! tur |
||
|- |
|||
| '''el''' || ''[[bg-el]]'' || || || || || || || || |
|||
|- |
|||
| '''ru''' || ''[[bg-ru]]'' || || || || || || ''[[hbs-rus]]'' || || |
|||
|- |
|||
| '''en''' || ''[[bg-en]]'' || '''[[mk-en]]''' || || || ''[[en-sq]]'' || ''[[ell-eng]]'' || ''[[sh-en]]'' || || ''[[tr-en]]'' |
|||
|- |
|||
| '''it''' || || || ''[[ro-it]]'' || || || || || ''[[sl-it]]'' || |
|||
|- |
|||
| '''spa''' || || || || || || || || ''[[slv-spa]]'' || |
|||
|- |
|||
| '''pol''' || || || || || || || || ''[[slv-pol]]'' || |
|||
|- |
|||
| '''eo''' || ''[[eo-bg]]'' || || || || || ''[[eo-el]]'' || || || |
|||
|- |
|||
| '''fr''' || || || ''[[fr-ro]]'' || || || || || || |
|||
|- |
|||
| '''ca''' || || || ''[[ca-ro]]'' || || || || || || |
|||
|- |
|||
| '''es''' || || || '''[[es-ro]]''' || || || || || || |
|||
|- |
|- |
||
| ''' |
| '''cs''' || || || || || || || || ''[[cs-sl]]'' || |
||
|- |
|- |
||
| ''' |
| '''kir''' || || || || || || || || || ''[[tur-kir]]'' |
||
|- |
|- |
||
| ''' |
| '''tat''' || || || || || || || || || ''[[tur-tat]]'' |
||
|- |
|- |
||
| ''' |
| '''uzb''' || || || || || || || || || ''[[tur-uzb]]'' |
||
|- |
|- |
||
| ''' |
| '''aze''' || || || || || || || || || [[tur-aze]] |
||
|- |
|- |
||
| ''' |
| '''cv''' || || || || || || || || || ''[[cv-tr]]'' |
||
|} |
|} |
||
Latest revision as of 08:56, 24 December 2013
The Balkan languages are those languages spoken in the Balkans, and possibly forming a part of the Balkan Sprachbund. They include Bulgarian, Macedonian, Romanian, Aromanian, Albanian, Greek, Serbo-Croatian, and a number of others.
The master plan involves generating independent finite-state transducers for each language, and then making individual dictionaries and transfer rules for every pair. The current status of these goals is listed below.
Status[edit]
The ultimate goal is to have multi-purposable transducers for a variety of Balkan languages. These can then be paired for X→Y translation with the addition of a CG for language X and transfer rules / dictionary for the pair X→Y. Below is listed development progress for each language's transducers and dictionary pairs.
Transducers[edit]
Once a transducer has ~80% coverage on a range of medium-large corpora we can say it is "working". Over 90% and it can be considered to be "production".
Balkan Language Classification [1][edit]
Existing language pairs[edit]
Balkan-Balkan pairs[edit]
Text in italic denotes language pairs under development / in the incubator. Regular text denotes a functioning language pair in staging, while text in bold denotes a stable well-working language pair in trunk.
bul | mkd | ron | rup | sqi | ell | hbs | slv | tur | |
---|---|---|---|---|---|---|---|---|---|
bul | - | mk-bg | bg-el | ||||||
mkd | - | mk-sq | |||||||
ron | - | ron-rup | |||||||
rup | - | ||||||||
sqi | - | ||||||||
ell | - | ||||||||
hbs | sh-mk | - | hbs-slv | ||||||
slv | sl-mk | - | |||||||
tur | - |
Pairs with non-Balkan languages[edit]
bul | mkd | ron | rup | sqi | ell | hbs | slv | tur | |
---|---|---|---|---|---|---|---|---|---|
el | bg-el | ||||||||
ru | bg-ru | hbs-rus | |||||||
en | bg-en | mk-en | en-sq | ell-eng | sh-en | tr-en | |||
it | ro-it | sl-it | |||||||
spa | slv-spa | ||||||||
pol | slv-pol | ||||||||
eo | eo-bg | eo-el | |||||||
fr | fr-ro | ||||||||
ca | ca-ro | ||||||||
es | es-ro | ||||||||
cs | cs-sl | ||||||||
kir | tur-kir | ||||||||
tat | tur-tat | ||||||||
uzb | tur-uzb | ||||||||
aze | tur-aze | ||||||||
cv | cv-tr |
Existing[edit]
Monolingual[edit]
Language | Module | Paradigms | Lemmata | Coverage (SETimes) | Coverage (Wikipedia) |
---|---|---|---|---|---|
Bulgarian | Macedonian and Bulgarian | 305 | 7873 | 88.1% | 77.15% |
Macedonian | Macedonian and Bulgarian | 225 | 8094 | 92.1% | |
Romanian | Spanish and Romanian | 997 | 18719 | 89.7% | 83.62% |
Aromanian | Incubator | 17 | 28 | - | |
Albanian | Incubator | 127 | 3302 | 80.2% | 65.62% |
Greek | Incubator | 377 | 859 | 49.4% | 49.75% |
Serbo-Croatian | Incubator | 85 | 660 | - | |
Slovenian | Incubator | 1128 | 20385 | - | |
Turkish | (external: TRMorph) | - | 37101 |
Languages missing: Roma