Difference between revisions of "User:Sushain/BalkanLangsConvert"

From Apertium
Jump to navigation Jump to search
 
(22 intermediate revisions by the same user not shown)
Line 18: Line 18:
!rowspan=2| state
!rowspan=2| state
!rowspan=2| stems
!rowspan=2| stems
!rowspan=2| paradigms
!rowspan=2| coverage
!rowspan=2| coverage
!rowspan=2| location
!rowspan=2| location
Line 24: Line 25:
! -2
! -2
! -3
! -3
|-
|| <code>[[apertium-bul]]</code>
|| [[Bulgarian]]
|| <code>bg</code>
|| <code>bul</code>
|| [[?]]
|| production
|align="right"| {{#lst:Apertium-bul/stats|stems}}
|align="center"| [[Apertium-bul#Current_State|~{{:Apertium-bul/stats/average}}%]]
|| [[apertium-bul]] ([[languages]])
|| [[User:Francis Tyers|Fran]], [[User:Tihomir|Tihomir]]
|-
|-
|| <code>[[apertium-mkd]]</code>
|| <code>[[apertium-mkd]]</code>
Line 40: Line 30:
|| <code>mk</code>
|| <code>mk</code>
|| <code>mkd</code>
|| <code>mkd</code>
|| [[?]]
|| [[lttoolbox]]
|| production
|| production
|align="right"| {{#lst:Apertium-mkd/stats|stems}}
|align="right"| {{#lst:Apertium-mkd/stats|stems}}
|align="right"| {{#lst:Apertium-mkd/stats|paradigms}}
|align="center"| [[Apertium-mkd#Current_State|~{{:Apertium-mkd/stats/average}}%]]
|align="center"| [[Apertium-mkd#Current_State|~{{:Apertium-mkd/stats/average}}%]]
|| [[apertium-mkd]] ([[languages]])
|| [[apertium-mkd]] ([[languages]])
|| [[User:Francis Tyers|Fran]], [[User:Tihomir|Tihomir]], [[User:Fpetkovski|Petkovski]]
|| [[User:Francis Tyers|Fran]], [[User:Tihomir|Tihomir]], [[User:Fpetkovski|Petkovski]]
|-
|| <code>[[apertium-rup]]</code>
|| [[Aromanian]]
|| <code>-</code>
|| <code>rup</code>
|| [[?]]
|| ?
|align="right"| {{#lst:Apertium-rup/stats|stems}}
|align="center"| -
|| [[apertium-rup]] ([[incubator]])
|| ?
|-
|| <code>[[apertium-sqi]]</code>
|| [[Albanian]]
|| <code>sq</code>
|| <code>sqi</code>
|| [[?]]
|| development
|align="right"| {{#lst:Apertium-sqi/stats|stems}}
|align="center"| [[Apertium-sqi#Current_State|~{{:Apertium-sqi/stats/average}}%]]
|| [[apertium-sqi]] ([[languages]])
|| ?
|-
|| <code>[[apertium-ell]]</code>
|| [[Greek]]
|| <code>el</code>
|| <code>ell</code>
|| [[?]]
|| ?
|align="right"| {{#lst:Apertium-ell/stats|stems}}
|align="center"| -
|| [[apertium-ell]] ([[languages]])
|| ?
|-
|-
|| <code>[[apertium-hbs]]</code>
|| <code>[[apertium-hbs]]</code>
Line 84: Line 42:
|| <code>sh</code>
|| <code>sh</code>
|| <code>hbs</code>
|| <code>hbs</code>
|| [[?]]
|| [[lttoolbox]]
|| working
|| working
|align="right"| {{#lst:Apertium-hbs/stats|stems}}
|align="right"| {{#lst:Apertium-hbs/stats|stems}}
|align="right"| {{#lst:Apertium-hbs/stats|paradigms}}
|align="center"| [[Apertium-hbs#Current_State|~{{:Apertium-hbs/stats/average}}%]]
|align="center"| [[Apertium-hbs#Current_State|~{{:Apertium-hbs/stats/average}}%]]
|| [[apertium-hbs]] ([[languages]])
|| [[apertium-hbs]] ([[languages]])
|| [[User: Francis Tyers|Fran]]
|| ?
|-
|-
|| <code>[[apertium-slv]]</code>
|| <code>[[apertium-slv]]</code>
|| [[Slovenian]]
|| [[Slovenian]]
|| <code>sl</code>
|| <code>sl</code>
|| <code>slv</code>
|| <code>slv</code>
|| [[?]]
|| [[lttoolbox]]
|| production
|| production
|align="right"| {{#lst:Apertium-slv/stats|stems}}
|align="right"| {{#lst:Apertium-slv/stats|stems}}
|align="right"| {{#lst:Apertium-slv/stats|paradigms}}
|align="center"| [[Apertium-slv#Current_State|~{{:Apertium-slv/stats/average}}%]]
|align="center"| [[Apertium-slv#Current_State|~{{:Apertium-slv/stats/average}}%]]
|| [[apertium-hbs-slv]] ([[trunk]])<br />[[apertium-slv-pol]] ([[incubator]])<br />[[apertium-sl-mk]] ([[incubator]])
|| [[apertium-hbs-slv]] ([[trunk]])<br />[[apertium-slv-pol]] ([[incubator]])<br />[[apertium-sl-mk]] ([[incubator]])
|| [[User:Francis Tyers|Fran]], [[User:Fpetkovski|Petkovski]], [[User:Krvoje|Peradin]], Aleš Horvat, Dejan Čabrilo, Ivica Dimitrijev
|| [[User:Francis Tyers|Fran]], [[User:Fpetkovski|Petkovski]], [[User:Krvoje|Peradin]], Horvat, Čabrilo, Dimitrijev
|-
|-
|| <code>[[apertium-tur]]</code>
|| <code>[[apertium-tur]]</code>
Line 106: Line 66:
|| <code>tr</code>
|| <code>tr</code>
|| <code>tur</code>
|| <code>tur</code>
|| [[?]]
|| [[HFST]]
|| working
|| working
|align="right"| {{#lst:Apertium-tur/stats|stems}}
|align="right"| {{#lst:Apertium-tur/stats|stems}}
|align="right"| {{#lst:Apertium-tur/stats|paradigms}}
|align="center"| [[Apertium-tur#Current_State|~{{:Apertium-tur/stats/average}}%]]
|align="center"| [[Apertium-tur#Current_State|~{{:Apertium-tur/stats/average}}%]]
|| [[apertium-tur]] ([[languages]])
|| [[apertium-tur]] ([[languages]])
|| [[User:Francis Tyers|Fran]], [[Users:Zfe|Grossi]], Sezgi Aydın
|| [[User:Francis Tyers|Fran]], [[User:Zfe|Gianluca]], Sezgi Aydın
|-
|| <code>[[apertium-bul]]</code>
|| [[Bulgarian]]
|| <code>bg</code>
|| <code>bul</code>
|| [[lttoolbox]]
|| production
|align="right"| {{#lst:Apertium-bul/stats|stems}}
|align="right"| {{#lst:Apertium-bul/stats|paradigms}}
|align="center"| [[Apertium-bul#Current_State|~{{:Apertium-bul/stats/average}}%]]
|| [[apertium-bul]] ([[languages]])
|| [[User:Francis Tyers|Fran]], [[User:Tihomir|Tihomir]]
|-
|| <code>[[apertium-sqi]]</code>
|| [[Albanian]]
|| <code>sq</code>
|| <code>sqi</code>
|| [[lttoolbox]]
|| development
|align="right"| {{#lst:Apertium-sqi/stats|stems}}
|align="right"| {{#lst:Apertium-sqi/stats|paradigms}}
|align="center"| [[Apertium-sqi#Current_State|~{{:Apertium-sqi/stats/average}}%]]
|| [[apertium-sqi]] ([[languages]])
|| [[User: Francis Tyers|Fran]]
|-
|| <code>[[apertium-ell]]</code>
|| [[Greek]]
|| <code>el</code>
|| <code>ell</code>
|| [[lttoolbox]]
|| ?
|align="right"| {{#lst:Apertium-ell/stats|stems}}
|align="right"| {{#lst:Apertium-ell/stats|paradigms}}
|align="center"| -
|| [[apertium-ell]] ([[languages]])
|| [[User:Francis Tyers|Fran]]
|-
|| <code>[[apertium-rup]]</code>
|| [[Aromanian]]
|| <code>-</code>
|| <code>rup</code>
|| [[lttoolbox]]
|| ?
|align="right"| {{#lst:Apertium-rup/stats|stems}}
|align="right"| {{#lst:Apertium-rup/stats|paradigms}}
|align="center"| -
|| [[apertium-rup]] ([[incubator]])
|| [[User: Francis Tyers|Fran]], shopskasalata
|-
|| <code>[[apertium-ron]]</code>
|| [[Romanian]]
|| <code>ro</code>
|| <code>ron</code>
|| [[lttoolbox]]
|| ?
|align="right"| ?
|align="right"| ?
|align="center"| -
|| [[apertium-ron-rup]] ([[incubator]])
|| [[User: Francis Tyers|Fran]]
|}
|}


=== Balkan Language Classification ===
=== Balkan Language Classification <sup><small>[https://en.wikipedia.org/wiki/Languages_of_the_Balkans]</small></sup> ===
* [[Albanian]]
???
** [[Arvanitika]]

** [[Gheg]]
** [[Tosk]]
* [[Hellenic languages]]
** [[Cappadocian Greek]]
** [[Greek|Standard Greek]]
** [[Pontic Greek]]
** [[Tsakonian]]
* [[Romance languages]]
** [[Aromanian]]
** [[Istriot]]
** [[Istro-Romanian]]
** [[Italian]]
** [[Ladino]]
** [[Megleno-Romanian]]
** [[Romanian]]
** [[Moldovan]]
* [[Slavic languages]]
** [[Western South Slavic]]
*** [[Serbo-Croatian]]
*** [[Slovenian]]
** [[Eastern South Slavic]]
*** [[Bulgarian]]
*** [[Macedonian]]
* [[Indo-Aryan languages]]
** [[Romani]]


=== Existing language pairs ===
=== Existing language pairs ===
Line 150: Line 196:
{| style="text-align: center;" class="wikitable"
{| style="text-align: center;" class="wikitable"
|- style="background: #ececec"
|- style="background: #ececec"
! !! eng !! asm !! epo !! pes
! !! bul !! mkd !! ron !! rup !! sqi !! ell !! hbs !! slv !! tur
|-
|-
| '''hin''' || [[eng-hin]] || [[as-hi]] || ||
| '''el''' || ''[[bg-el]]'' || || || || || || || ||
|-
|-
| '''ben''' || [[bn-en]] || || ||
| '''ru''' || ''[[bg-ru]]'' || || || || || || ''[[hbs-rus]]'' || ||
|-
|-
| '''urd''' || || || || [[ur-fa]]
| '''en''' || ''[[bg-en]]'' || '''[[mk-en]]''' || || || ''[[en-sq]]'' || ''[[ell-eng]]'' || ''[[sh-en]]'' || || ''[[tr-en]]''
|-
|-
| '''san''' || || || ||
| '''it''' || || || ''[[ro-it]]'' || || || || || ''[[sl-it]]'' ||
|-
|-
| '''nep''' || [[ne-en]] || || [[eo-ne]] ||
| '''spa''' || || || || || || || || ''[[slv-spa]]'' ||
|-
|-
| '''mar''' || [[mar-eng]] || || ||
| '''pol''' || || || || || || || || ''[[slv-pol]]'' ||
|-
|-
| '''pan''' || || || ||
| '''eo''' || ''[[eo-bg]]'' || || || || || ''[[eo-el]]'' || || ||
|-
| '''fr''' || || || ''[[fr-ro]]'' || || || || || ||
|-
| '''ca''' || || || ''[[ca-ro]]'' || || || || || ||
|-
| '''es''' || || || '''[[es-ro]]''' || || || || || ||
|-
| '''cs''' || || || || || || || || ''[[cs-sl]]'' ||
|-
| '''kir''' || || || || || || || || || ''[[tur-kir]]''
|-
| '''tat''' || || || || || || || || || ''[[tur-tat]]''
|-
| '''uzb''' || || || || || || || || || ''[[tur-uzb]]''
|-
| '''aze''' || || || || || || || || || [[tur-aze]]
|-
| '''cv''' || || || || || || || || || ''[[cv-tr]]''
|}
|}

==Existing==
==Existing==



Latest revision as of 08:56, 24 December 2013

The Balkan languages are those languages spoken in the Balkans, and possibly forming a part of the Balkan Sprachbund. They include Bulgarian, Macedonian, Romanian, Aromanian, Albanian, Greek, Serbo-Croatian, and a number of others.

The master plan involves generating independent finite-state transducers for each language, and then making individual dictionaries and transfer rules for every pair. The current status of these goals is listed below.

Status[edit]

The ultimate goal is to have multi-purposable transducers for a variety of Balkan languages. These can then be paired for X→Y translation with the addition of a CG for language X and transfer rules / dictionary for the pair X→Y. Below is listed development progress for each language's transducers and dictionary pairs.

Transducers[edit]

Once a transducer has ~80% coverage on a range of medium-large corpora we can say it is "working". Over 90% and it can be considered to be "production".

name Language ISO 639 formalism state stems paradigms coverage location primary authors
-2 -3
apertium-mkd Macedonian mk mkd lttoolbox production 30,686 260 ~90.5% apertium-mkd (languages) Fran, Tihomir, Petkovski
apertium-hbs Serbo-Croatian sh hbs lttoolbox working 58,004 1,092 ~90.5% apertium-hbs (languages) Fran
apertium-slv Slovenian sl slv lttoolbox production 20,596 1,435 ~90.5% apertium-hbs-slv (trunk)
apertium-slv-pol (incubator)
apertium-sl-mk (incubator)
Fran, Petkovski, Peradin, Horvat, Čabrilo, Dimitrijev
apertium-tur Turkish tr tur HFST working 17,221 1 ~87.3% apertium-tur (languages) Fran, Gianluca, Sezgi Aydın
apertium-bul Bulgarian bg bul lttoolbox production 8,578 317 ~80% apertium-bul (languages) Fran, Tihomir
apertium-sqi Albanian sq sqi lttoolbox development 3,312 138 ~80.2% apertium-sqi (languages) Fran
apertium-ell Greek el ell lttoolbox ? 2,460 951 - apertium-ell (languages) Fran
apertium-rup Aromanian - rup lttoolbox ? 312,005 26193 - apertium-rup (incubator) Fran, shopskasalata
apertium-ron Romanian ro ron lttoolbox ? ? ? - apertium-ron-rup (incubator) Fran

Balkan Language Classification [1][edit]

Existing language pairs[edit]

Balkan-Balkan pairs[edit]

Text in italic denotes language pairs under development / in the incubator. Regular text denotes a functioning language pair in staging, while text in bold denotes a stable well-working language pair in trunk.

bul mkd ron rup sqi ell hbs slv tur
bul - mk-bg bg-el
mkd - mk-sq
ron - ron-rup
rup -
sqi -
ell -
hbs sh-mk - hbs-slv
slv sl-mk -
tur -

Pairs with non-Balkan languages[edit]

bul mkd ron rup sqi ell hbs slv tur
el bg-el
ru bg-ru hbs-rus
en bg-en mk-en en-sq ell-eng sh-en tr-en
it ro-it sl-it
spa slv-spa
pol slv-pol
eo eo-bg eo-el
fr fr-ro
ca ca-ro
es es-ro
cs cs-sl
kir tur-kir
tat tur-tat
uzb tur-uzb
aze tur-aze
cv cv-tr

Existing[edit]

Monolingual[edit]

Language Module Paradigms Lemmata Coverage (SETimes) Coverage (Wikipedia)
Bulgarian Macedonian and Bulgarian 305 7873 88.1% 77.15%
Macedonian Macedonian and Bulgarian 225 8094 92.1%
Romanian Spanish and Romanian 997 18719 89.7% 83.62%
Aromanian Incubator 17 28 -
Albanian Incubator 127 3302 80.2% 65.62%
Greek Incubator 377 859 49.4% 49.75%
Serbo-Croatian Incubator 85 660 -
Slovenian Incubator 1128 20385 -
Turkish (external: TRMorph) - 37101

Languages missing: Roma

Bilingual[edit]

Language pairs[edit]

See also[edit]

External links[edit]