Difference between revisions of "User:Sushain/BalkanLangsConvert"

From Apertium
Jump to navigation Jump to search
 
(19 intermediate revisions by the same user not shown)
Line 18: Line 18:
 
!rowspan=2| state
 
!rowspan=2| state
 
!rowspan=2| stems
 
!rowspan=2| stems
  +
!rowspan=2| paradigms
 
!rowspan=2| coverage
 
!rowspan=2| coverage
 
!rowspan=2| location
 
!rowspan=2| location
Line 24: Line 25:
 
! -2
 
! -2
 
! -3
 
! -3
|-
 
|| <code>[[apertium-bul]]</code>
 
|| [[Bulgarian]]
 
|| <code>bg</code>
 
|| <code>bul</code>
 
|| [[?]]
 
|| production
 
|align="right"| {{#lst:Apertium-bul/stats|stems}}
 
|align="center"| [[Apertium-bul#Current_State|~{{:Apertium-bul/stats/average}}%]]
 
|| [[apertium-bul]] ([[languages]])
 
|| [[User:Francis Tyers|Fran]], [[User:Tihomir|Tihomir]]
 
 
|-
 
|-
 
|| <code>[[apertium-mkd]]</code>
 
|| <code>[[apertium-mkd]]</code>
Line 40: Line 30:
 
|| <code>mk</code>
 
|| <code>mk</code>
 
|| <code>mkd</code>
 
|| <code>mkd</code>
|| [[?]]
+
|| [[lttoolbox]]
 
|| production
 
|| production
 
|align="right"| {{#lst:Apertium-mkd/stats|stems}}
 
|align="right"| {{#lst:Apertium-mkd/stats|stems}}
  +
|align="right"| {{#lst:Apertium-mkd/stats|paradigms}}
 
|align="center"| [[Apertium-mkd#Current_State|~{{:Apertium-mkd/stats/average}}%]]
 
|align="center"| [[Apertium-mkd#Current_State|~{{:Apertium-mkd/stats/average}}%]]
 
|| [[apertium-mkd]] ([[languages]])
 
|| [[apertium-mkd]] ([[languages]])
 
|| [[User:Francis Tyers|Fran]], [[User:Tihomir|Tihomir]], [[User:Fpetkovski|Petkovski]]
 
|| [[User:Francis Tyers|Fran]], [[User:Tihomir|Tihomir]], [[User:Fpetkovski|Petkovski]]
|-
 
|| <code>[[apertium-ron]]</code>
 
|| [[Romanian]]
 
|| <code>ro</code>
 
|| <code>ron</code>
 
|| [[?]]
 
|| ?
 
|align="right"| ?
 
|align="center"| ?
 
|| [[apertium-ron-rup]] ([[incubator]])
 
|| ?
 
|-
 
|| <code>[[apertium-rup]]</code>
 
|| [[Aromanian]]
 
|| <code>-</code>
 
|| <code>rup</code>
 
|| [[?]]
 
|| ?
 
|align="right"| {{#lst:Apertium-rup/stats|stems}}
 
|align="center"| -
 
|| [[apertium-rup]] ([[incubator]])
 
|| ?
 
|-
 
|| <code>[[apertium-sqi]]</code>
 
|| [[Albanian]]
 
|| <code>sq</code>
 
|| <code>sqi</code>
 
|| [[?]]
 
|| development
 
|align="right"| {{#lst:Apertium-sqi/stats|stems}}
 
|align="center"| [[Apertium-sqi#Current_State|~{{:Apertium-sqi/stats/average}}%]]
 
|| [[apertium-sqi]] ([[languages]])
 
|| ?
 
|-
 
|| <code>[[apertium-ell]]</code>
 
|| [[Greek]]
 
|| <code>el</code>
 
|| <code>ell</code>
 
|| [[?]]
 
|| ?
 
|align="right"| {{#lst:Apertium-ell/stats|stems}}
 
|align="center"| -
 
|| [[apertium-ell]] ([[languages]])
 
|| ?
 
 
|-
 
|-
 
|| <code>[[apertium-hbs]]</code>
 
|| <code>[[apertium-hbs]]</code>
Line 95: Line 42:
 
|| <code>sh</code>
 
|| <code>sh</code>
 
|| <code>hbs</code>
 
|| <code>hbs</code>
|| [[?]]
+
|| [[lttoolbox]]
 
|| working
 
|| working
 
|align="right"| {{#lst:Apertium-hbs/stats|stems}}
 
|align="right"| {{#lst:Apertium-hbs/stats|stems}}
  +
|align="right"| {{#lst:Apertium-hbs/stats|paradigms}}
 
|align="center"| [[Apertium-hbs#Current_State|~{{:Apertium-hbs/stats/average}}%]]
 
|align="center"| [[Apertium-hbs#Current_State|~{{:Apertium-hbs/stats/average}}%]]
 
|| [[apertium-hbs]] ([[languages]])
 
|| [[apertium-hbs]] ([[languages]])
  +
|| [[User: Francis Tyers|Fran]]
|| ?
 
 
|-
 
|-
 
|| <code>[[apertium-slv]]</code>
 
|| <code>[[apertium-slv]]</code>
 
|| [[Slovenian]]
 
|| [[Slovenian]]
|| <code>sl</code>
+
|| <code>sl</code>
|| <code>slv</code>
+
|| <code>slv</code>
|| [[?]]
+
|| [[lttoolbox]]
 
|| production
 
|| production
 
|align="right"| {{#lst:Apertium-slv/stats|stems}}
 
|align="right"| {{#lst:Apertium-slv/stats|stems}}
  +
|align="right"| {{#lst:Apertium-slv/stats|paradigms}}
 
|align="center"| [[Apertium-slv#Current_State|~{{:Apertium-slv/stats/average}}%]]
 
|align="center"| [[Apertium-slv#Current_State|~{{:Apertium-slv/stats/average}}%]]
 
|| [[apertium-hbs-slv]] ([[trunk]])<br />[[apertium-slv-pol]] ([[incubator]])<br />[[apertium-sl-mk]] ([[incubator]])
 
|| [[apertium-hbs-slv]] ([[trunk]])<br />[[apertium-slv-pol]] ([[incubator]])<br />[[apertium-sl-mk]] ([[incubator]])
|| [[User:Francis Tyers|Fran]], [[User:Fpetkovski|Petkovski]], [[User:Krvoje|Peradin]], Aleš Horvat, Dejan Čabrilo, Ivica Dimitrijev
+
|| [[User:Francis Tyers|Fran]], [[User:Fpetkovski|Petkovski]], [[User:Krvoje|Peradin]], Horvat, Čabrilo, Dimitrijev
 
|-
 
|-
 
|| <code>[[apertium-tur]]</code>
 
|| <code>[[apertium-tur]]</code>
Line 117: Line 66:
 
|| <code>tr</code>
 
|| <code>tr</code>
 
|| <code>tur</code>
 
|| <code>tur</code>
|| [[?]]
+
|| [[HFST]]
 
|| working
 
|| working
 
|align="right"| {{#lst:Apertium-tur/stats|stems}}
 
|align="right"| {{#lst:Apertium-tur/stats|stems}}
  +
|align="right"| {{#lst:Apertium-tur/stats|paradigms}}
 
|align="center"| [[Apertium-tur#Current_State|~{{:Apertium-tur/stats/average}}%]]
 
|align="center"| [[Apertium-tur#Current_State|~{{:Apertium-tur/stats/average}}%]]
 
|| [[apertium-tur]] ([[languages]])
 
|| [[apertium-tur]] ([[languages]])
|| [[User:Francis Tyers|Fran]], [[Users:Zfe|Grossi]], Sezgi Aydın
+
|| [[User:Francis Tyers|Fran]], [[User:Zfe|Gianluca]], Sezgi Aydın
  +
|-
  +
|| <code>[[apertium-bul]]</code>
  +
|| [[Bulgarian]]
  +
|| <code>bg</code>
  +
|| <code>bul</code>
  +
|| [[lttoolbox]]
  +
|| production
  +
|align="right"| {{#lst:Apertium-bul/stats|stems}}
  +
|align="right"| {{#lst:Apertium-bul/stats|paradigms}}
  +
|align="center"| [[Apertium-bul#Current_State|~{{:Apertium-bul/stats/average}}%]]
  +
|| [[apertium-bul]] ([[languages]])
  +
|| [[User:Francis Tyers|Fran]], [[User:Tihomir|Tihomir]]
  +
|-
  +
|| <code>[[apertium-sqi]]</code>
  +
|| [[Albanian]]
  +
|| <code>sq</code>
  +
|| <code>sqi</code>
  +
|| [[lttoolbox]]
  +
|| development
  +
|align="right"| {{#lst:Apertium-sqi/stats|stems}}
  +
|align="right"| {{#lst:Apertium-sqi/stats|paradigms}}
  +
|align="center"| [[Apertium-sqi#Current_State|~{{:Apertium-sqi/stats/average}}%]]
  +
|| [[apertium-sqi]] ([[languages]])
  +
|| [[User: Francis Tyers|Fran]]
  +
|-
  +
|| <code>[[apertium-ell]]</code>
  +
|| [[Greek]]
  +
|| <code>el</code>
  +
|| <code>ell</code>
  +
|| [[lttoolbox]]
  +
|| ?
  +
|align="right"| {{#lst:Apertium-ell/stats|stems}}
  +
|align="right"| {{#lst:Apertium-ell/stats|paradigms}}
  +
|align="center"| -
  +
|| [[apertium-ell]] ([[languages]])
  +
|| [[User:Francis Tyers|Fran]]
  +
|-
  +
|| <code>[[apertium-rup]]</code>
  +
|| [[Aromanian]]
  +
|| <code>-</code>
  +
|| <code>rup</code>
  +
|| [[lttoolbox]]
  +
|| ?
  +
|align="right"| {{#lst:Apertium-rup/stats|stems}}
  +
|align="right"| {{#lst:Apertium-rup/stats|paradigms}}
  +
|align="center"| -
  +
|| [[apertium-rup]] ([[incubator]])
  +
|| [[User: Francis Tyers|Fran]], shopskasalata
  +
|-
  +
|| <code>[[apertium-ron]]</code>
  +
|| [[Romanian]]
  +
|| <code>ro</code>
  +
|| <code>ron</code>
  +
|| [[lttoolbox]]
  +
|| ?
  +
|align="right"| ?
  +
|align="right"| ?
  +
|align="center"| -
  +
|| [[apertium-ron-rup]] ([[incubator]])
  +
|| [[User: Francis Tyers|Fran]]
 
|}
 
|}
   
=== Balkan Language Classification ===
+
=== Balkan Language Classification <sup><small>[https://en.wikipedia.org/wiki/Languages_of_the_Balkans]</small></sup> ===
  +
* [[Albanian]]
???
 
  +
** [[Arvanitika]]
 
  +
** [[Gheg]]
  +
** [[Tosk]]
  +
* [[Hellenic languages]]
  +
** [[Cappadocian Greek]]
  +
** [[Greek|Standard Greek]]
  +
** [[Pontic Greek]]
  +
** [[Tsakonian]]
  +
* [[Romance languages]]
  +
** [[Aromanian]]
  +
** [[Istriot]]
  +
** [[Istro-Romanian]]
  +
** [[Italian]]
  +
** [[Ladino]]
  +
** [[Megleno-Romanian]]
  +
** [[Romanian]]
  +
** [[Moldovan]]
  +
* [[Slavic languages]]
  +
** [[Western South Slavic]]
  +
*** [[Serbo-Croatian]]
  +
*** [[Slovenian]]
  +
** [[Eastern South Slavic]]
  +
*** [[Bulgarian]]
  +
*** [[Macedonian]]
  +
* [[Indo-Aryan languages]]
  +
** [[Romani]]
   
 
=== Existing language pairs ===
 
=== Existing language pairs ===
Line 161: Line 196:
 
{| style="text-align: center;" class="wikitable"
 
{| style="text-align: center;" class="wikitable"
 
|- style="background: #ececec"
 
|- style="background: #ececec"
! !! bul !! mkd !! ron !! rup !! sqi !! ell !! hbs !! slv !! tur
+
! !! bul !! mkd !! ron !! rup !! sqi !! ell !! hbs !! slv !! tur
  +
|-
  +
| '''el''' || ''[[bg-el]]'' || || || || || || || ||
  +
|-
  +
| '''ru''' || ''[[bg-ru]]'' || || || || || || ''[[hbs-rus]]'' || ||
  +
|-
  +
| '''en''' || ''[[bg-en]]'' || '''[[mk-en]]''' || || || ''[[en-sq]]'' || ''[[ell-eng]]'' || ''[[sh-en]]'' || || ''[[tr-en]]''
  +
|-
  +
| '''it''' || || || ''[[ro-it]]'' || || || || || ''[[sl-it]]'' ||
  +
|-
  +
| '''spa''' || || || || || || || || ''[[slv-spa]]'' ||
  +
|-
  +
| '''pol''' || || || || || || || || ''[[slv-pol]]'' ||
  +
|-
  +
| '''eo''' || ''[[eo-bg]]'' || || || || || ''[[eo-el]]'' || || ||
  +
|-
  +
| '''fr''' || || || ''[[fr-ro]]'' || || || || || ||
  +
|-
  +
| '''ca''' || || || ''[[ca-ro]]'' || || || || || ||
  +
|-
  +
| '''es''' || || || '''[[es-ro]]''' || || || || || ||
  +
|-
  +
| '''cs''' || || || || || || || || ''[[cs-sl]]'' ||
  +
|-
  +
| '''kir''' || || || || || || || || || ''[[tur-kir]]''
  +
|-
  +
| '''tat''' || || || || || || || || || ''[[tur-tat]]''
 
|-
 
|-
| '''el''' || ''[[bg-el]]'' || || || || || || || ||
+
| '''uzb''' || || || || || || || || || ''[[tur-uzb]]''
 
|-
 
|-
| '''ru''' || ''[[bg-ru]]'' || || || || || || || ||
+
| '''aze''' || || || || || || || || || [[tur-aze]]
 
|-
 
|-
| '''en''' || ''[[bg-en]]'' || '''[[mk-en]]''' || || || || || || ||
+
| '''cv''' || || || || || || || || || ''[[cv-tr]]''
 
|}
 
|}
   

Latest revision as of 08:56, 24 December 2013

The Balkan languages are those languages spoken in the Balkans, and possibly forming a part of the Balkan Sprachbund. They include Bulgarian, Macedonian, Romanian, Aromanian, Albanian, Greek, Serbo-Croatian, and a number of others.

The master plan involves generating independent finite-state transducers for each language, and then making individual dictionaries and transfer rules for every pair. The current status of these goals is listed below.

Status[edit]

The ultimate goal is to have multi-purposable transducers for a variety of Balkan languages. These can then be paired for X→Y translation with the addition of a CG for language X and transfer rules / dictionary for the pair X→Y. Below is listed development progress for each language's transducers and dictionary pairs.

Transducers[edit]

Once a transducer has ~80% coverage on a range of medium-large corpora we can say it is "working". Over 90% and it can be considered to be "production".

name Language ISO 639 formalism state stems paradigms coverage location primary authors
-2 -3
apertium-mkd Macedonian mk mkd lttoolbox production 30,686 260 ~90.5% apertium-mkd (languages) Fran, Tihomir, Petkovski
apertium-hbs Serbo-Croatian sh hbs lttoolbox working 58,004 1,092 ~90.5% apertium-hbs (languages) Fran
apertium-slv Slovenian sl slv lttoolbox production 20,596 1,435 ~90.5% apertium-hbs-slv (trunk)
apertium-slv-pol (incubator)
apertium-sl-mk (incubator)
Fran, Petkovski, Peradin, Horvat, Čabrilo, Dimitrijev
apertium-tur Turkish tr tur HFST working 17,221 1 ~87.3% apertium-tur (languages) Fran, Gianluca, Sezgi Aydın
apertium-bul Bulgarian bg bul lttoolbox production 8,578 317 ~80% apertium-bul (languages) Fran, Tihomir
apertium-sqi Albanian sq sqi lttoolbox development 3,312 138 ~80.2% apertium-sqi (languages) Fran
apertium-ell Greek el ell lttoolbox ? 2,460 951 - apertium-ell (languages) Fran
apertium-rup Aromanian - rup lttoolbox ? 312,005 26193 - apertium-rup (incubator) Fran, shopskasalata
apertium-ron Romanian ro ron lttoolbox ? ? ? - apertium-ron-rup (incubator) Fran

Balkan Language Classification [1][edit]

Existing language pairs[edit]

Balkan-Balkan pairs[edit]

Text in italic denotes language pairs under development / in the incubator. Regular text denotes a functioning language pair in staging, while text in bold denotes a stable well-working language pair in trunk.

bul mkd ron rup sqi ell hbs slv tur
bul - mk-bg bg-el
mkd - mk-sq
ron - ron-rup
rup -
sqi -
ell -
hbs sh-mk - hbs-slv
slv sl-mk -
tur -

Pairs with non-Balkan languages[edit]

bul mkd ron rup sqi ell hbs slv tur
el bg-el
ru bg-ru hbs-rus
en bg-en mk-en en-sq ell-eng sh-en tr-en
it ro-it sl-it
spa slv-spa
pol slv-pol
eo eo-bg eo-el
fr fr-ro
ca ca-ro
es es-ro
cs cs-sl
kir tur-kir
tat tur-tat
uzb tur-uzb
aze tur-aze
cv cv-tr

Existing[edit]

Monolingual[edit]

Language Module Paradigms Lemmata Coverage (SETimes) Coverage (Wikipedia)
Bulgarian Macedonian and Bulgarian 305 7873 88.1% 77.15%
Macedonian Macedonian and Bulgarian 225 8094 92.1%
Romanian Spanish and Romanian 997 18719 89.7% 83.62%
Aromanian Incubator 17 28 -
Albanian Incubator 127 3302 80.2% 65.62%
Greek Incubator 377 859 49.4% 49.75%
Serbo-Croatian Incubator 85 660 -
Slovenian Incubator 1128 20385 -
Turkish (external: TRMorph) - 37101

Languages missing: Roma

Bilingual[edit]

Language pairs[edit]

See also[edit]

External links[edit]