Serbo-Croatian and Macedonian/Final report

From Apertium
Jump to navigation Jump to search

Description

The goal of the project was to implement the language pair sh→mk (Serbo-Croatian to Macedonian) in the Apertium framework. During the project we have developed a morphological analyser and a constraint grammar for Serbo-Croatian, a bilingual dictionary, as well as a set of transfer rules. Since Serbo-Croatian entails several standard languages, we added corresponding modules for analysis and translation.

The language resources online for both languages are sparse, and generally hard to reach, so a good part of the project was a lot of manual work. I'm quite happy that after this there will be a publicly available resource for Bosnian, Serbian and Croatian.

Overall, the work on the project was great fun. It provided me with a chance to go through different stages of NLP and MT, and I learned a lot both software-wise and linguistically.

The Morphological Analyser

sh

The Serbo-Croatian morphological analyser was written almost from scratch, using the current grammar of Croatian and hjp.srce.hr as the main resources. The analysis has two modes for ekavian and ijekavian reflex of yat.

The lexicon was constructed to match the Macedonian dictionary, which contains words from a frequency list made from SETimes.

mk

For generation we have used the morphological analyser from the Apertium's mk-bg language pair, written by Tihomir Rangelov.

We added 142 new entries and an absolute superlative degree for adjectives. Macedonian and Serbo-Croatian share the feature that every neuter (or masculine) adjective can be used as an adverb, so we added it to the analysis.

Bidix

The bidix is divided into four sections, one common for analysis, and three for generation (words specific to Bosnian, Croatian or Serbian).

The Constraint Grammar

The CG for Serbo-Croatian is a very good start, and has a good potential to be reused (i.e. with some modifications it can be used to analyse the Slovene language). It gives quite good results, but has room for restructuring and improvements. It's quite important for the quality of the translation, so it's high on my priority list.


Transfer rules

This was the most challenging part of the project. While the languages in the pair are closely related in vocabulary, their grammatical properties are sometimes quite different. The usage of the definite article is quite difficult to infer, and in some cases the grammatical cases of Serbo-Croatian do not translate clearly to prepositional constructions in macedonian (the problem with partitive vs. possesive genitive). While our transfer rules cover some cases quite well, there is still work to be done here.

Statistics

Dictionaries
  • sh morphological analyser lexicon: 7564 lemmata, 170787 surface forms (including ekavian/ijekavian)
  • apertium-sh-mk.sh-mk.dix (unique: 9985, total: 13032)
Coverage
  • (bs|hr|sr|sh) Wikipedia ( , std. dev.: )
  • (sr|hr) SETimes ( 73.624631444, std. dev.: 0.488418931215)
Testvoc
POS Total Clean With @ With # Clean %
adj 2072699 2072699 0 0 100
vblex 246713 246713 0 0 100
vbmod 96020 96020 0 0 100
np 54190 54190 0 0 100
n 41477 41477 0 0 100
adv 7808 7808 0 0 100
prn 7662 7662 0 0 100
num 4284 4284 0 0 100
vbhaver 224 224 0 0 100
vbser 170 170 0 0 100
pr 77 77 0 0 100
abbr 33 33 0 0 100
cnjsub 30 30 0 0 100
cnjcoo 20 20 0 0 100
vaux 0 0 0 0 100
rel 0 0 0 0 100
preadv 0 0 0 0 100
ij 0 0 0 0 100
guio 0 0 0 0 100
det 0 0 0 0 100
cnjadv 0 0 0 0 100
cm 0 0 0 0 100
Rules

apertium-sh-mk.sh-mk.t1x: 51

apertium-sh-mk.sh-mk.t2x: 11

apertium-sh-mk.sh-mk.t3x: 1

Error rate (Realistic results for now only for setimes.pilots.txt, the rest is just preliminary postedited)
File Num. Words % OOV WER (Sur) PER (Sur) WER (Lem) PER (Lem)
setimes.pilots.txt 454 0.44% 29.96% 20.48% - -
setimes.tablice.txt 466 0.43% 12.23% 9.23% - -
setimes.klupa.txt 477 18.12% 14.68% 12.37% - -
setimes.povijest.txt 519 14.18% 11.95% 9.25% - -
wikipedia.txt - - - - - -

Future work

Working on Apertium has opened my mind greatly regarding MT and NLP, and given me some new ideas. As i've said earlier, since the CG is an invaluable module I would like to do improvements on it. The lexicon for Serbo-Croatian is relatively small, and doesn't cover some quite common words, so I would like to expand it further, using a frequency list based on Bosnian, Serbian or Croatian texts. Transfer rules for this language pair are a work in progress, so I would also like to do work on them.