Serbo-Croatian and Macedonian/Final report
The goal of the project was to implement the language pair sh→mk (Serbo-Croatian to Macedonian) in the Apertium framework. During the project we have developed a morphological analyser and a constraint grammar for Serbo-Croatian, a bilingual dictionary, as well as a set of transfer rules. Since Serbo-Croatian entails several standard languages, we added corresponding modules for analysis and translation.
The language resources online for both languages are sparse, and generally hard to reach, so a good part of the project was a lot of manual work, as there were no software resources to speed up the work. I'm quite happy that after this there will be a publicly available resource for Bosnian, Serbian and Croatian.
The translation is not yet top quality; it has been dubbed by a native speaker as 'overall fine', as in there are errors, but the output is easily posteditable.
Overall, the work on the project was great fun. It provided me with a chance to go through different stages of NLP and MT, and I learned a lot both software-wise and linguistically.
The Morphological Analyser
The Serbo-Croatian morphological analyser was written almost from scratch, using the current grammar of Croatian and hjp.srce.hr as the main resources. The analysis has two modes; for the ekavian and ijekavian reflex of yat. The lexicon was constructed to match the Macedonian dictionary, which contains words from a frequency list made from SETimes. In my knowledge it's currently the only open source morphological analyser available for Croatian.
For generation we have used the morphological analyser from the Apertium's mk-bg language pair, written by Tihomir Rangelov. We made some additions to cover the new words in the text we used for testing (142 new entries) and an absolute superlative degree for adjectives. Also, Macedonian and Serbo-Croatian share the feature that every neuter (or masculine) adjective can be used as an adverb, so we added this to the analysis.
Since the intent is for the dictionary to cover Bosnian, Croatian and Serbian standard, the bidix is divided into respective sections. There is one common for analysis, and four are meant for generation mk→sh. They contain words that are in common or specific to Bosnian, Croatian or Serbian. Typically if a term varies between standards, all variants are in the analysis section, and the local variants are sorted individually into their respective sections. This was the least fun part of the work, since there was no word-by-word list available, most of it was done manually.
The Constraint Grammar
The CG for Serbo-Croatian currently has 124 rules. It's a very good start, and has a good potential to be reused (i.e. with some modifications it can be used to disambiguate the Slovene language). It gives quite good results, but it's performance can, I believe, be greatly improved by restructuring and reformulating of some of the rules. Also we have barelly tapped it's power regarding functional tags, so I think that comes next. It's quite important for the quality of the translation, so it's high on my priority list.
This was the most challenging part of the project. While the languages in the pair are closely related in vocabulary, their grammatical properties are sometimes quite different. The usage of the definite article is quite difficult to infer, and in some cases the grammatical cases of Serbo-Croatian do not translate clearly to prepositional constructions in macedonian (the problem with partitive vs. possesive genitive). Since the word order in both languages is very free, problems also appear with long-distance relationships of words. While our transfer rules cover some cases quite well, there is still work to be done.
apertium-sh-mk.sh.metadix: 7564 lemmata, 170787 surface forms
apertium-sh-mk.sh-mk.dix: (unique: 9985, total: 13032)
apertium-sh-mk.mk.dix: 8672 lemmata, 157734 surface forms
- (bs|hr|sr|sh) Wikipedia (73.1289554019, std. dev.: 0.365491364947 )
- (sr|hr) SETimes ( 77.9652051928, std. dev.: 3.42750034988)
(The 1st quarter of the SETimes corpus contains a lot of text in English, as well as text without proper diacritics)
|POS||Total||Clean||With @||With #||Clean %|
- Error rate (Realistic results for now only for
setimes.pilots.txt, the rest is just preliminary postedited)
|File||Num. Words||% OOV||WER (Sur)||PER (Sur)||WER (Lem)||PER (Lem)|
Working on Apertium has opened my mind greatly regarding MT and NLP, and given me some new ideas. As I have said earlier, since the CG is an invaluable module I would like to do improvements on it. The lexicon for Serbo-Croatian is relatively small, and doesn't cover some quite common words, so I would like to expand it further. Transfer rules for this language pair are a work in progress, so it is what I will naturally be continuing to work on.
Many a great thanks to: my mentor Francis Tyers whose omnipresence on the IRC channel (availability), great patience and willingless to help are hard to match, Dime Mitrevski for tell me about the Macedonian language in the detail I need, Tihomir Rangelov for advice in the beginning stages of this work, Tino Didriksen and Eckhard Bick for CG, and the rest of you guys from the IRC for answering questions when I needed, and for filling my XChat window with fun stuff while I was working :)