Difference between revisions of "Serbo-Croatian and Macedonian/Final report"

From Apertium
Jump to navigation Jump to search
Line 10: Line 10:
 
===The Morphological Analyser===
 
===The Morphological Analyser===
 
====sh====
 
====sh====
The Serbo-Croatian morphological analyser was written almost from scratch, using the current grammar of Croatian and hjp.srce.hr as the main resources. The analysis has two modes; for the ekavian and ijekavian reflex of yat. The lexicon was constructed to match the Macedonian dictionary, which contains words from a frequency list made from SETimes. In my knowledge it's currently the only open source morphological analyser available for Croatian.
+
The Serbo-Croatian morphological analyser was written almost from scratch, using the current grammar of Croatian and hjp.srce.hr as the main resources. The analysis has two modes; for the ekavian and ijekavian reflex of yat (+ an additional 'scientific' one that renders yat as an 'ě'). The lexicon was constructed to match the Macedonian dictionary, which contains words from a frequency list made from SETimes. In my knowledge it's currently the only open source morphological analyser available for Croatian, probably also Bosnian and Serbian. I am quite happy with it, and after some proofing I'll continue to work on expanding the lexicon.
   
 
====mk====
 
====mk====

Revision as of 17:09, 27 August 2011

Description

The goal of the project was to implement the language pair sh→mk (Serbo-Croatian to Macedonian) in the Apertium framework. During the project we have developed a morphological analyser and a constraint grammar for Serbo-Croatian, a bilingual dictionary, as well as a set of transfer rules. Since Serbo-Croatian entails several standard languages, we added corresponding modules for analysis and translation.

The language resources online for both languages are sparse, and generally hard to reach, so a good part of the project was a lot of manual work, as there were no software resources to speed up the work. I'm quite happy that after this there will be a publicly available resource for Bosnian, Serbian and Croatian.

The translation is not yet top quality; it has been dubbed by a native speaker as 'overall fine', as in there are errors, but the output is easily posteditable.

Overall, the work on the project was great fun. It provided me with a chance to go through different stages of NLP and MT, and I learned a lot both software-wise and linguistically.

The Morphological Analyser

sh

The Serbo-Croatian morphological analyser was written almost from scratch, using the current grammar of Croatian and hjp.srce.hr as the main resources. The analysis has two modes; for the ekavian and ijekavian reflex of yat (+ an additional 'scientific' one that renders yat as an 'ě'). The lexicon was constructed to match the Macedonian dictionary, which contains words from a frequency list made from SETimes. In my knowledge it's currently the only open source morphological analyser available for Croatian, probably also Bosnian and Serbian. I am quite happy with it, and after some proofing I'll continue to work on expanding the lexicon.

mk

For generation we have used the morphological analyser from the Apertium's mk-bg language pair, written by Tihomir Rangelov. We made some additions to cover the new words in the text we used for testing (142 new entries) and an absolute superlative degree for adjectives. Also, Macedonian and Serbo-Croatian share the feature that every neuter (or masculine) adjective can be used as an adverb, so we added this to the analysis.

Bidix

Since the intent is for the dictionary to cover Bosnian, Croatian and Serbian standard, the bidix is divided into respective sections. There is one common for analysis, and four are meant for generation mk→sh. They contain words that are in common or specific to Bosnian, Croatian or Serbian. Typically if a term varies between standards, all variants are in the analysis section, and the local variants are sorted individually into their respective sections. This was the least fun part of the work, since there was no word-by-word list available, most of it was done manually.

The Constraint Grammar

The CG for Serbo-Croatian currently has 124 rules. It's a very good start, and has a good potential to be reused (i.e. with some modifications it can be used to disambiguate the Slovene language). It gives quite good results, but it's performance can, I believe, be greatly improved by restructuring and reformulating of some of the rules. Also we have barelly tapped it's power regarding functional tags, so I think that comes next. It's quite important for the quality of the translation, so it's high on my priority list.

Transfer rules

This was the most challenging part of the project. While the languages in the pair are very closely related in vocabulary, their grammatical properties are sometimes quite different. The usage of the definite article is quite difficult to infer, and in some cases the grammatical cases of Serbo-Croatian do not translate clearly to prepositional constructions in macedonian (the problem with partitive vs. possesive genitive). Since the word order in both languages is very free, problems also appear with long-distance relationships of words. While our transfer rules cover some cases quite well, there is still work to be done.

Statistics

Dictionaries
  • apertium-sh-mk.sh.metadix: 7564 lemmata, 170787 surface forms
  • apertium-sh-mk.sh-mk.dix: (unique: 9985, total: 13032)
  • apertium-sh-mk.mk.dix: 8672 lemmata, 157734 surface forms
Coverage
  • (bs|hr|sr|sh) Wikipedia (73.1289554019, std. dev.: 0.365491364947 )
  • (sr|hr) SETimes ( 77.9652051928, std. dev.: 3.42750034988)

(The 1st quarter of the SETimes corpus contains a lot of text in English, as well as text without proper diacritics)

Testvoc
POS Total Clean With @ With # Clean %
adj 2072699 2072699 0 0 100
vblex 246713 246713 0 0 100
vbmod 96020 96020 0 0 100
np 54190 54190 0 0 100
n 41477 41477 0 0 100
adv 7808 7808 0 0 100
prn 7662 7662 0 0 100
num 4284 4284 0 0 100
vbhaver 224 224 0 0 100
vbser 170 170 0 0 100
pr 77 77 0 0 100
abbr 33 33 0 0 100
cnjsub 30 30 0 0 100
cnjcoo 20 20 0 0 100
vaux 0 0 0 0 100
rel 0 0 0 0 100
preadv 0 0 0 0 100
ij 0 0 0 0 100
guio 0 0 0 0 100
det 0 0 0 0 100
cnjadv 0 0 0 0 100
cm 0 0 0 0 100
Rules

apertium-sh-mk.sh-mk.rlx: 124

apertium-sh-mk.sh-mk.t1x: 51

apertium-sh-mk.sh-mk.t2x: 11

apertium-sh-mk.sh-mk.t3x: 1

Error rate
File Num. Words % OOV WER (Sur) PER (Sur) WER (Lem) PER (Lem)
setimes.pilots.txt 454 0.44% 29.96% 20.48% - -
setimes.tablice.txt 470 0.43% 48.12% 34.58% - -
setimes.klupa.txt 480 18.12% 60.36% 46.81% - -
setimes.povijest.txt 529 14.18% 53.35% 40.52% - -

(The results were obtained by comparision with postedited articles, however besides the grammar the [human] translator modified the style and vocabulary as well)

Future work

Working on Apertium has opened my mind greatly regarding MT and NLP, and given me some new ideas. As I have said earlier, since the CG is an invaluable module I would like to do improvements on it. The lexicon for Serbo-Croatian is relatively small, and doesn't cover some quite common words, so I would like to expand it further. Transfer rules for this language pair are a work in progress, so it is what I will naturally be continuing to work on.

Thanks

Many a great thanks to: my mentor Francis Tyers whose omnipresence on the IRC channel (availability), great patience and willingless to help are hard to match, Dime Mitrevski for tell me about the Macedonian language in the detail I need, Tihomir Rangelov for advice in the beginning stages of this work, Tino Didriksen and Eckhard Bick for CG, and the rest of you guys from the IRC for answering questions when I needed, and for filling my XChat window with fun stuff while I was working :)