Difference between revisions of "Serbo-Croatian and Macedonian/Final report"

Revision as of 07:12, 26 August 2011

Description

The goal of the project was to implement the language pair sh→mk (Serbo-Croatian to Macedonian) in the Apertium framework. During the project we have developed a morphological analyser and a constraint grammar for Serbo-Croatian, a bilingual dictionary, as well as a set of transfer rules. Since Serbo-Croatian entails several standard languages, we added corresponding modules for analysis and translation.

The language resources online for both languages are sparse, and generally hard to reach, so a good part of the project was a lot of manual work, as there were no software resources to speed up the work. I'm quite happy that after this there will be a publicly available resource for Bosnian, Serbian and Croatian.

The translation is not yet top quality; it has been dubbed by a native speaker as 'overall fine', as in there are errors, but the output is easily posteditable.

Overall, the work on the project was great fun. It provided me with a chance to go through different stages of NLP and MT, and I learned a lot both software-wise and linguistically.

The Morphological Analyser

sh

The Serbo-Croatian morphological analyser was written almost from scratch, using the current grammar of Croatian and hjp.srce.hr as the main resources. The analysis has two modes; for the ekavian and ijekavian reflex of yat. The lexicon was constructed to match the Macedonian dictionary, which contains words from a frequency list made from SETimes. In my knowledge it's currently the only open source morphological analyser available for Croatian.

mk

For generation we have used the morphological analyser from the Apertium's mk-bg language pair, written by Tihomir Rangelov. We made some additions to cover the new words in the text we used for testing (142 new entries) and an absolute superlative degree for adjectives. Also, Macedonian and Serbo-Croatian share the feature that every neuter (or masculine) adjective can be used as an adverb, so we added this to the analysis.

Bidix

Since the intent is for the dictionary to cover Bosnian, Croatian and Serbian standard, the bidix is divided into respective sections. There is one common for analysis, and four are meant for generation mk→sh. They contain words that are in common or specific to Bosnian, Croatian or Serbian. Typically if a term varies between standards, all variants are in the analysis section, and the local variants are sorted individually into their respective sections. This was the least fun part of the work, since there was no word-by-word list available, most of it was done manually.

The Constraint Grammar

The CG for Serbo-Croatian is a very good start, and has a good potential to be reused (i.e. with some modifications it can be used to analyse the Slovene language). It gives quite good results, but has room for restructuring and improvements. Also we have barelly tapped it's power regarding functional tags, so I think that comes next. It's quite important for the quality of the translation, so it's high on my priority list.

Transfer rules

This was the most challenging part of the project. While the languages in the pair are closely related in vocabulary, their grammatical properties are sometimes quite different. The usage of the definite article is quite difficult to infer, and in some cases the grammatical cases of Serbo-Croatian do not translate clearly to prepositional constructions in macedonian (the problem with partitive vs. possesive genitive). Since the word order in both languages is very free, problems also appear with long-distance relationships of words. While our transfer rules cover some cases quite well, there is still work to be done.

Statistics

Dictionaries

apertium-sh-mk.sh.metadix: 7564 lemmata, 170787 surface forms
apertium-sh-mk.sh-mk.dix: (unique: 9985, total: 13032)
apertium-sh-mk.mk.dix: 8672 lemmata, 157734 surface forms

Coverage

(bs|hr|sr|sh) Wikipedia (73.1289554019, std. dev.: 0.365491364947 )
(sr|hr) SETimes ( 77.9652051928, std. dev.: 3.42750034988)

(The 1st quarter of the SETimes corpus contains a lot of text in English, as well as text without proper diacritics)

Testvoc

POS	Total	Clean	Clean %
adj	2072699	2072699	100
vblex	246713	246713	100
vbmod	96020	96020	100
np	54190	54190	100
n	41477	41477	100
adv	7808	7808	100
prn	7662	7662	100
num	4284	4284	100
vbhaver	224	224	100
vbser	170	170	100
pr	77	77	100
abbr	33	33	100
cnjsub	30	30	100
cnjcoo	20	20	100
vaux	0	0	100
rel	0	0	100
preadv	0	0	100
ij	0	0	100
guio	0	0	100
det	0	0	100
cnjadv	0	0	100
cm	0	0	100

Rules

apertium-sh-mk.sh-mk.t1x: 51

apertium-sh-mk.sh-mk.t2x: 11

apertium-sh-mk.sh-mk.t3x: 1

Error rate (Realistic results for now only for setimes.pilots.txt, the rest is just preliminary postedited)

File	Num. Words	% OOV	WER (Sur)	PER (Sur)	WER (Lem)	PER (Lem)
`setimes.pilots.txt`	454	0.44%	29.96%	20.48%	-	-
`setimes.tablice.txt`	466	0.43%	12.23%	9.23%	-	-
`setimes.klupa.txt`	477	18.12%	14.68%	12.37%	-	-
`setimes.povijest.txt`	519	14.18%	11.95%	9.25%	-	-
`wikipedia.txt`	-	-	-	-	-	-

Future work

Working on Apertium has opened my mind greatly regarding MT and NLP, and given me some new ideas. As I have said earlier, since the CG is an invaluable module I would like to do improvements on it. The lexicon for Serbo-Croatian is relatively small, and doesn't cover some quite common words, so I would like to expand it further. Transfer rules for this language pair are a work in progress, so it is what I will naturally be continuing to work on.

Thanks

Many a great thanks to: my mentor Francis Tyers whose omnipresence on the IRC channel (availability), great patience and willingless to help are hard to match, Dime Mitrevski for tell me about the Macedonian language in the detail I need, Tihomir Rangelov for advice in the beginning stages of this work, Tino Didriksen and Eckhard Bick for CG, and the rest of you guys from the IRC for answering questions when I needed, and for filling my XChat window with fun stuff while I was working :)

@@ Line 16: / Line 16: @@
 ===Bidix===
-The bidix is divided into four sections, one common for analysis, and three for generation (words specific to Bosnian, Croatian or Serbian). This was the least fun part of the work, since we had no word-by-word list available, most of it was done manually.
+Since the intent is for the dictionary to cover Bosnian, Croatian and Serbian standard, the bidix is divided into respective sections. There is one common for analysis, and four are meant for generation mk→sh. They contain words that are in common or specific to Bosnian, Croatian or Serbian. Typically if a term varies between standards, all variants are in the analysis section, and the local variants are sorted individually into their respective sections. This was the least fun part of the work, since there was no word-by-word list available, most of it was done manually.
 ===The Constraint Grammar===

Difference between revisions of "Serbo-Croatian and Macedonian/Final report"

Revision as of 07:12, 26 August 2011

Contents

Description

The Morphological Analyser

sh

mk

Bidix

The Constraint Grammar

Transfer rules

Statistics

Future work

Thanks

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools