Revision as of 10:33, 29 August 2011

Description

The goal of this project was to implement a new language pair - Slovenian-Spanish into the Apertium translation system. During the project a complete revision and almost a complete remake of the Slovenian monolingual dictionary was necessary. Also the bilingual dictionary was created from scratch and transfer rules were added according to the contrastive grammar (Some basic rules were added, the rest is left for the FUTURE). Additional work has been made on the POS Tagger which was producing erroneous results, mostly due to tagset differences.

A more detailed work is presented in this report.

Morphological analyser

Slovenian

As mentioned before, a complete revision and almost a complete remake of the Slovenian morphology was made. The first version of the SL morphology was taken from sl-mk language pair, generated by Jernej Vičič. A complete revision consisted in eliminating not-existing words (lemmas) and duplicates, as for complete remake consisted in sorting existing paradigm entries, adding new lemmas with dual meaning and adding new paradigm entries for verbs and adverbs. According to Jimmy's instructions, nouns which derivate from verbs (in Slovene we call them "Glagolniki) have been tagged as vblex.ger which is a gerund form of the verb. The aspect of verbs has been tagged as perfective or imperfective - further changes were needed in the bilingual dictionary and in transfer rules because Spanish does not have perf/imperf form.

I have to say I was lucky in this part of the project because I had several internet tools which helped me with the Slovenian morphology.

The final version of the Slovenian morphology contains 20724 lemmas. I think it's a nice number and I'm very happy with this result.

Spanish

The Spanish morphology has been taken from the en-es language pair. During the project additional lemmas have been added by taking them from the es-ca language pair or by adding them manually. All proper names from SL morphology have also been added to the Spanish morphology. There were no bigger changes in the Spanish morphology.

The final version of the Spanish morphology contains 35683 lemmas.

Bilingual dictionary

The bilingual dictionary has been made from scratch. Different approaches were taken to generate bilingual entries (intersects between sl-it and it-es lemmas, intersects between sl-en and en-es lemmas, Google translate, manual translation, etc.), but after all, all entries had to be checked manually and if needed, fixed. In this period I wrote a lot of scripts by myself to generate and check translations as fast as possible. Most of them are written in PHP, the rest are in AWK. Scripts are placed in the dev/Scripts folder so feel free to use them.

The main goal was to cover 10000 translations (lemmas/entries) and we did it. Final version of the bilingual dictionary contains 10823 entries (lemmas). I'm very happy with this result too.

POS tagger

The POS tagger was taken from the sl-mk language pair which was producing erroneous results, mostly due to tagset differences. We had to regenerate it from scratch with a new corpus.

Transfer rules

Transfer rules were the last part of the project. Since I had to learn almost everything about transfer rules, it was the most challenging part of the project. We did not have much time so we decided to write down the basic rules/macros and also complicate a bit the existing macros to make the upcoming work in t2x and t3x easier. Most of time we had to deal with "Masculine Animate" and "Masculine Inanimate" genders which are not present in the Spanish morphology and "Dual" -> "Plural" macro. These are just examples of "problems" we had to take care of.

This part of the project was the most complicated one and we did our best. We did not have enough time solve all problems in this stage, but at least we covered the basic rules and prepared for additional work which will be applied in the future.

Statistics

Dictionaries

apertium-sl-es.sl.dix: 20724 lemmata, 549817 surface forms
apertium-sl-es.sl-es.dix: (unique: 9985, total: 13032)

Coverage

Corpus	Num. Words	% Average	STDEV
`MULTEXT-EAST (Orwell)`	104482	93.08%	0.29%
`OPUS (subtitles)`	2562969	89.515%	0.23%

Testvoc

For this occasion the Slovenian morphology has been cleaned. It contained only lemmas which have an appropriate translation in the bilingual dictionary.

POS	Total	Clean	With @	With #	Clean %
adj	261140	142942	505	117693	54,8
vblex	184455	180228	307	3920	97,7
n	86656	76544	5206	4906	88,4
det	6974	6681	0	293	95,8
prn	3350	3350	0	0	100
pr	1416	1272	144	0	89,9
np	1068	1068	0	0	100
adv	544	190	2	352	35
num	362	362	0	0	100
vbser	252	86	0	166	34,2
cnjcoo	18	8	0	10	44,5
cnjsub	9	0	0	9	0
cnjadv	6	6	0	0	100
vbmod	0	0	0	0	100
vbhaver	0	0	0	0	100
vaux	0	0	0	0	100
rel	0	0	0	0	100
preadv	0	0	0	0	100
ij	0	0	0	0	100
guio	0	0	0	0	100
cm	0	0	0	0	100
abbr	0	0	0	0	100

Rules

apertium-sl-es.sl-es.t1x: 20
apertium-sl-es.sl-es.t2x: Future work
apertium-sl-es.sl-es.t3x: Future work

More will be added

...

Future work

I am looking forward to continue working on Slovenian-Spanish language pair. I would like to cover more lemmas in the bilingual dictionary by adding missing lemmas and by adding different translations, to improve transfer rules - move to t2x and t3x and if necessary to improve the POS tagger.

I am aiming to a release-quality language pair.

Conclusion

I would like express my gratitude my mentor Jernej Vičič for encouraging me to take part at this project, for his patience, all meetings we had, for his willingness to help and for everything he taught me in this period, Francis Tyers for his Spanish-side help, all ideas and suggestions in the first part of the project and also for his patience, Jimmy O'Regan for all answers, advices and instructions about Slavic languages, for teaching me everything I know about transfer rules and for his great company during the late night while I was working on my project, my co-mentor Gema Ramírez-Sánchez for helping us with pending tests' translations and prepositions and in the end thanks to all other GSoC students and Apertium members who helped me on IRC, answering my questions and cheering me up with different stuff.

I learned a lot from all of you and I hope I will have a chance to continue this adventure.

@@ Line 12: / Line 12: @@
 I have to say I was lucky in this part of the project because I had several internet tools which helped me with the Slovenian morphology.
 <br/><br/>
-The final version of the Slovenian morphology contains 20718 lemmas. I think it's a nice number and I'm very happy with this result.
+The final version of the Slovenian morphology contains 20724 lemmas. I think it's a nice number and I'm very happy with this result.
 ===Spanish===

Difference between revisions of "Slovenian and Spanish/Final report"

Revision as of 10:33, 29 August 2011

Contents

Description

Morphological analyser

Slovenian

Spanish

Bilingual dictionary

POS tagger

Transfer rules

Statistics

Dictionaries

Coverage

Testvoc

Rules

More will be added

Future work

Conclusion

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools