Difference between revisions of "Slovenian and Spanish/Final report"

From Apertium
Jump to navigation Jump to search
m
Line 1: Line 1:
  +
{{TOCD}}
  +
 
=Description=
 
=Description=
 
The goal of this project was to implement a new language pair - Slovenian-Spanish into the Apertium translation system. During the project a complete revision and almost a complete remake of the Slovenian monolingual dictionary was necessary. Also the bilingual dictionary was created from scratch and transfer rules were added according to the contrastive grammar (Some basic rules were added, the rest is left for the FUTURE). Additional work has been made on the POS Tagger which was quietly not-smart at the beginning.
 
The goal of this project was to implement a new language pair - Slovenian-Spanish into the Apertium translation system. During the project a complete revision and almost a complete remake of the Slovenian monolingual dictionary was necessary. Also the bilingual dictionary was created from scratch and transfer rules were added according to the contrastive grammar (Some basic rules were added, the rest is left for the FUTURE). Additional work has been made on the POS Tagger which was quietly not-smart at the beginning.

Revision as of 18:41, 26 August 2011

Description

The goal of this project was to implement a new language pair - Slovenian-Spanish into the Apertium translation system. During the project a complete revision and almost a complete remake of the Slovenian monolingual dictionary was necessary. Also the bilingual dictionary was created from scratch and transfer rules were added according to the contrastive grammar (Some basic rules were added, the rest is left for the FUTURE). Additional work has been made on the POS Tagger which was quietly not-smart at the beginning.

A more detailed work is presented in this report later on.

Morphological analyser

Slovenian

As mentioned before, a complete revision and almost a complete remake of the Slovenian morphology was made. The first version of the SL morphology was taken from sl-mk language pair, generated by Jernej Vičič. A complete revision consisted in eliminating not-existing words (lemmas) and duplicates, as for complete remake consisted in sorting existing paradigm entries, adding new lemmas with dual meaning and adding new paradigm entries for verbs and adverbs. According to Jimmy's instructions, nouns that derivate from verbs (in Slovene we call them "Glagolniki) have been tagged as vblex.ger which is a gerund form of the verb. The aspect of verbs has been tagged as perfective or imperfective - further changes were needed in the bilingual dictionary and in transfer rules because Spanish does not have perf/imperf form.

I have to say I was lucky in this part of the project because I had different internet tools which helped me with the Slovenian morphology.

The final version of the Slovenian morphology contains 20718 lemmas. I think it's a nice number and I'm very happy with this result.

Spanish

The Spanish morphology has been taken from the en-es language pair. During the project additional lemmas have been added by taking them from the es-ca language pair or by adding them manually. All proper names from SL morphology have also been added to the Spanish morphology. There were no bigger changes in the Spanish morphology.

The final version of the Spanish morphology contains 35683 lemmas.

Bilingual dictionary

The bilingual dictionary has been made from scratch. Different approaches were taken to generate bilingual entries (intersects between sl-it and it-es lemmas, intersects between sl-en and en-es lemmas, Google translate and so on), but after all, all entries had to be checked manually and if needed, fixed. In this period I wrote a lot of scripts by myself to generate and check translations as fast as possible. Most of them are written in PHP the rest are in AWK. Scripts are placed in the dev/Scripts folder so feel free to use them.

The main goal was to cover 10000 translations (lemmas/entries) and we did it. Final version of the bilingual dictionary contains 10823 entries (lemmas). I'm very happy with this result too.

POS tagger

The POS tagger was taken from the sl-mk language pair which was not working quietly well. We had to regenerate it from scratch with a new corpus.

Transfer rules

Transfer rules were the last part of the project. Since I had to learn almost everything about transfer rules, it was the most challenging part of the project. We did not have much time so we decided to write down the basic rules/macros and also complicate a bit the existing macros to make the upcoming work in t2x and t3x easier. Most of time we had to deal with "Masculine Animate" and "Masculine Inanimate" genders which are not present in the Spanish morphology and "Dual" -> "Plural" macro. These are just examples of "problems" we had to take care.

This part of the project was the most complicated one and we did our best. We did not have enough time solve all problems in this stage, but at least we covered the basic rules and prepared for additional work which will be applied in the future.

Statistics

Testvoc

For this occasion the Slovenian morphology has been cleaned. It contained only lemmas which have an appropriate translation in the bilingual dictionary.

POS Total Clean With @ With # Clean %
adj 261140 142942 505 117693 54,8
vblex 184455 180228 307 3920 97,7
n 86656 76544 5206 4906 88,4
det 6974 6681 0 293 95,8
prn 3350 3350 0 0 100
pr 1416 1272 144 0 89,9
np 1068 1068 0 0 100
adv 544 190 2 352 35
num 362 362 0 0 100
vbser 252 86 0 166 34,2
cnjcoo 18 8 0 10 44,5
cnjsub 9 -2 1 10 -22
cnjadv 6 6 0 0 100
vbmod 0 0 0 0 100
vbhaver 0 0 0 0 100
vaux 0 0 0 0 100
rel 0 0 0 0 100
preadv 0 0 0 0 100
ij 0 0 0 0 100
guio 0 0 0 0 100
cm 0 0 0 0 100
abbr 0 0 0 0 100

Rules

apertium-sl-es.sl-es.t1x: 20
apertium-sl-es.sl-es.t2x: Future work
apertium-sl-es.sl-es.t3x: Future work

More will be added

...

Future work

I plan to continue to work on sl-es language pair in my free time. I hope I'll handle to work out the plan.

Conclusión

First of all I would like to thank my mentor Jernej for his patience and for all the meetings we had. He helped me a lot with this project and I believe he did a great work since I learned so many things from this adventure. I would also like to express my gratefulness to other members of the Apertium staff who helped me with this project and my co-mentor Gema Ramírez-Sánchez who helped us with the pending test and some difficult translations. I want to thank Francis Tyers for all ideas and suggestions (some crazy) he gave me in the first part of the project and a big thanks to Jimmy O'Regan for all the patience he had with me and my "n00bish" mistakes and questions regarding transfer rules and all other stuff regarding Slavic languages. I learned a lot from all of you and I hope I'll have a chance to continue this adventure with all of you.