Slovenian and Spanish/Final report

From Apertium
< Slovenian and Spanish
Revision as of 20:29, 30 August 2011 by Shraier (talk | contribs) (→‎Conclusion)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Description[edit]

The goal of this project was to implement a new language pair - Slovenian-Spanish into the Apertium translation system. During the project a complete revision and almost a complete remake of the Slovenian monolingual dictionary was necessary. Also the bilingual dictionary was created from scratch and transfer rules were added according to the contrastive grammar (Some basic rules were added, the rest is left for the FUTURE). Additional work has been made on the POS Tagger which was producing erroneous results, mostly due to tagset differences.

A more detailed work is presented in this report.

Morphological analyser[edit]

Slovenian[edit]

As mentioned before, a complete revision and almost a complete remake of the Slovenian morphology was made. The first version of the SL morphology was taken from sl-mk language pair, generated by Jernej Vičič. A complete revision consisted in eliminating not-existing words (lemmas) and duplicates, as for complete remake consisted in sorting existing paradigm entries, adding new lemmas with dual meaning and adding new paradigm entries for verbs and adverbs. According to Jimmy's instructions, nouns which derivate from verbs (in Slovene we call them "Glagolniki) have been tagged as vblex.ger which is a gerund form of the verb. The aspect of verbs has been tagged as perfective or imperfective - further changes were needed in the bilingual dictionary and in transfer rules because Spanish does not have perf/imperf form.

I have to say I was lucky in this part of the project because I had several internet tools which helped me with the Slovenian morphology.

The final version of the Slovenian morphology contains 20724 lemmas. I think it's a nice number and I'm very happy with this result.

Spanish[edit]

The Spanish morphology has been taken from the en-es language pair. During the project additional lemmas have been added by taking them from the es-ca language pair or by adding them manually. All proper names from SL morphology have also been added to the Spanish morphology. There were no bigger changes in the Spanish morphology.

The final version of the Spanish morphology contains 35683 lemmas.

Bilingual dictionary[edit]

The bilingual dictionary has been made from scratch. Different approaches were taken to generate bilingual entries (intersects between sl-it and it-es lemmas, intersects between sl-en and en-es lemmas, Google translate, manual translation, etc.), but after all, all entries had to be checked manually and if needed, fixed. In this period I wrote a lot of scripts by myself to generate and check translations as fast as possible. Most of them are written in PHP, the rest are in AWK. Scripts are placed in the dev/Scripts folder so feel free to use them.

The main goal was to cover 10000 translations (lemmas/entries) and we did it. Final version of the bilingual dictionary contains 10523 entries (lemmas). I'm very happy with this result too.

POS tagger[edit]

The POS tagger was taken from the sl-mk language pair which was producing erroneous results, mostly due to tagset differences. We had to regenerate it from scratch with a new corpus.

Transfer rules[edit]

Transfer rules were the last part of the project. Since I had to learn almost everything about transfer rules, it was the most challenging part of the project. We did not have much time so we decided to write down the basic rules/macros and also complicate a bit the existing macros to make the upcoming work in t2x and t3x easier. Most of time we had to deal with "Masculine Animate" and "Masculine Inanimate" genders which are not present in the Spanish morphology and "Dual" -> "Plural" macro. These are just examples of "problems" we had to take care of.

This part of the project was the most complicated one and we did our best. We did not have enough time solve all problems in this stage, but at least we covered the basic rules and prepared for additional work which will be applied in the future.

Statistics[edit]

Dictionaries[edit]

  • apertium-sl-es.sl.dix: 20724 lemmata, 845334 surface forms
  • apertium-sl-es.sl-es.dix: 10523 lemmata

Coverage[edit]

Corpus Num. Words % Average STDEV
MULTEXT-EAST (Orwell) 104482 93.08% 0.29%
OPUS (subtitles) 2562969 89.515% 0.23%

Testvoc[edit]

For this occasion the Slovenian morphology has been cleaned. It contained only lemmas which have an appropriate translation in the bilingual dictionary.

POS Total Clean With @ With # Clean %
adj 261140 142942 505 117693 54,8
vblex 184455 180228 307 3920 97,7
n 86656 76544 5206 4906 88,4
det 6974 6681 0 293 95,8
prn 3350 3350 0 0 100
pr 1416 1272 144 0 89,9
np 1068 1068 0 0 100
adv 544 190 2 352 35
num 362 362 0 0 100
vbser 252 86 0 166 34,2
cnjcoo 18 8 0 10 44,5
cnjsub 9 0 0 9 0
cnjadv 6 6 0 0 100
vbmod 0 0 0 0 100
vbhaver 0 0 0 0 100
vaux 0 0 0 0 100
rel 0 0 0 0 100
preadv 0 0 0 0 100
ij 0 0 0 0 100
guio 0 0 0 0 100
cm 0 0 0 0 100
abbr 0 0 0 0 100

Due to a lack of time I had to leave vbser/cnjcoo/cnjsub for the future. As for adjectives it was very hard to find bilingual material to come by. We have to move on t2x and t3x to generate the right translation.

Rules[edit]

apertium-sl-es.sl-es.t1x: 20
apertium-sl-es.sl-es.t2x: Future work
apertium-sl-es.sl-es.t3x: Future work

Future work[edit]

I am looking forward to continue working on Slovenian-Spanish language pair. I would like to cover more lemmas in the bilingual dictionary by adding missing lemmas and by adding different translations, to improve transfer rules - move to t2x and t3x and if necessary to improve the POS tagger.

I am aiming to a release-quality language pair.

In cooperation with Jernej, we are planning to construct two new language pairs:
- Slovenian-Serbian language pair in cooperation with Hrvoje Peradin
- Slovenian-Italian language pair

Conclusion[edit]

I would like express my gratitude to my mentor Jernej Vičič for encouraging me to take part at this project, for his patience, all meetings we had, for his willingness to help and for everything he taught me in this period, Francis Tyers for his Spanish-side help, all ideas and suggestions in the first part of the project and also for his patience, Jimmy O'Regan for all answers, advices and instructions about Slavic languages, for teaching me everything I know about transfer rules and for his great company during the late night while I was working on my project, my co-mentor Gema Ramírez-Sánchez for helping us with pending tests' translations and prepositions and in the end thanks to all other GSoC students and Apertium members who helped me on IRC, answering my questions and cheering me up with different stuff.

I learned a lot from all of you and I hope I will have a chance to continue this adventure.