Difference between revisions of "Slovenian and Spanish/Final report"
m (→Statistics) |
|||
Line 12: | Line 12: | ||
I have to say I was lucky in this part of the project because I had several internet tools which helped me with the Slovenian morphology. |
I have to say I was lucky in this part of the project because I had several internet tools which helped me with the Slovenian morphology. |
||
<br/><br/> |
<br/><br/> |
||
The final version of the Slovenian morphology contains |
The final version of the Slovenian morphology contains 20724 lemmas. I think it's a nice number and I'm very happy with this result. |
||
===Spanish=== |
===Spanish=== |
Revision as of 10:33, 29 August 2011
Description
The goal of this project was to implement a new language pair - Slovenian-Spanish into the Apertium translation system. During the project a complete revision and almost a complete remake of the Slovenian monolingual dictionary was necessary. Also the bilingual dictionary was created from scratch and transfer rules were added according to the contrastive grammar (Some basic rules were added, the rest is left for the FUTURE). Additional work has been made on the POS Tagger which was producing erroneous results, mostly due to tagset differences.
A more detailed work is presented in this report.
Morphological analyser
Slovenian
As mentioned before, a complete revision and almost a complete remake of the Slovenian morphology was made. The first version of the SL morphology was taken from sl-mk language pair, generated by Jernej Vičič. A complete revision consisted in eliminating not-existing words (lemmas) and duplicates, as for complete remake consisted in sorting existing paradigm entries, adding new lemmas with dual meaning and adding new paradigm entries for verbs and adverbs. According to Jimmy's instructions, nouns which derivate from verbs (in Slovene we call them "Glagolniki) have been tagged as vblex.ger which is a gerund form of the verb. The aspect of verbs has been tagged as perfective or imperfective - further changes were needed in the bilingual dictionary and in transfer rules because Spanish does not have perf/imperf form.
I have to say I was lucky in this part of the project because I had several internet tools which helped me with the Slovenian morphology.
The final version of the Slovenian morphology contains 20724 lemmas. I think it's a nice number and I'm very happy with this result.
Spanish
The Spanish morphology has been taken from the en-es language pair. During the project additional lemmas have been added by taking them from the es-ca language pair or by adding them manually. All proper names from SL morphology have also been added to the Spanish morphology. There were no bigger changes in the Spanish morphology.
The final version of the Spanish morphology contains 35683 lemmas.
Bilingual dictionary
The bilingual dictionary has been made from scratch. Different approaches were taken to generate bilingual entries (intersects between sl-it and it-es lemmas, intersects between sl-en and en-es lemmas, Google translate, manual translation, etc.), but after all, all entries had to be checked manually and if needed, fixed. In this period I wrote a lot of scripts by myself to generate and check translations as fast as possible. Most of them are written in PHP, the rest are in AWK. Scripts are placed in the dev/Scripts folder so feel free to use them.
The main goal was to cover 10000 translations (lemmas/entries) and we did it. Final version of the bilingual dictionary contains 10823 entries (lemmas). I'm very happy with this result too.
POS tagger
The POS tagger was taken from the sl-mk language pair which was producing erroneous results, mostly due to tagset differences. We had to regenerate it from scratch with a new corpus.
Transfer rules
Transfer rules were the last part of the project. Since I had to learn almost everything about transfer rules, it was the most challenging part of the project. We did not have much time so we decided to write down the basic rules/macros and also complicate a bit the existing macros to make the upcoming work in t2x and t3x easier. Most of time we had to deal with "Masculine Animate" and "Masculine Inanimate" genders which are not present in the Spanish morphology and "Dual" -> "Plural" macro. These are just examples of "problems" we had to take care of.
This part of the project was the most complicated one and we did our best. We did not have enough time solve all problems in this stage, but at least we covered the basic rules and prepared for additional work which will be applied in the future.
Statistics
Dictionaries
apertium-sl-es.sl.dix
: 20724 lemmata, 549817 surface formsapertium-sl-es.sl-es.dix
: (unique: 9985, total: 13032)
Coverage
Corpus | Num. Words | % Average | STDEV |
---|---|---|---|
MULTEXT-EAST (Orwell) |
104482 | 93.08% | 0.29% |
OPUS (subtitles) |
2562969 | 89.515% | 0.23% |
Testvoc
For this occasion the Slovenian morphology has been cleaned. It contained only lemmas which have an appropriate translation in the bilingual dictionary.
POS | Total | Clean | With @ | With # | Clean % |
---|---|---|---|---|---|
adj | 261140 | 142942 | 505 | 117693 | 54,8 |
vblex | 184455 | 180228 | 307 | 3920 | 97,7 |
n | 86656 | 76544 | 5206 | 4906 | 88,4 |
det | 6974 | 6681 | 0 | 293 | 95,8 |
prn | 3350 | 3350 | 0 | 0 | 100 |
pr | 1416 | 1272 | 144 | 0 | 89,9 |
np | 1068 | 1068 | 0 | 0 | 100 |
adv | 544 | 190 | 2 | 352 | 35 |
num | 362 | 362 | 0 | 0 | 100 |
vbser | 252 | 86 | 0 | 166 | 34,2 |
cnjcoo | 18 | 8 | 0 | 10 | 44,5 |
cnjsub | 9 | 0 | 0 | 9 | 0 |
cnjadv | 6 | 6 | 0 | 0 | 100 |
vbmod | 0 | 0 | 0 | 0 | 100 |
vbhaver | 0 | 0 | 0 | 0 | 100 |
vaux | 0 | 0 | 0 | 0 | 100 |
rel | 0 | 0 | 0 | 0 | 100 |
preadv | 0 | 0 | 0 | 0 | 100 |
ij | 0 | 0 | 0 | 0 | 100 |
guio | 0 | 0 | 0 | 0 | 100 |
cm | 0 | 0 | 0 | 0 | 100 |
abbr | 0 | 0 | 0 | 0 | 100 |
Rules
apertium-sl-es.sl-es.t1x: 20
apertium-sl-es.sl-es.t2x: Future work
apertium-sl-es.sl-es.t3x: Future work
More will be added
...
Future work
I am looking forward to continue working on Slovenian-Spanish language pair. I would like to cover more lemmas in the bilingual dictionary by adding missing lemmas and by adding different translations, to improve transfer rules - move to t2x and t3x and if necessary to improve the POS tagger.
I am aiming to a release-quality language pair.
Conclusion
I would like express my gratitude my mentor Jernej Vičič for encouraging me to take part at this project, for his patience, all meetings we had, for his willingness to help and for everything he taught me in this period, Francis Tyers for his Spanish-side help, all ideas and suggestions in the first part of the project and also for his patience, Jimmy O'Regan for all answers, advices and instructions about Slavic languages, for teaching me everything I know about transfer rules and for his great company during the late night while I was working on my project, my co-mentor Gema Ramírez-Sánchez for helping us with pending tests' translations and prepositions and in the end thanks to all other GSoC students and Apertium members who helped me on IRC, answering my questions and cheering me up with different stuff.
I learned a lot from all of you and I hope I will have a chance to continue this adventure.