Swedish and Danish
Contents |
Pressemeddelese
From: Jacob Nordfalk <jacob.nordfalk@gm...> - 2009-10-12 13:29 http://sourceforge.net/mailarchive/message.php?msg_name=20cf28cd0910120629o572ede0i13542ee2737f2deb%40mail.gmail.com
Første Open Source maskinoversættelse mellem svensk og dansk
First open source machine translation between Swedish and Danish
Dansk (english version below)
Vi har netop frigivet version 0.5 af svensk-dansk til open source maskinoversættessystemet Apertium.
Det er det første frie maskinoversættelsesystem mellem svensk og dansk.
Det kan allerede nu bruges fra http://apertium.org/, men forhåbentlig vil fællesskabet omkring fri software tage det til sig og snart gøre det tilgængeligt på bl.a. alle Linux-arbejdsstationer.
Til udviklingen har vi brugt et antal frit tilgængelige kilder, bl.a. open source stavekontrollen Aspell, Den stora svenska ordlistan, http://dsso.se og den svenske og danske Wikipedia og Wiktionary.
Udviklingen er sponsoreret af Google Summer of Code (GSOC) og foretaget
af student Michael Kristensen. Mentorer på projektet er Francis Tyers
(Universitat d'Alacant og Prompsit Language Engineering) og
Jacob Nordfalk (Ingeniørhøjskolen i København).
For nærmere oplysninger om udviklingen af sprogparret, se http://wiki.apertium.org/wiki/Swedish_and_Danish
For mere information om apertium og GSOC, se http://socghop.appspot.com/org/home/google/gsoc2009/apertium.
Tekniske specifikationer
Svensk morfologisk ordbog 5.230 ordrødder Tosproget ordbog 6.854 ordrødder Dansk morfologisk ordbog 10.694 ordrødder
Dækningen på Wikipedia-tekst er p.t. 72% og korpuset Europarl 80%.
Vi anvender 1-trins "shallow transfer" med 17 transferregler.
Vi har foretaget en sammenlignende vurdering med andre tilgængelige maskinoversættelsessystemer på 65 sætninger fra Wikipedia.
Resultaterne findes nedenfor (lavest tal er bedst)
System Edit distance WER (Word Error Rate) Apertium 353 31 % Gramtrans 308 26 % Google 415 35 %
Yderligere oplysninger kan findes i artiklen "Shallow-transfer rule-based
machine translation for Swedish to Danish" som vi vil præsentere på
First International Workshop on Free/Open-Source Rule-Based Machine
Translation (http://xixona.dlsi.ua.es/freerbmt09/).
For mere information, kontakt Jacob Nordfalk, Ingeniørhøjskolen i
København (jano@ih...), telefon 26206512.
English
A new language pair, Swedish-Danish, has been released for the the free and open-source Apertium machine translator engine.
It's the first open source machine translator for Swedish and Danish.
The pair is immediately available for testing at http://apertium.org/, but will hopefully adopted by the free-software community and be available on i.a. the Linux desktop.
In developing this system, we used a number of freely available sources of information for constructing the system, i.a. high coverage spell-checkers available in the aspell project, Den stora svenska ordlistan, http://dsso.se and the Swedish and Danish Wikipedia and Wiktionaries.
This language pair was developed as part of a Google Summer of Code (GsoC)
project by Michael Kristensen, mentored by Francis Tyers (Universitat
d'Alacant and Prompsit Language Engineering) and Jacob Nordfalk
(Ingeniørhøjskolen i København).
For more information on Apertium and GsoC, see
http://socghop.appspot.com/org/home/google/gsoc2009/apertium .
Many thanks to Thyge Larsen for his assistance with post-edition and evaluation.
For more details on development and the language pair, see http://wiki.apertium.org/wiki/Swedish_and_Danish
Technical details
Swedish monolingual dictionary 5,230 lemmas Bilingual dictionary 6,854 lemmas Danish monolingual dictionary 10,694 lemmas
We measured coverage on Wikipedia to 72 % and the EuroParl corpus to 80 %.
The system used 1-stage shallow transfer with 17 transfer rules.
We have made a comparative evaluation to other available MT systems. The results for 65 Wikipedia sentences can be found below
System Edit distance WER (Word Error Rate) Apertium 353 31 % Gramtrans 308 26 % Google 415 35 %
Further details can be found in the article "Shallow-transfer rule-based
machine translation for Swedish to Danish" to be presented during the
First International Workshop on Free/Open-Source Rule-Based Machine
Translation (http://xixona.dlsi.ua.es/freerbmt09/).
For more information, pls. contact Jacob Nordfalk, Ingeniørhøjskolen i
København (jano@ih...), phone 26206512.
-- Jacob Nordfalk एस्पेरान्तो के हो? http://www.esperanto.org.np/. Memoraĵoj de KEF -. http://kef.saluton.dk/memorajoj/
Swedish and Danish
Swedish and Danish are closely related languages. Their differences are mainly found on the morphological level, the main lexicon is identical (or rather, very similar, with systematic differences), and the syntax is very similar. There are some differences, though.
Syntax
Particle order
Swedish keeps the verb together with a conjoined adverbial particle, where Danish separates them.
- (sv) Vill du köra in bilen
- (da) Vil du køre bilen ind
Swedish moves the reflexive pronoun sig along with the verb to V2 position, where Danish leaves it behind:
- (sv) I går tvättade sig Peter äntligen
- (da) I går vaskede Peter sig endelig
NP structure
Danish and Swedish have different NP patterns.
- (sv) Vita huset
- (da) Det hvide hus
- (nb) Det hvite huset
- (nn) Det kvite huset
In most NPs, Swedish has both the determiner den and the definite form of the noun. Danish, as always, cannot have both. Here, nn patterns with sv and nb with both sv and da (beware of non-idiomatic da, sv word choices, but the patterns are correct).
- (sv) Den stora utmaningen är att göra det rätta. Utmaningen er svår.
- (da) Den store udfordring er at gøre det rette. Udfordringen er vanskelig.
Existential sentences
Swedish can use "det" as an equivalent to the English "there", where Danish prefers "der",
- (sv) Det kommer en bil
- (da) Der kommer en bil
Relative clauses
In N + RC constructions, where the relativised constituent is subject, Danish uses either som or der as relativiser, whereas Swedish has som:
- (da) manden som er her (the man who is here)
- (da) manden der er her (the man who is here)
- (sv) mannen som är här
When the relativised constituent is the object, on the other hand, the relativiser must be som, also in Danish:
- (da) manden som jeg så (the man who I saw)
- (sv) mannen som jag såg (the man who I saw)
Passives
- (sv) Ytterligare prov kommer att tas under måndagen. (further test will be taken some time Monday)
- (da) Yderligere prøve vil blive taget i løbet af mandagen
Grammatical words
Modal verbs
Danish and Swedish use more or less the same set of modal verbs, but with different meaning.
(allow) sv: Man får inte röka här da: Man må ikke ryge her en: one is not allowed to smoke here
Some verbs also take different modals,
Swedish | Danish | Modal | Gloss | Example |
---|---|---|---|---|
åka | tage | ha → være | to go | |
föra | føre | ha → være | to take | Två personer har förts → To personer er ført |
komma | komme | ha → være | to come | Min fru har kommit → Min kone er kommet hjem |
There is a list of the most frequent 250 verbs with the modal they take here.
Morphology
Supine
Resources
- http://spraakbanken.gu.se/sal/eng/ -- GPL morph. for Swedish
- http://w3.msi.vxu.se/~nivre/research/Talbanken05.html (A 300,000-word tree-bank: it is in XML, all words are nicely tagged with PAROLE-style tags, and it should be easy to build a morphological analyser and a PoS tagger from it; authors are likely be happy to let us use it if we cite them).
- http://www.isv.cbs.dk/~mbk/treebank/ (Danish tree bank, 100,000-word, as above, under the GPL)
- http://www.ling.su.se/staff/sofia/suc/suc.html (Stockholm Umeå Corpus: 1,000,000 Swedish words, tagged; a license has to be granted by authors - it was used for apertium-sv-da)
- http://www.woxikon.se Ordbok for svenska<->engelsk, tysk, nederlandsk...
- http://ordbok.nada.kth.se/ "Tvärslå är en nordisk ordbok bestående av många sammanslagna ordböcker"
- http://www.klid.dk/dansk/ordlister/samling.html - Kelds samling a resurser
See also
Further reading
- LUNDIN AKESSON Katarina (2003) "Constructions with låta, LET, reflexives and passive-s: a comment on some differences, similarities and related phenomena". Working papers in Scandinavian syntax ISSN 1100-097X