Northern Sámi and Norwegian
Introduction
This is a language pair translating from Northern Sámi to Norwegian Bokmål. You can test development versions from Sámi Giellatekno.
This pair uses HFST for the Sámi morphological analysis, CG for rule-based disambiguation and lexical selection, and lttoolbox/apertium modules for the rest. Transfer is four-stage chunking. The pipeline looks like this:
hfst-proc (sme analysis) | cg-proc (disambiguation) | apertium-tagger | cg-proc (lexical selection) | pretransfer |\ transfer (chunking) | interchunk1 | interchunk2 | postchunk | lt-proc (nob generation)
To get an overview of transfer rules: $ grep '<rule' *.t[1-4]x
More introductory documentation in:
and more in-depth documentation here:
- Northern Sámi and Norwegian/bidix and chunking
- Northern Sámi and Norwegian/Compounds and output macros
- Northern Sámi and Norwegian/Derivations
- Northern Sámi and Norwegian/anaphora "resolution"
- Northern Sámi and Norwegian/smemorf (about the analyser)
- Northern Sámi and Norwegian/CG (about the disambiguator and lexical selection)
See also the Northern Sámi and Norwegian/release TODO's and Northern Sámi and Norwegian/Installation.
misc todo's
- Can we add all N.Prop.Fem/Masc with identical translations without problems? They should only be "names of humans" ie. no translation …
- Try and use prpers for all nob pronouns and possessive determiners, would be a lot cleaner (see clean_pron in t4x)
- then the nob det.pos need person/number tags
- Headline language: Heahpat hállat go gillá => Skam å snakke når man lider(?)
- When bidix lookup is moved out transfer: match on tl in chunker instead of matching on sl and having a big choose-when for each tl possibility.
- Makkár oainnu oažžu son guhte čohkke buot dieđuid du birra neahtas? has no Qst tag, how do we tell t3x that the @OBJ→ can stay before the verb? (if interpreted as a regular focused object, it gets switched with the subject)
- ja makkár áššiid don it muital. shouldn't switch Neg and -FMAINV, t3x rule has to match @OBJ→ @SUBJ→ Neg IV
- riddoguovlluid ássiin lei geatnegasvuohta lágidit gonagassii fatnasiid currently becomes "på de som bor med kysttraktene det var en plikt drive til kongen båter" -- would it be better to go for "hadde" here?
See also the /smemorf todo-list.
Definiteness in nob
Some contexts are relatively safe:
- Attributive superlatives are definite, and have indefinite nouns: det.poss adj.sup n => det.poss adj.sup.def n
- min viktigste.def oppgave.ind
- Predicative superlatives are almost always indefinite
- min oppgave.ind/oppgaven.def min er viktigst.ind
- ...unless they have a definite determiner:
- min oppgave.ind er den viktigste.def
Others we have to guess.
Features we might be able to use: subject/object, theme/focus, prepositions?
- Du dálkasis sáhtii leamaš ávki => Din(det.poss) medisin(ind) kan ha vært til nytte
- Mánná oađđá => Barnet(def) sover (but is this ambiguous?)
- Son lea čeahpes bárdni => Han er en(art) flink(ind) gutt(ind)
- Dá livččii skeaŋka din čeahpes bárdnai => Her er en gave jeg kunne ønske å gi den(art) flinke(def) sønnen(def) deres(det.poss)
Sámi collective nouns are marked Coll, but there's no collective nor mass noun marking in nob.dix, so I guess that's not much help.
- but this list is GPL :-)
Some more rules:
- indefinite:
- lokativ/(illativ) in first position + leat
- habitiv tag
- advl
- definite: lokativ not in first position + leat
Dual verbforms always indicate definite subjects.
Case
Case to preposition
This is for adverbial cases
Essive nouns (mánnán=>som barn) are ambiguous between sg and pl; can we just choose sg.ind all the time?
Case to object
Accusative objects are just translated. The issue here is definiteness.
Case to possessor phrase
- Gen N => N-Def til Possessor; but for bokmål, Gen's N would be simpler and fine in most cases
- for both expressions, definiteness is more or less trivial
Postposition/number case choice
We can remove genitive case which is due to a postposition or after a number (or turn it into accusative for a pronoun).
- garra.ADJ dálkki.N.GEN geažil.PO[GEN] => på_grunn_av.PR dårlig.ADJ vær.N
- guokte.NUM biilla.N.SG.GEN => to.NUM biler.N.PL
Case to Number
This is the sme quantifier phrase: two.Sg.Nom book.Sg.Gen ==> two book.Pl.Indef. This also holds for two.Sg.Acc book.Sg.Acc, and coming to think of it, two.Sg.Gen book.Sg.Obliquecase. In the latter case, of course, the oblique case in question will have to be translated to a preposition or whatever.
Agreement
Subject-verb agreement to be removed.
Subject insertion from pro-drop
Pro-drop sentences should have subjects inserted, observing the nob V2 rule:
- Topicalised sentences
- X + V => X + V + subjpron
- X + Neg + V => X + V + subjpron + ikke
- Verb-initial sentences
- V => subjpron + V
- Neg + V => subjpron + V + ikke
We could do this by changing a variable in the movement interchunk stage based on whether the pattern matches a subject or not.
Negation
Negation is a verb in sme, an adverbial in nob.
- Subj + Neg + ConNeg => Subj + Prs + ikke
- Subj + Neg + PrfPrtc => Subj + Prt + ikke
- Neg + Subj + ConNeg => Subj + Prs + ikke
- Neg + Subj + PrfPrtc => Subj + Prt + ikke
- X + Neg (+ Subj) + ConNeg => X + Prs + Subj + ikke
- X + Neg (+ Subj) + PrfPrtc => X + Prt + Subj + ikke
Other verb=>adverb
veadjit => orke , greie (gal veadja leat=> det er kanskje) soaitit/dáidit => kanskje
Hvis dette verbet står med en infinitiv etterpå, så oversettes sjølve ordet med kanskje, og person/numerus/tense går til infinitiven som står etter det.
See also The Book
Essive SPRED => V
There is currently no way of checking whether bidix has a translation for this and that word in transfer; eg. for
Itgo boađáše munnje veahkkin? ikke.du komme meg.ILL hjelp.N.ESS.@←SPRED `Kommer du ikke og hjelper(V) meg?'
we might want to translate "V PRON.ILL N.ESS.@←SPRED" or something like that into "V og V PRON.ACC" but only if the first N has a corresponding verb (hjelp <=> hjelpe), otherwise we might want to stick with the more literal "til meg som N".
Could we simply do this by adding bidix entries for "PRON.ILL N.ESS.@←SPRED" into "V PRON.ACC"? Or would that overgenerate?
- No we can't, @←SPRED is not a valid sdef due to @ and ←. We _could_ put "PRON.ILL N.ESS" in bidix though, with even more chance of overgenerating…
- Also, we'd first have to turn it into one lexical unit (
^mun<Pron><Pers><Sg1><Ill><@ADVL>$ ^veahkki<N><Ess><@←SPRED>$ => ^mun# veahkki<Pron><Pers><Sg1><Ill><@ADVL><N><Ess><@←SPRED>$
or something).- Current solution: letting the lex.sel CG, sme-nob.lex, add a V tag to any N.ESS.@←SPRED followed by PRON.PERS.ILL, and then just make the bidix entry use this added tag. t1x checks for the tl vblex tag, t2x does the movement and adds the conjunction.
Focus particles
These are equivalent:
* reaškkihan reaškit+V+IV+VGen+Foc/han * reaškki han reaškit+V+IV+VGen han+Pcle
Fortunately, the relabel script can turn the first into
* ^reaškkihan/reaškit<V><IV><VGen>+han<Pcle>$
and cg-proc puts syntax tags on the first part of multiwords, so after pretransfer we get
* ^reaškit<V><IV><VGen><@X>$ ^han<Pcle>$
and we can handle focus particles in bidix even if they are attached to the preceding word.
leat => være / ha
leat may translate into either one of være or ha, wrong translations will become very odd.
- Mánát leat boahtán skuvlii => Barnene har kommet til skolen
- verb afterwards: har (well in this case "er" works, movement verb, but in general)
- Dat lea sihke buorre ja heittot => Det er både bra og dårlig
- Norga.no deháleamos doaibma lea ofelastit geavaheaddjiid almmolaš bálvalusaide => Norge.no's viktigste oppgave er å veivise brukere til offentlige tjenester
- å afterwards: er
- Mus lea oahpahus gaskkal guovtti ja njealji => Jeg har undervisning mellom to og fire
- "from.me is teaching between two and four"
- Mus lea biepmu => Jeg har mat
- "from.me is food"
- "Mus" is
<Loc><@HAB>
in both these;
- Ii mus leat bahá vuoigŋa => Jeg er ikke besatt
- "not.3SG from.me is.CONNEG angry spirit"
- counterexample to the two above...because of negation? or the adjective?
- Mun lean buorre => Jeg er god
- Son lea čeahpes bárdni => Han er en flink gutt
- "Mun", "Son" are not
<Loc><@HAB>
...
- "Mun", "Son" are not
- Mus lea gažaldat didjiide => Jeg har et spørsmål til dere
We handle this as a lexical selection problem in CG.
Transfer
Chunk naming scheme
Since t4x (postchunk) cannot tell how many and what kind of lexical units there are in each chunk, we use the chunk names to signal this. Some examples of chunk names:
- "verb" -- this chunk has a single verb lexical unit
- "det_adj_nom -- eg. "den nye saken", three lexical units
Transfer name oddities
Since we can't have strange symbols in XML id's, an "LSUBJ" is a chunk with the syn_label "@SUBJ→", an "ROPRED" has "@←OPRED" etc.
What goes where
The current plan:
- t1x
- (de-)compounding,
- derivation,
- simple noun phrases (heads and their simple modifiers/specifiers: adj nom, adj adj nom, det adj adj nom, num adj nom),
- simple periphrastic verb combinations (verb, vaux pp, vaux inf)
- Insert prepositions based on case
- t2x
- relatives (SN "who" SV -> SN)
- co-ordination (SN "and" SN -> SN)
- genitive modifiers (SN SN-Gen " [University of Reykjavik] [big old library]-GEN"
- t3x
- move postpositions (SN ADPOS -> ADPOS SN) "[1 big house which is on the hill] [2 in]"
- remove prepositions when case is governed by adpositions
- V2
- Insert dropped pronouns
- move postpositions (SN ADPOS -> ADPOS SN) "[1 big house which is on the hill] [2 in]"
- t4x
- Insert articles
- Cleanup
NP's
Definiteness is set on the noun phrase chunk in t1x, but might change during t2x or t3x. Gender and number, however, never changes based on larger contexts.
Example: t1x might give ^nom<SN><@ADVL><m><pl><ind>{^stein<n><m><pl><5>$}$
, before t4x is run, the place with the 5 (placeholder) is given the value ind. Thus we can change definiteness in t2x and t3x; gender and number tags however are on the actual word. The same goes for adjectives (and determiners) in the NP chunk, so gender is applied to the adjective/determiner in t1x, while we have the placeholder for definiteness until t4x.
We do some cleanup on adjectives in t4x to make sure they match the nob.dix format, eg. positive indefinites have gender tags while definites don't (this has to happen in t4x since we have to know if they end up being definite or what).
Exception: some contexts are certain, eg.:
- det.dem (adj.def) n.def
- det.qnt (adj.def) n.ind
- det.pos (adj.def) n.ind
For these, if we don't add the definiteness tag to the chunk but instead put apply definiteness directly on the words, it can't be changed during t2x/t3x (<let><clip pos="N" part="art"/><lit-tag "foo"/></let>
has no effect when the "art" attribute is empty), and we don't apply anything from the chunk during t4x if the "art" attribute is empty.
See also
- Pending tests
- Regression tests
- Northern Sámi and Norwegian/bidix and chunking
- Northern Sámi and Norwegian/Compounds and output macros
- Northern Sámi and Norwegian/Derivations
- Northern Sámi and Norwegian/smemorf (about the analyser)
External links
- lookup Sámi words in giellatekno hfst online
- Samisk Grammatikk av Klaus Peter Nickel (heile boka, frå Nasjonalbiblioteket)
- some translated example sentences at visl.sdu.dk
- ~5000 words sme-eng
- Nordisk møteordliste
- Termlister - Giella.org
Web sites with sme / sme-nob text
- Avvir.no Northern Sámi newspaper
- Infonuorra.no (unginfo), Norwegian + Sámi texts, news/blog-like
- Kongehuset.no lots of parallell text
- Sametingets plenum pdf's with parallel (sme-nob) gov't text, choose eg. Publikasjoner-Møtebøker-Plenum