Northern Sámi and Norwegian

From Apertium
Jump to navigation Jump to search

Introduction

This is a language pair translating from Northern Sámi to Norwegian Bokmål. It uses HFST for the Sámi morphological analysis, CG for rule-based disambiguation and lexical selection, and lttoolbox/apertium modules for the rest. Transfer is four-stage chunking. The pipeline looks like this:

hfst-proc (sme analysis) | cg-proc (disambiguation) | apertium-tagger | cg-proc (lexical selection) | pretransfer |\
transfer (chunking) | interchunk1 | interchunk2 | postchunk | lt-proc (nob generation)

To get an overview of transfer rules: $ grep '<rule' *.t[1-4]x

See also the comments at the beginning of each tNx-file, and the README.

misc todo's

  • Try and use prpers for all nob pronouns and possessive determiners, would be a lot cleaner (see clean_pron in t4x)
    • then the nob det.pos need person/number tags
  • Compound epenthetics: could add them in bidix as ep-e, ep-s or ep-Ø, but might as well just keep a single tag (like "Cmp") until nob generation, and let monodix sort it out
  • Headline language: Heahpat hállat go gillá => Skam å snakke når man lider(?)
  • Clean up t1x verb rule(s), possibly make several of them (at the moment we have "if adv=>out adv, elif nom=>out nom, elif adj=>out adj, else {if progressive=>out 'i gang med å', if causative=>out 'la', if passive=>out 'bli', out verb, if reflexive=>out 'seg'}"...)
  • When bidix lookup is moved out transfer: match on tl in chunker instead of matching on sl and having a big choose-when for each tl possibility.
  • move set_defnes_ana2 from t3x to t2x, should be cleaner that way (but does require verb chunks to carry gender)
    • keep two ana variables, one for {m,f} and one for {m,f,nt}, since we might get "Olga.f gulai dat.nt ja son[f, not nt] oahpai" but "Dat.nt lea buot[nt!] mii mus lea"
  • switch_cases needs to be called in lots more places...


See also /smemorf.

Corpus

TODO:


Word order

sme not-V2 and nob V2

sme OV tendencies and no nob OV

nob particle verbs

Since we "just" need generation, we could do it with this method.

Definiteness in nob

Some contexts are relatively safe:

  • Attributive superlatives are definite, and have indefinite nouns: det.poss adj.sup n => det.poss adj.sup.def n
    1. min viktigste.def oppgave.ind
  • Predicative superlatives are almost always indefinite
      1. min oppgave.ind/oppgaven.def min er viktigst.ind
  • ...unless they have a definite determiner:
    1. min oppgave.ind er den viktigste.def

Others we have to guess.

Features we might be able to use: subject/object, theme/focus, prepositions?

  • Du dálkasis sáhtii leamaš ávki => Din(det.poss) medisin(ind) kan ha vært til nytte
  • Mánná oađđá => Barnet(def) sover (but is this ambiguous?)
  • Son lea čeahpes bárdni => Han er en(art) flink(ind) gutt(ind)
  • Dá livččii skeaŋka din čeahpes bárdnai => Her er en gave jeg kunne ønske å gi den(art) flinke(def) sønnen(def) deres(det.poss)

Sámi collective nouns are marked Coll, but there's no collective nor mass noun marking in nob.dix, so I guess that's not much help.

but this list is GPL :-)

Some more rules:

  • indefinite:
    • lokativ/(illativ) in first position + leat
    • habitiv tag
    • advl
  • definite: lokativ not in first position + leat

Case

Case to preposition

This is for adverbial cases

Essive nouns (mánnán=>som barn) are ambiguous between sg and pl; can we just choose sg.ind all the time?

Case to object

Accusative objects are just translated. The issue here is definiteness.

Case to possessor phrase

  • Gen N => N-Def til Possessor; but for bokmål, Gen's N would be simpler and fine in most cases
    • for both expressions, definiteness is more or less trivial

Postposition/number case choice

We can remove genitive case which is due to a postposition or after a number (or turn it into accusative for a pronoun).

  • garra.ADJ dálkki.N.GEN geažil.PO[GEN] => på_grunn_av.PR dårlig.ADJ vær.N
  • guokte.NUM biilla.N.SG.GEN => to.NUM biler.N.PL

Case to Number

This is the sme quantifier phrase: two.Sg.Nom book.Sg.Gen ==> two book.Pl.Indef. This also holds for two.Sg.Acc book.Sg.Acc, and coming to think of it, two.Sg.Gen book.Sg.Obliquecase. In the latter case, of course, the oblique case in question will have to be translated to a preposition or whatever.

Agreement

Subject-verb agreement to be removed. Dual verbforms always indicate definite subjects.

Subject insertion from pro-drop

Pro-drop sentences should have subjects inserted, observing the nob V2 rule:

  • Topicalised sentences
    • X + V => X + V + subjpron
    • X + Neg + V => X + V + subjpron + ikke
  • Verb-initial sentences
    • V => subjpron + V
    • Neg + V => subjpron + V + ikke


We could do this by changing a variable in the movement interchunk stage based on whether the pattern matches a subject or not.

Negation

Negation is a verb in sme, an adverbial in nob.

  • Subj + Neg + ConNeg => Subj + Prs + ikke
  • Subj + Neg + PrfPrtc => Subj + Prt + ikke
  • Neg + Subj + ConNeg => Subj + Prs + ikke
  • Neg + Subj + PrfPrtc => Subj + Prt + ikke
  • X + Neg (+ Subj) + ConNeg => X + Prs + Subj + ikke
  • X + Neg (+ Subj) + PrfPrtc => X + Prt + Subj + ikke

Other verb=>adverb

veadjit => orke , greie (gal veadja leat=> det er kanskje)
soaitit/dáidit => kanskje

Hvis dette verbet står med en infinitiv etterpå, så oversettes sjølve ordet med kanskje, og person/numerus/tense går til infinitiven som står etter det.

See also The Book

Infinite verbforms

These are clause reducts, to be expanded to embedded sentences

Gerund

...

Actio locative

...

Actio essive

...


Pre- and post positions

Postposition to preposition

Lexical selection

Essive SPRED => V

There is currently no way of checking whether bidix has a translation for this and that word in transfer; eg. for

Itgo    boađáše munnje  veahkkin? 
ikke.du komme   meg.ILL hjelp.N.ESS.@←SPRED
`Kommer du ikke og hjelper(V) meg?'

we might want to translate "V PRON.ILL N.ESS.@←SPRED" or something like that into "V og V PRON.ACC" but only if the first N has a corresponding verb (hjelp <=> hjelpe), otherwise we might want to stick with the more literal "til meg som N".

Could we simply do this by adding bidix entries for "PRON.ILL N.ESS.@←SPRED" into "V PRON.ACC"? Or would that overgenerate?

No we can't, @←SPRED is not a valid sdef due to @ and ←. We _could_ put "PRON.ILL N.ESS" in bidix though, with even more chance of overgenerating…
Also, we'd first have to turn it into one lexical unit (^mun<Pron><Pers><Sg1><Ill><@ADVL>$ ^veahkki<N><Ess><@←SPRED>$ => ^mun# veahkki<Pron><Pers><Sg1><Ill><@ADVL><N><Ess><@←SPRED>$ or something).
Current solution: letting the lex.sel CG, sme-nob.lex, add a V tag to any N.ESS.@←SPRED followed by PRON.PERS.ILL, and then just make the bidix entry use this added tag. t1x checks for the tl vblex tag, t2x does the movement and adds the conjunction.

Clause types

Yes-no questions

Verb-initial yes-no questions are directly translated, with go removal

When other constituents are added

Relative clauses

sme relative pronoun into nob "som"

"som" may be deleted when the relative refers to

Passive clauses

Morphology

Syntax

POS disambiguation

vai

..

Existential sentences

  • Insert det in the nob translation
  • TODO: Use "pers" or "impers" tag from bidix

Focus particles

These are equivalent:

* musnai  mun+Pron+Pers+Sg1+Loc+Foc/naj
* mus nai     mun+Pron+Pers+Sg1+Loc     nai+Pcle

Fortunately, the relabel script can turn the first into

* ^musnai/mun<Pron><Pers><Sg1><Loc>+naj<Pcle>$

and cg-proc puts syntax tags on the first part of multiwords, so after pretransfer we get

* ^mun<Pron><Pers><Sg1><Loc><@PCLE>$ ^naj<Pcle>$

and can handle focus particles in bidix.

leat => være / ha

leat may translate into either one of være or ha, wrong translations will become very odd.

  • Mánát leat boahtán skuvlii => Barnene har kommet til skolen
    • verb afterwards: har (well in this case "er" works, movement verb, but in general)
  • Dat lea sihke buorre ja heittot => Det er både bra og dårlig
  • Norga.no deháleamos doaibma lea ofelastit geavaheaddjiid almmolaš bálvalusaide => Norge.no's viktigste oppgave er å veivise brukere til offentlige tjenester
    • å afterwards: er
  • Mus lea oahpahus gaskkal guovtti ja njealji => Jeg har undervisning mellom to og fire
    • "from.me is teaching between two and four"
  • Mus lea biepmu => Jeg har mat
    • "from.me is food"
    • "Mus" is <Loc><@HAB> in both these;
  • Ii mus leat bahá vuoigŋa => Jeg er ikke besatt
    • "not.3SG from.me is.CONNEG angry spirit"
    • counterexample to the two above...because of negation? or the adjective?
  • Mun lean buorre => Jeg er god
  • Son lea čeahpes bárdni => Han er en flink gutt
    • "Mun", "Son" are not <Loc><@HAB> ...
  • Mus lea gažaldat didjiide => Jeg har et spørsmål til dere


We handle this as a lexical selection problem in CG.

Derivations: general rules and exceptions

Sámi has a lot of derivation rules; sometimes the derived words have lexicalised translations in Bokmål, like ráhkisvuohta→kjærlighet, these we treat as exceptions which have to be specified in bidix. Other times we can use a general rule, like lohkagohten→begynte.1SG å lese.

We have two strategies for handling the rule/exception situation.

  1. For the situation where we have many exceptions, we let the analysis be eg. geavaheaddjiid/geavahit<V><TV><Der2><Actor><N><Pl> and from here there are two paths
    1. either this specific analysis is in bidix, here translating into bruker<n><m><pl>, or
    2. we have to use a transfer rule, in this case translating into de som bruker
  2. For the situation where we have few exceptions, we use dev/xfst2apertium.relabel to split the analysis into two lexical units. Two lexical units can't be specified in bidix, so here
    1. exceptions have to be added to the .lexc file as if they were lexicalised, so they remain one lexical unit
    2. while general transfer rules now match a pattern of two lexical units

More detailed: Deverbal nouns

Sámi verbs can turn into nouns. We want to be able to put this explicitly into the bidix (eg. sometimes the nob noun is not even based on the nob verb), but if it's not in bidix we want to be able to fall back on a construction using the verb, so

  • from geavaheaddjiid/geavahit<V><TV><Der2><Actor><N>
  • with fallback => de som bruker<vblex> (or something)
  • bidix specified => bruker<n><m>

With the following bidix entries we specify that we want bruker<n><m> in the above example:

    <e><p><l>geavahit<s n="V"/><s n="TV"/></l><r>bruke<s n="vblex"/><s n="pers"/></r></p><par n="__verb"/></e>
    <e><p><l>geavahit<s n="V"/><s n="TV"/><s n="Der2"/><s n="Actor"/><s n="N"/></l><r>bruker<s n="n"/><s n="m"/></r></p><par n="__n"/></e>

while if the second bidix line isn't there, we get the fallback. Transfer rules can now check

 <equal><clip side="tl" part="pos" ...><lit-tag v="N"/></equal>
 <equal><clip side="sl" part="pos" ...><lit-tag v="V"/></equal>

The same specification/fallback might be applied with other Derivations.

Other fallbacks

  • Der/at, adj->adv, gets adj.posi.nt.sg.ind ("vid" => "vidt")
  • Der/vuohta, adj->n, gets adj...def, eg. "grunn" -> "det grunne" (bidix overrides "fattig" to "fattigdom")

Derivations

Note: Við eigum að breyta mörk neðan af því að það er ekki hægt að nota /. í mörkum í apertium. En þá eigum við að breyta CG líka...

There are also derivations of derivations:

"<geavaheaddjis>"
...
          "geavvat" V* IV* Der1 Der/h V* TV Der2 Actor N Sg Acc PxSg3

For transfer purposes it might be simplest to treat these "flatly" as if they were single derivations (ie. Der1_Der_h_V_TV_Der2).

Tag Type Example in Bokmål
Der/Dimin Diminutive mánáš "mánná" N Der1 Der/Dimin N Sg Nom barn→lite barn
Der/1 Der/st Diminutive verb attestit "addit" V TV Der1 Der/st V Inf gi→gi litt
Der/adda V→N/PrfPrc/Actio bassaladdan "bassalit" V* TV Der2 Der/adda →vaske tøy (bassat=vaske)
Der/ahtti V→V vajálduhttit "vajálduvvat" V* IV* Der2 Der/ahtti V TV →overse/glemme
Der/alla suffix bázáhallan "bázihit" V* TV Der2 Der/alla V Actio
Der/amoš suffix muitalamoš "muitalit" V TV Der3 Der/amoš N Sg Nom fortelle→
Der/asti suffix muitalastit "muitalit" V TV Der2 Der/asti V Inf fortelle→
Der/at Adj→Adv viidát "viiddis" A* Der2 Der/at Adv vid→vidt
Der/d V→V[refl] basadit "bassat" V TV Der1 Der/d V vaske→vaske seg
Der/eaddji suffix muitaleaddji "muitalit" V TV Der2 Actor N Sg Nom fortelle→
Der/eamoš suffix muitaleamoš "muitalit" V* TV Der3 Der/eamoš fortelle→
Der/eapmi V→N deaivvadeapmi "deaivvadit" V IV Der2 Der/eapmi N Sg Nom møte(V) → møte(N)
Der/easti suffix muitaleastit "muitalit" V TV Der2 Der/easti V Inf fortelle →
Der/geahtes suffix eaiggátkeahtes "eaiggát" N* Der3 Der/geahtes eier →
Der/goahti V→V Inchoative boradišgohten "boradit" V TV Der3 Der/goahti V Ind Prt Sg1 spise → jeg begynte å spise
Der/h suffix geavaheaddji "geavvat" V* IV* Der1 Der/h V* TV Der2 Actor; orrohit "orrot" V* IV Der1 Der/h V heve seg→ ; bli/synes→
Der/halla V→V[recip] gulahallat "gullat" V* TV Der1 Der2 Der/halla høre→forstå hverandre («høre hverandre»?)
Der/heapmi suffix čađaheapmi "čađđa" N* Der1 Der2 Der/heapmi A
Der/huhtti suffix muosehuhttit "muoseheapme" A* Der1 Der/huhtti V* TV urolig→
Der/huvva suffix čađahuvvo "čađđa" N* Der1 Der2 Der/huvva V IV Imprt Prs ConNegII
Der/j suffix sáddejuvvot "sáddet" V* TV Der1 Der/j V* Der2 Der/PassL V sende→
Der1 Der/l V→V[subitive] borralit "borralit" V TV Der1 Der/l V spise→spise (i hast)
Der/l ???? ohcalit "ohcat" V* TV Der1 Der/l V lete→savne/lengte etter
Der/las suffix lotnolas "lotnut" V* TV Der1 Der2 Der/las A betale→
Der/laš N→Adj dábálaš "dáhpi" N Der1 Der/laš A Sg Nom skikk→vanlig
Der/lágan suffix earálágan "eará" Pron Indef Sg Gen Der1 Der/lágan A annen/andre→
Der/meahttun V→Adj[Neg] jáhkkemeahttun "jáhkkit" V TV Der1 Der/meahttun A Sg Nom tro/anta→utrolig
Der/muš suffix ??? "juhkat" V TV Der3 Der/muš N Sg Nom drikke→
Der/n suffix oažžun "oažžut" V* TV Der3 Der/n N få→?
Der/st suffix várástit "várát" V TV Der1 Der/st V
Der/stuvva suffix fuolastuvvat "fuollat" V* TV Der1 Der2 Der/stuvva V bry seg om→
Der/supmi suffix čállosupmi "čállit" V* TV Der2 Der/PassL V* Der3 Der/supmi N skrive/...→
Der/upmi suffix mearkkašupmi "mearkkašit" V* TV Der2 Der/PassL V* Der3 Der/upmi merge seg→
Der/viđá suffix málestanviđá "málet" V TV Der1 Der/st V Der2 Der/eapmi N SgCmp Der/viđá Adv male→
Der/vuohta Adj→N ráhkisvuohta "ráhkis" A Der3 Der/vuohta N Sg Nom kjær→kjærlighet

Transfer

Chunk naming scheme

Since t4x (postchunk) cannot tell how many and what kind of lexical units there are in each chunk, we use the chunk names to signal this. Some examples of chunk names:

  1. "verb" -- this chunk has a single verb lexical unit
  2. "verb_part_verb" -- eg. "begynne å lese", three lexical units

What goes where

The current plan:

  • t1x
    • (de-)compounding,
    • derivation,
    • simple noun phrases (heads and their simple modifiers/specifiers: adj nom, adj adj nom, det adj adj nom, num adj nom),
    • simple periphrastic verb combinations (verb, vaux pp, vaux inf)
    • Insert prepositions based on case
  • t2x
    • relatives (SN "who" SV -> SN)
    • co-ordination (SN "and" SN -> SN)
    • genitive modifiers (SN SN-Gen " [University of Reykjavik] [big old library]-GEN"
  • t3x
    • move postpositions (SN ADPOS -> ADPOS SN) "[1 big house which is on the hill] [2 in]"
      • remove prepositions when case is governed by adpositions
    • V2
    • Insert dropped pronouns
  • t4x
    • Insert articles
    • Cleanup


NP's

Definiteness is set on the noun phrase chunk in t1x, but might change during t2x or t3x. Gender and number, however, never changes based on larger contexts.

Example: t1x might give ^nom<SN><@ADVL><m><pl><ind>{^stein<n><m><pl><5>$}$, before t4x is run, the place with the 5 (placeholder) is given the value ind. Thus we can change definiteness in t2x and t3x; gender and number tags however are on the actual word. The same goes for adjectives (and determiners) in the NP chunk, so gender is applied to the adjective/determiner in t1x, while we have the placeholder for definiteness until t4x.

We do some cleanup on adjectives in t4x to make sure they match the nob.dix format, eg. positive indefinites have gender tags while definites don't (this has to happen in t4x since we have to know if they end up being definite or what).

Exception: some contexts are certain, eg.:

  • det.dem (adj.def) n.def
  • det.qnt (adj.def) n.ind
  • det.pos (adj.def) n.ind

For these, if we don't add the definiteness tag to the chunk but instead put apply definiteness directly on the words, it can't be changed during t2x/t3x (<let><clip pos="N" part="art"/><lit-tag "foo"/></let> has no effect when the "art" attribute is empty), and we don't apply anything from the chunk during t4x if the "art" attribute is empty.

See also


External links

Web sites with sme / sme-nob text