Difference between revisions of "Northern Sámi and Norwegian"

From Apertium
Jump to navigation Jump to search
(Undo revision 66781 by Memduh (talk))
 
(113 intermediate revisions by 7 users not shown)
Line 1: Line 1:
  +
This is a language pair translating '''from [[Northern Sámi]] to [[Norwegian Bokmål]]'''.
{{TOCD}}
 
   
==misc todo's==
 
* Script to turn sme-dis.rle into apertium-sme-nob.sme-nob.rlx
 
** <code>sed 's/\("[^"]*\)#/\1/g'</code> (ehm..more or less, there could be several #'s in there)
 
** <code>'s/>/→/', 's/</←/'</code>
 
* Remove # from lexc lemmas after dev/update-lexc.sh
 
* [[Ideas for Google Summer of Code/Morphology with HFST|HFST tokenisation]]
 
* Can HFST do LR longest match? (Otherwise, we could just remove compound analyses using CG if there's a non-compound analysis.)
 
   
  +
Giellatekno har ei nettside, Jorgal, som gir '''[http://gtweb.uit.no/jorgal maskinomsetjing frå samisk til norsk]''' der du kan prøva siste utgåve, med ekstra funksjonar som omsetjing av nettsider og stavekontroll av omsetjinga.
   
==Corpus==
 
   
TODO:
 
   
  +
There is a paper about it, see [[Publications#2012]]: "Evaluating North Sámi to Norwegian assimilation RBMT".
   
  +
== Technical background ==
  +
  +
This pair uses [[HFST]] for the Sámi morphological analysis, [[CG]] for rule-based disambiguation and lexical selection, and [[lttoolbox]]/apertium modules for the rest. Transfer is four-stage [[chunking]]. The pipeline looks like this:
  +
  +
hfst-proc (sme analysis) | cg-proc (mor.dis) | cg-proc (syn.dis) |\
  +
pretransfer | lt-proc (lexical transfer) | cg-proc (lexical selection) |\
  +
transfer (chunking) | interchunk1 | interchunk2 | postchunk | lt-proc (nob generation)
  +
  +
If you want to see output of intermediate stages, use one of the following commands:
  +
  +
apertium -d . sme-nob-morph # until analysis
  +
apertium -d . sme-nob-tagger # until CG morphological and syntactic disambiguation
  +
apertium -d . sme-nob-disam # until disambiguation with vislcg3 --trace on (but doesn't handle compounds properly)
  +
apertium -d . sme-nob-pretransfer # until pretransfer
  +
apertium -d . sme-nob-biltrans # until lexical transfer
  +
apertium -d . sme-nob-lex # until CG lexical selection
  +
apertium -d . sme-nob-chunker # until first transfer stage: chunking, verb/case→preposition etc.
  +
apertium -d . sme-nob-interchunk1 # until second transfer stage: postposition→preposition, anaphora resolution
  +
apertium -d . sme-nob-interchunk2 # until third transfer stage: word order changes, pro-drop insertion
  +
apertium -d . sme-nob-postchunk # until fourth transfer stage: infinitive marker, determiner insertion, tag cleanup
  +
apertium -d . sme-nob-dgen # until generation, with debug info
  +
apertium -d . sme-nob # until generation
  +
  +
[[Northern Sámi and Norwegian/Commands]] has more on what they do.
  +
  +
To get an overview of transfer rules: <code>$ grep '<rule' *.t[1-4]x</code>
  +
  +
More introductory documentation in:
  +
* [[Northern Sámi and Norwegian/tNxIntros|the comments at the beginning of each tNx-file]] [http://apertium.svn.sourceforge.net/viewvc/apertium/trunk/apertium-sme-nob/]
  +
* the [http://apertium.svn.sourceforge.net/viewvc/apertium/trunk/apertium-sme-nob/README README]
  +
and more in-depth documentation here:
  +
* [[Northern Sámi and Norwegian/bidix]] and chunking
  +
* [[Northern Sámi and Norwegian/Compounds]] and output macros
  +
* [[Northern Sámi and Norwegian/Derivations]]
  +
* [[Northern Sámi and Norwegian/anaphora]] "resolution"
  +
* [[Northern Sámi and Norwegian/smemorf]] (about the HFST/lexc analyser, from giellatekno)
  +
* [[Northern Sámi and Norwegian/CG]] (about the disambiguator and lexical selection)
  +
  +
See also the [[Northern Sámi and Norwegian/release]] TODO's and [[Northern Sámi and Norwegian/Installation]].
  +
  +
  +
{{TOCD}}
  +
  +
==misc todo's==
  +
* Try and use prpers for all nob pronouns and possessive determiners, would be a lot cleaner (see clean_pron in t4x)
  +
** then the nob det.pos need person/number tags
  +
* Headline language: Heahpat hállat go gillá => Skam å snakke når '''man''' lider(?)
  +
* When bidix lookup is moved out transfer: match on tl in chunker instead of matching on sl and having a big choose-when for each tl possibility.
  +
* ''Makkár oainnu oažžu son guhte čohkke buot dieđuid du birra neahtas?'' has no Qst tag, how do we tell t3x that the @OBJ→ can stay before the verb? (if interpreted as a regular focused object, it gets switched with the subject)
  +
* ''ja makkár áššiid don it muital.'' shouldn't switch Neg and -FMAINV, t3x rule has to match @OBJ→ @SUBJ→ Neg IV
   
==Word order==
 
   
===sme not-V2 and nob V2===
 
   
  +
* ''riddoguovlluid ássiin lei geatnegasvuohta lágidit gonagassii fatnasiid'' currently becomes "på de som bor med kysttraktene det var en plikt drive til kongen båter" -- would it be better to go for "hadde" here?
===sme OV tendencies and no nob OV===
 
   
   
  +
See also the [[/smemorf]] todo-list.
===nob particle verbs===
 
Since we "just" need generation, we could do it with [[Multiwords#The_Nynorsk_hack|this method]].
 
   
 
==Definiteness in nob==
 
==Definiteness in nob==
Line 46: Line 86:
   
 
Sámi collective nouns are marked Coll, but there's no collective nor mass noun marking in nob.dix, so I guess that's not much help.
 
Sámi collective nouns are marked Coll, but there's no collective nor mass noun marking in nob.dix, so I guess that's not much help.
  +
: but [http://svn.emmtee.net/tags/topp/parc/pargram/norwegian/bokmal/bokmal-nkllex.lfg this list] is GPL :-)
   
 
Some more rules:
 
Some more rules:
Line 53: Line 94:
 
** advl
 
** advl
 
* definite: lokativ not in first position + leat
 
* definite: lokativ not in first position + leat
  +
  +
  +
Dual verbforms always indicate definite subjects.
   
 
==Case==
 
==Case==
Line 76: Line 120:
 
* garra.ADJ dálkki.N.GEN geažil.PO[GEN] => på_grunn_av.PR dårlig.ADJ vær.N
 
* garra.ADJ dálkki.N.GEN geažil.PO[GEN] => på_grunn_av.PR dårlig.ADJ vær.N
 
* guokte.NUM biilla.N.SG.GEN => to.NUM biler.N.PL
 
* guokte.NUM biilla.N.SG.GEN => to.NUM biler.N.PL
  +
  +
=== Case to Number ===
  +
  +
This is the sme quantifier phrase: ''two.Sg.Nom book.Sg.Gen ==> two book.Pl.Indef''. This also holds for ''two.Sg.Acc book.Sg.Acc'', and coming to think of it, ''two.Sg.Gen book.Sg.Obliquecase''. In the latter case, of course, the oblique case in question will have to be translated to a preposition or whatever.
  +
  +
== Preposition choice ==
  +
  +
Inserted prepositions are mainly based on the case of the head noun, but it can be overridden in various ways.
  +
  +
For example, for locatives, the macro set_caseprep in t1x will default to "på", except for proper noun toponyms, which default to "i" (unless they're in the list loc-på), and common nouns in the list loc-i which also get "i". Locatives preceded by nouns in the list loc-fra-head, however, get "fra", while reflexive pronouns get no caseprep in locative.
  +
  +
If a verb is in the list loc-fra-verbs, they'll get the tag &lt;loc-fra&gt;. Down the pipeline, when t2x sees this verb, it'll set the caseprep_verb-variable to this tag. This variable is emptied by clause boundaries, but if it sees a PR followed by SN before that, the PR SN rule will use caseprep_verb to change the PR (from e.g. "på") to <code><nowiki>^caseprep<PR><loc>{^frå<pr>$}$</nowiki></code>.
   
 
==Agreement==
 
==Agreement==
Line 113: Line 169:
   
 
See also [http://openlibrary.org/b/OL2611005M/Modalverb_og_infinitiv_innen_verbalet The Book]
 
See also [http://openlibrary.org/b/OL2611005M/Modalverb_og_infinitiv_innen_verbalet The Book]
 
==Infinite verbforms==
 
 
These are clause reducts, to be expanded to embedded sentences
 
 
===Gerund===
 
...
 
===Actio locative===
 
...
 
===Actio essive===
 
...
 
 
 
==Pre- and post positions==
 
 
===Postposition to preposition===
 
 
 
==Lexical selection==
 
 
   
   
Line 149: Line 185:
 
:: Current solution: letting the lex.sel CG, sme-nob.lex, add a V tag to any N.ESS.@←SPRED followed by PRON.PERS.ILL, and then just make the bidix entry use this added tag. t1x checks for the tl vblex tag, t2x does the movement and adds the conjunction.
 
:: Current solution: letting the lex.sel CG, sme-nob.lex, add a V tag to any N.ESS.@←SPRED followed by PRON.PERS.ILL, and then just make the bidix entry use this added tag. t1x checks for the tl vblex tag, t2x does the movement and adds the conjunction.
   
==Clause types==
 
   
===Yes-no questions===
+
==Focus particles==
  +
These are equivalent:
   
  +
* reaškkihan reaškit+V+IV+VGen+Foc/han
Verb-initial yes-no questions are directly translated, with ''go'' removal
 
  +
* reaškki han reaškit+V+IV+VGen han+Pcle
   
When other constituents are added
 
===Relative clauses===
 
====sme relative pronoun into nob "som"====
 
"som" may be deleted when the relative refers to
 
   
  +
Fortunately, the relabel script can turn the first into
===Passive clauses===
 
==Morphology==
 
   
  +
* <code>^reaškkihan/reaškit<V><IV><VGen>+han<Pcle>$</code>
==Syntax==
 
   
  +
and cg-proc puts syntax tags on the first part of multiwords, so after pretransfer we get
===POS disambiguation===
 
====vai====
 
..
 
   
  +
* <code>^reaškit<V><IV><VGen><@X>$ ^han<Pcle>$</code>
==Existential sentences==
 
  +
* Insert ''det'' in the nob translation
 
  +
and we can handle focus particles in bidix even if they are attached to the preceding word.
* TODO: Use "pers" or "impers" tag from bidix
 
   
 
==leat => være / ha==
 
==leat => være / ha==
Line 201: Line 231:
 
We handle this as a lexical selection problem in CG.
 
We handle this as a lexical selection problem in CG.
   
==Derivations: general rules and exceptions==
 
Sámi has a lot of derivation rules; sometimes the derived words have lexicalised translations in Bokmål, like ''ráhkisvuohta→kjærlighet'', these we treat as '''exceptions''' which have to be specified in bidix. Other times we can use a '''general rule''', like ''lohkagohten→begynte.1SG å lese''.
 
 
We have two strategies for handling the rule/exception situation.
 
 
# For the situation where we have many exceptions, we let the analysis be eg. <code>geavaheaddjiid/geavahit<V><TV><Der2><Actor><N><Pl></code> and from here there are two paths
 
## either this specific analysis is in bidix, here translating into <code>bruker<n><m><pl></code>, or
 
## we have to use a transfer rule, in this case translating into <code>de som bruker</code>
 
# For the situation where we have few exceptions, we use <code>dev/xfst2apertium.relabel</code> to split the analysis into two lexical units. Two lexical units can't be specified in bidix, so here
 
## exceptions have to be added to the .lexc file as if they were lexicalised, so they remain one lexical unit
 
## while general transfer rules now match a pattern of two lexical units
 
 
===More detailed: Deverbal nouns===
 
Sámi verbs can turn into nouns. We want to be able to put this explicitly into the bidix (eg. sometimes the nob noun is not even based on the nob verb), but if it's not in bidix we want to be able to fall back on a construction using the verb, so
 
 
* from <code>geavaheaddjiid/geavahit<V><TV><Der2><Actor><N></code>
 
* with fallback <code>=> de som bruker<vblex></code> (or something)
 
* bidix specified <code>=> bruker<n><m></code>
 
 
With the following bidix entries we specify that we want <code>bruker<n><m></code> in the above example:
 
<pre>
 
<e><p><l>geavahit<s n="V"/><s n="TV"/></l><r>bruke<s n="vblex"/><s n="pers"/></r></p><par n="__verb"/></e>
 
<e><p><l>geavahit<s n="V"/><s n="TV"/><s n="Der2"/><s n="Actor"/><s n="N"/></l><r>bruker<s n="n"/><s n="m"/></r></p><par n="__n"/></e>
 
</pre>
 
 
while if the second bidix line isn't there, we get the fallback. Transfer rules can now check
 
<pre>
 
<equal><clip side="tl" part="pos" ...><lit-tag v="N"/></equal>
 
<equal><clip side="sl" part="pos" ...><lit-tag v="V"/></equal>
 
</pre>
 
 
The same specification/fallback might be applied with other Derivations.
 
 
==Derivations==
 
 
Note: Við eigum að breyta mörk neðan af því að það er ekki hægt að nota <code>/</code>. í mörkum í apertium. En þá eigum við að breyta CG líka...
 
 
There are also derivations of derivations:
 
<pre>
 
"<geavaheaddjis>"
 
...
 
"geavvat" V* IV* Der1 Der/h V* TV Der2 Actor N Sg Acc PxSg3
 
</pre>
 
For transfer purposes it might be simplest to treat these "flatly" as if they were single derivations (ie. Der1_Der_h_V_TV_Der2).
 
 
{|class=wikitable
 
! Tag !! Type !! Example !! in Bokmål
 
|-
 
|<code>Der/Dimin</code> || Diminutive || mánáš "mánná" N Der1 Der/Dimin N Sg Nom || barn→lite barn
 
|-
 
|<code>Der/1 Der/st</code> || Diminutive verb || attestit "addit" V TV Der1 Der/st V Inf || gi→gi litt
 
|-
 
|<code>Der/adda</code> || <code>V→N/PrfPrc/Actio</code> || bassaladdan "bassalit" V* TV Der2 Der/adda || →vaske tøy (bassat=vaske)
 
|-
 
|<code>Der/ahtti</code> || <code>V→V</code>|| vajálduhttit "vajálduvvat" V* IV* Der2 Der/ahtti V TV || →overse/glemme
 
|-
 
|<code>Der/alla</code> || suffix || bázáhallan "bázihit" V* TV Der2 Der/alla V Actio || →
 
|-
 
|<code>Der/amoš</code> || suffix || muitalamoš "muitalit" V TV Der3 Der/amoš N Sg Nom || fortelle→
 
|-
 
|<code>Der/asti</code> || suffix || muitalastit "muitalit" V TV Der2 Der/asti V Inf || fortelle→
 
|-
 
|<code>Der/at</code> || <code>Adj→Adv</code>|| viidát "viiddis" A* Der2 Der/at Adv || vid→vidt
 
|-
 
|<code>Der/d</code> || <code>V→V[refl]</code>|| basadit "bassat" V TV Der1 Der/d V || vaske→vaske seg
 
|-
 
|<code>Der/eaddji</code> || suffix || muitaleaddji "muitalit" V TV Der2 Actor N Sg Nom || fortelle→
 
|-
 
|<code>Der/eamoš</code> || suffix || muitaleamoš "muitalit" V* TV Der3 Der/eamoš || fortelle→
 
|-
 
|<code>Der/eapmi</code> || <code>V→N</code> || deaivvadeapmi "deaivvadit" V IV Der2 Der/eapmi N Sg Nom || møte(V) → møte(N)
 
|-
 
|<code>Der/easti</code> || suffix || muitaleastit "muitalit" V TV Der2 Der/easti V Inf || fortelle →
 
|-
 
|<code>Der/geahtes</code> || suffix || eaiggátkeahtes "eaiggát" N* Der3 Der/geahtes || eier →
 
|-
 
|<code>Der/goahti</code> || <code>V→V</code> [http://en.wikipedia.org/wiki/Inchoative_verb Inchoative] || boradišgohten "boradit" V TV Der3 Der/goahti V Ind Prt Sg1 || spise → jeg begynte å spise
 
|-
 
|<code>Der/h</code> || suffix || geavaheaddji "geavvat" V* IV* Der1 Der/h V* TV Der2 Actor; orrohit "orrot" V* IV Der1 Der/h V || heve seg→ ; bli/synes→
 
|-
 
|<code>Der/halla</code> || <code>V→V[recip]</code> || gulahallat "gullat" V* TV Der1 Der2 Der/halla || høre→forstå hverandre («høre hverandre»?)
 
|-
 
|<code>Der/heapmi</code> || suffix || čađaheapmi "čađđa" N* Der1 Der2 Der/heapmi A ||→
 
|-
 
|<code>Der/huhtti</code> || suffix || muosehuhttit "muoseheapme" A* Der1 Der/huhtti V* TV || urolig→
 
|-
 
|<code>Der/huvva</code> || suffix || čađahuvvo "čađđa" N* Der1 Der2 Der/huvva V IV Imprt Prs ConNegII || →
 
|-
 
|<code>Der/j</code> || suffix || sáddejuvvot "sáddet" V* TV Der1 Der/j V* Der2 Der/PassL V || sende→
 
|-
 
|<code>Der1 Der/l</code> || <code>V→V[subitive]</code> || borralit "borralit" V TV Der1 Der/l V || spise→spise (i hast)
 
|-
 
|<code>Der/l</code> || ???? || ohcalit "ohcat" V* TV Der1 Der/l V || lete→savne/lengte etter
 
|-
 
|<code>Der/las</code> || suffix || lotnolas "lotnut" V* TV Der1 Der2 Der/las A || betale→
 
|-
 
|<code>Der/laš</code> || <code>N→Adj</code> || dábálaš "dáhpi" N Der1 Der/laš A Sg Nom || skikk→vanlig
 
|-
 
|<code>Der/lágan</code> || suffix || earálágan "eará" Pron Indef Sg Gen Der1 Der/lágan A || annen/andre→
 
|-
 
|<code>Der/meahttun</code> || <code>V→Adj[Neg]</code> || jáhkkemeahttun "jáhkkit" V TV Der1 Der/meahttun A Sg Nom || tro/anta→utrolig
 
|-
 
|<code>Der/muš</code> || suffix || ??? "juhkat" V TV Der3 Der/muš N Sg Nom || drikke→
 
|-
 
|<code>Der/n</code> || suffix || oažžun "oažžut" V* TV Der3 Der/n N || få→?
 
|-
 
|<code>Der/st</code> || suffix || várástit "várát" V TV Der1 Der/st V || →
 
|-
 
|<code>Der/stuvva</code> || suffix || fuolastuvvat "fuollat" V* TV Der1 Der2 Der/stuvva V || bry seg om→
 
|-
 
|<code>Der/supmi</code> || suffix || čállosupmi "čállit" V* TV Der2 Der/PassL V* Der3 Der/supmi N || skrive/...→
 
|-
 
|<code>Der/upmi</code> || suffix || mearkkašupmi "mearkkašit" V* TV Der2 Der/PassL V* Der3 Der/upmi || merge seg→
 
|-
 
|<code>Der/viđá</code> || suffix || málestanviđá "málet" V TV Der1 Der/st V Der2 Der/eapmi N SgCmp Der/viđá Adv || male→
 
|-
 
|<code>Der/vuohta</code> || <code>Adj→N</code> || ráhkisvuohta "ráhkis" A Der3 Der/vuohta N Sg Nom || kjær→kjærlighet
 
|-
 
|}
 
   
 
==Transfer==
 
==Transfer==
Line 326: Line 237:
   
 
# "verb" -- this chunk has a single verb lexical unit
 
# "verb" -- this chunk has a single verb lexical unit
# "verb_part_verb" -- eg. "begynne å lese", three lexical units
+
# "det_adj_nom -- eg. "den nye saken", three lexical units
  +
  +
===Transfer name oddities===
  +
Since we can't have strange symbols in XML id's, an "LSUBJ" is a chunk with the syn_label "@SUBJ→", an "ROPRED" has "@←OPRED" etc.
   
 
===What goes where===
 
===What goes where===
Line 337: Line 251:
 
** simple periphrastic verb combinations (verb, vaux pp, vaux inf)
 
** simple periphrastic verb combinations (verb, vaux pp, vaux inf)
 
** Insert prepositions based on case
 
** Insert prepositions based on case
  +
** Tag verb chunks with preferred preposition per case
 
* t2x
 
* t2x
 
** relatives (SN "who" SV -> SN)
 
** relatives (SN "who" SV -> SN)
 
** co-ordination (SN "and" SN -> SN)
 
** co-ordination (SN "and" SN -> SN)
 
** genitive modifiers (SN SN-Gen " [University of Reykjavik] [big old library]-GEN"
 
** genitive modifiers (SN SN-Gen " [University of Reykjavik] [big old library]-GEN"
  +
** use verb preferred-preposition-tag to alter inserted case-prepositions
 
* t3x
 
* t3x
 
** move postpositions (SN ADPOS -> ADPOS SN) "[1 big house which is on the hill] [2 in]"
 
** move postpositions (SN ADPOS -> ADPOS SN) "[1 big house which is on the hill] [2 in]"
Line 368: Line 284:
 
* [[/Pending tests|Pending tests]]
 
* [[/Pending tests|Pending tests]]
 
* [[/Regression tests|Regression tests]]
 
* [[/Regression tests|Regression tests]]
  +
* [[Northern Sámi and Norwegian/bidix]] and chunking
  +
* [[Northern Sámi and Norwegian/Compounds]] and output macros
  +
* [[Northern Sámi and Norwegian/Derivations]]
  +
* [[Northern Sámi and Norwegian/anaphora]] "resolution"
  +
* [[Northern Sámi and Norwegian/smemorf]] (about the analyser)
  +
* [[Apertium-sme-nob/stats]]
   
 
==External links==
 
==External links==
* [http://www.samediggi.no/fil.asp?FilkategoriId=37&back=1&MId1=2383&MId2=2439 Sametingets plenum] pdf's with parallel (sme-nob) gov't text, choose eg. Publikasjoner-Møtebøker-Plenum
 
 
* [http://sami-cgi-bin.uit.no/cgi-bin/smi/smi.cgi?text=m%C3%A1nn%C3%A1%0D%0Am%C3%A1n%C3%A1%0D%0Am%C3%A1nn%C3%A1i%0D%0Am%C3%A1n%C3%A1s%0D%0Am%C3%A1n%C3%A1in%0D%0Am%C3%A1nn%C3%A1n&action=analyze&translate=nob&lang=sme&plang=eng&charset=utf-8 lookup Sámi words in giellatekno hfst online]
 
* [http://sami-cgi-bin.uit.no/cgi-bin/smi/smi.cgi?text=m%C3%A1nn%C3%A1%0D%0Am%C3%A1n%C3%A1%0D%0Am%C3%A1nn%C3%A1i%0D%0Am%C3%A1n%C3%A1s%0D%0Am%C3%A1n%C3%A1in%0D%0Am%C3%A1nn%C3%A1n&action=analyze&translate=nob&lang=sme&plang=eng&charset=utf-8 lookup Sámi words in giellatekno hfst online]
 
* [http://urn.nb.no/URN:NBN:no-nb_digibok_2007032812002 Samisk Grammatikk] av Klaus Peter Nickel (heile boka, frå Nasjonalbiblioteket)
 
* [http://urn.nb.no/URN:NBN:no-nb_digibok_2007032812002 Samisk Grammatikk] av Klaus Peter Nickel (heile boka, frå Nasjonalbiblioteket)
  +
* [http://visl.sdu.dk/visl/smi/parsing/nonautomatic/treebank.php?autoCorp=panola-vsl-nickel some translated example sentences at visl.sdu.dk]
* [http://www.avvir.no/ Avvir.no] Northern Sámi newspaper, for testing
 
  +
* [http://www.uta.fi/~km56049/same/svocab.html ~5000 words sme-eng]
  +
* [http://www.sprakrad.no/Sprakhjelp/Rettskriving_Ordboeker/Soek/Nordisk_moeteordliste/ Nordisk møteordliste]
  +
* [http://www.giella.org/artikkel.aspx?AId=2370&back=1&MId1=1248 Termlister - Giella.org]
  +
* [http://www.translatesolution.no/contents/no/Samisk-Norsk.pdf medical questionnaire form in both sme and nob]
  +
* [http://gtweb.uit.no/jorgal oversett samisk til norsk] med Giellatekno sine omsetjarar (køyrer omtrent nyaste SVN)
  +
  +
===Web sites with sme / sme-nob text===
  +
* [http://www.avvir.no/ Avvir.no] Northern Sámi newspaper
  +
* [http://infonuorra.no/ Infonuorra.no] (unginfo), Norwegian + Sámi texts, news/blog-like
  +
* [http://kongehuset.no/ Kongehuset.no] '''lots''' of parallell text
  +
* [http://www.samediggi.no/fil.asp?FilkategoriId=37&back=1&MId1=2383&MId2=2439 Sametingets plenum] pdf's with parallel (sme-nob) gov't text, choose eg. Publikasjoner-Møtebøker-Plenum
   
 
[[Category:Northern Sámi and Norwegian|*]]
 
[[Category:Northern Sámi and Norwegian|*]]
  +
[[Category:Language pairs]]

Latest revision as of 19:42, 17 April 2018

This is a language pair translating from Northern Sámi to Norwegian Bokmål.


Giellatekno har ei nettside, Jorgal, som gir maskinomsetjing frå samisk til norsk der du kan prøva siste utgåve, med ekstra funksjonar som omsetjing av nettsider og stavekontroll av omsetjinga.


There is a paper about it, see Publications#2012: "Evaluating North Sámi to Norwegian assimilation RBMT".

Technical background[edit]

This pair uses HFST for the Sámi morphological analysis, CG for rule-based disambiguation and lexical selection, and lttoolbox/apertium modules for the rest. Transfer is four-stage chunking. The pipeline looks like this:

hfst-proc (sme analysis) | cg-proc (mor.dis) | cg-proc (syn.dis) |\
pretransfer | lt-proc (lexical transfer) | cg-proc (lexical selection) |\
transfer (chunking) | interchunk1 | interchunk2 | postchunk | lt-proc (nob generation)

If you want to see output of intermediate stages, use one of the following commands:

apertium -d . sme-nob-morph          # until analysis
apertium -d . sme-nob-tagger         # until CG morphological and syntactic disambiguation
apertium -d . sme-nob-disam          # until disambiguation with vislcg3 --trace on (but doesn't handle compounds properly)
apertium -d . sme-nob-pretransfer    # until pretransfer
apertium -d . sme-nob-biltrans       # until lexical transfer
apertium -d . sme-nob-lex            # until CG lexical selection
apertium -d . sme-nob-chunker        # until first transfer stage: chunking, verb/case→preposition etc.
apertium -d . sme-nob-interchunk1    # until second transfer stage: postposition→preposition, anaphora resolution
apertium -d . sme-nob-interchunk2    # until third transfer stage: word order changes, pro-drop insertion
apertium -d . sme-nob-postchunk      # until fourth transfer stage: infinitive marker, determiner insertion, tag cleanup
apertium -d . sme-nob-dgen           # until generation, with debug info
apertium -d . sme-nob                # until generation

Northern Sámi and Norwegian/Commands has more on what they do.

To get an overview of transfer rules: $ grep '<rule' *.t[1-4]x

More introductory documentation in:

and more in-depth documentation here:

See also the Northern Sámi and Norwegian/release TODO's and Northern Sámi and Norwegian/Installation.


misc todo's[edit]

  • Try and use prpers for all nob pronouns and possessive determiners, would be a lot cleaner (see clean_pron in t4x)
    • then the nob det.pos need person/number tags
  • Headline language: Heahpat hállat go gillá => Skam å snakke når man lider(?)
  • When bidix lookup is moved out transfer: match on tl in chunker instead of matching on sl and having a big choose-when for each tl possibility.
  • Makkár oainnu oažžu son guhte čohkke buot dieđuid du birra neahtas? has no Qst tag, how do we tell t3x that the @OBJ→ can stay before the verb? (if interpreted as a regular focused object, it gets switched with the subject)
  • ja makkár áššiid don it muital. shouldn't switch Neg and -FMAINV, t3x rule has to match @OBJ→ @SUBJ→ Neg IV


  • riddoguovlluid ássiin lei geatnegasvuohta lágidit gonagassii fatnasiid currently becomes "på de som bor med kysttraktene det var en plikt drive til kongen båter" -- would it be better to go for "hadde" here?


See also the /smemorf todo-list.

Definiteness in nob[edit]

Some contexts are relatively safe:

  • Attributive superlatives are definite, and have indefinite nouns: det.poss adj.sup n => det.poss adj.sup.def n
    1. min viktigste.def oppgave.ind
  • Predicative superlatives are almost always indefinite
      1. min oppgave.ind/oppgaven.def min er viktigst.ind
  • ...unless they have a definite determiner:
    1. min oppgave.ind er den viktigste.def

Others we have to guess.

Features we might be able to use: subject/object, theme/focus, prepositions?

  • Du dálkasis sáhtii leamaš ávki => Din(det.poss) medisin(ind) kan ha vært til nytte
  • Mánná oađđá => Barnet(def) sover (but is this ambiguous?)
  • Son lea čeahpes bárdni => Han er en(art) flink(ind) gutt(ind)
  • Dá livččii skeaŋka din čeahpes bárdnai => Her er en gave jeg kunne ønske å gi den(art) flinke(def) sønnen(def) deres(det.poss)

Sámi collective nouns are marked Coll, but there's no collective nor mass noun marking in nob.dix, so I guess that's not much help.

but this list is GPL :-)

Some more rules:

  • indefinite:
    • lokativ/(illativ) in first position + leat
    • habitiv tag
    • advl
  • definite: lokativ not in first position + leat


Dual verbforms always indicate definite subjects.

Case[edit]

Case to preposition[edit]

This is for adverbial cases

Essive nouns (mánnán=>som barn) are ambiguous between sg and pl; can we just choose sg.ind all the time?

Case to object[edit]

Accusative objects are just translated. The issue here is definiteness.

Case to possessor phrase[edit]

  • Gen N => N-Def til Possessor; but for bokmål, Gen's N would be simpler and fine in most cases
    • for both expressions, definiteness is more or less trivial

Postposition/number case choice[edit]

We can remove genitive case which is due to a postposition or after a number (or turn it into accusative for a pronoun).

  • garra.ADJ dálkki.N.GEN geažil.PO[GEN] => på_grunn_av.PR dårlig.ADJ vær.N
  • guokte.NUM biilla.N.SG.GEN => to.NUM biler.N.PL

Case to Number[edit]

This is the sme quantifier phrase: two.Sg.Nom book.Sg.Gen ==> two book.Pl.Indef. This also holds for two.Sg.Acc book.Sg.Acc, and coming to think of it, two.Sg.Gen book.Sg.Obliquecase. In the latter case, of course, the oblique case in question will have to be translated to a preposition or whatever.

Preposition choice[edit]

Inserted prepositions are mainly based on the case of the head noun, but it can be overridden in various ways.

For example, for locatives, the macro set_caseprep in t1x will default to "på", except for proper noun toponyms, which default to "i" (unless they're in the list loc-på), and common nouns in the list loc-i which also get "i". Locatives preceded by nouns in the list loc-fra-head, however, get "fra", while reflexive pronouns get no caseprep in locative.

If a verb is in the list loc-fra-verbs, they'll get the tag <loc-fra>. Down the pipeline, when t2x sees this verb, it'll set the caseprep_verb-variable to this tag. This variable is emptied by clause boundaries, but if it sees a PR followed by SN before that, the PR SN rule will use caseprep_verb to change the PR (from e.g. "på") to ^caseprep<PR><loc>{^frå<pr>$}$.

Agreement[edit]

Subject-verb agreement to be removed.

Subject insertion from pro-drop[edit]

Pro-drop sentences should have subjects inserted, observing the nob V2 rule:

  • Topicalised sentences
    • X + V => X + V + subjpron
    • X + Neg + V => X + V + subjpron + ikke
  • Verb-initial sentences
    • V => subjpron + V
    • Neg + V => subjpron + V + ikke


We could do this by changing a variable in the movement interchunk stage based on whether the pattern matches a subject or not.

Negation[edit]

Negation is a verb in sme, an adverbial in nob.

  • Subj + Neg + ConNeg => Subj + Prs + ikke
  • Subj + Neg + PrfPrtc => Subj + Prt + ikke
  • Neg + Subj + ConNeg => Subj + Prs + ikke
  • Neg + Subj + PrfPrtc => Subj + Prt + ikke
  • X + Neg (+ Subj) + ConNeg => X + Prs + Subj + ikke
  • X + Neg (+ Subj) + PrfPrtc => X + Prt + Subj + ikke

Other verb=>adverb[edit]

veadjit => orke , greie (gal veadja leat=> det er kanskje)
soaitit/dáidit => kanskje

Hvis dette verbet står med en infinitiv etterpå, så oversettes sjølve ordet med kanskje, og person/numerus/tense går til infinitiven som står etter det.

See also The Book


Essive SPRED => V[edit]

There is currently no way of checking whether bidix has a translation for this and that word in transfer; eg. for

Itgo    boađáše munnje  veahkkin? 
ikke.du komme   meg.ILL hjelp.N.ESS.@←SPRED
`Kommer du ikke og hjelper(V) meg?'

we might want to translate "V PRON.ILL N.ESS.@←SPRED" or something like that into "V og V PRON.ACC" but only if the first N has a corresponding verb (hjelp <=> hjelpe), otherwise we might want to stick with the more literal "til meg som N".

Could we simply do this by adding bidix entries for "PRON.ILL N.ESS.@←SPRED" into "V PRON.ACC"? Or would that overgenerate?

No we can't, @←SPRED is not a valid sdef due to @ and ←. We _could_ put "PRON.ILL N.ESS" in bidix though, with even more chance of overgenerating…
Also, we'd first have to turn it into one lexical unit (^mun<Pron><Pers><Sg1><Ill><@ADVL>$ ^veahkki<N><Ess><@←SPRED>$ => ^mun# veahkki<Pron><Pers><Sg1><Ill><@ADVL><N><Ess><@←SPRED>$ or something).
Current solution: letting the lex.sel CG, sme-nob.lex, add a V tag to any N.ESS.@←SPRED followed by PRON.PERS.ILL, and then just make the bidix entry use this added tag. t1x checks for the tl vblex tag, t2x does the movement and adds the conjunction.


Focus particles[edit]

These are equivalent:

* reaškkihan      reaškit+V+IV+VGen+Foc/han
* reaškki han     reaškit+V+IV+VGen     han+Pcle


Fortunately, the relabel script can turn the first into

* ^reaškkihan/reaškit<V><IV><VGen>+han<Pcle>$

and cg-proc puts syntax tags on the first part of multiwords, so after pretransfer we get

* ^reaškit<V><IV><VGen><@X>$ ^han<Pcle>$

and we can handle focus particles in bidix even if they are attached to the preceding word.

leat => være / ha[edit]

leat may translate into either one of være or ha, wrong translations will become very odd.

  • Mánát leat boahtán skuvlii => Barnene har kommet til skolen
    • verb afterwards: har (well in this case "er" works, movement verb, but in general)
  • Dat lea sihke buorre ja heittot => Det er både bra og dårlig
  • Norga.no deháleamos doaibma lea ofelastit geavaheaddjiid almmolaš bálvalusaide => Norge.no's viktigste oppgave er å veivise brukere til offentlige tjenester
    • å afterwards: er
  • Mus lea oahpahus gaskkal guovtti ja njealji => Jeg har undervisning mellom to og fire
    • "from.me is teaching between two and four"
  • Mus lea biepmu => Jeg har mat
    • "from.me is food"
    • "Mus" is <Loc><@HAB> in both these;
  • Ii mus leat bahá vuoigŋa => Jeg er ikke besatt
    • "not.3SG from.me is.CONNEG angry spirit"
    • counterexample to the two above...because of negation? or the adjective?
  • Mun lean buorre => Jeg er god
  • Son lea čeahpes bárdni => Han er en flink gutt
    • "Mun", "Son" are not <Loc><@HAB> ...
  • Mus lea gažaldat didjiide => Jeg har et spørsmål til dere


We handle this as a lexical selection problem in CG.


Transfer[edit]

Chunk naming scheme[edit]

Since t4x (postchunk) cannot tell how many and what kind of lexical units there are in each chunk, we use the chunk names to signal this. Some examples of chunk names:

  1. "verb" -- this chunk has a single verb lexical unit
  2. "det_adj_nom -- eg. "den nye saken", three lexical units

Transfer name oddities[edit]

Since we can't have strange symbols in XML id's, an "LSUBJ" is a chunk with the syn_label "@SUBJ→", an "ROPRED" has "@←OPRED" etc.

What goes where[edit]

The current plan:

  • t1x
    • (de-)compounding,
    • derivation,
    • simple noun phrases (heads and their simple modifiers/specifiers: adj nom, adj adj nom, det adj adj nom, num adj nom),
    • simple periphrastic verb combinations (verb, vaux pp, vaux inf)
    • Insert prepositions based on case
    • Tag verb chunks with preferred preposition per case
  • t2x
    • relatives (SN "who" SV -> SN)
    • co-ordination (SN "and" SN -> SN)
    • genitive modifiers (SN SN-Gen " [University of Reykjavik] [big old library]-GEN"
    • use verb preferred-preposition-tag to alter inserted case-prepositions
  • t3x
    • move postpositions (SN ADPOS -> ADPOS SN) "[1 big house which is on the hill] [2 in]"
      • remove prepositions when case is governed by adpositions
    • V2
    • Insert dropped pronouns
  • t4x
    • Insert articles
    • Cleanup


NP's[edit]

Definiteness is set on the noun phrase chunk in t1x, but might change during t2x or t3x. Gender and number, however, never changes based on larger contexts.

Example: t1x might give ^nom<SN><@ADVL><m><pl><ind>{^stein<n><m><pl><5>$}$, before t4x is run, the place with the 5 (placeholder) is given the value ind. Thus we can change definiteness in t2x and t3x; gender and number tags however are on the actual word. The same goes for adjectives (and determiners) in the NP chunk, so gender is applied to the adjective/determiner in t1x, while we have the placeholder for definiteness until t4x.

We do some cleanup on adjectives in t4x to make sure they match the nob.dix format, eg. positive indefinites have gender tags while definites don't (this has to happen in t4x since we have to know if they end up being definite or what).

Exception: some contexts are certain, eg.:

  • det.dem (adj.def) n.def
  • det.qnt (adj.def) n.ind
  • det.pos (adj.def) n.ind

For these, if we don't add the definiteness tag to the chunk but instead put apply definiteness directly on the words, it can't be changed during t2x/t3x (<let><clip pos="N" part="art"/><lit-tag "foo"/></let> has no effect when the "art" attribute is empty), and we don't apply anything from the chunk during t4x if the "art" attribute is empty.

See also[edit]

External links[edit]

Web sites with sme / sme-nob text[edit]