North Saami and Finnish

From Apertium
Jump to navigation Jump to search

This page is for discussing the Northern Sámi and Finnish translator (apertium-fin-sme). Some pending things to think about:

General todo list

For old list items see completed tasks
  • How are compounds dealt with in Omorfi and in the GTSVN analysers ? Do they always split in the same places ? If not, we probably have to add those that don't as lexicalised entries in the transducers.
    • Compounds in sme and fin are similar, and we should strive at translating dynamic compounds.
  • Another possible source for paired words and sentences: Open-tran. Contains translation strings for linux software with GUIs, allows searching in any language pair, and contains Finnish and Northern Sámi. Ryan can contact them if it seems like their data would be of use.
  • Lex choice build xsl script: Add colon number on the Finnish side.
  • Find frequent multiwords, perhaps take advantage of mwetoolkit. Are there any multiword resources for Finnish existing ?

Comparisons of Northern Sámi and Finnish

Noun phrases

Both Northern Sámi and Finnish order noun suffixes in this way:

NOUN-Pl-Case-Possessive-CliticParticles

Possessives markers are much less common in Northern Sámi, but morphological analyzers will handle them.

Constituent order within noun phrases is similar:

Det Num Adj+ Noun

Where Det can be either a demonstrative pronoun or a pronoun denoting possession (i.e., a personal pronoun in the genitive).

Cases

Northern Sámi has 7 cases: nominative, accusative, genitive, locative, illative, comitative, essive.

  • Accusative and Genitive are often syncretic, except in some numbers and some pronouns.
  • Comitative and Essive are the same in singular and plural

Finnish has 15 cases (and several additional case-like suffixes only applied to adverbials). This is alot, here are the significant facts to avoid a string of opaque latinate terms:

  • Structural cases: 4. nominative, partitive, accusative, genitive
  • Locative cases: 6. An internal and external set (3 cases each) that show goal, location, and source.
  • Stative cases: 2. state, goal state; rarely a third - source state
  • Additional: 2 instructive/instrumental cases (with, without), 1 comitative case (plural only)

Where Finnish distinguishes internality and externality with locative and stative cases, there is no such distinction in Northern Sámi. Northern Sámi uses locative for source and location, and illative for goal. Thus, cases can roughly be transfered this way:

  • (fin) Internal Source, Internal Location, External Source, External Location → Locative
  • (fin) Internal Goal, External Goal → Illative
  • (fin) Partitive, Accusative, Genitive → AccGen

Of course, the last set ending in AccGen will have to be distinguished with certain numbers and pronouns.

Case agreement

Most adjectives just have a predicative and attributive form, but some do agree in number with the subject.

Váralaš > váralaččat

  • (fin) Muovipussit ovat vaarallisia. → Plásttetseahkat leat váralaččat.

earálágán > earálágánat

  • (fin) Ihmiset ovat erilaisia. → Olbmot leat earálágánat.

Pronouns and demonstratives within DPs also agree with their head nouns, although there is some amount of syncreticism when they are attributes. In the plural however, the illative and locative forms are not syncretic and agree with a plural head noun.

Case Independent As Attribute Head Noun Attr Pl Head Noun Pl
Nom mii mii (Nom) beana (Nom) mat (Nom Pl) beatnagat (Nom Pl)
Gen man man (Gen/Acc) beatnaga (Gen/Acc) maid (Gen/Acc Pl) beatnagiid (Gen/Acc Pl)
Acc man man (Gen/Acc) beatnaga (Gen/Acc) maid (Gen/Acc Pl) beatnagiid (Gen/Acc Pl)
Ill masa man (Gen/Acc) beatnagii (Ill) maidda (Ill Pl) beatnagiidda (Ill Pl)
Loc mas man (Gen/Acc) beatnagis (Loc) main (Loc Pl) beatnagiin (Loc Pl)
Com mainna mainna (Com) beatnagiin (Com) maiguin (Com Pl) beatnagiiguin (Com Pl)
Ess manin manin (Ess) beanan (Ess) manin (Ess) beanan (Ess)

This pattern holds for other demonstratives and numbers, except numbers do not have the same syncreticisms for Gen/Acc, in that the numbers may show separate marking for genitive and accusative, although the head noun shows syncretic Gen/Acc forms.

Case Independent As Attribute Head Noun Attr Pl Head Noun Pl
Nom guokte guokte (Nom) gápmaga (Gen/Acc) guovttit (Nom Pl) gápmagat (Nom Pl)
Gen man guovtti (Gen) gápmaga (Gen/Acc) guvttiid (Gen/Acc) gápmagiid (Gen/Acc)
Acc man guokte (Acc) gápmaga (Gen/Acc) guvttiid (Gen/Acc) gápmagiid (Gen/Acc)
Ill masa guovtti (Gen/Acc) gápmagii (Ill) guvttiide (Ill Pl) gápmagiidda (Ill Pl)
Loc mas guovtti (Gen/Acc) gápmagis (Loc) guvttiin (Loc Pl) gápmagiin (Loc Pl)
Com mainna guvttiin (Com) gápmagiin (Com) guvttiiguin (Com Pl) gápmagiiguin (Com Pl)
Ess manin guoktin (Ess) gáman (Ess) guoktin (Ess) gáman (Ess)

This pattern is not exactly the same as with the number one, which has a syncreticism with Gen/Acc.

Some other attributes such as goappašat and guktot 'both (pl)' have separate patterns: syncretic gen/acc/ill (sometimes ill agreement is okay), agreement with locative, but optional agreement with comitative.

Case Goappašat case Noun Case
Nom Pl AGR AGR
Gen Pl AGR AGR
Acc Pl AGR AGR
Ill Pl Gen/Acc Pl OR Ill Pl Ill Pl
Loc Pl AGR AGR
Com Pl Gen/Acc Pl OR Com Pl Com Pl
Case Guktot case Noun Case
Nom Pl AGR AGR
Gen Pl AGR AGR
Acc Pl AGR AGR
Ill Pl Gen/Acc Pl Ill Pl
Loc Pl AGR AGR
Com Pl Gen/Acc Pl Com Pl

Adjectives

Adjectives in Northern Sámi can have two separate forms depending on whether they are attributive or predicative. The attributive adjectives mostly do not agree in number with the head noun, but predicative adjectives do. Attributive adjectives do not agree in case with the head noun.

In Finnish, adjectives always agree in number and case with the head noun, and agree in number when they occur in predicates (although there is some variation as to whether or not the predicative adjective is partitive plural or nominative plural).

Derivation

Tag Type Example Analysis in North Sámi Gloss
Der/inen N→Adj "muovinen" muovi+N+Der/inen+Pos+Sg+Nom plastihkas ráhkaduvvon plastihkka+n.loc build+v.pass.pp
Der/ja V→N "kirjoja" kirjoa+V+Der/ja+Sg+Nom kirjoa-ja = write-er (writer) ?
Der/lainen N→Adj "saamelainen" saame+N+Der/lainen+Pos+Sg+Nom sápmelaš -laš
Der/llinen N→Adj "kirjallinen" kirja+N+Der/llinen+Sg+Nom kirja-llinen = book-ish (literary)?
Der/minen marks deverbal nouns ?
Der/oi
Der/sti Adj→Adv derives an adverb from an adjective ? -ly
Der/tar
Der/ton N→Adj "rahaton" raha+N+Der/ton+Sg+Nom ruđaheapme ruht + -heapme
Der/tse
Der/ttain
Der/u
Der/vs

There are some cases where both a derived and a lexicalised entry might be in one analyser, but only one or the other in the other analyser. For example:

saamelainen	[LEMMA='saamelainen'][POS=ADJECTIVE][KTN=38][CMP=POS][NUM=SG][CASE=NOM]
saamelainen	[LEMMA='saame'][POS=NOUN][KTN=8][GUESS=DERIVE][DRV=LAINEN][CMP=POS][NUM=SG][CASE=NOM]
saamelainen	[LEMMA='saame'][POS=NOUN][KTN=8][NUM=SG][CASE=NOM][BOUNDARY=COMPOUND][GUESS=COMPOUND][LEMMA='lainen'][POS=NOUN][KTN=38][NUM=SG][CASE=NOM]

saamelainen	saamelainen+A+Pos+Sg+Nom
saamelainen	saame+N+Der/lainen+Pos+Sg+Nom
saamelainen	saame+N+Sg+Nom#lainen+N+Sg+Nom

versus:

sápmelaš	sápmelaš+A+Sg+Nom
sápmelaš	sápmelaš+A+Attr

How to deal with this will be one of the main challenges. E.g. do we add more entries, or do we remove entries ? Is there a way to do either of those automatically ?

The reason why the sme analysis gives only the lexicalised analysis is that there is a postprocessor choosing the lexicalised one, the perl file lookup2cg. Run through the same file the fin output is compatible:

$echo saamelainen|ufin|lookup2cg
"<saamelainen>"
	 "saamelainen" A Pos Sg Nom

Files

Source files
File Description Notes
apertium-sme-fin.sme-fin.dix Transfer lexicon / Bilingual dictionary
apertium-sme-fin.sme.twol Morphophonology for Sámi This file is copied as is from Giellatekno SVN. No changes should be made to the local version.
apertium-sme-fin.fin-sme.rlx Constraint Grammar for Finnish
apertium-sme-fin.fin-sme.t1x Chunker file for Finnish→Northern Sámi
Compiled and binary files
File Description Notes
fin-sme.prob Tagger HMM probability file This file needs to be trained when the CG is fully converted.
fin-sme.rlx.bin Compiled Constraint Grammar for Finnish
fin-sme.autobil.bin Compiled transfer lexicon
fin-sme.t1x.bin Compiled transfer rules These are first-stage transfer rules, mostly for chunking and local reordering.

Transfer

Some thoughts on when to just adjust tagsets, transfer lexically (in the bilingual dictionary) or with transfer rules.

  1. If there is a 1:1 correspondence between tags that mean the same thing in all contexts, then the tagset should be changed. For example [NEG=CON], previously <NegCon>, should be just relabelled as <ConNeg> (the tag as it appears in the Sámi analysers).
  2. If there is a need to insert or remove tags based on other tags, because of a difference in what is tagged then this can be done in the bilingual dictionary. For example in Finnish, adjectives in the positive comparison are tagged with <A><Pos>, where in Sámi the positive is unmarked <A>. In this case, a paradigm in the bilingual dictionary can be used to remove all <Pos> tags from adjectives in the target language.
  3. If the tag changes depend on syntactic or morphological context of more than one word, then it should be done in transfer rules.

Note: There will usually be some unclearness between 1--2 and 2--3. So if in doubt, just do it the first way that comes to mind. It can always be changed later.

Tagging inconsistencies

kritisoinnin - kritiserema
$           echo kritiserema | osme
191480 0
kritiserema	kritiseret+V+TV+Der3+Der/n+N+Sg+Gen
kritiserema	kritiseret+V+TV+Der3+Der/n+N+Sg+Acc
kritiserema	kritiseret+V+TV+Actio+Gen
kritiserema	kritiseret+V+TV+Actio+Acc

$ echo kritisoinnin | ofin
kritisoinnin	kritisointi+N+Sg+Gen

Can we add a lexicalised entry for this derived noun in the sme lex ?

See also