Difference between revisions of "North Saami and Finnish"
Line 4: | Line 4: | ||
* How are compounds dealt with in [[Omorfi]] and in the GTSVN analysers ? Do they always split in the same places ? If not, we probably have to add those that don't as lexicalised entries in the transducers. |
* How are compounds dealt with in [[Omorfi]] and in the GTSVN analysers ? Do they always split in the same places ? If not, we probably have to add those that don't as lexicalised entries in the transducers. |
||
* Adding subcategories (Dem, Itg, etc.) to pronouns in Omorfi |
* <s>Adding subcategories (Dem, Itg, etc.) to pronouns in Omorfi</s> |
||
* Fred Karlsson's constraint grammar for Finnish has been GPL'd, and is available and undergoing conversion to CG3 here: https://victorio.uit.no/langtech/trunk/kt/fin/src |
* Fred Karlsson's constraint grammar for Finnish has been GPL'd, and is available and undergoing conversion to CG3 here: https://victorio.uit.no/langtech/trunk/kt/fin/src |
Revision as of 15:12, 10 May 2010
This page is for discussing the Northern Sámi and Finnish translator (apertium-fin-sme
). Some pending things to think about:
- How are compounds dealt with in Omorfi and in the GTSVN analysers ? Do they always split in the same places ? If not, we probably have to add those that don't as lexicalised entries in the transducers.
Adding subcategories (Dem, Itg, etc.) to pronouns in Omorfi
- Fred Karlsson's constraint grammar for Finnish has been GPL'd, and is available and undergoing conversion to CG3 here: https://victorio.uit.no/langtech/trunk/kt/fin/src
- This should be converted in an Apertium-compatible manner from the start! No using reserved symbols (e.g.
<
,>
and/
)
- This should be converted in an Apertium-compatible manner from the start! No using reserved symbols (e.g.
- How can we restrict generation of alternative forms in the Sámi generator ? In lttoolbox this is done with LR (only analyse)/RL (only generate) markings.
- Can we get access to the Álgu database ?
hfst-lookup
or something similar to _generate_ analyses that come in with ^ and $
- Can we rig up SVN to pull in the twol file from GT svn directly ?
- Some tags do not get replaced by the relabel script: olleet olla+V[GEN=ACT]+Pcp1+Pos+Pl+Nom
Comparisons of Northern Sámi and Finnish
Noun phrases
Both Northern Sámi and Finnish order noun suffixes in this way:
NOUN-Pl-Case-Possessive-CliticParticles
Possessives markers are much less common in Northern Sámi, but morphological analyzers will handle them.
Constituent order within noun phrases is similar:
Det Num Adj+ Noun
Where Det can be either a demonstrative, or possessive pronoun.
Cases
Northern Sámi has 7 cases: nominative, accusative, genitive, locative, illative, comitative, essive.
- Accusative and Genitive are often syncretic, except in some numbers and some pronouns.
- Comitative and Essive are the same in singular and plural
Finnish has 15 cases (and several additional case-like suffixes only applied to adverbials). This is alot, here are the significant facts to avoid a string of opaque latinate terms:
- Structural cases: 4. nominative, partitive, accusative, genitive
- Locative cases: 6. An internal and external set (3 cases each) that show goal, location, and source.
- Stative cases: 2. state, goal state; rarely a third - source state
- Additional: 2 instructive/instrumental cases (with, without), 1 comitative case (plural only)
Where Finnish distinguishes internality and externality with locative and stative cases, there is no such distinction in Northern Sámi. Northern Sámi uses locative for source and location, and illative for goal. Thus, cases can roughly be transfered this way:
- (fin) Internal Source, Internal Location, External Source, External Location → Locative
- (fin) Internal Goal, External Goal → Illative
- (fin) Partitive, Accusative, Genitive → AccGen
Of course, the last set ending in AccGen will have to be distinguished with certain numbers and pronouns.
Adjectives
Adjectives in Northern Sámi can have two separate forms depending on whether they are attributive or predicative. The attributive adjectives mostly do not agree in number with the head noun, but predicative adjectives do. Attributive adjectives do not agree in case with the head noun.
In Finnish, adjectives always agree in number and case with the head noun, and agree in number when they occur in predicates (although there is some variation as to whether or not the predicative adjective is partitive plural or nominative plural).
Derivation
Tag | Type | Example | Analysis | in North Sámi | Gloss |
---|---|---|---|---|---|
Der/inen |
N→Adj | "muovinen" | muovi+N+Der/inen+Pos+Sg+Nom |
plastihkas ráhkaduvvon | plastihkka+n.loc build+v.pass.pp |
Der/ja |
V→N | "kirjoja" | kirjoa+V+Der/ja+Sg+Nom |
kirjoa-ja = write-er (writer) ? | |
Der/lainen |
N→Adj | "saamelainen" | saame+N+Der/lainen+Pos+Sg+Nom |
sápmelaš | -laš
|
Der/llinen |
N→Adj | "kirjallinen" | kirja+N+Der/llinen+Sg+Nom |
kirja-llinen = book-ish (literary)? | |
Der/minen |
marks deverbal nouns ? | ||||
Der/oi |
|||||
Der/sti |
Adj→Adv | derives an adverb from an adjective ? -ly | |||
Der/tar |
|||||
Der/ton |
N→Adj | "rahaton" | raha+N+Der/ton+Sg+Nom |
ruđaheapme | ruht + -heapme
|
Der/tse |
|||||
Der/ttain |
|||||
Der/u |
|||||
Der/vs |
There are some cases where both a derived and a lexicalised entry might be in one analyser, but only one or the other in the other analyser. For example:
saamelainen [LEMMA='saamelainen'][POS=ADJECTIVE][KTN=38][CMP=POS][NUM=SG][CASE=NOM] saamelainen [LEMMA='saame'][POS=NOUN][KTN=8][GUESS=DERIVE][DRV=LAINEN][CMP=POS][NUM=SG][CASE=NOM] saamelainen [LEMMA='saame'][POS=NOUN][KTN=8][NUM=SG][CASE=NOM][BOUNDARY=COMPOUND][GUESS=COMPOUND][LEMMA='lainen'][POS=NOUN][KTN=38][NUM=SG][CASE=NOM] saamelainen saamelainen+A+Pos+Sg+Nom saamelainen saame+N+Der/lainen+Pos+Sg+Nom saamelainen saame+N+Sg+Nom#lainen+N+Sg+Nom
versus:
sápmelaš sápmelaš+A+Sg+Nom sápmelaš sápmelaš+A+Attr
How to deal with this will be one of the main challenges. E.g. do we add more entries, or do we remove entries ? Is there a way to do either of those automatically ?
Files
Source files | ||
File | Description | Notes |
---|---|---|
apertium-sme-fin.sme-fin.dix |
Transfer lexicon / Bilingual dictionary | |
apertium-sme-fin.sme.twol |
Morphophonology for Sámi | This file is copied as is from Giellatekno SVN. No changes should be made to the local version. |
apertium-sme-fin.fin-sme.rlx |
Constraint Grammar for Finnish | |
apertium-sme-fin.fin-sme.t1x |
Chunker file for Finnish→Northern Sámi | |
Compiled and binary files | ||
File | Description | Notes |
fin-sme.prob |
Tagger HMM probability file | This file needs to be trained when the CG is fully converted. |
fin-sme.rlx.bin |
Compiled Constraint Grammar for Finnish | |
fin-sme.autobil.bin |
Compiled transfer lexicon | |
fin-sme.t1x.bin |
Compiled transfer rules | These are first-stage transfer rules, mostly for chunking and local reordering. |