Northern Sámi and Norwegian/smemorf
Documentation and TODO's for apertium-sme-nob's sme morphological analyser from Giellatekno.
Contents
Description
The sme morphological analyser is a trimmed and tag-changed version of the one in Giellatekno.
Trimming
Trimming happens using the same HFST method as in the Turkic pairs etc. Compounds are not handled correctly by this method.
Tagset changes
The tagset in apertium-sme-nob is closer to Apertium's tagset, thus a bit different from the one in Giellatekno.
- We remove Err/Orth/Usage tags (+Use/Sub, etc) –
- We remove any derivational analyses that aren't yet handled by transfer/bidix
- We reorder some tags
- We change the format of tags, so +N becomes <n>, +Der/PassL becomes <der_passl>, etc.
- We change certain tags
TODO
Misc
- add entries from bidix that are missing from the analyser
- regex for URL's (don't want telefonkatalogen.no => *telefonkatalogen.nå)
- regex for acronyms like "GsoC:as" (tokenisation dependent...)
8632: SUBSTITUTE:TV_IV (V TV) (V IV) FAUXV (0 ("lávet"));
-- this should be analysed as both, and disambiguated
Typos
I've seen "odda" many places ("ođđa"), can we just add these to the analyser? (Would be cool to have charlifter/Diacritic Restoration, but until then…)
- a list of high-frequency typos where the correction has an analysis
Dashes
lexc handles dashes by adding them literally (like a lemma), doing that in with a bidix pardef would be very messy (also, doesn't it give issues with lemma-matching in CG?).
Multiwords
Add simple multiwords and fixed expressions to the analyser.
- dasa lassin => i tillegg (til det)
- dán áigge => for tiden
- mun ieš => meg selv
- bures boahtin => velkommen
- Buorre beaivi => God dag
- leat guollebivddus => å fiske
- maid ban dainna => hva i all verden
- jagis jahkái => fra år til år
- oaidnaleapmái => 'see you'
- ovdamearkka => for eksempel
- Mo manná? => Hvordan går det?
- ja nu ain => og så videre
(Some of these MWE's might be very Apertium-specific, but in that case we just keep our own file and append the entries with update-lexc.sh.)
Also, oktavuohta.N.Sg.Loc turns into an mwe preposition in nob:
- oktavuođas => i forbindelse med
it'd be a lot simpler for transfer to just analyse such a fixed expression as oktavuođas.Po in the first place.