Northern Sámi and Norwegian/smemorf
apertium-sme-nob TODOs for the sme morphological analyser from Giellatekno.
Contents
Misc
- add entries from bidix that are missing from the analyser
- Missing nouns, adverbs, adjectives
- regex for URL's (don't want telefonkatalogen.no => *telefonkatalogen.nå)
- regex for acronyms like "GsoC:as" (tokenisation dependent...)
8632: SUBSTITUTE:TV_IV (V TV) (V IV) FAUXV (0 ("lávet"));
-- this should be analysed as both, and disambiguated
Typos
I've seen "odda" many places ("ođđa"), can we just add these to the analyser? (Would be cool to have charlifter/Diacritic Restoration, but until then…)
- a list of high-frequency typos where the correction has an analysis
Compounding
ensure compounding is only tried if there is no other solution
Most general solution: Use a weighted transducer, and give the compound border (ie the dynamic compounding border, the R lexicon) a non-zero weight.
However, hfst-proc does the same by just removing any analysis with a + in it if we have one without a +, so we're OK for now…
Multiwords
Add simple multiwords and fixed expressions to the analyser.
- dasa lassin => i tillegg (til det)
- dán áigge => for tiden
- mun ieš => meg selv
- bures boahtin => velkommen
- Buorre beaivi => God dag
- leat guollebivddus => å fiske
- maid ban dainna => hva i all verden
- jagis jahkái => fra år til år
- oaidnaleapmái => 'see you'
- ovdamearkka => for eksempel
- Mo manná? => Hvordan går det?
- ja nu ain => og så videre
(Some of these MWE's might be very Apertium-specific, but in that case we just keep our own file and append the entries with update-lexc.sh.)
Also, oktavuohta.N.Sg.Loc turns into an mwe preposition in nob:
- oktavuođas => i forbindelse med
it'd be a lot simpler for transfer to just analyse such a fixed expression as oktavuođas.Po in the first place.