Difference between revisions of "Northern Sámi and Norwegian/smemorf"

Revision as of 13:45, 27 October 2011

apertium-sme-nob TODOs for the sme morphological analyser from Giellatekno.

Misc

add entries from bidix that are missing from the analyser
- Missing nouns, adverbs, adjectives

regex for URL's (don't want telefonkatalogen.no => *telefonkatalogen.nå)

regex for acronyms like "GsoC:as" (tokenisation dependent...)

8632: SUBSTITUTE:TV_IV (V TV) (V IV) FAUXV (0 ("lávet")); -- this should be analysed as both, and disambiguated

Typos

I've seen "odda" many places ("ođđa"), can we just add these to the analyser? (Would be cool to have charlifter/Diacritic Restoration, but until then…)

a list of high-frequency typos where the correction has an analysis

Dashes

lexc handles dashes by adding them literally (like a lemma), doing that in with a bidix pardef would be very messy (also, doesn't it give issues with lemma-matching in CG?). Currently we remove the dashes in dev/xfst2apertium.useless.twol (and in certain cases re-add them in transfer as the tag <dash>), but perhaps we could add a tag there …

Compounding

ensure compounding is only tried if there is no other solution

Most general solution: Use a weighted transducer, and give the compound border (ie the dynamic compounding border, the R lexicon) a non-zero weight.

However, we have a CG rule that removes compounds if there are other readings, so we're OK for now.

Multiwords

Add simple multiwords and fixed expressions to the analyser.

dasa lassin => i tillegg (til det)
dán áigge => for tiden
mun ieš => meg selv
bures boahtin => velkommen
Buorre beaivi => God dag
leat guollebivddus => å fiske
maid ban dainna => hva i all verden
jagis jahkái => fra år til år
oaidnaleapmái => 'see you'
ovdamearkka => for eksempel
Mo manná? => Hvordan går det?
ja nu ain => og så videre

(Some of these MWE's might be very Apertium-specific, but in that case we just keep our own file and append the entries with update-lexc.sh.)

Also, oktavuohta.N.Sg.Loc turns into an mwe preposition in nob:

oktavuođas => i forbindelse med

it'd be a lot simpler for transfer to just analyse such a fixed expression as oktavuođas.Po in the first place.

@@ Line 26: / Line 26: @@
 Most general solution: Use a weighted transducer, and give the compound border (ie the dynamic compounding border, the R lexicon) a non-zero weight.
-However, hfst-proc does the same by just removing any analysis with a + in it if we have one without a +, so we're OK for now…
+However, we have a CG rule that removes compounds if there are other readings, so we're OK for now.
 ==Multiwords==

Difference between revisions of "Northern Sámi and Norwegian/smemorf"

Revision as of 13:45, 27 October 2011

Contents

Misc

Typos

Dashes

Compounding

ensure compounding is only tried if there is no other solution

Multiwords

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools