Difference between revisions of "Northern Sámi and Norwegian/smemorf"

Revision as of 14:19, 14 July 2010

apertium-sme-nob TODOs for the sme morphological analyser from Giellatekno.

add entries from bidix that are missing from the analyser
- Missing nouns, adverbs, adjectives

Proper casing support in the sme lexicon. (Mánát vs. mánát)
- In the xerox software, this is done by a separate fst m (->) M || .#. _ ;

8632: SUBSTITUTE:TV_IV (V TV) (V IV) FAUXV (0 ("lávet")); -- this should be analysed as both, and disambiguated

geafivuohta gets one Der/vuohta-analysis that's got no A tag, and one with an A tag, fix:
- geafivuohta geafi+SgNomCmp+SgGenCmp+PlGenCmp+A+Der3+Der/vuohta+N+Sg+Nom
- geafivuohta geafi+SgNomCmp+SgGenCmp+PlGenCmp+Der3+Der/vuohta+N+Sg+Nom
- geafivuohta geafivuohta+SgGenCmp+DefPlGenCmp+N+Sg+Nom
- http://giellatekno.uit.no/bugzilla/show_bug.cgi?id=852

Typos: I've seen "odda" many places ("ođđa"), can we just add these to the analyser? (Would be cool to have charlifter/Diacritic Restoration, but until then…)

oktan seems to be used as a preposition, can we have that?
- Juo cuoŋománu 10. beaivvi ija vuostá mátkkoštii Ruvdnaprinseassa Märtha badjel ráji Ruŧŧii oktan Ruvdnaprinsabára golmmain mánáin.
- áhčči oktan mánáinis
- http://giellatekno.uit.no/bugzilla/show_bug.cgi?id=853

Numbers with case shouldn't get the ':' in between tags: ^19.00:s/19.00<Num>:<Sg><Loc>$

allaskuvla, Riikaráđi and 1100-logu get no compound tags in the compound analysis, the first can be fixed by adding +Cmpnd in LEXICON ATTR, the RReal border, similarly for the second in LEXICON ACCRA-NE, both RReal borders.

-prográmma gives ^-prográmma/+Cmpnd+prográmma<N><Sg><Nom>/+Cmpnd+prográmma<N><Sg><Acc>/+Cmpnd+prográmma<N><Sg><Gen>$. We currently remove the +Cmpnd+ in dev/xfst2apertium.hashtags.twol, but then we have no indication that the word started with a dash (-). In other language pairs, the - is output separately (either an inconditional lexical unit, like punctuation, or without any analysis, like unknown symbols).

- A combination of the two errors above: Ánde-máhka gives ^Ánde<N><Prop><Mal><Sg><Nom><-máhka><N><Sg><Nom>$... we should have something like ^Ánde<N><Prop><Mal><Sg><Nom>+<-máhka><N><Sg><Nom>$

Most general solution: Use a weighted transducer, and give the compound border (ie the dynamic compounding border, the R lexicon) a non-zero weight.

However, hfst-proc does the same by just removing any analysis with a + in it if we have one without a +, so we're OK for now…

Add simple multiwords and fixed expressions to the analyser.

(Some of these MWE's might be very Apertium-specific, but in that case we just keep our own file and append the entries with update-lexc.sh.)

Also, oktavuohta.N.Sg.Loc turns into an mwe preposition in nob:

it'd be a lot simpler for transfer to just analyse such a fixed expression as oktavuođas.Po in the first place.

Revision as of 14:02, 14 July 2010 (edit) Unhammer (talk \| contribs) (→‎Misc) ← Older edit		Revision as of 14:19, 14 July 2010 (edit) (undo) Unhammer (talk \| contribs) (→‎Misc) Newer edit →
Line 19:		Line 19:


	* ~~Make~~ ~~sure~~ ~~bidix~~ ~~has~~ +Actor ~~or +G3~~ tag for the +N's that require it		* The +G3 tag: handle like +Actor tag for the +N's that require it