Difference between revisions of "Northern Sámi and Norwegian/smemorf"

Revision as of 13:20, 27 October 2011

apertium-sme-nob TODOs for the sme morphological analyser from Giellatekno.

Misc

add entries from bidix that are missing from the analyser
- Missing nouns, adverbs, adjectives

regex for URL's (don't want telefonkatalogen.no => *telefonkatalogen.nå)

regex for acronyms like "GsoC:as" (tokenisation dependent...)

8632: SUBSTITUTE:TV_IV (V TV) (V IV) FAUXV (0 ("lávet")); -- this should be analysed as both, and disambiguated

oktan seems to be used as a preposition, can we have that?
- Juo cuoŋománu 10. beaivvi ija vuostá mátkkoštii Ruvdnaprinseassa Märtha badjel ráji Ruŧŧii oktan Ruvdnaprinsabára golmmain mánáin.
- áhčči oktan mánáinis
- http://giellatekno.uit.no/bugzilla/show_bug.cgi?id=853

Typos

I've seen "odda" many places ("ođđa"), can we just add these to the analyser? (Would be cool to have charlifter/Diacritic Restoration, but until then…)

a list of high-frequency typos where the correction has an analysis

Compounding

ensure compounds are tagged as such

http://giellatekno.uit.no/bugzilla/show_bug.cgi?id=851

allaskuvla, Riikaráđi and 1100-logu get no compound tags in the compound analysis, the first can be fixed by adding +Cmpnd in LEXICON ATTR, the RReal border, similarly for the second in LEXICON ACCRA-NE, both RReal borders.

-prográmma gives ^-prográmma/+Cmpnd+prográmma<N><Sg><Nom>/+Cmpnd+prográmma<N><Sg><Acc>/+Cmpnd+prográmma<N><Sg><Gen>$. We currently remove the +Cmpnd+ in dev/xfst2apertium.hashtags.twol, but then we have no indication that the word started with a dash (-). In other language pairs, the - is output separately (either an inconditional lexical unit, like punctuation, or without any analysis, like unknown symbols).

- A combination of the two errors above: Ánde-máhka gives ^Ánde<N><Prop><Mal><Sg><Nom><-máhka><N><Sg><Nom>$... we should have something like ^Ánde<N><Prop><Mal><Sg><Nom>+<-máhka><N><Sg><Nom>$

ensure compounding is only tried if there is no other solution

Most general solution: Use a weighted transducer, and give the compound border (ie the dynamic compounding border, the R lexicon) a non-zero weight.

However, hfst-proc does the same by just removing any analysis with a + in it if we have one without a +, so we're OK for now…

Multiwords

Add simple multiwords and fixed expressions to the analyser.

dasa lassin => i tillegg (til det)
dán áigge => for tiden
mun ieš => meg selv
bures boahtin => velkommen
Buorre beaivi => God dag
leat guollebivddus => å fiske
maid ban dainna => hva i all verden
jagis jahkái => fra år til år
oaidnaleapmái => 'see you'
ovdamearkka => for eksempel
Mo manná? => Hvordan går det?
ja nu ain => og så videre

(Some of these MWE's might be very Apertium-specific, but in that case we just keep our own file and append the entries with update-lexc.sh.)

Also, oktavuohta.N.Sg.Loc turns into an mwe preposition in nob:

oktavuođas => i forbindelse med

it'd be a lot simpler for transfer to just analyse such a fixed expression as oktavuođas.Po in the first place.

@@ Line 13: / Line 13: @@
 * <code>8632: SUBSTITUTE:TV_IV (V TV) (V IV) FAUXV (0 ("lávet"));</code> -- this should be analysed as both, and disambiguated
-* Typos: I've seen "odda" many places ("ođđa"), can we just add these to the analyser? (Would be cool to have charlifter/[[Diacritic Restoration]], but until then…)
@@ Line 23: / Line 20: @@
 ** http://giellatekno.uit.no/bugzilla/show_bug.cgi?id=853
-* Numbers with case shouldn't get the ':' in between tags: <code>^19.00:s/19.00<Num>:<Sg><Loc>$</code>
+==Typos==
+I've seen "odda" many places ("ođđa"), can we just add these to the analyser? (Would be cool to have charlifter/[[Diacritic Restoration]], but until then…)
+* a list of [http://paste.pocoo.org/raw/498998/ high-frequency typos] where the correction has an analysis
 ==Compounding==

Difference between revisions of "Northern Sámi and Norwegian/smemorf"

Revision as of 13:20, 27 October 2011

Contents

Misc

Typos

Compounding

ensure compounds are tagged as such

ensure compounding is only tried if there is no other solution

Multiwords

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools