Difference between revisions of "Northern Sámi and Norwegian/smemorf"

From Apertium
Jump to navigation Jump to search
Line 13: Line 13:
   
 
* <code>8632: SUBSTITUTE:TV_IV (V TV) (V IV) FAUXV (0 ("lávet"));</code> -- this should be analysed as both, and disambiguated
 
* <code>8632: SUBSTITUTE:TV_IV (V TV) (V IV) FAUXV (0 ("lávet"));</code> -- this should be analysed as both, and disambiguated
 
 
* Typos: I've seen "odda" many places ("ođđa"), can we just add these to the analyser? (Would be cool to have charlifter/[[Diacritic Restoration]], but until then…)
 
   
   
Line 23: Line 20:
 
** http://giellatekno.uit.no/bugzilla/show_bug.cgi?id=853
 
** http://giellatekno.uit.no/bugzilla/show_bug.cgi?id=853
   
  +
* Numbers with case shouldn't get the ':' in between tags: <code>^19.00:s/19.00<Num>:<Sg><Loc>$</code>
 
  +
  +
==Typos==
 
I've seen "odda" many places ("ođđa"), can we just add these to the analyser? (Would be cool to have charlifter/[[Diacritic Restoration]], but until then…)
  +
  +
* a list of [http://paste.pocoo.org/raw/498998/ high-frequency typos] where the correction has an analysis
   
 
==Compounding==
 
==Compounding==

Revision as of 13:20, 27 October 2011

apertium-sme-nob TODOs for the sme morphological analyser from Giellatekno.

Misc


  • regex for URL's (don't want telefonkatalogen.no => *telefonkatalogen.nå)


  • regex for acronyms like "GsoC:as" (tokenisation dependent...)


  • 8632: SUBSTITUTE:TV_IV (V TV) (V IV) FAUXV (0 ("lávet")); -- this should be analysed as both, and disambiguated


  • oktan seems to be used as a preposition, can we have that?


Typos

I've seen "odda" many places ("ođđa"), can we just add these to the analyser? (Would be cool to have charlifter/Diacritic Restoration, but until then…)

Compounding

ensure compounds are tagged as such

http://giellatekno.uit.no/bugzilla/show_bug.cgi?id=851

  • allaskuvla, Riikaráđi and 1100-logu get no compound tags in the compound analysis, the first can be fixed by adding +Cmpnd in LEXICON ATTR, the RReal border, similarly for the second in LEXICON ACCRA-NE, both RReal borders.
  • -prográmma gives ^-prográmma/+Cmpnd+prográmma<N><Sg><Nom>/+Cmpnd+prográmma<N><Sg><Acc>/+Cmpnd+prográmma<N><Sg><Gen>$. We currently remove the +Cmpnd+ in dev/xfst2apertium.hashtags.twol, but then we have no indication that the word started with a dash (-). In other language pairs, the - is output separately (either an inconditional lexical unit, like punctuation, or without any analysis, like unknown symbols).
    • A combination of the two errors above: Ánde-máhka gives ^Ánde<N><Prop><Mal><Sg><Nom><-máhka><N><Sg><Nom>$... we should have something like ^Ánde<N><Prop><Mal><Sg><Nom>+<-máhka><N><Sg><Nom>$

ensure compounding is only tried if there is no other solution

Most general solution: Use a weighted transducer, and give the compound border (ie the dynamic compounding border, the R lexicon) a non-zero weight.

However, hfst-proc does the same by just removing any analysis with a + in it if we have one without a +, so we're OK for now…

Multiwords

Add simple multiwords and fixed expressions to the analyser.

  • dasa lassin => i tillegg (til det)
  • dán áigge => for tiden
  • mun ieš => meg selv
  • bures boahtin => velkommen
  • Buorre beaivi => God dag
  • leat guollebivddus => å fiske
  • maid ban dainna => hva i all verden
  • jagis jahkái => fra år til år
  • oaidnaleapmái => 'see you'
  • ovdamearkka => for eksempel
  • Mo manná? => Hvordan går det?
  • ja nu ain => og så videre

(Some of these MWE's might be very Apertium-specific, but in that case we just keep our own file and append the entries with update-lexc.sh.)

Also, oktavuohta.N.Sg.Loc turns into an mwe preposition in nob:

  • oktavuođas => i forbindelse med

it'd be a lot simpler for transfer to just analyse such a fixed expression as oktavuođas.Po in the first place.