Difference between revisions of "Northern Sámi and Norwegian/smemorf"

From Apertium
Jump to navigation Jump to search
Line 20: Line 20:
 
==Multiwords==
 
==Multiwords==
 
Add simple multiwords and fixed expressions to the analyser.
 
Add simple multiwords and fixed expressions to the analyser.
 
MWE's won't be noticed until we get proper HFST tokenisation, eg. "ovdal go" (før) is already in the analyser.
 
   
 
* lea go => <code>^leat<V><IV><Ind><Prs><Sg3><Qst></code> (just like "leago")
 
* lea go => <code>^leat<V><IV><Ind><Prs><Sg3><Qst></code> (just like "leago")
Line 27: Line 25:
 
* dán áigge => for tiden
 
* dán áigge => for tiden
 
* mun ieš => meg selv
 
* mun ieš => meg selv
  +
* bures boahtin => velkommen
   
 
(Some of these MWE's might be very Apertium-specific, but in that case we just keep our own file and append the entries with update-lexc.sh.)
 
(Some of these MWE's might be very Apertium-specific, but in that case we just keep our own file and append the entries with update-lexc.sh.)

Revision as of 12:56, 7 June 2010

apertium-sme-nob TODOs for the sme morphological analyser from Giellatekno.

Misc

  • HFST tokenisation
    • Also, split up lexicon into standard and inconditional (punctuation) for hfst-proc
  • regex for acronyms like "GsoC:as" (tokenisation dependent...)
  • Proper casing support in the sme lexicon. (Mánát vs. mánát)
    • In the xerox software, this is done by a separate fst m (->) M || .#. _ ;
  • 8632: SUBSTITUTE:TV_IV (V TV) (V IV) FAUXV (0 ("lávet")); -- this should be analysed as both, and disambiguated
  • Ánde-máhka gives ^Ánde<N><Prop><Mal><Sg><Nom><-+máhka><N><Sg><Nom>$... we should have something like ^Ánde<N><Prop><Mal><Sg><Nom>+<-máhka><N><Sg><Nom>$
  • Suddenly some nouns have the +Actor tag after +N... remove?
  • geafivuohta gets one Der/vuohta-analysis that's got no A tag, and one with an A tag, fix:
    • geafivuohta geafi+SgNomCmp+SgGenCmp+PlGenCmp+A+Der3+Der/vuohta+N+Sg+Nom
    • geafivuohta geafi+SgNomCmp+SgGenCmp+PlGenCmp+Der3+Der/vuohta+N+Sg+Nom
    • geafivuohta geafivuohta+SgGenCmp+DefPlGenCmp+N+Sg+Nom

Compounding

Ensure compounding is only tried if there is no other solution. Use a weighted transducer, and give the compound border (ie the dynamic compounding border, the R lexicon) a non-zero weight.

Multiwords

Add simple multiwords and fixed expressions to the analyser.

  • lea go => ^leat<V><IV><Ind><Prs><Sg3><Qst> (just like "leago")
  • dasa lassin => i tillegg (til det)
  • dán áigge => for tiden
  • mun ieš => meg selv
  • bures boahtin => velkommen

(Some of these MWE's might be very Apertium-specific, but in that case we just keep our own file and append the entries with update-lexc.sh.)

Also, oktavuohta.N.Sg.Loc turns into an mwe preposition in nob:

  • oktavuođas => i forbindelse med

it'd be a lot simpler for transfer to just analyse such a fixed expression as oktavuođas.Po in the first place.