Difference between revisions of "Northern Sámi and Norwegian/smemorf"

From Apertium
Jump to navigation Jump to search
Line 3: Line 3:
 
==Misc==
 
==Misc==
 
* [[Ideas for Google Summer of Code/Morphology with HFST|HFST tokenisation]]
 
* [[Ideas for Google Summer of Code/Morphology with HFST|HFST tokenisation]]
** Also, split up lexicon into standard and inconditional (punctuation) for hfst-proc
+
** Split up lexicon into standard and inconditional (punctuation) for hfst-proc
  +
 
* regex for acronyms like "GsoC:as" (tokenisation dependent...)
 
* regex for acronyms like "GsoC:as" (tokenisation dependent...)
  +
 
* Proper casing support in the sme lexicon. (Mánát vs. mánát)
 
* Proper casing support in the sme lexicon. (Mánát vs. mánát)
 
** In the xerox software, this is done by a separate fst m (->) M || .#. _ ;
 
** In the xerox software, this is done by a separate fst m (->) M || .#. _ ;
  +
 
* <code>8632: SUBSTITUTE:TV_IV (V TV) (V IV) FAUXV (0 ("lávet"));</code> -- this should be analysed as both, and disambiguated
 
* <code>8632: SUBSTITUTE:TV_IV (V TV) (V IV) FAUXV (0 ("lávet"));</code> -- this should be analysed as both, and disambiguated
  +
 
* Ánde-máhka gives <code>^Ánde<N><Prop><Mal><Sg><Nom><-+máhka><N><Sg><Nom>$</code>... we should have something like <code>^Ánde<N><Prop><Mal><Sg><Nom>+<-máhka><N><Sg><Nom>$</code>
 
* Ánde-máhka gives <code>^Ánde<N><Prop><Mal><Sg><Nom><-+máhka><N><Sg><Nom>$</code>... we should have something like <code>^Ánde<N><Prop><Mal><Sg><Nom>+<-máhka><N><Sg><Nom>$</code>
  +
 
* Suddenly some nouns have the +Actor tag after +N... remove?
 
* Suddenly some nouns have the +Actor tag after +N... remove?
  +
 
* geafivuohta gets one Der/vuohta-analysis that's got no A tag, and one with an A tag, fix:
 
* geafivuohta gets one Der/vuohta-analysis that's got no A tag, and one with an A tag, fix:
 
** geafivuohta geafi+SgNomCmp+SgGenCmp+PlGenCmp+A+Der3+Der/vuohta+N+Sg+Nom
 
** geafivuohta geafi+SgNomCmp+SgGenCmp+PlGenCmp+A+Der3+Der/vuohta+N+Sg+Nom
Line 16: Line 22:
   
 
* -prográmma gives <code>^-prográmma/+Cmpnd+prográmma<N><Sg><Nom>/+Cmpnd+prográmma<N><Sg><Acc>/+Cmpnd+prográmma<N><Sg><Gen>$</code>. We currently remove the +Cmpnd+ in dev/xfst2apertium.hashtags.twol, but then we have no indication that the word started with a dash (-). In other language pairs, the - is output separately (either an inconditional lexical unit, like punctuation, or without any analysis, like unknown symbols).
 
* -prográmma gives <code>^-prográmma/+Cmpnd+prográmma<N><Sg><Nom>/+Cmpnd+prográmma<N><Sg><Acc>/+Cmpnd+prográmma<N><Sg><Gen>$</code>. We currently remove the +Cmpnd+ in dev/xfst2apertium.hashtags.twol, but then we have no indication that the word started with a dash (-). In other language pairs, the - is output separately (either an inconditional lexical unit, like punctuation, or without any analysis, like unknown symbols).
  +
  +
* apertium-destxt adds an extra period if we have an empty line below... is there a way we could make it add a headline tag instead?
   
 
==Compounding==
 
==Compounding==

Revision as of 08:21, 8 June 2010

apertium-sme-nob TODOs for the sme morphological analyser from Giellatekno.

Misc

  • HFST tokenisation
    • Split up lexicon into standard and inconditional (punctuation) for hfst-proc
  • regex for acronyms like "GsoC:as" (tokenisation dependent...)
  • Proper casing support in the sme lexicon. (Mánát vs. mánát)
    • In the xerox software, this is done by a separate fst m (->) M || .#. _ ;
  • 8632: SUBSTITUTE:TV_IV (V TV) (V IV) FAUXV (0 ("lávet")); -- this should be analysed as both, and disambiguated
  • Ánde-máhka gives ^Ánde<N><Prop><Mal><Sg><Nom><-+máhka><N><Sg><Nom>$... we should have something like ^Ánde<N><Prop><Mal><Sg><Nom>+<-máhka><N><Sg><Nom>$
  • Suddenly some nouns have the +Actor tag after +N... remove?
  • geafivuohta gets one Der/vuohta-analysis that's got no A tag, and one with an A tag, fix:
    • geafivuohta geafi+SgNomCmp+SgGenCmp+PlGenCmp+A+Der3+Der/vuohta+N+Sg+Nom
    • geafivuohta geafi+SgNomCmp+SgGenCmp+PlGenCmp+Der3+Der/vuohta+N+Sg+Nom
    • geafivuohta geafivuohta+SgGenCmp+DefPlGenCmp+N+Sg+Nom
  • -prográmma gives ^-prográmma/+Cmpnd+prográmma<N><Sg><Nom>/+Cmpnd+prográmma<N><Sg><Acc>/+Cmpnd+prográmma<N><Sg><Gen>$. We currently remove the +Cmpnd+ in dev/xfst2apertium.hashtags.twol, but then we have no indication that the word started with a dash (-). In other language pairs, the - is output separately (either an inconditional lexical unit, like punctuation, or without any analysis, like unknown symbols).
  • apertium-destxt adds an extra period if we have an empty line below... is there a way we could make it add a headline tag instead?

Compounding

Ensure compounding is only tried if there is no other solution. Use a weighted transducer, and give the compound border (ie the dynamic compounding border, the R lexicon) a non-zero weight.

Multiwords

Add simple multiwords and fixed expressions to the analyser.

  • lea go => ^leat<V><IV><Ind><Prs><Sg3><Qst> (just like "leago")
  • dasa lassin => i tillegg (til det)
  • dán áigge => for tiden
  • mun ieš => meg selv
  • bures boahtin => velkommen

(Some of these MWE's might be very Apertium-specific, but in that case we just keep our own file and append the entries with update-lexc.sh.)

Also, oktavuohta.N.Sg.Loc turns into an mwe preposition in nob:

  • oktavuođas => i forbindelse med

it'd be a lot simpler for transfer to just analyse such a fixed expression as oktavuođas.Po in the first place.