Difference between revisions of "Northern Sámi and Norwegian/smemorf"
Jump to navigation
Jump to search
(→Misc) |
(→Misc) |
||
Line 4: | Line 4: | ||
* [[Ideas for Google Summer of Code/Morphology with HFST|HFST tokenisation]] |
* [[Ideas for Google Summer of Code/Morphology with HFST|HFST tokenisation]] |
||
** Split up lexicon into standard and inconditional (punctuation) for hfst-proc |
** Split up lexicon into standard and inconditional (punctuation) for hfst-proc |
||
* regex for acronyms like "GsoC:as" (tokenisation dependent...) |
* regex for acronyms like "GsoC:as" (tokenisation dependent...) |
||
* Proper casing support in the sme lexicon. (Mánát vs. mánát) |
* Proper casing support in the sme lexicon. (Mánát vs. mánát) |
||
** In the xerox software, this is done by a separate fst m (->) M || .#. _ ; |
** In the xerox software, this is done by a separate fst m (->) M || .#. _ ; |
||
* <code>8632: SUBSTITUTE:TV_IV (V TV) (V IV) FAUXV (0 ("lávet"));</code> -- this should be analysed as both, and disambiguated |
* <code>8632: SUBSTITUTE:TV_IV (V TV) (V IV) FAUXV (0 ("lávet"));</code> -- this should be analysed as both, and disambiguated |
||
Line 18: | Line 21: | ||
** A combination of the two errors above: Ánde-máhka gives <code>^Ánde<N><Prop><Mal><Sg><Nom><-máhka><N><Sg><Nom>$</code>... we should have something like <code>^Ánde<N><Prop><Mal><Sg><Nom>+<-máhka><N><Sg><Nom>$</code> |
** A combination of the two errors above: Ánde-máhka gives <code>^Ánde<N><Prop><Mal><Sg><Nom><-máhka><N><Sg><Nom>$</code>... we should have something like <code>^Ánde<N><Prop><Mal><Sg><Nom>+<-máhka><N><Sg><Nom>$</code> |
||
* Suddenly some nouns have the +Actor tag after +N... remove? |
* Suddenly some nouns have the +Actor tag after +N... remove? |
||
* geafivuohta gets one Der/vuohta-analysis that's got no A tag, and one with an A tag, fix: |
* geafivuohta gets one Der/vuohta-analysis that's got no A tag, and one with an A tag, fix: |
||
Line 28: | Line 33: | ||
* Typos: I've seen "odda" many places ("ođđa"), can we just add these to the analyser? (Would be cool to have charlifter/[[Diacritic Restoration]], but until then…) |
* Typos: I've seen "odda" many places ("ođđa"), can we just add these to the analyser? (Would be cool to have charlifter/[[Diacritic Restoration]], but until then…) |
||
* oktan seems to be used as a preposition, can we have that? |
* oktan seems to be used as a preposition, can we have that? |
Revision as of 07:55, 15 June 2010
apertium-sme-nob TODOs for the sme morphological analyser from Giellatekno.
Misc
- HFST tokenisation
- Split up lexicon into standard and inconditional (punctuation) for hfst-proc
- regex for acronyms like "GsoC:as" (tokenisation dependent...)
- Proper casing support in the sme lexicon. (Mánát vs. mánát)
- In the xerox software, this is done by a separate fst m (->) M || .#. _ ;
8632: SUBSTITUTE:TV_IV (V TV) (V IV) FAUXV (0 ("lávet"));
-- this should be analysed as both, and disambiguated
- allaskuvla, Riikaráđi and 1100-logu get no compound tags in the compound analysis, the first can be fixed by adding +Cmpnd in LEXICON ATTR, the RReal border, similarly for the second in LEXICON ACCRA-NE, both RReal borders.
- -prográmma gives
^-prográmma/+Cmpnd+prográmma<N><Sg><Nom>/+Cmpnd+prográmma<N><Sg><Acc>/+Cmpnd+prográmma<N><Sg><Gen>$
. We currently remove the +Cmpnd+ in dev/xfst2apertium.hashtags.twol, but then we have no indication that the word started with a dash (-). In other language pairs, the - is output separately (either an inconditional lexical unit, like punctuation, or without any analysis, like unknown symbols).
- A combination of the two errors above: Ánde-máhka gives
^Ánde<N><Prop><Mal><Sg><Nom><-máhka><N><Sg><Nom>$
... we should have something like^Ánde<N><Prop><Mal><Sg><Nom>+<-máhka><N><Sg><Nom>$
- A combination of the two errors above: Ánde-máhka gives
- Suddenly some nouns have the +Actor tag after +N... remove?
- geafivuohta gets one Der/vuohta-analysis that's got no A tag, and one with an A tag, fix:
- geafivuohta geafi+SgNomCmp+SgGenCmp+PlGenCmp+A+Der3+Der/vuohta+N+Sg+Nom
- geafivuohta geafi+SgNomCmp+SgGenCmp+PlGenCmp+Der3+Der/vuohta+N+Sg+Nom
- geafivuohta geafivuohta+SgGenCmp+DefPlGenCmp+N+Sg+Nom
- Typos: I've seen "odda" many places ("ođđa"), can we just add these to the analyser? (Would be cool to have charlifter/Diacritic Restoration, but until then…)
- oktan seems to be used as a preposition, can we have that?
- Juo cuoŋománu 10. beaivvi ija vuostá mátkkoštii Ruvdnaprinseassa Märtha badjel ráji Ruŧŧii oktan Ruvdnaprinsabára golmmain mánáin.
- áhčči oktan mánáinis
Compounding
Ensure compounding is only tried if there is no other solution. Use a weighted transducer, and give the compound border (ie the dynamic compounding border, the R lexicon) a non-zero weight.
Multiwords
Add simple multiwords and fixed expressions to the analyser.
- lea go =>
^leat<V><IV><Ind><Prs><Sg3><Qst>
(just like "leago") - dasa lassin => i tillegg (til det)
- dán áigge => for tiden
- mun ieš => meg selv
- bures boahtin => velkommen
- Buorre beaivi => God dag
- leat guollebivddus => å fiske
- maid ban dainna => hva i all verden
- jagis jahkái => fra år til år
- oaidnaleapmái => 'see you'
(Some of these MWE's might be very Apertium-specific, but in that case we just keep our own file and append the entries with update-lexc.sh.)
Also, oktavuohta.N.Sg.Loc turns into an mwe preposition in nob:
- oktavuođas => i forbindelse med
it'd be a lot simpler for transfer to just analyse such a fixed expression as oktavuođas.Po in the first place.