Difference between revisions of "Northern Sámi and Norwegian/smemorf"
Jump to navigation
Jump to search
Line 20: | Line 20: | ||
==Multiwords== |
==Multiwords== |
||
Add simple multiwords and fixed expressions to the analyser. |
Add simple multiwords and fixed expressions to the analyser. |
||
MWE's won't be noticed until we get proper HFST tokenisation, eg. "ovdal go" (før) is already in the analyser. |
|||
* lea go => <code>^leat<V><IV><Ind><Prs><Sg3><Qst></code> (just like "leago") |
* lea go => <code>^leat<V><IV><Ind><Prs><Sg3><Qst></code> (just like "leago") |
||
Line 27: | Line 25: | ||
* dán áigge => for tiden |
* dán áigge => for tiden |
||
* mun ieš => meg selv |
* mun ieš => meg selv |
||
* bures boahtin => velkommen |
|||
(Some of these MWE's might be very Apertium-specific, but in that case we just keep our own file and append the entries with update-lexc.sh.) |
(Some of these MWE's might be very Apertium-specific, but in that case we just keep our own file and append the entries with update-lexc.sh.) |
Revision as of 12:56, 7 June 2010
apertium-sme-nob TODOs for the sme morphological analyser from Giellatekno.
Misc
- HFST tokenisation
- Also, split up lexicon into standard and inconditional (punctuation) for hfst-proc
- regex for acronyms like "GsoC:as" (tokenisation dependent...)
- Proper casing support in the sme lexicon. (Mánát vs. mánát)
- In the xerox software, this is done by a separate fst m (->) M || .#. _ ;
8632: SUBSTITUTE:TV_IV (V TV) (V IV) FAUXV (0 ("lávet"));
-- this should be analysed as both, and disambiguated- Ánde-máhka gives
^Ánde<N><Prop><Mal><Sg><Nom><-+máhka><N><Sg><Nom>$
... we should have something like^Ánde<N><Prop><Mal><Sg><Nom>+<-máhka><N><Sg><Nom>$
- Suddenly some nouns have the +Actor tag after +N... remove?
- geafivuohta gets one Der/vuohta-analysis that's got no A tag, and one with an A tag, fix:
- geafivuohta geafi+SgNomCmp+SgGenCmp+PlGenCmp+A+Der3+Der/vuohta+N+Sg+Nom
- geafivuohta geafi+SgNomCmp+SgGenCmp+PlGenCmp+Der3+Der/vuohta+N+Sg+Nom
- geafivuohta geafivuohta+SgGenCmp+DefPlGenCmp+N+Sg+Nom
Compounding
Ensure compounding is only tried if there is no other solution. Use a weighted transducer, and give the compound border (ie the dynamic compounding border, the R lexicon) a non-zero weight.
Multiwords
Add simple multiwords and fixed expressions to the analyser.
- lea go =>
^leat<V><IV><Ind><Prs><Sg3><Qst>
(just like "leago") - dasa lassin => i tillegg (til det)
- dán áigge => for tiden
- mun ieš => meg selv
- bures boahtin => velkommen
(Some of these MWE's might be very Apertium-specific, but in that case we just keep our own file and append the entries with update-lexc.sh.)
Also, oktavuohta.N.Sg.Loc turns into an mwe preposition in nob:
- oktavuođas => i forbindelse med
it'd be a lot simpler for transfer to just analyse such a fixed expression as oktavuođas.Po in the first place.