Difference between revisions of "Northern Sámi and Norwegian/smemorf"
(→Misc) |
|||
Line 4: | Line 4: | ||
* [[Ideas for Google Summer of Code/Morphology with HFST|HFST tokenisation]] |
* [[Ideas for Google Summer of Code/Morphology with HFST|HFST tokenisation]] |
||
* find all Foc/foo tags and put them in dev/xfst2apertium.relabel |
* find all Foc/foo tags and put them in dev/xfst2apertium.relabel |
||
* regex for acronyms like "GsoC:as" |
* regex for acronyms like "GsoC:as" (tokenisation dependent...) |
||
* Proper casing support in the sme lexicon. (Mánát vs. mánát) |
* Proper casing support in the sme lexicon. (Mánát vs. mánát) |
||
** In the xerox software, this is done by a separate fst m (->) M || .#. _ ; |
** In the xerox software, this is done by a separate fst m (->) M || .#. _ ; |
Revision as of 09:16, 10 May 2010
apertium-sme-nob TODOs for the sme morphological analyser from Giellatekno.
Misc
- HFST tokenisation
- find all Foc/foo tags and put them in dev/xfst2apertium.relabel
- regex for acronyms like "GsoC:as" (tokenisation dependent...)
- Proper casing support in the sme lexicon. (Mánát vs. mánát)
- In the xerox software, this is done by a separate fst m (->) M || .#. _ ;
Compounding
Ensure compounding is only tried if there is no other solution. Use a weighted transducer, and give the compound border (ie the dynamic compounding border, the R lexicon) a non-zero weight.
Multiwords
Add simple multiwords and fixed expressions to the analyser.
MWE's won't be noticed until we get proper HFST tokenisation, eg. "ovdal go" (før) is already in the analyser.
- lea go =>
^leat<V><IV><Ind><Prs><Sg3><Qst>
(just like "leago") - dasa lassin => i tillegg (til det)
(Some of these MWE's might be very Apertium-specific, but in that case we just keep our own file and append the entries with update-lexc.sh.)
Also, oktavuohta.N.Sg.Loc turns into an mwe preposition in nob:
- oktavuođas => i forbindelse med
it'd be a lot simpler for transfer to just analyse such a fixed expression as oktavuođas.Po in the first place.