Difference between revisions of "Northern Sámi and Norwegian/smemorf"
| (49 intermediate revisions by 2 users not shown) | |||
| Line 1: | Line 1: | ||
| apertium-sme-nob | Documentation and TODO's for apertium-sme-nob's sme morphological analyser from Giellatekno. | ||
| == | ==Description== | ||
| * [[Ideas for Google Summer of Code/Morphology with HFST|HFST tokenisation]] | |||
| ** Split up lexicon into standard and inconditional (punctuation) for hfst-proc | |||
| The sme morphological analyser is a trimmed and tag-changed version of the one in Giellatekno. | |||
| ===Trimming=== | |||
| ⚫ | |||
| Trimming happens using the same HFST method as in the Turkic pairs etc. [[Automatically_trimming_a_monodix#Compounds_vs_trimming_in_HFST|Compounds are not handled correctly by this method]]. | |||
| ===Tagset changes=== | |||
| * Proper casing support in the sme lexicon.  (Mánát vs. mánát) | |||
| ** In the xerox software, this is done by a separate fst m (->) M || .#. _ ; | |||
| The tagset in apertium-sme-nob is closer to Apertium's tagset, thus a bit different from the one in Giellatekno. | |||
| * We remove Err/Orth/Usage tags (+Use/Sub, etc) –  | |||
| * We remove any derivational analyses that aren't yet handled by transfer/bidix | |||
| ** https://victorio.uit.no/langtech/trunk/langs/sme/tools/mt/apertium/filters/remove-derivation-strings-modifications.nob.regex | |||
| ** https://victorio.uit.no/langtech/trunk/langs/sme/src/filters/remove-illegal-derivation-strings.regex | |||
| ** [[Northern Sámi and Norwegian/Derivations|More on derivations in sme-nob]] | |||
| * We reorder some tags | |||
| ** https://victorio.uit.no/langtech/trunk/langs/sme/src/filters/reorder-subpos-tags.sme.regex | |||
| * We change the format of tags, so +N becomes <n>, +Der/PassL becomes <der_passl>, etc. | |||
| * We change certain tags | |||
| ** https://victorio.uit.no/langtech/trunk/langs/sme/tools/mt/apertium/tagsets/apertium.nob.relabel | |||
| ** https://victorio.uit.no/langtech/trunk/langs/sme/tools/mt/apertium/tagsets/apertium.postproc.relabel | |||
| ** https://victorio.uit.no/langtech/trunk/langs/sme/tools/mt/apertium/tagsets/modify-tags.nob.regex | |||
| ==TODO== | |||
| ===Misc=== | |||
| * add entries from bidix that are missing from the analyser | |||
| * regex for URL's (don't want telefonkatalogen.no => *telefonkatalogen.nå) | |||
| ⚫ | |||
| * <code>8632: SUBSTITUTE:TV_IV (V TV) (V IV) FAUXV (0 ("lávet"));</code> -- this should be analysed as both, and disambiguated | * <code>8632: SUBSTITUTE:TV_IV (V TV) (V IV) FAUXV (0 ("lávet"));</code> -- this should be analysed as both, and disambiguated | ||
| ===Typos=== | |||
| ⚫ | |||
| * a list of [http://paste.pocoo.org/raw/498998/ high-frequency typos] where the correction has an analysis | |||
| * Suddenly some nouns have the +Actor tag after +N... remove? | |||
| * geafivuohta gets one Der/vuohta-analysis that's got no A tag, and one with an A tag, fix: | |||
| ** geafivuohta     geafi+SgNomCmp+SgGenCmp+PlGenCmp+A+Der3+Der/vuohta+N+Sg+Nom | |||
| ** geafivuohta     geafi+SgNomCmp+SgGenCmp+PlGenCmp+Der3+Der/vuohta+N+Sg+Nom | |||
| ** geafivuohta     geafivuohta+SgGenCmp+DefPlGenCmp+N+Sg+Nom | |||
| ** http://giellatekno.uit.no/bugzilla/show_bug.cgi?id=852 | |||
| ⚫ | |||
| * oktan seems to be used as a preposition, can we have that? | |||
| ** Juo cuoŋománu 10. beaivvi ija vuostá mátkkoštii Ruvdnaprinseassa Märtha badjel ráji Ruŧŧii oktan Ruvdnaprinsabára golmmain mánáin. | |||
| ** áhčči oktan mánáinis | |||
| ** http://giellatekno.uit.no/bugzilla/show_bug.cgi?id=853 | |||
| ==Compounding== | |||
| ===ensure compounds are tagged as such=== | |||
| http://giellatekno.uit.no/bugzilla/show_bug.cgi?id=851 | |||
| * ''allaskuvla'', ''Riikaráđi'' and ''1100-logu'' get no compound tags in the compound analysis, the first can be fixed by adding +Cmpnd in LEXICON ATTR, the RReal border, similarly for the second in LEXICON ACCRA-NE, both RReal borders. | |||
| * -prográmma gives <code>^-prográmma/+Cmpnd+prográmma<N><Sg><Nom>/+Cmpnd+prográmma<N><Sg><Acc>/+Cmpnd+prográmma<N><Sg><Gen>$</code>. We currently remove the +Cmpnd+ in dev/xfst2apertium.hashtags.twol, but then we have no indication that the word started with a dash (-). In other language pairs, the - is output separately (either an inconditional lexical unit, like punctuation, or without any analysis, like unknown symbols). | |||
| ** A combination of the two errors above: Ánde-máhka gives <code>^Ánde<N><Prop><Mal><Sg><Nom><-máhka><N><Sg><Nom>$</code>... we should have something like <code>^Ánde<N><Prop><Mal><Sg><Nom>+<-máhka><N><Sg><Nom>$</code> | |||
| ===ensure compounding is only tried if there is no other solution=== | |||
| Most general solution: Use a weighted transducer, and give the compound border (ie the dynamic compounding border, the R lexicon) a non-zero weight. | |||
| ===Dashes=== | |||
| However, hfst-proc does the same by just removing any analysis with a + in it if we have one without a +, so we're OK for now… | |||
| lexc handles dashes by adding them literally (like a lemma), doing that in with a bidix pardef would be very messy (also, doesn't it give issues with lemma-matching in CG?). | |||
| ==Multiwords== | ===Multiwords=== | ||
| Add simple multiwords and fixed expressions to the analyser. | Add simple multiwords and fixed expressions to the analyser. | ||
| * lea go => <code>^leat<V><IV><Ind><Prs><Sg3><Qst></code> (just like "leago") | |||
| * dasa lassin => i tillegg (til det) | * dasa lassin => i tillegg (til det) | ||
| * dán áigge => for tiden | * dán áigge => for tiden | ||
| Line 64: | Line 55: | ||
| * ovdamearkka => for eksempel | * ovdamearkka => for eksempel | ||
| * Mo manná? => Hvordan går det? | * Mo manná? => Hvordan går det? | ||
| * ja nu ain => og så videre | |||
| (Some of these MWE's might be very Apertium-specific, but in that case we just keep our own file and append the entries with update-lexc.sh.) | (Some of these MWE's might be very Apertium-specific, but in that case we just keep our own file and append the entries with update-lexc.sh.) | ||
Latest revision as of 09:46, 15 April 2015
Documentation and TODO's for apertium-sme-nob's sme morphological analyser from Giellatekno.
Contents
Description[edit]
The sme morphological analyser is a trimmed and tag-changed version of the one in Giellatekno.
Trimming[edit]
Trimming happens using the same HFST method as in the Turkic pairs etc. Compounds are not handled correctly by this method.
Tagset changes[edit]
The tagset in apertium-sme-nob is closer to Apertium's tagset, thus a bit different from the one in Giellatekno.
- We remove Err/Orth/Usage tags (+Use/Sub, etc) –
- We remove any derivational analyses that aren't yet handled by transfer/bidix
- We reorder some tags
- We change the format of tags, so +N becomes <n>, +Der/PassL becomes <der_passl>, etc.
- We change certain tags
TODO[edit]
Misc[edit]
- add entries from bidix that are missing from the analyser
- regex for URL's (don't want telefonkatalogen.no => *telefonkatalogen.nå)
- regex for acronyms like "GsoC:as" (tokenisation dependent...)
- 8632: SUBSTITUTE:TV_IV (V TV) (V IV) FAUXV (0 ("lávet"));-- this should be analysed as both, and disambiguated
Typos[edit]
I've seen "odda" many places ("ođđa"), can we just add these to the analyser? (Would be cool to have charlifter/Diacritic Restoration, but until then…)
- a list of high-frequency typos where the correction has an analysis
Dashes[edit]
lexc handles dashes by adding them literally (like a lemma), doing that in with a bidix pardef would be very messy (also, doesn't it give issues with lemma-matching in CG?).
Multiwords[edit]
Add simple multiwords and fixed expressions to the analyser.
- dasa lassin => i tillegg (til det)
- dán áigge => for tiden
- mun ieš => meg selv
- bures boahtin => velkommen
- Buorre beaivi => God dag
- leat guollebivddus => å fiske
- maid ban dainna => hva i all verden
- jagis jahkái => fra år til år
- oaidnaleapmái => 'see you'
- ovdamearkka => for eksempel
- Mo manná? => Hvordan går det?
- ja nu ain => og så videre
(Some of these MWE's might be very Apertium-specific, but in that case we just keep our own file and append the entries with update-lexc.sh.)
Also, oktavuohta.N.Sg.Loc turns into an mwe preposition in nob:
- oktavuođas => i forbindelse med
it'd be a lot simpler for transfer to just analyse such a fixed expression as oktavuođas.Po in the first place.

