Difference between revisions of "Northern Sámi and Norwegian/smemorf"

From Apertium
Jump to navigation Jump to search
 
(65 intermediate revisions by 3 users not shown)
Line 1: Line 1:
apertium-sme-nob TODOs for the sme morphological analyser from Giellatekno.
Documentation and TODO's for apertium-sme-nob's sme morphological analyser from Giellatekno.


==Misc==
==Description==
* [[Ideas for Google Summer of Code/Morphology with HFST|HFST tokenisation]]
** Split up lexicon into standard and inconditional (punctuation) for hfst-proc


The sme morphological analyser is a trimmed and tag-changed version of the one in Giellatekno.
* regex for acronyms like "GsoC:as" (tokenisation dependent...)


===Trimming===
* Proper casing support in the sme lexicon. (Mánát vs. mánát)
** In the xerox software, this is done by a separate fst m (->) M || .#. _ ;


Trimming happens using the same HFST method as in the Turkic pairs etc. [[Automatically_trimming_a_monodix#Compounds_vs_trimming_in_HFST|Compounds are not handled correctly by this method]].
* <code>8632: SUBSTITUTE:TV_IV (V TV) (V IV) FAUXV (0 ("lávet"));</code> -- this should be analysed as both, and disambiguated


===Tagset changes===
* Ánde-máhka gives <code>^Ánde<N><Prop><Mal><Sg><Nom><-+máhka><N><Sg><Nom>$</code>... we should have something like <code>^Ánde<N><Prop><Mal><Sg><Nom>+<-máhka><N><Sg><Nom>$</code>


The tagset in apertium-sme-nob is closer to Apertium's tagset, thus a bit different from the one in Giellatekno.
* Suddenly some nouns have the +Actor tag after +N... remove?


* We remove Err/Orth/Usage tags (+Use/Sub, etc) –
* geafivuohta gets one Der/vuohta-analysis that's got no A tag, and one with an A tag, fix:
* We remove any derivational analyses that aren't yet handled by transfer/bidix
** geafivuohta geafi+SgNomCmp+SgGenCmp+PlGenCmp+A+Der3+Der/vuohta+N+Sg+Nom
** https://victorio.uit.no/langtech/trunk/langs/sme/tools/mt/apertium/filters/remove-derivation-strings-modifications.nob.regex
** geafivuohta geafi+SgNomCmp+SgGenCmp+PlGenCmp+Der3+Der/vuohta+N+Sg+Nom
** https://victorio.uit.no/langtech/trunk/langs/sme/src/filters/remove-illegal-derivation-strings.regex
** geafivuohta geafivuohta+SgGenCmp+DefPlGenCmp+N+Sg+Nom
** [[Northern Sámi and Norwegian/Derivations|More on derivations in sme-nob]]
* We reorder some tags
** https://victorio.uit.no/langtech/trunk/langs/sme/src/filters/reorder-subpos-tags.sme.regex
* We change the format of tags, so +N becomes &lt;n&gt;, +Der/PassL becomes &lt;der_passl&gt;, etc.
* We change certain tags
** https://victorio.uit.no/langtech/trunk/langs/sme/tools/mt/apertium/tagsets/apertium.nob.relabel
** https://victorio.uit.no/langtech/trunk/langs/sme/tools/mt/apertium/tagsets/apertium.postproc.relabel
** https://victorio.uit.no/langtech/trunk/langs/sme/tools/mt/apertium/tagsets/modify-tags.nob.regex

==TODO==
===Misc===
* add entries from bidix that are missing from the analyser
* regex for URL's (don't want telefonkatalogen.no => *telefonkatalogen.nå)
* regex for acronyms like "GsoC:as" (tokenisation dependent...)
* <code>8632: SUBSTITUTE:TV_IV (V TV) (V IV) FAUXV (0 ("lávet"));</code> -- this should be analysed as both, and disambiguated


===Typos===
* -prográmma gives <code>^-prográmma/+Cmpnd+prográmma<N><Sg><Nom>/+Cmpnd+prográmma<N><Sg><Acc>/+Cmpnd+prográmma<N><Sg><Gen>$</code>. We currently remove the +Cmpnd+ in dev/xfst2apertium.hashtags.twol, but then we have no indication that the word started with a dash (-). In other language pairs, the - is output separately (either an inconditional lexical unit, like punctuation, or without any analysis, like unknown symbols).
I've seen "odda" many places ("ođđa"), can we just add these to the analyser? (Would be cool to have charlifter/[[Diacritic Restoration]], but until then…)


* a list of [http://paste.pocoo.org/raw/498998/ high-frequency typos] where the correction has an analysis
* allaskuvla gets no compound tag in the compound analysis


==Compounding==
===Dashes===
lexc handles dashes by adding them literally (like a lemma), doing that in with a bidix pardef would be very messy (also, doesn't it give issues with lemma-matching in CG?).
Ensure compounding is only tried if there is no other solution. Use a weighted transducer, and give the compound border (ie the dynamic compounding border, the R lexicon) a non-zero weight.


==Multiwords==
===Multiwords===
Add simple multiwords and fixed expressions to the analyser.
Add simple multiwords and fixed expressions to the analyser.


* lea go => <code>^leat<V><IV><Ind><Prs><Sg3><Qst></code> (just like "leago")
* dasa lassin => i tillegg (til det)
* dasa lassin => i tillegg (til det)
* dán áigge => for tiden
* dán áigge => for tiden
Line 40: Line 52:
* maid ban dainna => hva i all verden
* maid ban dainna => hva i all verden
* jagis jahkái => fra år til år
* jagis jahkái => fra år til år
* oaidnaleapmái => 'see you'
* ovdamearkka => for eksempel
* Mo manná? => Hvordan går det?
* ja nu ain => og så videre


(Some of these MWE's might be very Apertium-specific, but in that case we just keep our own file and append the entries with update-lexc.sh.)
(Some of these MWE's might be very Apertium-specific, but in that case we just keep our own file and append the entries with update-lexc.sh.)

Latest revision as of 09:46, 15 April 2015

Documentation and TODO's for apertium-sme-nob's sme morphological analyser from Giellatekno.

Description[edit]

The sme morphological analyser is a trimmed and tag-changed version of the one in Giellatekno.

Trimming[edit]

Trimming happens using the same HFST method as in the Turkic pairs etc. Compounds are not handled correctly by this method.

Tagset changes[edit]

The tagset in apertium-sme-nob is closer to Apertium's tagset, thus a bit different from the one in Giellatekno.

TODO[edit]

Misc[edit]

  • add entries from bidix that are missing from the analyser
  • regex for URL's (don't want telefonkatalogen.no => *telefonkatalogen.nå)
  • regex for acronyms like "GsoC:as" (tokenisation dependent...)
  • 8632: SUBSTITUTE:TV_IV (V TV) (V IV) FAUXV (0 ("lávet")); -- this should be analysed as both, and disambiguated

Typos[edit]

I've seen "odda" many places ("ođđa"), can we just add these to the analyser? (Would be cool to have charlifter/Diacritic Restoration, but until then…)

Dashes[edit]

lexc handles dashes by adding them literally (like a lemma), doing that in with a bidix pardef would be very messy (also, doesn't it give issues with lemma-matching in CG?).

Multiwords[edit]

Add simple multiwords and fixed expressions to the analyser.

  • dasa lassin => i tillegg (til det)
  • dán áigge => for tiden
  • mun ieš => meg selv
  • bures boahtin => velkommen
  • Buorre beaivi => God dag
  • leat guollebivddus => å fiske
  • maid ban dainna => hva i all verden
  • jagis jahkái => fra år til år
  • oaidnaleapmái => 'see you'
  • ovdamearkka => for eksempel
  • Mo manná? => Hvordan går det?
  • ja nu ain => og så videre

(Some of these MWE's might be very Apertium-specific, but in that case we just keep our own file and append the entries with update-lexc.sh.)

Also, oktavuohta.N.Sg.Loc turns into an mwe preposition in nob:

  • oktavuođas => i forbindelse med

it'd be a lot simpler for transfer to just analyse such a fixed expression as oktavuođas.Po in the first place.