Difference between revisions of "Northern Sámi and Norwegian/smemorf"

Latest revision as of 09:46, 15 April 2015

Documentation and TODO's for apertium-sme-nob's sme morphological analyser from Giellatekno.

Description[edit]

The sme morphological analyser is a trimmed and tag-changed version of the one in Giellatekno.

Trimming[edit]

Trimming happens using the same HFST method as in the Turkic pairs etc. Compounds are not handled correctly by this method.

Tagset changes[edit]

The tagset in apertium-sme-nob is closer to Apertium's tagset, thus a bit different from the one in Giellatekno.

We remove Err/Orth/Usage tags (+Use/Sub, etc) –
We remove any derivational analyses that aren't yet handled by transfer/bidix
We reorder some tags
- https://victorio.uit.no/langtech/trunk/langs/sme/src/filters/reorder-subpos-tags.sme.regex
We change the format of tags, so +N becomes <n>, +Der/PassL becomes <der_passl>, etc.
We change certain tags

TODO[edit]

Misc[edit]

add entries from bidix that are missing from the analyser
regex for URL's (don't want telefonkatalogen.no => *telefonkatalogen.nå)
regex for acronyms like "GsoC:as" (tokenisation dependent...)
8632: SUBSTITUTE:TV_IV (V TV) (V IV) FAUXV (0 ("lávet")); -- this should be analysed as both, and disambiguated

Typos[edit]

I've seen "odda" many places ("ođđa"), can we just add these to the analyser? (Would be cool to have charlifter/Diacritic Restoration, but until then…)

a list of high-frequency typos where the correction has an analysis

Dashes[edit]

lexc handles dashes by adding them literally (like a lemma), doing that in with a bidix pardef would be very messy (also, doesn't it give issues with lemma-matching in CG?).

Multiwords[edit]

Add simple multiwords and fixed expressions to the analyser.

dasa lassin => i tillegg (til det)
dán áigge => for tiden
mun ieš => meg selv
bures boahtin => velkommen
Buorre beaivi => God dag
leat guollebivddus => å fiske
maid ban dainna => hva i all verden
jagis jahkái => fra år til år
oaidnaleapmái => 'see you'
ovdamearkka => for eksempel
Mo manná? => Hvordan går det?
ja nu ain => og så videre

(Some of these MWE's might be very Apertium-specific, but in that case we just keep our own file and append the entries with update-lexc.sh.)

Also, oktavuohta.N.Sg.Loc turns into an mwe preposition in nob:

oktavuođas => i forbindelse med

it'd be a lot simpler for transfer to just analyse such a fixed expression as oktavuođas.Po in the first place.

@@ Line 1: / Line 1: @@
-apertium-sme-nob TODOs for the sme morphological analyser from Giellatekno.
+Documentation and TODO's for apertium-sme-nob's sme morphological analyser from Giellatekno.
-==Misc==
+==Description==
-* [[Ideas for Google Summer of Code/Morphology with HFST|HFST tokenisation]]
-** Split up lexicon into standard and inconditional (punctuation) for hfst-proc
+The sme morphological analyser is a trimmed and tag-changed version of the one in Giellatekno.
-* regex for acronyms like "GsoC:as" (tokenisation dependent...)
+===Trimming===
-* Proper casing support in the sme lexicon.  (Mánát vs. mánát)
-** In the xerox software, this is done by a separate fst m (->) M || .#. _ ;
+Trimming happens using the same HFST method as in the Turkic pairs etc. [[Automatically_trimming_a_monodix#Compounds_vs_trimming_in_HFST|Compounds are not handled correctly by this method]].
-* <code>8632: SUBSTITUTE:TV_IV (V TV) (V IV) FAUXV (0 ("lávet"));</code> -- this should be analysed as both, and disambiguated
+===Tagset changes===
-* Ánde-máhka gives <code>^Ánde<N><Prop><Mal><Sg><Nom><-+máhka><N><Sg><Nom>$</code>... we should have something like <code>^Ánde<N><Prop><Mal><Sg><Nom>+<-máhka><N><Sg><Nom>$</code>
+The tagset in apertium-sme-nob is closer to Apertium's tagset, thus a bit different from the one in Giellatekno.
-* Suddenly some nouns have the +Actor tag after +N... remove?
+* We remove Err/Orth/Usage tags (+Use/Sub, etc) –
-* geafivuohta gets one Der/vuohta-analysis that's got no A tag, and one with an A tag, fix:
+* We remove any derivational analyses that aren't yet handled by transfer/bidix
-** geafivuohta     geafi+SgNomCmp+SgGenCmp+PlGenCmp+A+Der3+Der/vuohta+N+Sg+Nom
+** https://victorio.uit.no/langtech/trunk/langs/sme/tools/mt/apertium/filters/remove-derivation-strings-modifications.nob.regex
-** geafivuohta     geafi+SgNomCmp+SgGenCmp+PlGenCmp+Der3+Der/vuohta+N+Sg+Nom
+** https://victorio.uit.no/langtech/trunk/langs/sme/src/filters/remove-illegal-derivation-strings.regex
-** geafivuohta     geafivuohta+SgGenCmp+DefPlGenCmp+N+Sg+Nom
+** [[Northern Sámi and Norwegian/Derivations|More on derivations in sme-nob]]
+* We reorder some tags
+** https://victorio.uit.no/langtech/trunk/langs/sme/src/filters/reorder-subpos-tags.sme.regex
+* We change the format of tags, so +N becomes &lt;n&gt;, +Der/PassL becomes &lt;der_passl&gt;, etc.
+* We change certain tags
+** https://victorio.uit.no/langtech/trunk/langs/sme/tools/mt/apertium/tagsets/apertium.nob.relabel
+** https://victorio.uit.no/langtech/trunk/langs/sme/tools/mt/apertium/tagsets/apertium.postproc.relabel
+** https://victorio.uit.no/langtech/trunk/langs/sme/tools/mt/apertium/tagsets/modify-tags.nob.regex
+==TODO==
+===Misc===
+* add entries from bidix that are missing from the analyser
+* regex for URL's (don't want telefonkatalogen.no => *telefonkatalogen.nå)
+* regex for acronyms like "GsoC:as" (tokenisation dependent...)
+* <code>8632: SUBSTITUTE:TV_IV (V TV) (V IV) FAUXV (0 ("lávet"));</code> -- this should be analysed as both, and disambiguated
+===Typos===
-* -prográmma gives <code>^-prográmma/+Cmpnd+prográmma<N><Sg><Nom>/+Cmpnd+prográmma<N><Sg><Acc>/+Cmpnd+prográmma<N><Sg><Gen>$</code>. We currently remove the +Cmpnd+ in dev/xfst2apertium.hashtags.twol, but then we have no indication that the word started with a dash (-). In other language pairs, the - is output separately (either an inconditional lexical unit, like punctuation, or without any analysis, like unknown symbols).
+I've seen "odda" many places ("ođđa"), can we just add these to the analyser? (Would be cool to have charlifter/[[Diacritic Restoration]], but until then…)
+* a list of [http://paste.pocoo.org/raw/498998/ high-frequency typos] where the correction has an analysis
-* allaskuvla gets no compound tag in the compound analysis
-==Compounding==
+===Dashes===
+lexc handles dashes by adding them literally (like a lemma), doing that in with a bidix pardef would be very messy (also, doesn't it give issues with lemma-matching in CG?).
-Ensure compounding is only tried if there is no other solution. Use a weighted transducer, and give the compound border (ie the dynamic compounding border, the R lexicon) a non-zero weight.
-==Multiwords==
+===Multiwords===
 Add simple multiwords and fixed expressions to the analyser.
-* lea go => <code>^leat<V><IV><Ind><Prs><Sg3><Qst></code> (just like "leago")
 * dasa lassin => i tillegg (til det)
 * dán áigge => for tiden
@@ Line 40: / Line 52: @@
 * maid ban dainna => hva i all verden
 * jagis jahkái => fra år til år
+* oaidnaleapmái => 'see you'
+* ovdamearkka => for eksempel
+* Mo manná? => Hvordan går det?
+* ja nu ain => og så videre
 (Some of these MWE's might be very Apertium-specific, but in that case we just keep our own file and append the entries with update-lexc.sh.)

Difference between revisions of "Northern Sámi and Norwegian/smemorf"

Latest revision as of 09:46, 15 April 2015

Contents

Description[edit]

Trimming[edit]

Tagset changes[edit]

TODO[edit]

Misc[edit]

Typos[edit]

Dashes[edit]

Multiwords[edit]

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools