Difference between revisions of "Northern Sámi and Norwegian/smemorf"

From Apertium
Jump to navigation Jump to search
 
(39 intermediate revisions by the same user not shown)
Line 1: Line 1:
apertium-sme-nob TODOs for the sme morphological analyser from Giellatekno.
+
Documentation and TODO's for apertium-sme-nob's sme morphological analyser from Giellatekno.
   
==Misc==
+
==Description==
* add entries from bidix that are missing from the analyser
 
** Missing [http://codepad.org/Dusebd68 nouns], [http://apertium.codepad.org/6Kr6H7RO adverbs], [http://codepad.org/7Hadok6S adjectives]
 
   
  +
The sme morphological analyser is a trimmed and tag-changed version of the one in Giellatekno.
   
  +
===Trimming===
* regex for URL's (don't want telefonkatalogen.no => *telefonkatalogen.nå)
 
   
  +
Trimming happens using the same HFST method as in the Turkic pairs etc. [[Automatically_trimming_a_monodix#Compounds_vs_trimming_in_HFST|Compounds are not handled correctly by this method]].
   
  +
===Tagset changes===
* regex for acronyms like "GsoC:as" (tokenisation dependent...)
 
   
  +
The tagset in apertium-sme-nob is closer to Apertium's tagset, thus a bit different from the one in Giellatekno.
   
  +
* We remove Err/Orth/Usage tags (+Use/Sub, etc) –
  +
* We remove any derivational analyses that aren't yet handled by transfer/bidix
  +
** https://victorio.uit.no/langtech/trunk/langs/sme/tools/mt/apertium/filters/remove-derivation-strings-modifications.nob.regex
  +
** https://victorio.uit.no/langtech/trunk/langs/sme/src/filters/remove-illegal-derivation-strings.regex
  +
** [[Northern Sámi and Norwegian/Derivations|More on derivations in sme-nob]]
  +
* We reorder some tags
  +
** https://victorio.uit.no/langtech/trunk/langs/sme/src/filters/reorder-subpos-tags.sme.regex
  +
* We change the format of tags, so +N becomes <n>, +Der/PassL becomes <der_passl>, etc.
  +
* We change certain tags
  +
** https://victorio.uit.no/langtech/trunk/langs/sme/tools/mt/apertium/tagsets/apertium.nob.relabel
  +
** https://victorio.uit.no/langtech/trunk/langs/sme/tools/mt/apertium/tagsets/apertium.postproc.relabel
  +
** https://victorio.uit.no/langtech/trunk/langs/sme/tools/mt/apertium/tagsets/modify-tags.nob.regex
  +
  +
==TODO==
  +
===Misc===
 
* add entries from bidix that are missing from the analyser
 
* regex for URL's (don't want telefonkatalogen.no => *telefonkatalogen.nå)
 
* regex for acronyms like "GsoC:as" (tokenisation dependent...)
 
* <code>8632: SUBSTITUTE:TV_IV (V TV) (V IV) FAUXV (0 ("lávet"));</code> -- this should be analysed as both, and disambiguated
 
* <code>8632: SUBSTITUTE:TV_IV (V TV) (V IV) FAUXV (0 ("lávet"));</code> -- this should be analysed as both, and disambiguated
   
  +
===Typos===
 
I've seen "odda" many places ("ođđa"), can we just add these to the analyser? (Would be cool to have charlifter/[[Diacritic Restoration]], but until then…)
   
  +
* a list of [http://paste.pocoo.org/raw/498998/ high-frequency typos] where the correction has an analysis
* geafivuohta gets one Der/vuohta-analysis that's got no A tag, and one with an A tag, fix:
 
** geafivuohta geafi+SgNomCmp+SgGenCmp+PlGenCmp+A+Der3+Der/vuohta+N+Sg+Nom
 
** geafivuohta geafi+SgNomCmp+SgGenCmp+PlGenCmp+Der3+Der/vuohta+N+Sg+Nom
 
** geafivuohta geafivuohta+SgGenCmp+DefPlGenCmp+N+Sg+Nom
 
** http://giellatekno.uit.no/bugzilla/show_bug.cgi?id=852
 
 
 
* Typos: I've seen "odda" many places ("ođđa"), can we just add these to the analyser? (Would be cool to have charlifter/[[Diacritic Restoration]], but until then…)
 
 
 
* oktan seems to be used as a preposition, can we have that?
 
** Juo cuoŋománu 10. beaivvi ija vuostá mátkkoštii Ruvdnaprinseassa Märtha badjel ráji Ruŧŧii oktan Ruvdnaprinsabára golmmain mánáin.
 
** áhčči oktan mánáinis
 
** http://giellatekno.uit.no/bugzilla/show_bug.cgi?id=853
 
 
* Numbers with case shouldn't get the ':' in between tags: <code>^19.00:s/19.00<Num>:<Sg><Loc>$</code>
 
 
==Compounding==
 
===ensure compounds are tagged as such===
 
http://giellatekno.uit.no/bugzilla/show_bug.cgi?id=851
 
 
* ''allaskuvla'', ''Riikaráđi'' and ''1100-logu'' get no compound tags in the compound analysis, the first can be fixed by adding +Cmpnd in LEXICON ATTR, the RReal border, similarly for the second in LEXICON ACCRA-NE, both RReal borders.
 
 
* -prográmma gives <code>^-prográmma/+Cmpnd+prográmma<N><Sg><Nom>/+Cmpnd+prográmma<N><Sg><Acc>/+Cmpnd+prográmma<N><Sg><Gen>$</code>. We currently remove the +Cmpnd+ in dev/xfst2apertium.hashtags.twol, but then we have no indication that the word started with a dash (-). In other language pairs, the - is output separately (either an inconditional lexical unit, like punctuation, or without any analysis, like unknown symbols).
 
 
** A combination of the two errors above: Ánde-máhka gives <code>^Ánde<N><Prop><Mal><Sg><Nom><-máhka><N><Sg><Nom>$</code>... we should have something like <code>^Ánde<N><Prop><Mal><Sg><Nom>+<-máhka><N><Sg><Nom>$</code>
 
 
===ensure compounding is only tried if there is no other solution===
 
Most general solution: Use a weighted transducer, and give the compound border (ie the dynamic compounding border, the R lexicon) a non-zero weight.
 
   
  +
===Dashes===
However, hfst-proc does the same by just removing any analysis with a + in it if we have one without a +, so we're OK for now…
 
  +
lexc handles dashes by adding them literally (like a lemma), doing that in with a bidix pardef would be very messy (also, doesn't it give issues with lemma-matching in CG?).
   
==Multiwords==
+
===Multiwords===
 
Add simple multiwords and fixed expressions to the analyser.
 
Add simple multiwords and fixed expressions to the analyser.
   

Latest revision as of 09:46, 15 April 2015

Documentation and TODO's for apertium-sme-nob's sme morphological analyser from Giellatekno.

Description[edit]

The sme morphological analyser is a trimmed and tag-changed version of the one in Giellatekno.

Trimming[edit]

Trimming happens using the same HFST method as in the Turkic pairs etc. Compounds are not handled correctly by this method.

Tagset changes[edit]

The tagset in apertium-sme-nob is closer to Apertium's tagset, thus a bit different from the one in Giellatekno.

TODO[edit]

Misc[edit]

  • add entries from bidix that are missing from the analyser
  • regex for URL's (don't want telefonkatalogen.no => *telefonkatalogen.nå)
  • regex for acronyms like "GsoC:as" (tokenisation dependent...)
  • 8632: SUBSTITUTE:TV_IV (V TV) (V IV) FAUXV (0 ("lávet")); -- this should be analysed as both, and disambiguated

Typos[edit]

I've seen "odda" many places ("ođđa"), can we just add these to the analyser? (Would be cool to have charlifter/Diacritic Restoration, but until then…)

Dashes[edit]

lexc handles dashes by adding them literally (like a lemma), doing that in with a bidix pardef would be very messy (also, doesn't it give issues with lemma-matching in CG?).

Multiwords[edit]

Add simple multiwords and fixed expressions to the analyser.

  • dasa lassin => i tillegg (til det)
  • dán áigge => for tiden
  • mun ieš => meg selv
  • bures boahtin => velkommen
  • Buorre beaivi => God dag
  • leat guollebivddus => å fiske
  • maid ban dainna => hva i all verden
  • jagis jahkái => fra år til år
  • oaidnaleapmái => 'see you'
  • ovdamearkka => for eksempel
  • Mo manná? => Hvordan går det?
  • ja nu ain => og så videre

(Some of these MWE's might be very Apertium-specific, but in that case we just keep our own file and append the entries with update-lexc.sh.)

Also, oktavuohta.N.Sg.Loc turns into an mwe preposition in nob:

  • oktavuođas => i forbindelse med

it'd be a lot simpler for transfer to just analyse such a fixed expression as oktavuođas.Po in the first place.