Difference between revisions of "Northern Sámi and Norwegian/smemorf"

From Apertium
Jump to navigation Jump to search
(deprecated)
 
(2 intermediate revisions by the same user not shown)
Line 13: Line 13:
The tagset in apertium-sme-nob is closer to Apertium's tagset, thus a bit different from the one in Giellatekno.
The tagset in apertium-sme-nob is closer to Apertium's tagset, thus a bit different from the one in Giellatekno.


#* We remove Usage tags (+Use/Sub, etc), see the set ''Useless''
* We remove Err/Orth/Usage tags (+Use/Sub, etc)
#* We remove any derivational analyses that aren't yet handled by transfer/bidix
* We remove any derivational analyses that aren't yet handled by transfer/bidix
** https://victorio.uit.no/langtech/trunk/langs/sme/tools/mt/apertium/filters/remove-derivation-strings-modifications.nob.regex
#** [[Northern Sámi and Norwegian/Derivations|More on derivations in sme-nob]]
** https://victorio.uit.no/langtech/trunk/langs/sme/src/filters/remove-illegal-derivation-strings.regex
#* We remove the - from split compound lemmas so that they may be looked up in bidix.
** [[Northern Sámi and Norwegian/Derivations|More on derivations in sme-nob]]
#* We remove the #-mark between those compounds that are lexicalised/non-dynamic (this should not be necessary any longer?)
* We reorder some tags
#* We also ensure the +G3 tag occurs ''after'' the +N tag, a common upstream bug in the lexc files
** https://victorio.uit.no/langtech/trunk/langs/sme/src/filters/reorder-subpos-tags.sme.regex
#* We change the format of tags, so +N becomes <N>, +Der/1 becomes <Der_1>, etc.
* We change the format of tags, so +N becomes <n>, +Der/PassL becomes <der_passl>, etc.
* We change certain tags
** https://victorio.uit.no/langtech/trunk/langs/sme/tools/mt/apertium/tagsets/apertium.nob.relabel
** https://victorio.uit.no/langtech/trunk/langs/sme/tools/mt/apertium/tagsets/apertium.postproc.relabel
** https://victorio.uit.no/langtech/trunk/langs/sme/tools/mt/apertium/tagsets/modify-tags.nob.regex


==TODO==
==TODO==
Line 34: Line 39:


===Dashes===
===Dashes===
lexc handles dashes by adding them literally (like a lemma), doing that in with a bidix pardef would be very messy (also, doesn't it give issues with lemma-matching in CG?). Currently we remove the dashes in xfst2apertium.useless.twol (and in certain cases re-add them in transfer as the tag <dash>), but perhaps we could add a tag there …
lexc handles dashes by adding them literally (like a lemma), doing that in with a bidix pardef would be very messy (also, doesn't it give issues with lemma-matching in CG?).


===Multiwords===
===Multiwords===

Latest revision as of 09:46, 15 April 2015

Documentation and TODO's for apertium-sme-nob's sme morphological analyser from Giellatekno.

Description[edit]

The sme morphological analyser is a trimmed and tag-changed version of the one in Giellatekno.

Trimming[edit]

Trimming happens using the same HFST method as in the Turkic pairs etc. Compounds are not handled correctly by this method.

Tagset changes[edit]

The tagset in apertium-sme-nob is closer to Apertium's tagset, thus a bit different from the one in Giellatekno.

TODO[edit]

Misc[edit]

  • add entries from bidix that are missing from the analyser
  • regex for URL's (don't want telefonkatalogen.no => *telefonkatalogen.nå)
  • regex for acronyms like "GsoC:as" (tokenisation dependent...)
  • 8632: SUBSTITUTE:TV_IV (V TV) (V IV) FAUXV (0 ("lávet")); -- this should be analysed as both, and disambiguated

Typos[edit]

I've seen "odda" many places ("ođđa"), can we just add these to the analyser? (Would be cool to have charlifter/Diacritic Restoration, but until then…)

Dashes[edit]

lexc handles dashes by adding them literally (like a lemma), doing that in with a bidix pardef would be very messy (also, doesn't it give issues with lemma-matching in CG?).

Multiwords[edit]

Add simple multiwords and fixed expressions to the analyser.

  • dasa lassin => i tillegg (til det)
  • dán áigge => for tiden
  • mun ieš => meg selv
  • bures boahtin => velkommen
  • Buorre beaivi => God dag
  • leat guollebivddus => å fiske
  • maid ban dainna => hva i all verden
  • jagis jahkái => fra år til år
  • oaidnaleapmái => 'see you'
  • ovdamearkka => for eksempel
  • Mo manná? => Hvordan går det?
  • ja nu ain => og så videre

(Some of these MWE's might be very Apertium-specific, but in that case we just keep our own file and append the entries with update-lexc.sh.)

Also, oktavuohta.N.Sg.Loc turns into an mwe preposition in nob:

  • oktavuođas => i forbindelse med

it'd be a lot simpler for transfer to just analyse such a fixed expression as oktavuođas.Po in the first place.