Difference between revisions of "Northern Sámi and Norwegian/smemorf"

Revision as of 11:25, 13 June 2014

Documentation and TODO's for apertium-sme-nob's sme morphological analyser from Giellatekno.

Description

The sme morphological analyser is a trimmed and tag-changed version of the one in Giellatekno.

Trimming

Trimming happens using the same HFST method as in the Turkic pairs etc. Compounds are not handled correctly by this method.

Tagset changes

The tagset in apertium-sme-nob is closer to Apertium's tagset, thus a bit different from the one in Giellatekno.

We remove Usage tags (+Use/Sub, etc), see the set Useless
We remove any derivational analyses that aren't yet handled by transfer/bidix
- More on derivations in sme-nob
We remove the - from split compound lemmas so that they may be looked up in bidix.
We remove the #-mark between those compounds that are lexicalised/non-dynamic (this should not be necessary any longer?)
We also ensure the +G3 tag occurs after the +N tag, a common upstream bug in the lexc files
We change the format of tags, so +N becomes <N>, +Der/1 becomes <Der_1>, etc.

TODO

Misc

add entries from bidix that are missing from the analyser
regex for URL's (don't want telefonkatalogen.no => *telefonkatalogen.nå)
regex for acronyms like "GsoC:as" (tokenisation dependent...)
8632: SUBSTITUTE:TV_IV (V TV) (V IV) FAUXV (0 ("lávet")); -- this should be analysed as both, and disambiguated

Typos

I've seen "odda" many places ("ođđa"), can we just add these to the analyser? (Would be cool to have charlifter/Diacritic Restoration, but until then…)

a list of high-frequency typos where the correction has an analysis

Dashes

lexc handles dashes by adding them literally (like a lemma), doing that in with a bidix pardef would be very messy (also, doesn't it give issues with lemma-matching in CG?).

Multiwords

Add simple multiwords and fixed expressions to the analyser.

dasa lassin => i tillegg (til det)
dán áigge => for tiden
mun ieš => meg selv
bures boahtin => velkommen
Buorre beaivi => God dag
leat guollebivddus => å fiske
maid ban dainna => hva i all verden
jagis jahkái => fra år til år
oaidnaleapmái => 'see you'
ovdamearkka => for eksempel
Mo manná? => Hvordan går det?
ja nu ain => og så videre

(Some of these MWE's might be very Apertium-specific, but in that case we just keep our own file and append the entries with update-lexc.sh.)

Also, oktavuohta.N.Sg.Loc turns into an mwe preposition in nob:

oktavuođas => i forbindelse med

it'd be a lot simpler for transfer to just analyse such a fixed expression as oktavuođas.Po in the first place.

@@ Line 34: / Line 34: @@
 ===Dashes===
-lexc handles dashes by adding them literally (like a lemma), doing that in with a bidix pardef would be very messy (also, doesn't it give issues with lemma-matching in CG?). Currently we remove the dashes in xfst2apertium.useless.twol (and in certain cases re-add them in transfer as the tag &lt;dash&gt;), but perhaps we could add a tag there …
+lexc handles dashes by adding them literally (like a lemma), doing that in with a bidix pardef would be very messy (also, doesn't it give issues with lemma-matching in CG?).
 ===Multiwords===

Difference between revisions of "Northern Sámi and Norwegian/smemorf"

Revision as of 11:25, 13 June 2014

Contents

Description

Trimming

Tagset changes

TODO

Misc

Typos

Dashes

Multiwords

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools