Northern Sámi and Norwegian/release

From Apertium
Jump to navigation Jump to search

This page holds information about the release schedule for apertium-sme-nob.

Issues

High priority bad translations

What are the high-priority linguistic issues to deal with?



  • DONE: Cond rules get modals, handled like passive in t1x (instead of "kanskje" in t3x), Pot gets "da<adv>" for now
  • DONE: In chunker/t1x, add tags to certain verbs (eg. using def-lists) that can be used in t3x to change/remove case-prepositions (see dev/valence.txt)
    • TODO: fill out with more def-list entries
    • TODO: go through dev/valence.txt and mark off which examples can be handled with noun-lists, and which with verb-lists
  • DONE: Insert epenthetics in compounds (nob.dix has n.*.sg.ind.cmp forms for all nouns, outputting the right epenthetic based on that)

DONE: Compounds in CG

Compounds messed up CG: We have to leave all the tags of the non-heads in because of bidix lookup, so we get eg. politiija<N><Sg><Nom><Cmp>+stašuvdna<N><Sg><Ill>. If CG sees Nom it applies rules that it shouldn't, etc.

Fix: cg-proc now ignores anything up until the last baseform, so given politiija<N><Sg><Nom><Cmp>+stašuvdna<N><Sg><Ill>, the rules will only see stašuvdna<N><Sg><Ill> (we have a special CG feature to refer to the other sub-readings)

TODO: Derivations mess up CG

Mainly a problem with the PoS-changing derivations.

In lookup2cg, PoS tags are given stars if they appear before derivational tags, so "V Der N" becomes "V* Der N". We could do this with twol, but we have no way of removing them before bidix.

  • Simple / clean solution: lexicalise.
  • Boring / ugly solution: add stars with twol before any PoS-changing derivation tags, then instead of
<e><p><l>geavahit<s n="V"/><s n="TV"/></l><r>bruke<s n="vblex"/><s n="pers"/></r></p><par n="__verb"/></e>

we have

<e><p><l>geavahit</l><r>bruke<s n="vblex"/><s n="pers"/></r></p><par n="__tverb"/></e>

where __tverb adds the TV tag, and all PoS changing derivations use V* instead of V.

DONE: remove unhandled derivations

Any derivations that are not handled we remove from the analyser with a twol negation rule in dev/xfst2apertium.useless.twol, this makes the lexicon a lot easier to handle for testvoc.

DONE: +G3 tag

This is used like the +Actor tag, for sme wsd (here based on "stadieveksling").

We move it after the +N tag in dev/xfst2apertium.hashtags.twol (so it gets the same position as +Actor), this makes transfer a lot simpler.

Getting a list of lemmas which have +G3 is easy (grep all nouns from bidix, analyse their lemmas and grep for lemma+G3), use this to tag bidix.

DONE: ensure we have all necessary postchunk rules

Postchunk rules are needed for any chunk containing a determiner/pronoun/adjective/noun/verb, we can easily make sure each possible chunk name has a postchunk rule (new chunks are created in t1x with names like pre_pre_nom, but may also be merged in t2x to eg. pre_pre_nom_conj_nom)

All possible SN and SA chunks should have the needed postchunk rules now.

DONE: bidix pardef to handle CG changing Plc to Sur

sme-dis.rle can change arbitrary Plc-tagged proper nouns into Sur tags, so bidix needs a pardef that translates LR any Sur as if it were Plc.

(might also want to split entries where sme-lemma != nob-lemma into a Plc entry, and a Sur-one where sme==nob)

TODO: Testvoc

Before release, we need to get testvoc out of the way – making sure there are no #'s and @'s in the output. As yet we don't have a way to create all possible surface forms from an HFST analyser, but we can at least run as large a corpus as we can find through sme-nob and look for # and @.

The script dev/sme-nob.inconsistency.sh tries to generate (with nob.dix) base forms (like infinitives or singular indefinites) of the various rhs entries in the bidix. Statistics so far:

Part-of-Speech entries from bidix that are OK in nob.dix in bidix but not in nob.dix comments in sme analyser but not bidix
verbs 2536 0 :) ???
nouns 11758 0 :) ???
proper nouns 28310 15159 look for PPP mark in bidix, use dev/props-from-bidix-to-nob.sh ???
adverbs 235 0 :) only one nob pardef, simple to add ???
prepositions 42 0 :) lots missing from bidix still
adjectives 1056 0 :) (not sure if all forms are covered though) ???
abbreviations ??? 0 :) 0
sub-/conjunctions 25 0 :) ???
pronouns ??? ??? 0
ShCmp ??? 0 compound parts, removed from analyser 0
Numerals ??? ??? bidix should be OK, not 100% sure, still lots missing from generator! ???


These generation/transfer errors need to be fixed:


There's a handy script dev/gt-expand-to-bidix.sh that takes as input one word and PoS (tab-separated) per line, e.g.

galbmit	V
beana	N

, and uses the giellatekno web-based lemma expander to create all forms of each lemma (since hfst is still not very good for that), then runs them through bidix to check for missing bidix entries.