Northern Sámi and Norwegian/release
This page holds information about the release schedule for apertium-sme-nob.
 High priority bad translations
What are the high-priority linguistic issues to deal with?
- TODO: Bidix will be added to with stuff from GTSVN
- TODO: After bidix additions, Francis will run apertium-lex-learner to automatically discover lex.sel rules
- TODO: URL recognition (I tried making some with lexc, it worked by itself, but took ages to compile into the regular lexc, not sure why)
- DONE: Cond rules get modals, handled like passive in t1x (instead of "kanskje" in t3x), Pot gets "da<adv>" for now
- DONE: In chunker/t1x, add tags to certain verbs (eg. using def-lists) that can be used in t3x to change/remove case-prepositions (see dev/valence.txt)
- TODO: fill out with more def-list entries
- TODO: go through dev/valence.txt and mark off which examples can be handled with noun-lists, and which with verb-lists
- DONE: Insert epenthetics in compounds (nob.dix has n.*.sg.ind.cmp forms for all nouns, outputting the right epenthetic based on that)
 DONE: Compounds in CG
Compounds messed up CG: We have to leave all the tags of the non-heads in because of bidix lookup, so we get eg.
politiija<N><Sg><Nom><Cmp>+stašuvdna<N><Sg><Ill>. If CG sees
Nom it applies rules that it shouldn't, etc.
Fix: cg-proc now ignores anything up until the last baseform, so given
politiija<N><Sg><Nom><Cmp>+stašuvdna<N><Sg><Ill>, the rules will only see
stašuvdna<N><Sg><Ill> (we have a special CG feature to refer to the other sub-readings)
 TODO: Derivations mess up CG
Mainly a problem with the PoS-changing derivations.
In lookup2cg, PoS tags are given stars if they appear before derivational tags, so "V Der N" becomes "V* Der N". We could do this with twol, but we have no way of removing them before bidix.
- Simple / clean solution: lexicalise.
- Boring / ugly solution: add stars with twol before any PoS-changing derivation tags, then instead of
<e><p><l>geavahit<s n="V"/><s n="TV"/></l><r>bruke<s n="vblex"/><s n="pers"/></r></p><par n="__verb"/></e>
<e><p><l>geavahit</l><r>bruke<s n="vblex"/><s n="pers"/></r></p><par n="__tverb"/></e>
where __tverb adds the TV tag, and all PoS changing derivations use V* instead of V.
 DONE: remove unhandled derivations
Any derivations that are not handled we remove from the analyser with a twol negation rule in dev/xfst2apertium.useless.twol, this makes the lexicon a lot easier to handle for testvoc.
 DONE: +G3 tag
This is used like the +Actor tag, for sme wsd (here based on "stadieveksling").
We move it after the +N tag in dev/xfst2apertium.hashtags.twol (so it gets the same position as +Actor), this makes transfer a lot simpler.
Getting a list of lemmas which have +G3 is easy (grep all nouns from bidix, analyse their lemmas and grep for lemma+G3), use this to tag bidix.
 DONE: ensure we have all necessary postchunk rules
Postchunk rules are needed for any chunk containing a determiner/pronoun/adjective/noun/verb, we can easily make sure each possible chunk name has a postchunk rule (new chunks are created in t1x with names like pre_pre_nom, but may also be merged in t2x to eg. pre_pre_nom_conj_nom)
All possible SN and SA chunks should have the needed postchunk rules now.
 DONE: bidix pardef to handle CG changing Plc to Sur
sme-dis.rle can change arbitrary Plc-tagged proper nouns into Sur tags, so bidix needs a pardef that translates LR any Sur as if it were Plc.
(might also want to split entries where sme-lemma != nob-lemma into a Plc entry, and a Sur-one where sme==nob)
 TODO: Testvoc
Before release, we need to get testvoc out of the way – making sure there are no #'s and @'s in the output. As yet we don't have a way to create all possible surface forms from an HFST analyser, but we can at least run as large a corpus as we can find through sme-nob and look for # and @.
 Generation report
dev/generation-test -r corpus translates a corpus and gives a frequency sorted list of errors (words marked with #, / or
@). Current status: http://apertium.svn.sourceforge.net/viewvc/apertium/staging/apertium-sme-nob/dev/generation-report.txt
 Bidix inconsistency
The script dev/sme-nob.inconsistency.sh tries to generate (with nob.dix) base forms (like infinitives or singular indefinites) of the various rhs entries in the bidix. Statistics so far:
|Part-of-Speech||entries from bidix that are OK in nob.dix||in bidix but not in nob.dix||comments||in sme analyser but not bidix|
|proper nouns||28310||15159||look for PPP mark in bidix, use dev/props-from-bidix-to-nob.sh||???|
|adverbs||235||0||:) only one nob pardef, simple to add||???|
|prepositions||42||0||:)||lots missing from bidix still|
|adjectives||1056||0||:) (not sure if all forms are covered though)||???|
|ShCmp||???||0||compound parts, removed from analyser||0|
|Numerals||???||???||bidix should be OK, not 100% sure, still lots missing from generator!||???|
 Expanding the morphology
hfst-fst2strings sme-nob.automorf.hfst.ol creates an expansion of the morphology, might be possible to use for testvoc? TODO
dev/gt-expand-to-bidix.shthat takes as input one word and PoS (tab-separated) per line, e.g.
galbmit V beana N, and uses the giellatekno web-based lemma expander to create all forms of each lemma (since hfst is still not very good for that), then runs them through bidix to check for missing bidix entries.