Northern Sámi and Norwegian/release
This page holds information about the release schedule for apertium-sme-nob.
Contents
- 1 Issues
- 1.1 High priority bad translations
- 1.2 DONE: Compounds in CG
- 1.3 TODO: Derivations mess up CG
- 1.4 DONE: remove unhandled derivations
- 1.5 DONE: +G3 tag
- 1.6 DONE: ensure we have all necessary postchunk rules
- 1.7 DONE: bidix pardef to handle CG changing Plc to Sur
- 1.8 TODO: Testvoc
- 1.9 Generation report
- 1.10 Bidix inconsistency
- 1.11 Expanding the morphology
Issues
High priority bad translations
What are the high-priority linguistic issues to deal with?
- TODO: Bidix will be added to with stuff from GTSVN
- see dev/inc* (verbs need transitivity, nouns need gender checked, words with spaces shouldn't have), see also these files with entries are not in bidix:
- dev/bidix-pronouns.todo.dix
- dev/bidix-abbr.todo.dix
- dev/bidix-shcmp.todo.dix
- TODO: After bidix additions, Francis will run apertium-lex-learner to automatically discover lex.sel rules
- TODO: URL recognition (I tried making some with lexc, it worked by itself, but took ages to compile into the regular lexc, not sure why)
- DONE: Cond rules get modals, handled like passive in t1x (instead of "kanskje" in t3x), Pot gets "da<adv>" for now
- DONE: In chunker/t1x, add tags to certain verbs (eg. using def-lists) that can be used in t3x to change/remove case-prepositions (see dev/valence.txt)
- TODO: fill out with more def-list entries
- TODO: go through dev/valence.txt and mark off which examples can be handled with noun-lists, and which with verb-lists
- DONE: Insert epenthetics in compounds (nob.dix has n.*.sg.ind.cmp forms for all nouns, outputting the right epenthetic based on that)
DONE: Compounds in CG
Compounds messed up CG: We have to leave all the tags of the non-heads in because of bidix lookup, so we get eg. politiija<N><Sg><Nom><Cmp>+stašuvdna<N><Sg><Ill>
. If CG sees Nom
it applies rules that it shouldn't, etc.
Fix: cg-proc now ignores anything up until the last baseform, so given politiija<N><Sg><Nom><Cmp>+stašuvdna<N><Sg><Ill>
, the rules will only see stašuvdna<N><Sg><Ill>
(we have a special CG feature to refer to the other sub-readings)
TODO: Derivations mess up CG
Mainly a problem with the PoS-changing derivations.
In lookup2cg, PoS tags are given stars if they appear before derivational tags, so "V Der N" becomes "V* Der N". We could do this with twol, but we have no way of removing them before bidix.
- Simple / clean solution: lexicalise.
- Boring / ugly solution: add stars with twol before any PoS-changing derivation tags, then instead of
<e><p><l>geavahit<s n="V"/><s n="TV"/></l><r>bruke<s n="vblex"/><s n="pers"/></r></p><par n="__verb"/></e>
we have
<e><p><l>geavahit</l><r>bruke<s n="vblex"/><s n="pers"/></r></p><par n="__tverb"/></e>
where __tverb adds the TV tag, and all PoS changing derivations use V* instead of V.
DONE: remove unhandled derivations
Any derivations that are not handled we remove from the analyser with a twol negation rule in dev/xfst2apertium.useless.twol, this makes the lexicon a lot easier to handle for testvoc.
DONE: +G3 tag
This is used like the +Actor tag, for sme wsd (here based on "stadieveksling").
We move it after the +N tag in dev/xfst2apertium.hashtags.twol (so it gets the same position as +Actor), this makes transfer a lot simpler.
Getting a list of lemmas which have +G3 is easy (grep all nouns from bidix, analyse their lemmas and grep for lemma+G3), use this to tag bidix.
DONE: ensure we have all necessary postchunk rules
Postchunk rules are needed for any chunk containing a determiner/pronoun/adjective/noun/verb, we can easily make sure each possible chunk name has a postchunk rule (new chunks are created in t1x with names like pre_pre_nom, but may also be merged in t2x to eg. pre_pre_nom_conj_nom)
All possible SN and SA chunks should have the needed postchunk rules now.
DONE: bidix pardef to handle CG changing Plc to Sur
sme-dis.rle can change arbitrary Plc-tagged proper nouns into Sur tags, so bidix needs a pardef that translates LR any Sur as if it were Plc.
(might also want to split entries where sme-lemma != nob-lemma into a Plc entry, and a Sur-one where sme==nob)
TODO: Testvoc
Before release, we need to get testvoc out of the way – making sure there are no #'s and @'s in the output. As yet we don't have a way to create all possible surface forms from an HFST analyser, but we can at least run as large a corpus as we can find through sme-nob and look for # and @.
Generation report
The script dev/generation-test -r corpus
translates a corpus and gives a frequency sorted list of errors (words marked with #, / or
@). Current status: http://apertium.svn.sourceforge.net/viewvc/apertium/staging/apertium-sme-nob/dev/generation-report.txt
Bidix inconsistency
The script dev/sme-nob.inconsistency.sh tries to generate (with nob.dix) base forms (like infinitives or singular indefinites) of the various rhs entries in the bidix. Statistics so far:
Part-of-Speech | entries from bidix that are OK in nob.dix | in bidix but not in nob.dix | comments | in sme analyser but not bidix |
---|---|---|---|---|
verbs | 2536 | 0 | :) | ??? |
nouns | 11758 | 0 | :) | ??? |
proper nouns | 28310 | 15159 | look for PPP mark in bidix, use dev/props-from-bidix-to-nob.sh | ??? |
adverbs | 235 | 0 | :) only one nob pardef, simple to add | ??? |
prepositions | 42 | 0 | :) | lots missing from bidix still |
adjectives | 1056 | 0 | :) (not sure if all forms are covered though) | ??? |
abbreviations | ??? | 0 | :) | 0 |
sub-/conjunctions | 25 | 0 | :) | ??? |
pronouns | ??? | ??? | 0 | |
ShCmp | ??? | 0 | compound parts, removed from analyser | 0 |
Numerals | ??? | ??? | bidix should be OK, not 100% sure, still lots missing from generator! | ??? |
Expanding the morphology
Running hfst-fst2strings sme-nob.automorf.hfst.ol
creates an expansion of the morphology, might be possible to use for testvoc? TODO
There's a handy script dev/gt-expand-to-bidix.sh
that takes as input one word and PoS (tab-separated) per line, e.g.
galbmit V beana N
, and uses the giellatekno web-based lemma expander to create all forms of each lemma (since hfst is still not very good for that), then runs them through bidix to check for missing bidix entries.