Northern Sámi and Norwegian/release
This page holds information about the release schedule for apertium-sme-nob.
Contents
Issues
High priority bad translations
What are the high-priority linguistic issues to deal with?
Would we gain a lot by inserting modals instead of adverbs for Pot/Cond verbs? Is there a better, general, way to translate the progressive? Should we get some of Francis' automatically discovered lex.sel rules? And are there any "simple" constructions that we could handle but don't yet?
- TODO: We should add a post-generator to insert epenthetics in compounds, turning eg. "ing~b" into "ingsb"
- TODO: Bidix will be added to with stuff from GTSVN
- see dev/inc* (verbs need transitivity, nouns need gender checked, words with spaces shouldn't have)
- TODO: After bidix additions, Francis will run apertium-lex-learner
- TODO: add forms to nob.dix (that are in bidix) from nn-nb-infreq; add lexicalised compounds using existing lemmas; mwe's and other missing stuff manually
- TODO: URL recognition (I tried making some with lexc, it worked by itself, but took ages to compile into the regular lexc, not sure why)
- DONE: Cond rules get modals, handled like passive in t1x (instead of "kanskje" in t3x), Pot gets "da<adv>" for now
- DONE: In chunker/t1x, add tags to certain verbs (eg. using def-lists) that can be used in t3x to change/remove case-prepositions (see dev/valence.txt)
- TODO: fill out with more def-list entries
DONE: remove unhandled derivations
Any derivations that are not handled we remove from the analyser with a twol negation rule in dev/xfst2apertium.useless.twol, this makes the lexicon a lot easier to handle for testvoc.
DONE: +G3 tag
This is used like the +Actor tag, for sme wsd (here based on "stadieveksling").
We move it after the +N tag in dev/xfst2apertium.hashtags.twol (so it gets the same position as +Actor), this makes transfer a lot simpler.
Getting a list of lemmas which have +G3 is easy (grep all nouns from bidix, analyse their lemmas and grep for lemma+G3), use this to tag bidix.
Testvoc
Before release, we need to get testvoc out of the way – making sure there are no #'s and @'s in the output. As yet we don't have a way to create all possible surface forms from an HFST analyser, but we can at least run as large a corpus as we can find through sme-nob and look for # and @.
Postchunk rules are needed for any chunk containing a determiner/pronoun/adjective/noun/verb, we can easily make sure each possible chunk name has a postchunk rule (new chunks are created in t1x with names like det_adj_nom, but may also be merged in t2x to eg. det_adj_nom_conj_nom)
The script dev/sme-nob.inconsistency.sh tries to generate (with nob.dix) base forms (like infinitives or singular indefinites) of the various rhs entries in the bidix. Statistics so far:
Part-of-Speech | entries from bidix that are OK in nob.dix | in bidix but not in nob.dix | comments | in sme analyser but not bidix |
---|---|---|---|---|
verbs | 14017 | 486 | mostly mwe's missing | ??? |
nouns | 7466 | 7037 | mostly compounds missing | ??? |
proper nouns | 1885 | 15474 | ??? | |
adverbs | 96 | 143 | ??? | |
prepositions | 42 | 0 | :-) | ??? |
adjectives | 463 | 593 | ??? | |
abbreviations | ??? | ??? | see dev/abbr.todo.dix | 619 |
Schedule
Task | Date |
---|---|
Work on high priority bad translations, expand bidix coverage | until 2010-07-14 |
Remove unhandled derivations, ensure we have all postchunk rules | 2010-07-14…2010-07-18 |
Testvoc | 2010-07-18…2010-08-01 |
Tentative release date for apertium-sme-nob 0.1.0 | August 1st 2010 |
Update: Lene and Trond have more free time after August, real release beginning of September?