Difference between revisions of "Northern Sámi and Norwegian/release"
Line 60: | Line 60: | ||
| '''verbs''' || 14017 || '''486''' || mostly mwe's missing || ??? |
| '''verbs''' || 14017 || '''486''' || mostly mwe's missing || ??? |
||
|- |
|- |
||
| '''nouns''' || |
| '''nouns''' || 3776 || '''8147''' || mostly compounds missing || ??? |
||
|- |
|- |
||
| '''proper nouns''' || 17412 || '''0''' || :-) easy to find nob pardefs automatically || ??? |
| '''proper nouns''' || 17412 || '''0''' || :-) easy to find nob pardefs automatically || ??? |
Revision as of 14:17, 23 July 2010
This page holds information about the release schedule for apertium-sme-nob.
Contents
Issues
High priority bad translations
What are the high-priority linguistic issues to deal with?
Would we gain a lot by inserting modals instead of adverbs for Pot/Cond verbs? Is there a better, general, way to translate the progressive? Should we get some of Francis' automatically discovered lex.sel rules? And are there any "simple" constructions that we could handle but don't yet?
- TODO: We should add a post-generator to insert epenthetics in compounds, turning eg. "ing~b" into "ingsb"
- TODO: Bidix will be added to with stuff from GTSVN
- see dev/inc* (verbs need transitivity, nouns need gender checked, words with spaces shouldn't have)
- TODO: After bidix additions, Francis will run apertium-lex-learner
- TODO: add forms to nob.dix (that are in bidix) from nn-nb-infreq; add lexicalised compounds using existing lemmas; mwe's and other missing stuff manually
- TODO: URL recognition (I tried making some with lexc, it worked by itself, but took ages to compile into the regular lexc, not sure why)
- TODO: Compounds mess up CG. We have to leave all the tags of the non-heads in because of bidix lookup, so we get eg.
politiija<N><Sg><Nom><Cmp>+stašuvdna<N><Sg><Ill>
(instead ofpolitiija#stašuvdna<N><Sg><Ill>
like the lookup2cg script gives). CG seesNom
and thus applies rules that it shouldn't, etc.- The ideal solution, but currently impossible: rename the non-head tags in BEFORE-/AFTER-SECTIONS. I tried
BEFORE-SECTIONS SUBSTITUTE (N Sg Nom Cmp) (N* Sg* Nom* Cmp)
which works for the above, but CG then also turnsduohta<A><Sg><Nom><Cmp>+dilli<N><Sg><Acc>
intoduohta<A><@→N>+dilli<N*><Sg*><Nom*><Cmp><Sg><Gen>$
, while if there are several non-heads, it'll only substitute in the first part. It seems some sort of tag order mechanism would be needed in CG for the BEFORE-SECTIONS/AFTER-SECTIONS stashing method to work. - It's possible to do the initial renaming in a twol rule (committed, but commented out for now), but we have no way of changing the tags back in CG (an AFTER-SECTIONS rule SUBSTITUTE N* N will unfortunately merge N* and later occurences of N).
- The ideal solution, but currently impossible: rename the non-head tags in BEFORE-/AFTER-SECTIONS. I tried
- DONE: Cond rules get modals, handled like passive in t1x (instead of "kanskje" in t3x), Pot gets "da<adv>" for now
- DONE: In chunker/t1x, add tags to certain verbs (eg. using def-lists) that can be used in t3x to change/remove case-prepositions (see dev/valence.txt)
- TODO: fill out with more def-list entries
- TODO: go through dev/valence.txt and mark off which examples can be handled with noun-lists, and which with verb-lists
DONE: remove unhandled derivations
Any derivations that are not handled we remove from the analyser with a twol negation rule in dev/xfst2apertium.useless.twol, this makes the lexicon a lot easier to handle for testvoc.
DONE: +G3 tag
This is used like the +Actor tag, for sme wsd (here based on "stadieveksling").
We move it after the +N tag in dev/xfst2apertium.hashtags.twol (so it gets the same position as +Actor), this makes transfer a lot simpler.
Getting a list of lemmas which have +G3 is easy (grep all nouns from bidix, analyse their lemmas and grep for lemma+G3), use this to tag bidix.
DONE: ensure we have all necessary postchunk rules
Postchunk rules are needed for any chunk containing a determiner/pronoun/adjective/noun/verb, we can easily make sure each possible chunk name has a postchunk rule (new chunks are created in t1x with names like pre_pre_nom, but may also be merged in t2x to eg. pre_pre_nom_conj_nom)
All possible SN and SA chunks should have the needed postchunk rules now.
Testvoc
Before release, we need to get testvoc out of the way – making sure there are no #'s and @'s in the output. As yet we don't have a way to create all possible surface forms from an HFST analyser, but we can at least run as large a corpus as we can find through sme-nob and look for # and @.
The script dev/sme-nob.inconsistency.sh tries to generate (with nob.dix) base forms (like infinitives or singular indefinites) of the various rhs entries in the bidix. Statistics so far:
Part-of-Speech | entries from bidix that are OK in nob.dix | in bidix but not in nob.dix | comments | in sme analyser but not bidix |
---|---|---|---|---|
verbs | 14017 | 486 | mostly mwe's missing | ??? |
nouns | 3776 | 8147 | mostly compounds missing | ??? |
proper nouns | 17412 | 0 | :-) easy to find nob pardefs automatically | ??? |
adverbs | 96 | 143 | ??? | |
prepositions | 42 | 0 | :-) | ??? |
adjectives | 463 | 593 | ??? | |
abbreviations | ??? | ??? | see dev/abbr.todo.dix | 619 |
Schedule
Task | Date |
---|---|
Work on high priority bad translations, expand bidix coverage | until 2010-07-14 |
Remove unhandled derivations, ensure we have all postchunk rules | 2010-07-14…2010-07-18 |
Testvoc | 2010-07-18…2010-08-01 |
Tentative release date for apertium-sme-nob 0.1.0 | August 1st 2010 |
Update: Lene and Trond have more free time after August, real release beginning of September?