Difference between revisions of "Northern Sámi and Norwegian/release"
Line 18: | Line 18: | ||
* TODO: add forms to nob.dix (that are in bidix) from nn-nb-infreq; add lexicalised compounds using existing lemmas; mwe's and other missing stuff manually |
* TODO: add forms to nob.dix (that are in bidix) from nn-nb-infreq; add lexicalised compounds using existing lemmas; mwe's and other missing stuff manually |
||
* TODO: URL recognition (I tried making some with lexc, it worked by itself, but took ages to compile into the regular lexc, not sure why) |
* TODO: URL recognition (I tried making some with lexc, it worked by itself, but took ages to compile into the regular lexc, not sure why) |
||
⚫ | * |
||
⚫ | ** The ideal solution, but currently impossible: rename the non-head tags in BEFORE-/AFTER-SECTIONS. I tried <code>BEFORE-SECTIONS SUBSTITUTE (N Sg Nom Cmp) (N* Sg* Nom* Cmp)</code> which works for the above, but CG then also turns <code>duohta<A><Sg><Nom><Cmp>+dilli<N><Sg><Acc></code> into <code>duohta<A><@→N>+dilli<N*><Sg*><Nom*><Cmp><Sg><Gen>$</code>, while if there are several non-heads, it'll only substitute in the first part. It seems some sort of tag order mechanism would be needed in CG for the BEFORE-SECTIONS/AFTER-SECTIONS stashing method to work. |
||
⚫ | |||
Line 27: | Line 24: | ||
** TODO: fill out with more def-list entries |
** TODO: fill out with more def-list entries |
||
** TODO: go through dev/valence.txt and mark off which examples can be handled with noun-lists, and which with verb-lists |
** TODO: go through dev/valence.txt and mark off which examples can be handled with noun-lists, and which with verb-lists |
||
===TODO [1/2]: Multiple identical tags per reading in CG=== |
|||
⚫ | * DONE: Compounds mess up CG. We have to leave all the tags of the non-heads in because of bidix lookup, so we get eg. <code>politiija<N><Sg><Nom><Cmp>+stašuvdna<N><Sg><Ill></code> (instead of <code>politiija#stašuvdna<N><Sg><Ill></code> like the lookup2cg script gives). CG sees <code>Nom</code> and thus applies rules that it shouldn't, etc. |
||
⚫ | ** The ideal solution, but currently impossible: rename the non-head tags in BEFORE-/AFTER-SECTIONS. I tried <code>BEFORE-SECTIONS SUBSTITUTE (N Sg Nom Cmp) (N* Sg* Nom* Cmp)</code> which works for the above, but CG then also turns <code>duohta<A><Sg><Nom><Cmp>+dilli<N><Sg><Acc></code> into <code>duohta<A><@→N>+dilli<N*><Sg*><Nom*><Cmp><Sg><Gen>$</code>, while if there are several non-heads, it'll only substitute in the first part. It seems some sort of tag order mechanism would be needed in CG for the BEFORE-SECTIONS/AFTER-SECTIONS stashing method to work. |
||
⚫ | |||
** Fix: cg-proc now ignores anything up until the last baseform, so given <code>politiija<N><Sg><Nom><Cmp>+stašuvdna<N><Sg><Ill></code>, the rules will only see <code>stašuvdna<N><Sg><Ill></code> (later we may have CG features to refer to the other sub-readings) |
|||
* TODO: Derivations mess up CG. In lookup2cg, PoS tags are given stars if they appear before derivational tags. We could do this with twol, but again have no way of removing them before bidix. Also, the CG sub-reading features won't help here since we can't ignore _all_ tags up until the last derivation; say we have <code>"lemma" V TV Der/n N Sg Ind</code>, lookup2cg gives <code>"lemma" V* TV Der/n N Sg Ind</code> (leaving the TV tag intact). |
|||
===DONE: remove unhandled derivations=== |
===DONE: remove unhandled derivations=== |
Revision as of 10:08, 27 July 2010
This page holds information about the release schedule for apertium-sme-nob.
Contents
Issues
High priority bad translations
What are the high-priority linguistic issues to deal with?
Would we gain a lot by inserting modals instead of adverbs for Pot/Cond verbs? Is there a better, general, way to translate the progressive? Should we get some of Francis' automatically discovered lex.sel rules? And are there any "simple" constructions that we could handle but don't yet?
- TODO: We should add a post-generator to insert epenthetics in compounds, turning eg. "ing~b" into "ingsb"
- TODO: Bidix will be added to with stuff from GTSVN
- see dev/inc* (verbs need transitivity, nouns need gender checked, words with spaces shouldn't have)
- TODO: After bidix additions, Francis will run apertium-lex-learner
- TODO: add forms to nob.dix (that are in bidix) from nn-nb-infreq; add lexicalised compounds using existing lemmas; mwe's and other missing stuff manually
- TODO: URL recognition (I tried making some with lexc, it worked by itself, but took ages to compile into the regular lexc, not sure why)
- DONE: Cond rules get modals, handled like passive in t1x (instead of "kanskje" in t3x), Pot gets "da<adv>" for now
- DONE: In chunker/t1x, add tags to certain verbs (eg. using def-lists) that can be used in t3x to change/remove case-prepositions (see dev/valence.txt)
- TODO: fill out with more def-list entries
- TODO: go through dev/valence.txt and mark off which examples can be handled with noun-lists, and which with verb-lists
TODO [1/2]: Multiple identical tags per reading in CG
- DONE: Compounds mess up CG. We have to leave all the tags of the non-heads in because of bidix lookup, so we get eg.
politiija<N><Sg><Nom><Cmp>+stašuvdna<N><Sg><Ill>
(instead ofpolitiija#stašuvdna<N><Sg><Ill>
like the lookup2cg script gives). CG seesNom
and thus applies rules that it shouldn't, etc.- The ideal solution, but currently impossible: rename the non-head tags in BEFORE-/AFTER-SECTIONS. I tried
BEFORE-SECTIONS SUBSTITUTE (N Sg Nom Cmp) (N* Sg* Nom* Cmp)
which works for the above, but CG then also turnsduohta<A><Sg><Nom><Cmp>+dilli<N><Sg><Acc>
intoduohta<A><@→N>+dilli<N*><Sg*><Nom*><Cmp><Sg><Gen>$
, while if there are several non-heads, it'll only substitute in the first part. It seems some sort of tag order mechanism would be needed in CG for the BEFORE-SECTIONS/AFTER-SECTIONS stashing method to work. - It's possible to do the initial renaming in a twol rule (committed, but commented out for now), but we have no way of changing the tags back in CG (an AFTER-SECTIONS rule SUBSTITUTE N* N will unfortunately merge N* and later occurences of N).
- Fix: cg-proc now ignores anything up until the last baseform, so given
politiija<N><Sg><Nom><Cmp>+stašuvdna<N><Sg><Ill>
, the rules will only seestašuvdna<N><Sg><Ill>
(later we may have CG features to refer to the other sub-readings)
- The ideal solution, but currently impossible: rename the non-head tags in BEFORE-/AFTER-SECTIONS. I tried
- TODO: Derivations mess up CG. In lookup2cg, PoS tags are given stars if they appear before derivational tags. We could do this with twol, but again have no way of removing them before bidix. Also, the CG sub-reading features won't help here since we can't ignore _all_ tags up until the last derivation; say we have
"lemma" V TV Der/n N Sg Ind
, lookup2cg gives"lemma" V* TV Der/n N Sg Ind
(leaving the TV tag intact).
DONE: remove unhandled derivations
Any derivations that are not handled we remove from the analyser with a twol negation rule in dev/xfst2apertium.useless.twol, this makes the lexicon a lot easier to handle for testvoc.
DONE: +G3 tag
This is used like the +Actor tag, for sme wsd (here based on "stadieveksling").
We move it after the +N tag in dev/xfst2apertium.hashtags.twol (so it gets the same position as +Actor), this makes transfer a lot simpler.
Getting a list of lemmas which have +G3 is easy (grep all nouns from bidix, analyse their lemmas and grep for lemma+G3), use this to tag bidix.
DONE: ensure we have all necessary postchunk rules
Postchunk rules are needed for any chunk containing a determiner/pronoun/adjective/noun/verb, we can easily make sure each possible chunk name has a postchunk rule (new chunks are created in t1x with names like pre_pre_nom, but may also be merged in t2x to eg. pre_pre_nom_conj_nom)
All possible SN and SA chunks should have the needed postchunk rules now.
Testvoc
Before release, we need to get testvoc out of the way – making sure there are no #'s and @'s in the output. As yet we don't have a way to create all possible surface forms from an HFST analyser, but we can at least run as large a corpus as we can find through sme-nob and look for # and @.
The script dev/sme-nob.inconsistency.sh tries to generate (with nob.dix) base forms (like infinitives or singular indefinites) of the various rhs entries in the bidix. Statistics so far:
Part-of-Speech | entries from bidix that are OK in nob.dix | in bidix but not in nob.dix | comments | in sme analyser but not bidix |
---|---|---|---|---|
verbs | 14017 | 486 | mostly mwe's missing | ??? |
nouns | 11635 | 238 | some odd ones left | ??? |
proper nouns | 17412 | 0 | :-) easy to find nob pardefs automatically | ??? |
adverbs | 96 | 143 | ??? | |
prepositions | 42 | 0 | :-) | ??? |
adjectives | 463 | 593 | ??? | |
abbreviations | ??? | ??? | see dev/abbr.todo.dix | 619 |
Schedule
Task | Date |
---|---|
Work on high priority bad translations, expand bidix coverage | until 2010-07-14 |
Remove unhandled derivations, ensure we have all postchunk rules | 2010-07-14…2010-07-18 |
Testvoc | 2010-07-18…2010-08-01 |
Tentative release date for apertium-sme-nob 0.1.0 | August 1st 2010 |
Update: Lene and Trond have more free time after August, real release beginning of September?