Difference between revisions of "Northern Sámi and Norwegian/release"
(→Testvoc: …) |
|||
Line 67: | Line 67: | ||
| '''adverbs''' || 96 || '''143''' |
| '''adverbs''' || 96 || '''143''' |
||
|- |
|- |
||
| '''prepositions''' || |
| '''prepositions''' || 42 || '''0''' || :-) |
||
|- |
|- |
||
| '''adjectives''' || 463 || '''593''' |
| '''adjectives''' || 463 || '''593''' |
Revision as of 16:04, 13 July 2010
This page holds information about the release schedule for apertium-sme-nob.
Issues
High priority bad translations
What are the high-priority linguistic issues to deal with?
Would we gain a lot by inserting modals instead of adverbs for Pot/Cond verbs? Is there a better, general, way to translate the progressive? Should we get some of Francis' automatically discovered lex.sel rules? And are there any "simple" constructions that we could handle but don't yet?
- DONE: Cond rules get modals, handled like passive in t1x (instead of "kanskje" in t3x), Pot gets "da<adv>" for now
- TODO: We should add a post-generator to insert epenthetics in compounds, turning eg. "ing~b" into "ingsb"
- TODO: In chunker/t1x, add tags to certain verbs (eg. using def-lists) that can be used in t3x to change/remove case-prepositions (see dev/valence.txt)
- TODO: Bidix will be added to with stuff from GTSVN
- see dev/inc* (verbs need transitivity, nouns need gender checked, words with spaces shouldn't have)
- TODO: After bidix additions, Francis will run apertium-lex-learner
- TODO: add forms to nob.dix (that are in bidix) from nn-nb-infreq; add lexicalised compounds using existing lemmas; mwe's and other missing stuff manually
- TODO: URL recognition (I tried making some with lexc, it worked by itself, but took ages to compile into the regular lexc, not sure why)
Derivations:
Any derivations that are not handled should be removed from the analyser. Maybe we could have a "negation" twol rule like
? /<= UnhandledDerivations _ ; ! fail if analysis contains a tag from the set UnhandledDerivations
If this works, we could probably also write a rule like
? /<= AnyDerivationtag+ PoStag+ AnyDerivationtag+ _ ;
to remove any derivations of derivations, since these are not handled either unless there are explicit transfer rules for them. We should remove any unhandled derivations before testvoc. Northern Sámi and Norwegian/Derivations#Summary of fallbacks contains the list of derivations that are and aren't handled.
Testvoc
Before release, we need to get testvoc out of the way – making sure there are no #'s and @'s in the output. As yet we don't have a way to create all possible surface forms from an HFST analyser, but we can at least run as large a corpus as we can find through sme-nob and look for # and @.
Postchunk rules are needed for any chunk containing a determiner/pronoun/adjective/noun/verb, we can easily make sure each possible chunk name has a postchunk rule (new chunks are created in t1x with names like det_adj_nom, but may also be merged in t2x to eg. det_adj_nom_conj_nom)
The script dev/sme-nob.inconsistency.sh tries to generate (with nob.dix) base forms (like infinitives or singular indefinites) of the various rhs entries in the bidix. Statistics so far:
Part-of-Speech | entries from bidix that are OK in nob.dix | in bidix but not in nob.dix | comments |
---|---|---|---|
verbs | 14017 | 486 | mostly mwe's missing |
nouns | 7466 | 7037 | mostly compounds missing |
proper nouns | 1885 | 15474 | |
adverbs | 96 | 143 | |
prepositions | 42 | 0 | :-) |
adjectives | 463 | 593 |
Schedule
Task | Date |
---|---|
Work on high priority bad translations, expand bidix coverage | until 2010-07-14 |
Remove unhandled derivations, ensure we have all postchunk rules | 2010-07-14…2010-07-18 |
Testvoc | 2010-07-18…2010-08-01 |
Tentative release date for apertium-sme-nob 0.1.0 | August 1st 2010 |
Update: Lene and Trond can working in August, real release beginning of September?