Difference between revisions of "Northern Sámi and Norwegian/release"

Revision as of 13:50, 15 July 2010

This page holds information about the release schedule for apertium-sme-nob.

Issues

High priority bad translations

What are the high-priority linguistic issues to deal with?

Would we gain a lot by inserting modals instead of adverbs for Pot/Cond verbs? Is there a better, general, way to translate the progressive? Should we get some of Francis' automatically discovered lex.sel rules? And are there any "simple" constructions that we could handle but don't yet?

DONE: Cond rules get modals, handled like passive in t1x (instead of "kanskje" in t3x), Pot gets "da<adv>" for now
TODO: We should add a post-generator to insert epenthetics in compounds, turning eg. "ing~b" into "ingsb"
DONE: In chunker/t1x, add tags to certain verbs (eg. using def-lists) that can be used in t3x to change/remove case-prepositions (see dev/valence.txt)
- TODO: fill out with more def-list entries
TODO: Bidix will be added to with stuff from GTSVN
- see dev/inc* (verbs need transitivity, nouns need gender checked, words with spaces shouldn't have)
TODO: After bidix additions, Francis will run apertium-lex-learner
TODO: add forms to nob.dix (that are in bidix) from nn-nb-infreq; add lexicalised compounds using existing lemmas; mwe's and other missing stuff manually
TODO: URL recognition (I tried making some with lexc, it worked by itself, but took ages to compile into the regular lexc, not sure why)

DONE: remove unhandled derivations

Any derivations that are not handled we remove from the analyser with a twol negation rule in dev/xfst2apertium.useless.twol, this makes the lexicon a lot easier to handle for testvoc.

DONE: +G3 tag

This is used like the +Actor tag, for sme wsd (here based on "stadieveksling").

We move it after the +N tag in dev/xfst2apertium.hashtags.twol (so it gets the same position as +Actor), this makes transfer a lot simpler.

Getting a list of lemmas which have +G3 is easy (grep all nouns from bidix, analyse their lemmas and grep for lemma+G3), use this to tag bidix.

Testvoc

Before release, we need to get testvoc out of the way – making sure there are no #'s and @'s in the output. As yet we don't have a way to create all possible surface forms from an HFST analyser, but we can at least run as large a corpus as we can find through sme-nob and look for # and @.

Postchunk rules are needed for any chunk containing a determiner/pronoun/adjective/noun/verb, we can easily make sure each possible chunk name has a postchunk rule (new chunks are created in t1x with names like det_adj_nom, but may also be merged in t2x to eg. det_adj_nom_conj_nom)

The script dev/sme-nob.inconsistency.sh tries to generate (with nob.dix) base forms (like infinitives or singular indefinites) of the various rhs entries in the bidix. Statistics so far:

Part-of-Speech	entries from bidix that are OK in nob.dix	in bidix but not in nob.dix	comments	in sme analyser but not bidix
verbs	14017	486	mostly mwe's missing	???
nouns	7466	7037	mostly compounds missing	???
proper nouns	1885	15474		???
adverbs	96	143		???
prepositions	42	0	:-)	???
adjectives	463	593		???
abbreviations	???	???	see dev/abbr.todo.dix	619

Schedule

Task	Date
Work on high priority bad translations, expand bidix coverage	until 2010-07-14
Remove unhandled derivations, ensure we have all postchunk rules	2010-07-14…2010-07-18
Testvoc	2010-07-18…2010-08-01
Tentative release date for apertium-sme-nob 0.1.0	August 1st 2010

Update: Lene and Trond have more free time after August, real release beginning of September?

@@ Line 14: / Line 14: @@
 * DONE: Cond rules get modals, handled like passive in t1x (instead of "kanskje" in t3x), Pot gets "da<adv>" for now
 * TODO: We should add a post-generator to insert epenthetics in compounds, turning eg. "ing~b" into "ingsb"
-* TODO: In chunker/t1x, add tags to certain verbs (eg. using def-lists) that can be used in t3x to change/remove case-prepositions (see dev/valence.txt)
+* DONE: In chunker/t1x, add tags to certain verbs (eg. using def-lists) that can be used in t3x to change/remove case-prepositions (see dev/valence.txt)
+** TODO: fill out with more def-list entries
 * TODO: Bidix will be added to with stuff from GTSVN
 ** see dev/inc* (verbs need transitivity, nouns need gender checked, words with spaces shouldn't have)

Difference between revisions of "Northern Sámi and Norwegian/release"

Revision as of 13:50, 15 July 2010

Contents

Issues

High priority bad translations

DONE: remove unhandled derivations

DONE: +G3 tag

Testvoc

Schedule

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools