Difference between revisions of "Northern Sámi and Norwegian/release"

Revision as of 10:22, 26 October 2011

This page holds information about the release schedule for apertium-sme-nob.

Issues

High priority bad translations

What are the high-priority linguistic issues to deal with?

TODO: Bidix will be added to with stuff from GTSVN
- see dev/inc* (verbs need transitivity, nouns need gender checked, words with spaces shouldn't have), see also these files with entries are not in bidix:
- dev/bidix-pronouns.todo.dix
- dev/bidix-abbr.todo.dix
- dev/bidix-shcmp.todo.dix
TODO: After bidix additions, Francis will run apertium-lex-learner to automatically discover lex.sel rules
TODO: URL recognition (I tried making some with lexc, it worked by itself, but took ages to compile into the regular lexc, not sure why)

DONE: Cond rules get modals, handled like passive in t1x (instead of "kanskje" in t3x), Pot gets "da<adv>" for now
DONE: In chunker/t1x, add tags to certain verbs (eg. using def-lists) that can be used in t3x to change/remove case-prepositions (see dev/valence.txt)
- TODO: fill out with more def-list entries
- TODO: go through dev/valence.txt and mark off which examples can be handled with noun-lists, and which with verb-lists
DONE: Insert epenthetics in compounds (nob.dix has n.*.sg.ind.cmp forms for all nouns, outputting the right epenthetic based on that)

DONE: Compounds in CG

Compounds messed up CG: We have to leave all the tags of the non-heads in because of bidix lookup, so we get eg. politiija<N><Sg><Nom><Cmp>+stašuvdna<N><Sg><Ill>. If CG sees Nom it applies rules that it shouldn't, etc.

Fix: cg-proc now ignores anything up until the last baseform, so given politiija<N><Sg><Nom><Cmp>+stašuvdna<N><Sg><Ill>, the rules will only see stašuvdna<N><Sg><Ill> (we have a special CG feature to refer to the other sub-readings)

TODO: Derivations mess up CG

Mainly a problem with the PoS-changing derivations.

In lookup2cg, PoS tags are given stars if they appear before derivational tags, so "V Der N" becomes "V* Der N". We could do this with twol, but we have no way of removing them before bidix.

Simple / clean solution: lexicalise.
Boring / ugly solution: add stars with twol before any PoS-changing derivation tags, then instead of

<e><p><l>geavahit<s n="V"/><s n="TV"/></l><r>bruke<s n="vblex"/><s n="pers"/></r></p><par n="__verb"/></e>

we have

<e><p><l>geavahit</l><r>bruke<s n="vblex"/><s n="pers"/></r></p><par n="__tverb"/></e>

where __tverb adds the TV tag, and all PoS changing derivations use V* instead of V.

DONE: remove unhandled derivations

Any derivations that are not handled we remove from the analyser with a twol negation rule in dev/xfst2apertium.useless.twol, this makes the lexicon a lot easier to handle for testvoc.

DONE: +G3 tag

This is used like the +Actor tag, for sme wsd (here based on "stadieveksling").

We move it after the +N tag in dev/xfst2apertium.hashtags.twol (so it gets the same position as +Actor), this makes transfer a lot simpler.

Getting a list of lemmas which have +G3 is easy (grep all nouns from bidix, analyse their lemmas and grep for lemma+G3), use this to tag bidix.

DONE: ensure we have all necessary postchunk rules

Postchunk rules are needed for any chunk containing a determiner/pronoun/adjective/noun/verb, we can easily make sure each possible chunk name has a postchunk rule (new chunks are created in t1x with names like pre_pre_nom, but may also be merged in t2x to eg. pre_pre_nom_conj_nom)

All possible SN and SA chunks should have the needed postchunk rules now.

DONE: bidix pardef to handle CG changing Plc to Sur

sme-dis.rle can change arbitrary Plc-tagged proper nouns into Sur tags, so bidix needs a pardef that translates LR any Sur as if it were Plc.

(might also want to split entries where sme-lemma != nob-lemma into a Plc entry, and a Sur-one where sme==nob)

TODO: Testvoc

Before release, we need to get testvoc out of the way – making sure there are no #'s and @'s in the output. As yet we don't have a way to create all possible surface forms from an HFST analyser, but we can at least run as large a corpus as we can find through sme-nob and look for # and @.

The script dev/sme-nob.inconsistency.sh tries to generate (with nob.dix) base forms (like infinitives or singular indefinites) of the various rhs entries in the bidix. Statistics so far:

Part-of-Speech	entries from bidix that are OK in nob.dix	in bidix but not in nob.dix	comments	in sme analyser but not bidix
verbs	2536	0	:)	???
nouns	11758	0	:)	???
proper nouns	28310	15159	look for PPP mark in bidix, use dev/props-from-bidix-to-nob.sh	???
adverbs	235	0	:) only one nob pardef, simple to add	???
prepositions	42	0	:)	lots missing from bidix still
adjectives	1056	0	:) (not sure if all forms are covered though)	???
abbreviations	???	0	:)	0
sub-/conjunctions	25	0	:)	???
pronouns	???	???		0
ShCmp	???	0	compound parts, removed from analyser	0
Numerals	???	???	bidix should be OK, not 100% sure, still lots missing from generator!	???

These generation/transfer errors need to be fixed:

http://apertium.svn.sourceforge.net/viewvc/apertium/incubator/apertium-sme-nob/dev/generation-report.txt

There's a handy script dev/gt-expand-to-bidix.sh that takes as input one word and PoS (tab-separated) per line, e.g.

galbmit	V
beana	N

, and uses the giellatekno web-based lemma expander to create all forms of each lemma (since hfst is still not very good for that), then runs them through bidix to check for missing bidix entries.

@@ Line 32: / Line 32: @@
 In lookup2cg, PoS tags are given stars if they appear before derivational tags, so "V Der N" becomes "V* Der N". We could do this with twol, but we have no way of removing them before bidix.
-** Simple / clean solution: lexicalise.
+* Simple / clean solution: lexicalise.
-** Boring / ugly solution: add stars with twol before any PoS-changing derivation tags, then instead of
+* Boring / ugly solution: add stars with twol before any PoS-changing derivation tags, then instead of
 <pre><e><p><l>geavahit<s n="V"/><s n="TV"/></l><r>bruke<s n="vblex"/><s n="pers"/></r></p><par n="__verb"/></e></pre>
 we have

Difference between revisions of "Northern Sámi and Norwegian/release"

Revision as of 10:22, 26 October 2011

Contents

Issues

High priority bad translations

DONE: Compounds in CG

TODO: Derivations mess up CG

DONE: remove unhandled derivations

DONE: +G3 tag

DONE: ensure we have all necessary postchunk rules

DONE: bidix pardef to handle CG changing Plc to Sur

TODO: Testvoc

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools