Difference between revisions of "Northern Sámi and Norwegian/release"

Revision as of 07:48, 14 June 2014

This page holds information about the release schedule for apertium-sme-nob.

Issues

Periods in abbreviations missing from lemma

Forms "nr" and "nr." get the exact same analysis:

$ echo -e 'nr\nnr.'|apertium -f none -d . sme-nob-morph 
^nr/nr<n><abbr><acc>/nr<n><abbr><attr>/nr<n><abbr><gen>/nr<n><abbr><nom>$
^nr./nr<n><abbr><acc>/nr<n><abbr><attr>/nr<n><abbr><gen>/nr<n><abbr><nom>$

So when translating to Norwegian, we have no idea whether to include the dot or not.

If form "nr." had lemma "nr." this would be simple.

High priority bad translations

What are the high-priority linguistic issues to deal with?

TODO: Bidix will be added to with stuff from GTSVN
- see dev/inc* (verbs need transitivity, nouns need gender checked, words with spaces shouldn't have), see also these files with entries are not in bidix:
- dev/bidix-pronouns.todo.dix
- dev/bidix-abbr.todo.dix
- dev/bidix-shcmp.todo.dix
TODO: After bidix additions, Francis will run apertium-lex-learner to automatically discover lex.sel rules
TODO: URL recognition (I tried making some with lexc, it worked by itself, but took ages to compile into the regular lexc, not sure why)

DONE: In chunker/t1x, add tags to certain verbs (eg. using def-lists) that can be used in t3x to change/remove case-prepositions (see dev/valence.txt)
- TODO: fill out with more def-list entries
- TODO: go through dev/valence.txt and mark off which examples can be handled with noun-lists, and which with verb-lists

TODO: Derivations mess up CG

Mainly a problem with the PoS-changing derivations.

In lookup2cg, PoS tags are given stars if they appear before derivational tags, so "V Der N" becomes "V* Der N". We could do this with twol, but we have no way of removing them before bidix.

Simple / clean solution: lexicalise.
Boring / ugly solution: add stars with twol before any PoS-changing derivation tags, then instead of

<e><p><l>geavahit<s n="V"/><s n="TV"/></l><r>bruke<s n="vblex"/><s n="pers"/></r></p><par n="__verb"/></e>

we have

<e><p><l>geavahit</l><r>bruke<s n="vblex"/><s n="pers"/></r></p><par n="__tverb"/></e>

where __tverb adds the TV tag, and all PoS changing derivations use V* instead of V.

TODO: Testvoc

Before release, we need to get testvoc out of the way – making sure there are no #'s and @'s in the output. As yet we don't have a way to create all possible surface forms from an HFST analyser, but we can at least run as large a corpus as we can find through sme-nob and look for # and @.

Generation report

The script dev/generation-test -r corpus translates a corpus and gives a frequency sorted list of errors (words marked with #, / or @). Current status: http://apertium.svn.sourceforge.net/viewvc/apertium/staging/apertium-sme-nob/dev/generation-report.txt

Bidix inconsistency

The script dev/sme-nob.inconsistency.sh tries to generate (with nob.dix) base forms (like infinitives or singular indefinites) of the various rhs entries in the bidix. Statistics so far:

Part-of-Speech	entries from bidix that are OK in nob.dix	in bidix but not in nob.dix	comments	in sme analyser but not bidix
verbs	2536	0	:)	???
nouns	11758	0	:)	???
proper nouns	28310	15159	look for PPP mark in bidix, use dev/props-from-bidix-to-nob.sh	???
adverbs	235	0	:) only one nob pardef, simple to add	???
prepositions	42	0	:)	lots missing from bidix still
adjectives	1056	0	:) (not sure if all forms are covered though)	???
abbreviations	???	0	:)	0
sub-/conjunctions	25	0	:)	???
pronouns	???	???		0
ShCmp	???	0	compound parts, removed from analyser	0
Numerals	???	???	bidix should be OK, not 100% sure, still lots missing from generator!	???

Expanding the morphology

Running hfst-fst2strings sme-nob.automorf.hfst.ol creates an expansion of the morphology, might be possible to use for testvoc? TODO

There's a handy script dev/gt-expand-to-bidix.sh that takes as input one word and PoS (tab-separated) per line, e.g.

galbmit	V
beana	N

, and uses the giellatekno web-based lemma expander to create all forms of each lemma (since hfst is still not very good for that), then runs them through bidix to check for missing bidix entries.

@@ Line 3: / Line 3: @@
 ==Issues==
+===Periods in abbreviations missing from lemma===
+Forms "nr" and "nr." get the exact same analysis:
+<pre>
+$ echo -e 'nr\nnr.'|apertium -f none -d . sme-nob-morph
+^nr/nr<n><abbr><acc>/nr<n><abbr><attr>/nr<n><abbr><gen>/nr<n><abbr><nom>$
+^nr./nr<n><abbr><acc>/nr<n><abbr><attr>/nr<n><abbr><gen>/nr<n><abbr><nom>$
+</pre>
+So when translating to Norwegian, we have no idea whether to include the dot or not.
+If form "nr." had lemma "nr." this would be simple.
 ===High priority bad translations===
 What are the high-priority linguistic issues to deal with?
@@ Line 17: / Line 32: @@
-* DONE: Cond rules get modals, handled like passive in t1x (instead of "kanskje" in t3x), Pot gets "da<adv>" for now
 * DONE: In chunker/t1x, add tags to certain verbs (eg. using def-lists) that can be used in t3x to change/remove case-prepositions (see dev/valence.txt)
 ** TODO: fill out with more def-list entries
 ** TODO: go through dev/valence.txt and mark off which examples can be handled with noun-lists, and which with verb-lists
-* DONE: Insert epenthetics in compounds (nob.dix has n.*.sg.ind.cmp forms for all nouns, outputting the right epenthetic based on that)
-===DONE: Compounds in CG===
-Compounds messed up CG: We have to leave all the tags of the non-heads in because of bidix lookup, so we get eg. <code>politiija<N><Sg><Nom><Cmp>+stašuvdna<N><Sg><Ill></code>. If CG sees <code>Nom</code> it applies rules that it shouldn't, etc.
-Fix: cg-proc now ignores anything up until the last baseform, so given <code>politiija<N><Sg><Nom><Cmp>+stašuvdna<N><Sg><Ill></code>, the rules will only see <code>stašuvdna<N><Sg><Ill></code> (we have a special CG feature to refer to the other sub-readings)
 ===TODO: Derivations mess up CG===
@@ Line 38: / Line 46: @@
 <pre><e><p><l>geavahit</l><r>bruke<s n="vblex"/><s n="pers"/></r></p><par n="__tverb"/></e></pre>
 where __tverb adds the TV tag, and all PoS changing derivations use V* instead of V.
-===DONE: remove unhandled derivations===
-Any [[Northern Sámi and Norwegian/Derivations|derivations]] that are not handled we remove from the analyser with a twol negation rule in dev/xfst2apertium.useless.twol, this makes the lexicon a lot easier to handle for testvoc.
-===DONE: +G3 tag===
-This is used like the +Actor tag, for sme wsd (here based on "stadieveksling").
-We move it after the +N tag in dev/xfst2apertium.hashtags.twol (so it gets the same position as +Actor), this makes transfer a lot simpler.
-Getting a list of lemmas which have +G3 is easy (grep all nouns from bidix, analyse their lemmas and grep for lemma+G3), use this to tag bidix.
-===DONE: ensure we have all necessary postchunk rules===
-Postchunk rules are needed for any chunk containing a
-determiner/pronoun/adjective/noun/verb, we can easily make sure each
-possible chunk name has a postchunk rule (new chunks are created in
-t1x with names like pre_pre_nom, but may also be merged in t2x to
-eg. pre_pre_nom_conj_nom)
-All possible SN and SA chunks should have the needed postchunk rules now.
-===DONE: bidix pardef to handle CG changing Plc to Sur===
-sme-dis.rle can change arbitrary Plc-tagged proper nouns into Sur tags, so bidix needs a pardef that translates LR any Sur as if it were Plc.
-(might also want to split entries where sme-lemma != nob-lemma into a Plc entry, and a Sur-one where sme==nob)
 ===TODO: Testvoc===

Difference between revisions of "Northern Sámi and Norwegian/release"

Revision as of 07:48, 14 June 2014

Contents

Issues

Periods in abbreviations missing from lemma

High priority bad translations

TODO: Derivations mess up CG

TODO: Testvoc

Generation report

Bidix inconsistency

Expanding the morphology

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools