Difference between revisions of "Northern Sámi and Norwegian/release"

Revision as of 10:17, 6 November 2011

This page holds information about the release schedule for apertium-sme-nob.

Issues

High priority bad translations

What are the high-priority linguistic issues to deal with?

TODO: Bidix will be added to with stuff from GTSVN
- see dev/inc* (verbs need transitivity, nouns need gender checked, words with spaces shouldn't have), see also these files with entries are not in bidix:
- dev/bidix-pronouns.todo.dix
- dev/bidix-abbr.todo.dix
- dev/bidix-shcmp.todo.dix
TODO: After bidix additions, Francis will run apertium-lex-learner to automatically discover lex.sel rules
TODO: URL recognition (I tried making some with lexc, it worked by itself, but took ages to compile into the regular lexc, not sure why)

DONE: Cond rules get modals, handled like passive in t1x (instead of "kanskje" in t3x), Pot gets "da<adv>" for now
DONE: In chunker/t1x, add tags to certain verbs (eg. using def-lists) that can be used in t3x to change/remove case-prepositions (see dev/valence.txt)
- TODO: fill out with more def-list entries
- TODO: go through dev/valence.txt and mark off which examples can be handled with noun-lists, and which with verb-lists
DONE: Insert epenthetics in compounds (nob.dix has n.*.sg.ind.cmp forms for all nouns, outputting the right epenthetic based on that)

DONE: Compounds in CG

Compounds messed up CG: We have to leave all the tags of the non-heads in because of bidix lookup, so we get eg. politiija<N><Sg><Nom><Cmp>+stašuvdna<N><Sg><Ill>. If CG sees Nom it applies rules that it shouldn't, etc.

Fix: cg-proc now ignores anything up until the last baseform, so given politiija<N><Sg><Nom><Cmp>+stašuvdna<N><Sg><Ill>, the rules will only see stašuvdna<N><Sg><Ill> (we have a special CG feature to refer to the other sub-readings)

TODO: Derivations mess up CG

Mainly a problem with the PoS-changing derivations.

In lookup2cg, PoS tags are given stars if they appear before derivational tags, so "V Der N" becomes "V* Der N". We could do this with twol, but we have no way of removing them before bidix.

Simple / clean solution: lexicalise.
Boring / ugly solution: add stars with twol before any PoS-changing derivation tags, then instead of

<e><p><l>geavahit<s n="V"/><s n="TV"/></l><r>bruke<s n="vblex"/><s n="pers"/></r></p><par n="__verb"/></e>

we have

<e><p><l>geavahit</l><r>bruke<s n="vblex"/><s n="pers"/></r></p><par n="__tverb"/></e>

where __tverb adds the TV tag, and all PoS changing derivations use V* instead of V.

DONE: remove unhandled derivations

Any derivations that are not handled we remove from the analyser with a twol negation rule in dev/xfst2apertium.useless.twol, this makes the lexicon a lot easier to handle for testvoc.

DONE: +G3 tag

This is used like the +Actor tag, for sme wsd (here based on "stadieveksling").

We move it after the +N tag in dev/xfst2apertium.hashtags.twol (so it gets the same position as +Actor), this makes transfer a lot simpler.

Getting a list of lemmas which have +G3 is easy (grep all nouns from bidix, analyse their lemmas and grep for lemma+G3), use this to tag bidix.

DONE: ensure we have all necessary postchunk rules

Postchunk rules are needed for any chunk containing a determiner/pronoun/adjective/noun/verb, we can easily make sure each possible chunk name has a postchunk rule (new chunks are created in t1x with names like pre_pre_nom, but may also be merged in t2x to eg. pre_pre_nom_conj_nom)

All possible SN and SA chunks should have the needed postchunk rules now.

DONE: bidix pardef to handle CG changing Plc to Sur

sme-dis.rle can change arbitrary Plc-tagged proper nouns into Sur tags, so bidix needs a pardef that translates LR any Sur as if it were Plc.

(might also want to split entries where sme-lemma != nob-lemma into a Plc entry, and a Sur-one where sme==nob)

TODO: Testvoc

Before release, we need to get testvoc out of the way – making sure there are no #'s and @'s in the output. As yet we don't have a way to create all possible surface forms from an HFST analyser, but we can at least run as large a corpus as we can find through sme-nob and look for # and @.

Generation report

The script dev/generation-test -r corpus translates a corpus and gives a frequency sorted list of errors (words marked with #, / or @). Current status: http://apertium.svn.sourceforge.net/viewvc/apertium/staging/apertium-sme-nob/dev/generation-report.txt

Bidix inconsistency

The script dev/sme-nob.inconsistency.sh tries to generate (with nob.dix) base forms (like infinitives or singular indefinites) of the various rhs entries in the bidix. Statistics so far:

Part-of-Speech	entries from bidix that are OK in nob.dix	in bidix but not in nob.dix	comments	in sme analyser but not bidix
verbs	2536	0	:)	???
nouns	11758	0	:)	???
proper nouns	28310	15159	look for PPP mark in bidix, use dev/props-from-bidix-to-nob.sh	???
adverbs	235	0	:) only one nob pardef, simple to add	???
prepositions	42	0	:)	lots missing from bidix still
adjectives	1056	0	:) (not sure if all forms are covered though)	???
abbreviations	???	0	:)	0
sub-/conjunctions	25	0	:)	???
pronouns	???	???		0
ShCmp	???	0	compound parts, removed from analyser	0
Numerals	???	???	bidix should be OK, not 100% sure, still lots missing from generator!	???

Expanding the morphology

Running hfst-fst2strings sme-nob.automorf.hfst.ol creates an expansion of the morphology, might be possible to use for testvoc? TODO

There's a handy script dev/gt-expand-to-bidix.sh that takes as input one word and PoS (tab-separated) per line, e.g.

galbmit	V
beana	N

, and uses the giellatekno web-based lemma expander to create all forms of each lemma (since hfst is still not very good for that), then runs them through bidix to check for missing bidix entries.

@@ Line 64: / Line 64: @@
 ===TODO: Testvoc===
-Before release, we need to get [[testvoc]] out of the way – making
+Before release, we need to get [[testvoc]] out of the way – making sure there are no #'s and @'s in the output. As yet we don't have a way to create all possible surface forms from an [[HFST]] analyser, but we can at least run as large a corpus as we can find through sme-nob and look for # and @.
-sure there are no #'s and @'s in the output. As yet we don't have a
-way to create all possible surface forms from an [[HFST]] analyser,
-but we can at least run as large a corpus as we can find through
-sme-nob and look for # and @.
+===Generation report===
+The script <code>dev/generation-test -r corpus</code> translates a corpus and gives a frequency sorted list of errors (words marked with #, / or
+@). Current status: http://apertium.svn.sourceforge.net/viewvc/apertium/staging/apertium-sme-nob/dev/generation-report.txt
+===Bidix inconsistency===
 The script dev/sme-nob.inconsistency.sh tries to generate (with nob.dix) base forms (like infinitives or singular indefinites) of the various rhs entries in the bidix. Statistics so far:
 {|class=wikitable
@@ Line 98: / Line 99: @@
+===Expanding the morphology===
-These generation/transfer errors need to be fixed:
+Running <code>hfst-fst2strings sme-nob.automorf.hfst.ol</code> creates an expansion of the morphology, might be possible to use for testvoc? TODO
-* http://apertium.svn.sourceforge.net/viewvc/apertium/incubator/apertium-sme-nob/dev/generation-report.txt
 There's a handy script <code>dev/gt-expand-to-bidix.sh</code> that takes as input one word and PoS (tab-separated) per line, e.g. <pre>galbmit	V

Difference between revisions of "Northern Sámi and Norwegian/release"

Revision as of 10:17, 6 November 2011

Contents

Issues

High priority bad translations

DONE: Compounds in CG

TODO: Derivations mess up CG

DONE: remove unhandled derivations

DONE: +G3 tag

DONE: ensure we have all necessary postchunk rules

DONE: bidix pardef to handle CG changing Plc to Sur

TODO: Testvoc

Generation report

Bidix inconsistency

Expanding the morphology

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools