Difference between revisions of "Northern Sámi and Norwegian/release"

Latest revision as of 13:47, 15 September 2015

This page holds information about the release schedule for apertium-sme-nob.

Issues[edit]

sme-nob-specific stuff should be in sme-nob[edit]

e.g. langs/sme/tools/mt/apertium/tagsets/apertium.nob.relabel

Alternatively (ideally), there shouldn't be sme-nob-specific tag changes; all apertium pairs should have pretty similar tagsets.

Periods in abbreviations missing from lemma[edit]

Forms "nr" and "nr." get the exact same analysis:

$ echo -e 'nr\nnr.'|apertium -f none -d . sme-nob-morph 
^nr/nr<n><abbr><acc>/nr<n><abbr><attr>/nr<n><abbr><gen>/nr<n><abbr><nom>$
^nr./nr<n><abbr><acc>/nr<n><abbr><attr>/nr<n><abbr><gen>/nr<n><abbr><nom>$

So when translating to Norwegian, we have no idea whether to include the dot or not.

If form "nr." had lemma "nr." this would be simple.

High priority bad translations[edit]

What are the high-priority linguistic issues to deal with?

TODO: Bidix will be added to with stuff from GTSVN
- see dev/inc* (verbs need transitivity, nouns need gender checked, words with spaces shouldn't have), see also these files with entries are not in bidix:
- dev/bidix-pronouns.todo.dix
- dev/bidix-abbr.todo.dix
- dev/bidix-shcmp.todo.dix
TODO: After bidix additions, Francis will run apertium-lex-learner to automatically discover lex.sel rules
TODO: URL recognition (I tried making some with lexc, it worked by itself, but took ages to compile into the regular lexc, not sure why)

DONE: In chunker/t1x, add tags to certain verbs (eg. using def-lists) that can be used in t3x to change/remove case-prepositions (see dev/valence.txt)
- TODO: fill out with more def-list entries
- TODO: go through dev/valence.txt and mark off which examples can be handled with noun-lists, and which with verb-lists

Derivations mess up CG[edit]

Mainly a problem with the PoS-changing derivations.

In lookup2cg, PoS tags are given stars if they appear before derivational tags, so "V Der N" becomes "V* Der N". We could do this with twol, but we have no way of removing them before bidix.

Simple / clean solution: lexicalise.

Testvoc[edit]

Before release, we need to get testvoc out of the way – making sure there are no #'s and @'s in the output. The analyser is trimmed, but we can still have generator errors (and CG substitute errors). We can't create all possible surface forms from a cyclic HFST analyser, but we can at least run as large a corpus as we can find through sme-nob and look for # and @, and also send through the right-hand-sides of the bidix.

There are two helper scripts:

dev/sme-nob.inconsistency.sh | grep '^#' should give no results. This script just sends the rhs of the bidix through the generator.
sed 's/^/sme\t/' smecorpus.txt | dev/analyse-all-stages.sh > analysedcorpus.txt</dev> – here grep 'DGEN.*#' should give no hits. This script also outputs the input and morph and tagger stages (and nob-sentence if handed a parallel corpus)

Expanding the morphology[edit]

Running hfst-fst2strings sme-nob.automorf.hfst.ol creates an expansion of the morphology, might be possible to use for testvoc.

There's a handy script dev/gt-expand-to-bidix.sh that takes as input one word and PoS (tab-separated) per line, e.g.

galbmit	V
beana	N

, and uses the giellatekno web-based lemma expander to create all forms of each lemma (since hfst is still not very good for that), then runs them through bidix to check for missing bidix entries.

@@ Line 3: / Line 3: @@
 ==Issues==
+===sme-nob-specific stuff should be in sme-nob===
+e.g. langs/sme/tools/mt/apertium/tagsets/apertium.nob.relabel
+Alternatively (ideally), there shouldn't be sme-nob-specific tag changes; all apertium pairs should have pretty similar tagsets.
+===Periods in abbreviations missing from lemma===
+Forms "nr" and "nr." get the exact same analysis:
+<pre>
+$ echo -e 'nr\nnr.'|apertium -f none -d . sme-nob-morph
+^nr/nr<n><abbr><acc>/nr<n><abbr><attr>/nr<n><abbr><gen>/nr<n><abbr><nom>$
+^nr./nr<n><abbr><acc>/nr<n><abbr><attr>/nr<n><abbr><gen>/nr<n><abbr><nom>$
+</pre>
+So when translating to Norwegian, we have no idea whether to include the dot or not.
+If form "nr." had lemma "nr." this would be simple.
 ===High priority bad translations===
 What are the high-priority linguistic issues to deal with?
@@ Line 17: / Line 37: @@
-* DONE: Cond rules get modals, handled like passive in t1x (instead of "kanskje" in t3x), Pot gets "da<adv>" for now
 * DONE: In chunker/t1x, add tags to certain verbs (eg. using def-lists) that can be used in t3x to change/remove case-prepositions (see dev/valence.txt)
 ** TODO: fill out with more def-list entries
 ** TODO: go through dev/valence.txt and mark off which examples can be handled with noun-lists, and which with verb-lists
-* DONE: Insert epenthetics in compounds (nob.dix has n.*.sg.ind.cmp forms for all nouns, outputting the right epenthetic based on that)
-===DONE: Compounds in CG===
+===Derivations mess up CG===
-Compounds messed up CG: We have to leave all the tags of the non-heads in because of bidix lookup, so we get eg. <code>politiija<N><Sg><Nom><Cmp>+stašuvdna<N><Sg><Ill></code>. If CG sees <code>Nom</code> it applies rules that it shouldn't, etc.
-Fix: cg-proc now ignores anything up until the last baseform, so given <code>politiija<N><Sg><Nom><Cmp>+stašuvdna<N><Sg><Ill></code>, the rules will only see <code>stašuvdna<N><Sg><Ill></code> (we have a special CG feature to refer to the other sub-readings)
-===TODO: Derivations mess up CG===
 Mainly a problem with the PoS-changing derivations.
 In lookup2cg, PoS tags are given stars if they appear before derivational tags, so "V Der N" becomes "V* Der N". We could do this with twol, but we have no way of removing them before bidix.
-* Simple / clean solution: lexicalise.
-* Boring / ugly solution: add stars with twol before any PoS-changing derivation tags, then instead of
-<pre><e><p><l>geavahit<s n="V"/><s n="TV"/></l><r>bruke<s n="vblex"/><s n="pers"/></r></p><par n="__verb"/></e></pre>
-we have
-<pre><e><p><l>geavahit</l><r>bruke<s n="vblex"/><s n="pers"/></r></p><par n="__tverb"/></e></pre>
-where __tverb adds the TV tag, and all PoS changing derivations use V* instead of V.
-===DONE: remove unhandled derivations===
-Any [[Northern Sámi and Norwegian/Derivations|derivations]] that are not handled we remove from the analyser with a twol negation rule in dev/xfst2apertium.useless.twol, this makes the lexicon a lot easier to handle for testvoc.
-===DONE: +G3 tag===
-This is used like the +Actor tag, for sme wsd (here based on "stadieveksling").
-We move it after the +N tag in dev/xfst2apertium.hashtags.twol (so it gets the same position as +Actor), this makes transfer a lot simpler.
-Getting a list of lemmas which have +G3 is easy (grep all nouns from bidix, analyse their lemmas and grep for lemma+G3), use this to tag bidix.
-===DONE: ensure we have all necessary postchunk rules===
-Postchunk rules are needed for any chunk containing a
-determiner/pronoun/adjective/noun/verb, we can easily make sure each
-possible chunk name has a postchunk rule (new chunks are created in
-t1x with names like pre_pre_nom, but may also be merged in t2x to
-eg. pre_pre_nom_conj_nom)
+Simple / clean solution: lexicalise.
-All possible SN and SA chunks should have the needed postchunk rules now.
+===Testvoc===
-===DONE: bidix pardef to handle CG changing Plc to Sur===
+Before release, we need to get [[testvoc]] out of the way – making sure there are no #'s and @'s in the output. The analyser is trimmed, but we can still have generator errors (and CG substitute errors). We can't create all possible surface forms from a cyclic [[HFST]] analyser, but we can at least run as large a corpus as we can find through sme-nob and look for # and @, and also send through the right-hand-sides of the bidix.
-sme-dis.rle can change arbitrary Plc-tagged proper nouns into Sur tags, so bidix needs a pardef that translates LR any Sur as if it were Plc.
+There are two helper scripts:
-(might also want to split entries where sme-lemma != nob-lemma into a Plc entry, and a Sur-one where sme==nob)
+* <code>dev/sme-nob.inconsistency.sh | grep '^#'</code> should give no results. This script just sends the rhs of the bidix through the generator.
-===TODO: Testvoc===
+* <code>sed 's/^/sme\t/' smecorpus.txt | dev/analyse-all-stages.sh > analysedcorpus.txt</dev> – here <code>grep 'DGEN.*#'</code> should give no hits. This script also outputs the input and morph and tagger stages (and nob-sentence if handed a parallel corpus)
-Before release, we need to get [[testvoc]] out of the way – making
-sure there are no #'s and @'s in the output. As yet we don't have a
-way to create all possible surface forms from an [[HFST]] analyser,
-but we can at least run as large a corpus as we can find through
-sme-nob and look for # and @.
-The script dev/sme-nob.inconsistency.sh tries to generate (with nob.dix) base forms (like infinitives or singular indefinites) of the various rhs entries in the bidix. Statistics so far:
-{|class=wikitable
-! Part-of-Speech !! entries from bidix that are OK in nob.dix !! in bidix but not in nob.dix !! comments !! in sme analyser but not bidix
-|-
-|  '''verbs'''  ||  2536  || '''0''' || :) || ???
-|-
-|  '''nouns'''  ||  11758  || '''0''' || :) || ???
-|-
-|  '''proper nouns'''  ||  28310  || '''15159''' || look for PPP mark in bidix, use dev/props-from-bidix-to-nob.sh || ???
-|-
-|  '''adverbs'''  ||  235  || '''0''' || :) only one nob pardef, simple to add || ???
-|-
-|  '''prepositions'''  ||  42  || '''0''' || :) || lots missing from bidix still
-|-
-|  '''adjectives'''  ||  1056  || '''0''' || :) (not sure if all forms are covered though) || ???
-|-
-|  '''abbreviations'''  ||  ???  || '''0''' || :)  || '''0'''
-|-
-|  '''sub-/conjunctions'''  ||  25  || '''0''' || :) || ???
-|-
-|  '''pronouns'''  ||  ???  || ??? ||  || '''0'''
-|-
-|  '''ShCmp'''  ||  ???  || '''0''' || compound parts, removed from analyser || '''0'''
-|-
-|  '''Numerals'''  ||  ???  || '''???''' || bidix should be OK, not 100% sure, still lots missing from generator! || '''???'''
-|}
-These generation/transfer errors need to be fixed:
-* http://apertium.svn.sourceforge.net/viewvc/apertium/incubator/apertium-sme-nob/dev/generation-report.txt
+====Expanding the morphology====
+Running <code>hfst-fst2strings sme-nob.automorf.hfst.ol</code> creates an expansion of the morphology, might be possible to use for testvoc.
 There's a handy script <code>dev/gt-expand-to-bidix.sh</code> that takes as input one word and PoS (tab-separated) per line, e.g. <pre>galbmit	V

Difference between revisions of "Northern Sámi and Norwegian/release"

Latest revision as of 13:47, 15 September 2015

Contents

Issues[edit]

sme-nob-specific stuff should be in sme-nob[edit]

Periods in abbreviations missing from lemma[edit]

High priority bad translations[edit]

Derivations mess up CG[edit]

Testvoc[edit]

Expanding the morphology[edit]

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools