Difference between revisions of "Northern Sámi and Norwegian/release"

From Apertium
Jump to navigation Jump to search
 
(15 intermediate revisions by the same user not shown)
Line 3: Line 3:


==Issues==
==Issues==
===sme-nob-specific stuff should be in sme-nob===
e.g. langs/sme/tools/mt/apertium/tagsets/apertium.nob.relabel

Alternatively (ideally), there shouldn't be sme-nob-specific tag changes; all apertium pairs should have pretty similar tagsets.

===Periods in abbreviations missing from lemma===

Forms "nr" and "nr." get the exact same analysis:

<pre>
$ echo -e 'nr\nnr.'|apertium -f none -d . sme-nob-morph
^nr/nr<n><abbr><acc>/nr<n><abbr><attr>/nr<n><abbr><gen>/nr<n><abbr><nom>$
^nr./nr<n><abbr><acc>/nr<n><abbr><attr>/nr<n><abbr><gen>/nr<n><abbr><nom>$
</pre>

So when translating to Norwegian, we have no idea whether to include the dot or not.

If form "nr." had lemma "nr." this would be simple.


===High priority bad translations===
===High priority bad translations===
What are the high-priority linguistic issues to deal with?
What are the high-priority linguistic issues to deal with?
Line 17: Line 37:




* DONE: Cond rules get modals, handled like passive in t1x (instead of "kanskje" in t3x), Pot gets "da<adv>" for now
* DONE: In chunker/t1x, add tags to certain verbs (eg. using def-lists) that can be used in t3x to change/remove case-prepositions (see dev/valence.txt)
* DONE: In chunker/t1x, add tags to certain verbs (eg. using def-lists) that can be used in t3x to change/remove case-prepositions (see dev/valence.txt)
** TODO: fill out with more def-list entries
** TODO: fill out with more def-list entries
** TODO: go through dev/valence.txt and mark off which examples can be handled with noun-lists, and which with verb-lists
** TODO: go through dev/valence.txt and mark off which examples can be handled with noun-lists, and which with verb-lists
* DONE: Insert epenthetics in compounds (nob.dix has n.*.sg.ind.cmp forms for all nouns, outputting the right epenthetic based on that)

===TODO [1/2]: Multiple identical tags per reading in CG===
* DONE: Compounds mess up CG. We have to leave all the tags of the non-heads in because of bidix lookup, so we get eg. <code>politiija<N><Sg><Nom><Cmp>+stašuvdna<N><Sg><Ill></code> (instead of <code>politiija#stašuvdna<N><Sg><Ill></code> like the lookup2cg script gives). CG sees <code>Nom</code> and thus applies rules that it shouldn't, etc.
** The ideal solution, but currently impossible: rename the non-head tags in BEFORE-/AFTER-SECTIONS. I tried <code>BEFORE-SECTIONS SUBSTITUTE (N Sg Nom Cmp) (N* Sg* Nom* Cmp)</code> which works for the above, but CG then also turns <code>duohta<A><Sg><Nom><Cmp>+dilli<N><Sg><Acc></code> into <code>duohta<A><@→N>+dilli<N*><Sg*><Nom*><Cmp><Sg><Gen>$</code>, while if there are several non-heads, it'll only substitute in the first part. It seems some sort of tag order mechanism would be needed in CG for the BEFORE-SECTIONS/AFTER-SECTIONS stashing method to work.
** It's possible to do the initial renaming in a twol rule (committed, but commented out for now), but we have no way of changing the tags back in CG (an AFTER-SECTIONS rule SUBSTITUTE N* N will unfortunately merge N* and later occurences of N).
** Fix: cg-proc now ignores anything up until the last baseform, so given <code>politiija<N><Sg><Nom><Cmp>+stašuvdna<N><Sg><Ill></code>, the rules will only see <code>stašuvdna<N><Sg><Ill></code> (later we may have CG features to refer to the other sub-readings)

* TODO: Derivations mess up CG. In lookup2cg, PoS tags are given stars if they appear before derivational tags. We could do this with twol, but again have no way of removing them before bidix. Also, the CG sub-reading features won't help here since we can't ignore _all_ tags up until the last derivation; say we have <code>"lemma" V TV Der/n N Sg Ind</code>, lookup2cg gives <code>"lemma" V* TV Der/n N Sg Ind</code> (leaving the TV tag intact).
** Simple solution: lexicalise.
** Boring solution: do the twol renaming to add stars to any PoS-before-derivation, and then move all the PoS tags into pardefs:
<pre>
<pardef n="V" c="treat V* exactly like V for bidix lookup">
<e> <p><l><s n="V"/></l><r></r></p></e>
<e r="LR"><p><l><s n="V*"/></l><r></r></p></e>
</pardef>
...
<e><p><l>gieldit<s n="TV"/></l><r>forby<s n="vblex"/><s n="pers"/></r></p><par n="V"/><par n="__verb"/></e>
</pre>


===Derivations mess up CG===
===DONE: remove unhandled derivations===
Mainly a problem with the PoS-changing derivations.
Any [[Northern Sámi and Norwegian/Derivations|derivations]] that are not handled we remove from the analyser with a twol negation rule in dev/xfst2apertium.useless.twol, this makes the lexicon a lot easier to handle for testvoc.


In lookup2cg, PoS tags are given stars if they appear before derivational tags, so "V Der N" becomes "V* Der N". We could do this with twol, but we have no way of removing them before bidix.
===DONE: +G3 tag===
This is used like the +Actor tag, for sme wsd (here based on "stadieveksling").


Simple / clean solution: lexicalise.
We move it after the +N tag in dev/xfst2apertium.hashtags.twol (so it gets the same position as +Actor), this makes transfer a lot simpler.


===Testvoc===
Getting a list of lemmas which have +G3 is easy (grep all nouns from bidix, analyse their lemmas and grep for lemma+G3), use this to tag bidix.
Before release, we need to get [[testvoc]] out of the way – making sure there are no #'s and @'s in the output. The analyser is trimmed, but we can still have generator errors (and CG substitute errors). We can't create all possible surface forms from a cyclic [[HFST]] analyser, but we can at least run as large a corpus as we can find through sme-nob and look for # and @, and also send through the right-hand-sides of the bidix.


There are two helper scripts:
===DONE: ensure we have all necessary postchunk rules===
Postchunk rules are needed for any chunk containing a
determiner/pronoun/adjective/noun/verb, we can easily make sure each
possible chunk name has a postchunk rule (new chunks are created in
t1x with names like pre_pre_nom, but may also be merged in t2x to
eg. pre_pre_nom_conj_nom)


* <code>dev/sme-nob.inconsistency.sh | grep '^#'</code> should give no results. This script just sends the rhs of the bidix through the generator.
All possible SN and SA chunks should have the needed postchunk rules now.
* <code>sed 's/^/sme\t/' smecorpus.txt | dev/analyse-all-stages.sh > analysedcorpus.txt</dev> – here <code>grep 'DGEN.*#'</code> should give no hits. This script also outputs the input and morph and tagger stages (and nob-sentence if handed a parallel corpus)


===DONE: bidix pardef to handle CG changing Plc to Sur===
sme-dis.rle can change arbitrary Plc-tagged proper nouns into Sur tags, so bidix needs a pardef that translates LR any Sur as if it were Plc.


(might also want to split entries where sme-lemma != nob-lemma into a Plc entry, and a Sur-one where sme==nob)


===TODO: Testvoc===
Before release, we need to get [[testvoc]] out of the way – making
sure there are no #'s and @'s in the output. As yet we don't have a
way to create all possible surface forms from an [[HFST]] analyser,
but we can at least run as large a corpus as we can find through
sme-nob and look for # and @.


====Expanding the morphology====
The script dev/sme-nob.inconsistency.sh tries to generate (with nob.dix) base forms (like infinitives or singular indefinites) of the various rhs entries in the bidix. Statistics so far:
Running <code>hfst-fst2strings sme-nob.automorf.hfst.ol</code> creates an expansion of the morphology, might be possible to use for testvoc.
{|class=wikitable
! Part-of-Speech !! entries from bidix that are OK in nob.dix !! in bidix but not in nob.dix !! comments !! in sme analyser but not bidix
|-
| '''verbs''' || 2536 || '''0''' || :) || ???
|-
| '''nouns''' || 11758 || '''0''' || :) || ???
|-
| '''proper nouns''' || 28310 || '''15159''' || look for PPP mark in bidix, use dev/props-from-bidix-to-nob.sh || ???
|-
| '''adverbs''' || 235 || '''0''' || :) only one nob pardef, simple to add || ???
|-
| '''prepositions''' || 42 || '''0''' || :) || lots missing from bidix still
|-
| '''adjectives''' || 1056 || '''0''' || :) (not sure if all forms are covered though) || ???
|-
| '''abbreviations''' || ??? || '''0''' || :) || '''0'''
|-
| '''sub-/conjunctions''' || 25 || '''0''' || :) || ???
|-
| '''pronouns''' || ??? || ??? || || '''0'''
|-
| '''ShCmp''' || ??? || '''0''' || compound parts, removed from analyser || '''0'''
|-
| '''Numerals''' || ??? || ??? || should be OK, not 100% sure || '''???'''
|}


There's a handy script <code>dev/gt-expand-to-bidix.sh</code> that takes as input one word and PoS (tab-separated) per line, e.g. <pre>galbmit V
These files contain words that need to be added to bidix in order for the translator to be consistent:
beana N</pre>, and uses the giellatekno web-based lemma expander to create all forms of each lemma (since hfst is still not very good for that), then runs them through bidix to check for missing bidix entries.
* [http://apertium.svn.sourceforge.net/viewvc/apertium/incubator/apertium-sme-nob/dev/bidix-pcle.todo.dix dev/bidix-pcle.todo.dix]


Also, these bugs in the sme-nob analyser in GTSVN need to be fixed:
* http://giellatekno.uit.no/bugzilla/show_bug.cgi?id=873
* http://giellatekno.uit.no/bugzilla/show_bug.cgi?id=857#c3
* http://giellatekno.uit.no/bugzilla/show_bug.cgi?id=875
* http://giellatekno.uit.no/bugzilla/show_bug.cgi?id=872


==Schedule==
{|class=wikitable
! Task !! Date
|-
| Work on high priority bad translations, expand bidix coverage || until 2010-07-14
|-
| Remove unhandled derivations, ensure we have all postchunk rules || 2010-07-14…2010-07-18
|-
| Testvoc || 2010-07-18…2010-08-01
|-
| '''Tentative release date for apertium-sme-nob 0.1.0''' || '''August 1st 2010'''
|-
|}


[[Category:Northern Sámi and Norwegian|*]]
Update: Lene and Trond have more free time after August, real release beginning of September?

Latest revision as of 13:47, 15 September 2015

This page holds information about the release schedule for apertium-sme-nob.

Issues[edit]

sme-nob-specific stuff should be in sme-nob[edit]

e.g. langs/sme/tools/mt/apertium/tagsets/apertium.nob.relabel

Alternatively (ideally), there shouldn't be sme-nob-specific tag changes; all apertium pairs should have pretty similar tagsets.

Periods in abbreviations missing from lemma[edit]

Forms "nr" and "nr." get the exact same analysis:

$ echo -e 'nr\nnr.'|apertium -f none -d . sme-nob-morph 
^nr/nr<n><abbr><acc>/nr<n><abbr><attr>/nr<n><abbr><gen>/nr<n><abbr><nom>$
^nr./nr<n><abbr><acc>/nr<n><abbr><attr>/nr<n><abbr><gen>/nr<n><abbr><nom>$

So when translating to Norwegian, we have no idea whether to include the dot or not.

If form "nr." had lemma "nr." this would be simple.


High priority bad translations[edit]

What are the high-priority linguistic issues to deal with?



  • DONE: In chunker/t1x, add tags to certain verbs (eg. using def-lists) that can be used in t3x to change/remove case-prepositions (see dev/valence.txt)
    • TODO: fill out with more def-list entries
    • TODO: go through dev/valence.txt and mark off which examples can be handled with noun-lists, and which with verb-lists

Derivations mess up CG[edit]

Mainly a problem with the PoS-changing derivations.

In lookup2cg, PoS tags are given stars if they appear before derivational tags, so "V Der N" becomes "V* Der N". We could do this with twol, but we have no way of removing them before bidix.

Simple / clean solution: lexicalise.

Testvoc[edit]

Before release, we need to get testvoc out of the way – making sure there are no #'s and @'s in the output. The analyser is trimmed, but we can still have generator errors (and CG substitute errors). We can't create all possible surface forms from a cyclic HFST analyser, but we can at least run as large a corpus as we can find through sme-nob and look for # and @, and also send through the right-hand-sides of the bidix.

There are two helper scripts:

  • dev/sme-nob.inconsistency.sh | grep '^#' should give no results. This script just sends the rhs of the bidix through the generator.
  • sed 's/^/sme\t/' smecorpus.txt | dev/analyse-all-stages.sh > analysedcorpus.txt</dev> – here grep 'DGEN.*#' should give no hits. This script also outputs the input and morph and tagger stages (and nob-sentence if handed a parallel corpus)



Expanding the morphology[edit]

Running hfst-fst2strings sme-nob.automorf.hfst.ol creates an expansion of the morphology, might be possible to use for testvoc.

There's a handy script dev/gt-expand-to-bidix.sh that takes as input one word and PoS (tab-separated) per line, e.g.

galbmit	V
beana	N

, and uses the giellatekno web-based lemma expander to create all forms of each lemma (since hfst is still not very good for that), then runs them through bidix to check for missing bidix entries.