Difference between revisions of "Northern Sámi and Norwegian/release"

From Apertium
Jump to navigation Jump to search
 
Line 5: Line 5:
===sme-nob-specific stuff should be in sme-nob===
===sme-nob-specific stuff should be in sme-nob===
e.g. langs/sme/tools/mt/apertium/tagsets/apertium.nob.relabel
e.g. langs/sme/tools/mt/apertium/tagsets/apertium.nob.relabel

Alternatively (ideally), there shouldn't be sme-nob-specific tag changes; all apertium pairs should have pretty similar tagsets.

===Periods in abbreviations missing from lemma===
===Periods in abbreviations missing from lemma===


Line 38: Line 41:
** TODO: go through dev/valence.txt and mark off which examples can be handled with noun-lists, and which with verb-lists
** TODO: go through dev/valence.txt and mark off which examples can be handled with noun-lists, and which with verb-lists


===TODO: Derivations mess up CG===
===Derivations mess up CG===
Mainly a problem with the PoS-changing derivations.
Mainly a problem with the PoS-changing derivations.


In lookup2cg, PoS tags are given stars if they appear before derivational tags, so "V Der N" becomes "V* Der N". We could do this with twol, but we have no way of removing them before bidix.
In lookup2cg, PoS tags are given stars if they appear before derivational tags, so "V Der N" becomes "V* Der N". We could do this with twol, but we have no way of removing them before bidix.
* Simple / clean solution: lexicalise.
* Boring / ugly solution: add stars with twol before any PoS-changing derivation tags, then instead of
<pre><e><p><l>geavahit<s n="V"/><s n="TV"/></l><r>bruke<s n="vblex"/><s n="pers"/></r></p><par n="__verb"/></e></pre>
we have
<pre><e><p><l>geavahit</l><r>bruke<s n="vblex"/><s n="pers"/></r></p><par n="__tverb"/></e></pre>
where __tverb adds the TV tag, and all PoS changing derivations use V* instead of V.


Simple / clean solution: lexicalise.
===TODO: Testvoc===

Before release, we need to get [[testvoc]] out of the way – making sure there are no #'s and @'s in the output. As yet we don't have a way to create all possible surface forms from an [[HFST]] analyser, but we can at least run as large a corpus as we can find through sme-nob and look for # and @.
===Testvoc===
Before release, we need to get [[testvoc]] out of the way – making sure there are no #'s and @'s in the output. The analyser is trimmed, but we can still have generator errors (and CG substitute errors). We can't create all possible surface forms from a cyclic [[HFST]] analyser, but we can at least run as large a corpus as we can find through sme-nob and look for # and @, and also send through the right-hand-sides of the bidix.

There are two helper scripts:

* <code>dev/sme-nob.inconsistency.sh | grep '^#'</code> should give no results. This script just sends the rhs of the bidix through the generator.
* <code>sed 's/^/sme\t/' smecorpus.txt | dev/analyse-all-stages.sh > analysedcorpus.txt</dev> – here <code>grep 'DGEN.*#'</code> should give no hits. This script also outputs the input and morph and tagger stages (and nob-sentence if handed a parallel corpus)


===Generation report===
The script <code>dev/generation-test -r corpus</code> translates a corpus and gives a frequency sorted list of errors (words marked with #, / or
@). Current status: http://apertium.svn.sourceforge.net/viewvc/apertium/staging/apertium-sme-nob/dev/generation-report.txt


===Bidix inconsistency===
The script dev/sme-nob.inconsistency.sh tries to generate (with nob.dix) base forms (like infinitives or singular indefinites) of the various rhs entries in the bidix. Statistics so far:
{|class=wikitable
! Part-of-Speech !! entries from bidix that are OK in nob.dix !! in bidix but not in nob.dix !! comments !! in sme analyser but not bidix
|-
| '''verbs''' || 2536 || '''0''' || :) || ???
|-
| '''nouns''' || 11758 || '''0''' || :) || ???
|-
| '''proper nouns''' || 28310 || '''15159''' || look for PPP mark in bidix, use dev/props-from-bidix-to-nob.sh || ???
|-
| '''adverbs''' || 235 || '''0''' || :) only one nob pardef, simple to add || ???
|-
| '''prepositions''' || 42 || '''0''' || :) || lots missing from bidix still
|-
| '''adjectives''' || 1056 || '''0''' || :) (not sure if all forms are covered though) || ???
|-
| '''abbreviations''' || ??? || '''0''' || :) || '''0'''
|-
| '''sub-/conjunctions''' || 25 || '''0''' || :) || ???
|-
| '''pronouns''' || ??? || ??? || || '''0'''
|-
| '''ShCmp''' || ??? || '''0''' || compound parts, removed from analyser || '''0'''
|-
| '''Numerals''' || ??? || '''???''' || bidix should be OK, not 100% sure, still lots missing from generator! || '''???'''
|}




===Expanding the morphology===
====Expanding the morphology====
Running <code>hfst-fst2strings sme-nob.automorf.hfst.ol</code> creates an expansion of the morphology, might be possible to use for testvoc? TODO
Running <code>hfst-fst2strings sme-nob.automorf.hfst.ol</code> creates an expansion of the morphology, might be possible to use for testvoc.


There's a handy script <code>dev/gt-expand-to-bidix.sh</code> that takes as input one word and PoS (tab-separated) per line, e.g. <pre>galbmit V
There's a handy script <code>dev/gt-expand-to-bidix.sh</code> that takes as input one word and PoS (tab-separated) per line, e.g. <pre>galbmit V

Latest revision as of 13:47, 15 September 2015

This page holds information about the release schedule for apertium-sme-nob.

Issues[edit]

sme-nob-specific stuff should be in sme-nob[edit]

e.g. langs/sme/tools/mt/apertium/tagsets/apertium.nob.relabel

Alternatively (ideally), there shouldn't be sme-nob-specific tag changes; all apertium pairs should have pretty similar tagsets.

Periods in abbreviations missing from lemma[edit]

Forms "nr" and "nr." get the exact same analysis:

$ echo -e 'nr\nnr.'|apertium -f none -d . sme-nob-morph 
^nr/nr<n><abbr><acc>/nr<n><abbr><attr>/nr<n><abbr><gen>/nr<n><abbr><nom>$
^nr./nr<n><abbr><acc>/nr<n><abbr><attr>/nr<n><abbr><gen>/nr<n><abbr><nom>$

So when translating to Norwegian, we have no idea whether to include the dot or not.

If form "nr." had lemma "nr." this would be simple.


High priority bad translations[edit]

What are the high-priority linguistic issues to deal with?



  • DONE: In chunker/t1x, add tags to certain verbs (eg. using def-lists) that can be used in t3x to change/remove case-prepositions (see dev/valence.txt)
    • TODO: fill out with more def-list entries
    • TODO: go through dev/valence.txt and mark off which examples can be handled with noun-lists, and which with verb-lists

Derivations mess up CG[edit]

Mainly a problem with the PoS-changing derivations.

In lookup2cg, PoS tags are given stars if they appear before derivational tags, so "V Der N" becomes "V* Der N". We could do this with twol, but we have no way of removing them before bidix.

Simple / clean solution: lexicalise.

Testvoc[edit]

Before release, we need to get testvoc out of the way – making sure there are no #'s and @'s in the output. The analyser is trimmed, but we can still have generator errors (and CG substitute errors). We can't create all possible surface forms from a cyclic HFST analyser, but we can at least run as large a corpus as we can find through sme-nob and look for # and @, and also send through the right-hand-sides of the bidix.

There are two helper scripts:

  • dev/sme-nob.inconsistency.sh | grep '^#' should give no results. This script just sends the rhs of the bidix through the generator.
  • sed 's/^/sme\t/' smecorpus.txt | dev/analyse-all-stages.sh > analysedcorpus.txt</dev> – here grep 'DGEN.*#' should give no hits. This script also outputs the input and morph and tagger stages (and nob-sentence if handed a parallel corpus)



Expanding the morphology[edit]

Running hfst-fst2strings sme-nob.automorf.hfst.ol creates an expansion of the morphology, might be possible to use for testvoc.

There's a handy script dev/gt-expand-to-bidix.sh that takes as input one word and PoS (tab-separated) per line, e.g.

galbmit	V
beana	N

, and uses the giellatekno web-based lemma expander to create all forms of each lemma (since hfst is still not very good for that), then runs them through bidix to check for missing bidix entries.