Difference between revisions of "Northern Sámi and Norwegian/release"

From Apertium
Jump to navigation Jump to search
 
(71 intermediate revisions by the same user not shown)
Line 3: Line 3:
   
 
==Issues==
 
==Issues==
  +
===sme-nob-specific stuff should be in sme-nob===
  +
e.g. langs/sme/tools/mt/apertium/tagsets/apertium.nob.relabel
  +
  +
Alternatively (ideally), there shouldn't be sme-nob-specific tag changes; all apertium pairs should have pretty similar tagsets.
  +
  +
===Periods in abbreviations missing from lemma===
  +
  +
Forms "nr" and "nr." get the exact same analysis:
  +
  +
<pre>
  +
$ echo -e 'nr\nnr.'|apertium -f none -d . sme-nob-morph
  +
^nr/nr<n><abbr><acc>/nr<n><abbr><attr>/nr<n><abbr><gen>/nr<n><abbr><nom>$
  +
^nr./nr<n><abbr><acc>/nr<n><abbr><attr>/nr<n><abbr><gen>/nr<n><abbr><nom>$
  +
</pre>
  +
  +
So when translating to Norwegian, we have no idea whether to include the dot or not.
  +
  +
If form "nr." had lemma "nr." this would be simple.
  +
  +
 
===High priority bad translations===
 
===High priority bad translations===
 
What are the high-priority linguistic issues to deal with?
 
What are the high-priority linguistic issues to deal with?
   
Would we gain a lot by inserting modals instead of adverbs for
 
Pot/Cond verbs? Is there a better, general, way to translate the
 
progressive? Should we get some of Francis' [[Generating lexical selection rules|automatically discovered
 
lex.sel rules]]? And are there any "simple" constructions that we could
 
handle but don't yet?
 
   
* DONE: Cond rules get modals, handled like passive in t1x (instead of "kanskje" in t3x), Pot gets "da<adv>" for now
 
* TODO: We should add a post-generator to insert epenthetics in compounds, turning eg. "ing~b" into "ingsb"
 
* TODO: In chunker/t1x, add tags to certain verbs (eg. using def-lists) that can be used in t3x to change/remove case-prepositions (see dev/valence.txt)
 
 
* TODO: Bidix will be added to with stuff from GTSVN
 
* TODO: Bidix will be added to with stuff from GTSVN
** see dev/inc* (verbs need transitivity, nouns need gender checked, words with spaces shouldn't have)
+
** see dev/inc* (verbs need transitivity, nouns need gender checked, words with spaces shouldn't have), see also these files with entries are not in bidix:
  +
** [http://apertium.svn.sourceforge.net/viewvc/apertium/incubator/apertium-sme-nob/dev/bidix-pronouns.todo.dix dev/bidix-pronouns.todo.dix]
* TODO: After bidix additions, Francis will run apertium-lex-learner
 
  +
** [http://apertium.svn.sourceforge.net/viewvc/apertium/incubator/apertium-sme-nob/dev/bidix-abbr.todo.dix dev/bidix-abbr.todo.dix]
* TODO: add forms to nob.dix (that are in bidix) from nn-nb-infreq; add lexicalised compounds using existing lemmas; mwe's and other missing stuff manually
 
  +
** [http://apertium.svn.sourceforge.net/viewvc/apertium/incubator/apertium-sme-nob/dev/bidix-shcmp.todo.dix dev/bidix-shcmp.todo.dix]
  +
* TODO: After bidix additions, Francis will run apertium-lex-learner to [[Generating lexical selection rules|automatically discover
  +
lex.sel rules]]
 
* TODO: URL recognition (I tried making some with lexc, it worked by itself, but took ages to compile into the regular lexc, not sure why)
 
* TODO: URL recognition (I tried making some with lexc, it worked by itself, but took ages to compile into the regular lexc, not sure why)
   
===Derivations:===
 
Any [[Northern Sámi and Norwegian/Derivations|derivations]] that are
 
not handled we remove from the analyser with a twol negation rule:
 
   
  +
* DONE: In chunker/t1x, add tags to certain verbs (eg. using def-lists) that can be used in t3x to change/remove case-prepositions (see dev/valence.txt)
<pre>
 
  +
** TODO: fill out with more def-list entries
UnhandledDerivations /<= _ ; ! fail if analysis contains a tag from the set UnhandledDerivations
 
  +
** TODO: go through dev/valence.txt and mark off which examples can be handled with noun-lists, and which with verb-lists
</pre>
 
   
  +
===Derivations mess up CG===
We could probably also write a rule like
 
  +
Mainly a problem with the PoS-changing derivations.
   
  +
In lookup2cg, PoS tags are given stars if they appear before derivational tags, so "V Der N" becomes "V* Der N". We could do this with twol, but we have no way of removing them before bidix.
<pre>
 
Derivation /<= Derivation+ PoS+ _ ;
 
</pre>
 
   
  +
Simple / clean solution: lexicalise.
to remove any derivations of derivations, since these are not handled
 
either unless there are explicit transfer rules for them. We should
 
remove any unhandled derivations before testvoc.
 
[[Northern Sámi and Norwegian/Derivations#Summary of fallbacks]]
 
contains the list of derivations that are and aren't handled.
 
   
 
===Testvoc===
 
===Testvoc===
  +
Before release, we need to get [[testvoc]] out of the way – making sure there are no #'s and @'s in the output. The analyser is trimmed, but we can still have generator errors (and CG substitute errors). We can't create all possible surface forms from a cyclic [[HFST]] analyser, but we can at least run as large a corpus as we can find through sme-nob and look for # and @, and also send through the right-hand-sides of the bidix.
Before release, we need to get [[testvoc]] out of the way – making
 
  +
sure there are no #'s and @'s in the output. As yet we don't have a
 
  +
There are two helper scripts:
way to create all possible surface forms from an [[HFST]] analyser,
 
  +
but we can at least run as large a corpus as we can find through
 
  +
* <code>dev/sme-nob.inconsistency.sh | grep '^#'</code> should give no results. This script just sends the rhs of the bidix through the generator.
sme-nob and look for # and @.
 
  +
* <code>sed 's/^/sme\t/' smecorpus.txt | dev/analyse-all-stages.sh > analysedcorpus.txt</dev> – here <code>grep 'DGEN.*#'</code> should give no hits. This script also outputs the input and morph and tagger stages (and nob-sentence if handed a parallel corpus)
  +
  +
  +
  +
  +
====Expanding the morphology====
  +
Running <code>hfst-fst2strings sme-nob.automorf.hfst.ol</code> creates an expansion of the morphology, might be possible to use for testvoc.
   
  +
There's a handy script <code>dev/gt-expand-to-bidix.sh</code> that takes as input one word and PoS (tab-separated) per line, e.g. <pre>galbmit V
Postchunk rules are needed for any chunk containing a
 
  +
beana N</pre>, and uses the giellatekno web-based lemma expander to create all forms of each lemma (since hfst is still not very good for that), then runs them through bidix to check for missing bidix entries.
determiner/pronoun/adjective/noun/verb, we can easily make sure each
 
possible chunk name has a postchunk rule (new chunks are created in
 
t1x with names like det_adj_nom, but may also be merged in t2x to
 
eg. det_adj_nom_conj_nom)
 
   
The script dev/sme-nob.inconsistency.sh tries to generate (with nob.dix) base forms (like infinitives or singular indefinites) of the various rhs entries in the bidix. Statistics so far:
 
{|class=wikitable
 
! Part-of-Speech !! entries from bidix that are OK in nob.dix !! in bidix but not in nob.dix !! comments !! in sme analyser but not bidix
 
|-
 
| '''verbs''' || 14017 || '''486''' || mostly mwe's missing || ???
 
|-
 
| '''nouns''' || 7466 || '''7037''' || mostly compounds missing || ???
 
|-
 
| '''proper nouns''' || 1885 || '''15474''' || || ???
 
|-
 
| '''adverbs''' || 96 || '''143''' || || ???
 
|-
 
| '''prepositions''' || 42 || '''0''' || :-) || ???
 
|-
 
| '''adjectives''' || 463 || '''593''' || || ???
 
|-
 
| '''abbreviations''' || ??? || '''???''' || see dev/abbr.todo.dix || 619
 
|}
 
   
==Schedule==
 
{|class=wikitable
 
! Task !! Date
 
|-
 
| Work on high priority bad translations, expand bidix coverage || until 2010-07-14
 
|-
 
| Remove unhandled derivations, ensure we have all postchunk rules || 2010-07-14…2010-07-18
 
|-
 
| Testvoc || 2010-07-18…2010-08-01
 
|-
 
| '''Tentative release date for apertium-sme-nob 0.1.0''' || '''August 1st 2010'''
 
|-
 
|}
 
   
  +
[[Category:Northern Sámi and Norwegian|*]]
Update: Lene and Trond have more free time after August, real release beginning of September?
 

Latest revision as of 13:47, 15 September 2015

This page holds information about the release schedule for apertium-sme-nob.

Issues[edit]

sme-nob-specific stuff should be in sme-nob[edit]

e.g. langs/sme/tools/mt/apertium/tagsets/apertium.nob.relabel

Alternatively (ideally), there shouldn't be sme-nob-specific tag changes; all apertium pairs should have pretty similar tagsets.

Periods in abbreviations missing from lemma[edit]

Forms "nr" and "nr." get the exact same analysis:

$ echo -e 'nr\nnr.'|apertium -f none -d . sme-nob-morph 
^nr/nr<n><abbr><acc>/nr<n><abbr><attr>/nr<n><abbr><gen>/nr<n><abbr><nom>$
^nr./nr<n><abbr><acc>/nr<n><abbr><attr>/nr<n><abbr><gen>/nr<n><abbr><nom>$

So when translating to Norwegian, we have no idea whether to include the dot or not.

If form "nr." had lemma "nr." this would be simple.


High priority bad translations[edit]

What are the high-priority linguistic issues to deal with?



  • DONE: In chunker/t1x, add tags to certain verbs (eg. using def-lists) that can be used in t3x to change/remove case-prepositions (see dev/valence.txt)
    • TODO: fill out with more def-list entries
    • TODO: go through dev/valence.txt and mark off which examples can be handled with noun-lists, and which with verb-lists

Derivations mess up CG[edit]

Mainly a problem with the PoS-changing derivations.

In lookup2cg, PoS tags are given stars if they appear before derivational tags, so "V Der N" becomes "V* Der N". We could do this with twol, but we have no way of removing them before bidix.

Simple / clean solution: lexicalise.

Testvoc[edit]

Before release, we need to get testvoc out of the way – making sure there are no #'s and @'s in the output. The analyser is trimmed, but we can still have generator errors (and CG substitute errors). We can't create all possible surface forms from a cyclic HFST analyser, but we can at least run as large a corpus as we can find through sme-nob and look for # and @, and also send through the right-hand-sides of the bidix.

There are two helper scripts:

  • dev/sme-nob.inconsistency.sh | grep '^#' should give no results. This script just sends the rhs of the bidix through the generator.
  • sed 's/^/sme\t/' smecorpus.txt | dev/analyse-all-stages.sh > analysedcorpus.txt</dev> – here grep 'DGEN.*#' should give no hits. This script also outputs the input and morph and tagger stages (and nob-sentence if handed a parallel corpus)



Expanding the morphology[edit]

Running hfst-fst2strings sme-nob.automorf.hfst.ol creates an expansion of the morphology, might be possible to use for testvoc.

There's a handy script dev/gt-expand-to-bidix.sh that takes as input one word and PoS (tab-separated) per line, e.g.

galbmit	V
beana	N

, and uses the giellatekno web-based lemma expander to create all forms of each lemma (since hfst is still not very good for that), then runs them through bidix to check for missing bidix entries.