Difference between revisions of "Northern Sámi and Norwegian/release"

From Apertium
Jump to navigation Jump to search
 
(94 intermediate revisions by the same user not shown)
Line 3: Line 3:
   
 
==Issues==
 
==Issues==
  +
===sme-nob-specific stuff should be in sme-nob===
===High priority bad translations===
 
  +
e.g. langs/sme/tools/mt/apertium/tagsets/apertium.nob.relabel
What are the high-priority linguistic issues to deal with?
 
   
  +
Alternatively (ideally), there shouldn't be sme-nob-specific tag changes; all apertium pairs should have pretty similar tagsets.
Would we gain a lot by inserting modals instead of adverbs for
 
Pot/Cond verbs? Is there a better, general, way to translate the
 
progressive? Should we get some of Francis' automatically discovered
 
lex.sel rules? And are there any "simple" constructions that we could
 
handle but don't yet?
 
   
  +
===Periods in abbreviations missing from lemma===
===Derivations:===
 
  +
Any [[Northern Sámi and Norwegian/Derivations|derivations]] that are
 
  +
Forms "nr" and "nr." get the exact same analysis:
not handled should be removed from the analyser. Maybe we could have
 
a "negation" twol rule like
 
   
 
<pre>
 
<pre>
  +
$ echo -e 'nr\nnr.'|apertium -f none -d . sme-nob-morph
? /<= UnhandledDerivations _ ; ! fail if analysis contains a tag from the set UnhandledDerivations
 
  +
^nr/nr<n><abbr><acc>/nr<n><abbr><attr>/nr<n><abbr><gen>/nr<n><abbr><nom>$
  +
^nr./nr<n><abbr><acc>/nr<n><abbr><attr>/nr<n><abbr><gen>/nr<n><abbr><nom>$
 
</pre>
 
</pre>
   
  +
So when translating to Norwegian, we have no idea whether to include the dot or not.
If this works, we could probably also write a rule like
 
   
  +
If form "nr." had lemma "nr." this would be simple.
<pre>
 
? /<= AnyDerivationtag+ PoStag+ AnyDerivationtag+ _ ;
 
</pre>
 
   
  +
to remove any derivations of derivations, since these are not handled
 
 
===High priority bad translations===
either unless there are explicit transfer rules for them. We should
 
 
What are the high-priority linguistic issues to deal with?
remove any unhandled derivations before testvoc.
 
  +
[[Northern Sámi and Norwegian/Derivations#Summary of fallbacks]]
 
  +
contains the list of derivations that are and aren't handled.
 
  +
* TODO: Bidix will be added to with stuff from GTSVN
  +
** see dev/inc* (verbs need transitivity, nouns need gender checked, words with spaces shouldn't have), see also these files with entries are not in bidix:
  +
** [http://apertium.svn.sourceforge.net/viewvc/apertium/incubator/apertium-sme-nob/dev/bidix-pronouns.todo.dix dev/bidix-pronouns.todo.dix]
  +
** [http://apertium.svn.sourceforge.net/viewvc/apertium/incubator/apertium-sme-nob/dev/bidix-abbr.todo.dix dev/bidix-abbr.todo.dix]
  +
** [http://apertium.svn.sourceforge.net/viewvc/apertium/incubator/apertium-sme-nob/dev/bidix-shcmp.todo.dix dev/bidix-shcmp.todo.dix]
  +
* TODO: After bidix additions, Francis will run apertium-lex-learner to [[Generating lexical selection rules|automatically discover
  +
lex.sel rules]]
  +
* TODO: URL recognition (I tried making some with lexc, it worked by itself, but took ages to compile into the regular lexc, not sure why)
  +
  +
  +
* DONE: In chunker/t1x, add tags to certain verbs (eg. using def-lists) that can be used in t3x to change/remove case-prepositions (see dev/valence.txt)
  +
** TODO: fill out with more def-list entries
  +
** TODO: go through dev/valence.txt and mark off which examples can be handled with noun-lists, and which with verb-lists
  +
 
===Derivations mess up CG===
  +
Mainly a problem with the PoS-changing derivations.
  +
  +
In lookup2cg, PoS tags are given stars if they appear before derivational tags, so "V Der N" becomes "V* Der N". We could do this with twol, but we have no way of removing them before bidix.
  +
  +
Simple / clean solution: lexicalise.
   
 
===Testvoc===
 
===Testvoc===
  +
Before release, we need to get [[testvoc]] out of the way – making sure there are no #'s and @'s in the output. The analyser is trimmed, but we can still have generator errors (and CG substitute errors). We can't create all possible surface forms from a cyclic [[HFST]] analyser, but we can at least run as large a corpus as we can find through sme-nob and look for # and @, and also send through the right-hand-sides of the bidix.
Before release, we need to get [[testvoc]] out of the way – making
 
  +
sure there are no #'s and @'s in the output. As yet we don't have a
 
  +
There are two helper scripts:
way to create all possible surface forms from an [[HFST]] analyser,
 
  +
but we can at least run as large a corpus as we can find through
 
  +
* <code>dev/sme-nob.inconsistency.sh | grep '^#'</code> should give no results. This script just sends the rhs of the bidix through the generator.
sme-nob and look for # and @.
 
  +
* <code>sed 's/^/sme\t/' smecorpus.txt | dev/analyse-all-stages.sh > analysedcorpus.txt</dev> – here <code>grep 'DGEN.*#'</code> should give no hits. This script also outputs the input and morph and tagger stages (and nob-sentence if handed a parallel corpus)
  +
  +
  +
  +
  +
====Expanding the morphology====
  +
Running <code>hfst-fst2strings sme-nob.automorf.hfst.ol</code> creates an expansion of the morphology, might be possible to use for testvoc.
  +
  +
There's a handy script <code>dev/gt-expand-to-bidix.sh</code> that takes as input one word and PoS (tab-separated) per line, e.g. <pre>galbmit V
  +
beana N</pre>, and uses the giellatekno web-based lemma expander to create all forms of each lemma (since hfst is still not very good for that), then runs them through bidix to check for missing bidix entries.
  +
   
Postchunk rules are needed for any chunk containing a
 
determiner/pronoun/adjective/noun/verb, we can easily make sure each
 
possible chunk name has a postchunk rule (new chunks are created in
 
t1x with names like det_adj_nom, but may also be merged in t2x to
 
eg. det_adj_nom_conj_nom)
 
   
 
[[Category:Northern Sámi and Norwegian|*]]
==Schedule==
 
{|class=wikitable
 
! Task !! Date
 
|-
 
| Work on high priority bad translations || until 2010-07-25
 
|-
 
| Remove unhandled derivations, ensure we have all postchunk rules || 2010-07-26…2010-08-01
 
|-
 
| Testvoc || 2010-08-01…2010-08-15
 
|-
 
| '''Tentative release date for apertium-sme-nob 0.1.0''' || '''August 15th 2010'''
 
|-
 
|}
 

Latest revision as of 13:47, 15 September 2015

This page holds information about the release schedule for apertium-sme-nob.

Issues[edit]

sme-nob-specific stuff should be in sme-nob[edit]

e.g. langs/sme/tools/mt/apertium/tagsets/apertium.nob.relabel

Alternatively (ideally), there shouldn't be sme-nob-specific tag changes; all apertium pairs should have pretty similar tagsets.

Periods in abbreviations missing from lemma[edit]

Forms "nr" and "nr." get the exact same analysis:

$ echo -e 'nr\nnr.'|apertium -f none -d . sme-nob-morph 
^nr/nr<n><abbr><acc>/nr<n><abbr><attr>/nr<n><abbr><gen>/nr<n><abbr><nom>$
^nr./nr<n><abbr><acc>/nr<n><abbr><attr>/nr<n><abbr><gen>/nr<n><abbr><nom>$

So when translating to Norwegian, we have no idea whether to include the dot or not.

If form "nr." had lemma "nr." this would be simple.


High priority bad translations[edit]

What are the high-priority linguistic issues to deal with?



  • DONE: In chunker/t1x, add tags to certain verbs (eg. using def-lists) that can be used in t3x to change/remove case-prepositions (see dev/valence.txt)
    • TODO: fill out with more def-list entries
    • TODO: go through dev/valence.txt and mark off which examples can be handled with noun-lists, and which with verb-lists

Derivations mess up CG[edit]

Mainly a problem with the PoS-changing derivations.

In lookup2cg, PoS tags are given stars if they appear before derivational tags, so "V Der N" becomes "V* Der N". We could do this with twol, but we have no way of removing them before bidix.

Simple / clean solution: lexicalise.

Testvoc[edit]

Before release, we need to get testvoc out of the way – making sure there are no #'s and @'s in the output. The analyser is trimmed, but we can still have generator errors (and CG substitute errors). We can't create all possible surface forms from a cyclic HFST analyser, but we can at least run as large a corpus as we can find through sme-nob and look for # and @, and also send through the right-hand-sides of the bidix.

There are two helper scripts:

  • dev/sme-nob.inconsistency.sh | grep '^#' should give no results. This script just sends the rhs of the bidix through the generator.
  • sed 's/^/sme\t/' smecorpus.txt | dev/analyse-all-stages.sh > analysedcorpus.txt</dev> – here grep 'DGEN.*#' should give no hits. This script also outputs the input and morph and tagger stages (and nob-sentence if handed a parallel corpus)



Expanding the morphology[edit]

Running hfst-fst2strings sme-nob.automorf.hfst.ol creates an expansion of the morphology, might be possible to use for testvoc.

There's a handy script dev/gt-expand-to-bidix.sh that takes as input one word and PoS (tab-separated) per line, e.g.

galbmit	V
beana	N

, and uses the giellatekno web-based lemma expander to create all forms of each lemma (since hfst is still not very good for that), then runs them through bidix to check for missing bidix entries.