Difference between revisions of "Northern Sámi and Norwegian/release"
(→Issues) |
|||
(28 intermediate revisions by the same user not shown) | |||
Line 3: | Line 3: | ||
==Issues== |
==Issues== |
||
===sme-nob-specific stuff should be in sme-nob=== |
|||
e.g. langs/sme/tools/mt/apertium/tagsets/apertium.nob.relabel |
|||
Alternatively (ideally), there shouldn't be sme-nob-specific tag changes; all apertium pairs should have pretty similar tagsets. |
|||
===Periods in abbreviations missing from lemma=== |
|||
Forms "nr" and "nr." get the exact same analysis: |
|||
<pre> |
|||
$ echo -e 'nr\nnr.'|apertium -f none -d . sme-nob-morph |
|||
^nr/nr<n><abbr><acc>/nr<n><abbr><attr>/nr<n><abbr><gen>/nr<n><abbr><nom>$ |
|||
^nr./nr<n><abbr><acc>/nr<n><abbr><attr>/nr<n><abbr><gen>/nr<n><abbr><nom>$ |
|||
</pre> |
|||
So when translating to Norwegian, we have no idea whether to include the dot or not. |
|||
If form "nr." had lemma "nr." this would be simple. |
|||
===High priority bad translations=== |
===High priority bad translations=== |
||
What are the high-priority linguistic issues to deal with? |
What are the high-priority linguistic issues to deal with? |
||
Line 8: | Line 28: | ||
* TODO: Bidix will be added to with stuff from GTSVN |
* TODO: Bidix will be added to with stuff from GTSVN |
||
** see dev/inc* (verbs need transitivity, nouns need gender checked, words with spaces shouldn't have) |
** see dev/inc* (verbs need transitivity, nouns need gender checked, words with spaces shouldn't have), see also these files with entries are not in bidix: |
||
** [http://apertium.svn.sourceforge.net/viewvc/apertium/incubator/apertium-sme-nob/dev/bidix-pronouns.todo.dix dev/bidix-pronouns.todo.dix] |
|||
** [http://apertium.svn.sourceforge.net/viewvc/apertium/incubator/apertium-sme-nob/dev/bidix-abbr.todo.dix dev/bidix-abbr.todo.dix] |
|||
** [http://apertium.svn.sourceforge.net/viewvc/apertium/incubator/apertium-sme-nob/dev/bidix-shcmp.todo.dix dev/bidix-shcmp.todo.dix] |
|||
* TODO: After bidix additions, Francis will run apertium-lex-learner to [[Generating lexical selection rules|automatically discover |
* TODO: After bidix additions, Francis will run apertium-lex-learner to [[Generating lexical selection rules|automatically discover |
||
lex.sel rules]] |
lex.sel rules]] |
||
Line 14: | Line 37: | ||
* DONE: Cond rules get modals, handled like passive in t1x (instead of "kanskje" in t3x), Pot gets "da<adv>" for now |
|||
* DONE: In chunker/t1x, add tags to certain verbs (eg. using def-lists) that can be used in t3x to change/remove case-prepositions (see dev/valence.txt) |
* DONE: In chunker/t1x, add tags to certain verbs (eg. using def-lists) that can be used in t3x to change/remove case-prepositions (see dev/valence.txt) |
||
** TODO: fill out with more def-list entries |
** TODO: fill out with more def-list entries |
||
** TODO: go through dev/valence.txt and mark off which examples can be handled with noun-lists, and which with verb-lists |
** TODO: go through dev/valence.txt and mark off which examples can be handled with noun-lists, and which with verb-lists |
||
* DONE: Insert epenthetics in compounds (nob.dix has n.*.sg.ind.cmp forms for all nouns, outputting the right epenthetic based on that) |
|||
===Derivations mess up CG=== |
|||
===TODO [1/2]: Multiple identical tags per reading in CG=== |
|||
Mainly a problem with the PoS-changing derivations. |
|||
* DONE: Compounds mess up CG. We have to leave all the tags of the non-heads in because of bidix lookup, so we get eg. <code>politiija<N><Sg><Nom><Cmp>+stašuvdna<N><Sg><Ill></code> (instead of <code>politiija#stašuvdna<N><Sg><Ill></code> like the lookup2cg script gives). CG sees <code>Nom</code> and thus applies rules that it shouldn't, etc. |
|||
** The ideal solution, but currently impossible: rename the non-head tags in BEFORE-/AFTER-SECTIONS. I tried <code>BEFORE-SECTIONS SUBSTITUTE (N Sg Nom Cmp) (N* Sg* Nom* Cmp)</code> which works for the above, but CG then also turns <code>duohta<A><Sg><Nom><Cmp>+dilli<N><Sg><Acc></code> into <code>duohta<A><@→N>+dilli<N*><Sg*><Nom*><Cmp><Sg><Gen>$</code>, while if there are several non-heads, it'll only substitute in the first part. It seems some sort of tag order mechanism would be needed in CG for the BEFORE-SECTIONS/AFTER-SECTIONS stashing method to work. |
|||
** It's possible to do the initial renaming in a twol rule (committed, but commented out for now), but we have no way of changing the tags back in CG (an AFTER-SECTIONS rule SUBSTITUTE N* N will unfortunately merge N* and later occurences of N). |
|||
** Fix: cg-proc now ignores anything up until the last baseform, so given <code>politiija<N><Sg><Nom><Cmp>+stašuvdna<N><Sg><Ill></code>, the rules will only see <code>stašuvdna<N><Sg><Ill></code> (later we may have CG features to refer to the other sub-readings) |
|||
In lookup2cg, PoS tags are given stars if they appear before derivational tags, so "V Der N" becomes "V* Der N". We could do this with twol, but we have no way of removing them before bidix. |
|||
** Simple solution: lexicalise. |
|||
** Boring solution: do the twol renaming to add stars to any PoS-before-derivation, and then move all the PoS tags into pardefs: |
|||
<pre> |
|||
<pardef n="V" c="treat V* exactly like V for bidix lookup"> |
|||
<e> <p><l><s n="V"/></l><r></r></p></e> |
|||
<e r="LR"><p><l><s n="V*"/></l><r></r></p></e> |
|||
</pardef> |
|||
... |
|||
<e><p><l>gieldit<s n="TV"/></l><r>forby<s n="vblex"/><s n="pers"/></r></p><par n="V"/><par n="__verb"/></e> |
|||
</pre> |
|||
Simple / clean solution: lexicalise. |
|||
===DONE: remove unhandled derivations=== |
|||
Any [[Northern Sámi and Norwegian/Derivations|derivations]] that are not handled we remove from the analyser with a twol negation rule in dev/xfst2apertium.useless.twol, this makes the lexicon a lot easier to handle for testvoc. |
|||
=== |
===Testvoc=== |
||
Before release, we need to get [[testvoc]] out of the way – making sure there are no #'s and @'s in the output. The analyser is trimmed, but we can still have generator errors (and CG substitute errors). We can't create all possible surface forms from a cyclic [[HFST]] analyser, but we can at least run as large a corpus as we can find through sme-nob and look for # and @, and also send through the right-hand-sides of the bidix. |
|||
This is used like the +Actor tag, for sme wsd (here based on "stadieveksling"). |
|||
There are two helper scripts: |
|||
We move it after the +N tag in dev/xfst2apertium.hashtags.twol (so it gets the same position as +Actor), this makes transfer a lot simpler. |
|||
* <code>dev/sme-nob.inconsistency.sh | grep '^#'</code> should give no results. This script just sends the rhs of the bidix through the generator. |
|||
Getting a list of lemmas which have +G3 is easy (grep all nouns from bidix, analyse their lemmas and grep for lemma+G3), use this to tag bidix. |
|||
* <code>sed 's/^/sme\t/' smecorpus.txt | dev/analyse-all-stages.sh > analysedcorpus.txt</dev> – here <code>grep 'DGEN.*#'</code> should give no hits. This script also outputs the input and morph and tagger stages (and nob-sentence if handed a parallel corpus) |
|||
===DONE: ensure we have all necessary postchunk rules=== |
|||
Postchunk rules are needed for any chunk containing a |
|||
determiner/pronoun/adjective/noun/verb, we can easily make sure each |
|||
possible chunk name has a postchunk rule (new chunks are created in |
|||
t1x with names like pre_pre_nom, but may also be merged in t2x to |
|||
eg. pre_pre_nom_conj_nom) |
|||
All possible SN and SA chunks should have the needed postchunk rules now. |
|||
===Testvoc=== |
|||
Before release, we need to get [[testvoc]] out of the way – making |
|||
sure there are no #'s and @'s in the output. As yet we don't have a |
|||
way to create all possible surface forms from an [[HFST]] analyser, |
|||
but we can at least run as large a corpus as we can find through |
|||
sme-nob and look for # and @. |
|||
====Expanding the morphology==== |
|||
The script dev/sme-nob.inconsistency.sh tries to generate (with nob.dix) base forms (like infinitives or singular indefinites) of the various rhs entries in the bidix. Statistics so far: |
|||
Running <code>hfst-fst2strings sme-nob.automorf.hfst.ol</code> creates an expansion of the morphology, might be possible to use for testvoc. |
|||
{|class=wikitable |
|||
! Part-of-Speech !! entries from bidix that are OK in nob.dix !! in bidix but not in nob.dix !! comments !! in sme analyser but not bidix |
|||
There's a handy script <code>dev/gt-expand-to-bidix.sh</code> that takes as input one word and PoS (tab-separated) per line, e.g. <pre>galbmit V |
|||
|- |
|||
beana N</pre>, and uses the giellatekno web-based lemma expander to create all forms of each lemma (since hfst is still not very good for that), then runs them through bidix to check for missing bidix entries. |
|||
| '''verbs''' || 2536 || '''0''' || :) || ??? |
|||
|- |
|||
| '''nouns''' || 11758 || '''0''' || :) || ??? |
|||
|- |
|||
| '''proper nouns''' || 28310 || '''15159''' || look for PPP mark in bidix, use dev/props-from-bidix-to-nob.sh || ??? |
|||
|- |
|||
| '''adverbs''' || 235 || '''1''' || only one nob pardef, simple to add || ??? |
|||
|- |
|||
| '''prepositions''' || 42 || '''0''' || :-) || lots missing from bidix still |
|||
|- |
|||
| '''adjectives''' || 1056 || '''7''' || ones left should have other translations (not sure if all forms are covered though) || ??? |
|||
|- |
|||
| '''abbreviations''' || ??? || '''???''' || see dev/abbr.todo.dix || 619 |
|||
|- |
|||
| '''sub-/conjunctions''' || 25 || '''0''' || :) || ??? |
|||
|- |
|||
| '''pronouns''' || ??? || ??? || some MWE ones missing at least || ??? |
|||
|} |
|||
==Schedule== |
|||
{|class=wikitable |
|||
! Task !! Date |
|||
|- |
|||
| Work on high priority bad translations, expand bidix coverage || until 2010-07-14 |
|||
|- |
|||
| Remove unhandled derivations, ensure we have all postchunk rules || 2010-07-14…2010-07-18 |
|||
|- |
|||
| Testvoc || 2010-07-18…2010-08-01 |
|||
|- |
|||
| '''Tentative release date for apertium-sme-nob 0.1.0''' || '''August 1st 2010''' |
|||
|- |
|||
|} |
|||
[[Category:Northern Sámi and Norwegian|*]] |
|||
Update: Lene and Trond have more free time after August, real release beginning of September? |
Latest revision as of 13:47, 15 September 2015
This page holds information about the release schedule for apertium-sme-nob.
Contents
Issues[edit]
sme-nob-specific stuff should be in sme-nob[edit]
e.g. langs/sme/tools/mt/apertium/tagsets/apertium.nob.relabel
Alternatively (ideally), there shouldn't be sme-nob-specific tag changes; all apertium pairs should have pretty similar tagsets.
Periods in abbreviations missing from lemma[edit]
Forms "nr" and "nr." get the exact same analysis:
$ echo -e 'nr\nnr.'|apertium -f none -d . sme-nob-morph ^nr/nr<n><abbr><acc>/nr<n><abbr><attr>/nr<n><abbr><gen>/nr<n><abbr><nom>$ ^nr./nr<n><abbr><acc>/nr<n><abbr><attr>/nr<n><abbr><gen>/nr<n><abbr><nom>$
So when translating to Norwegian, we have no idea whether to include the dot or not.
If form "nr." had lemma "nr." this would be simple.
High priority bad translations[edit]
What are the high-priority linguistic issues to deal with?
- TODO: Bidix will be added to with stuff from GTSVN
- see dev/inc* (verbs need transitivity, nouns need gender checked, words with spaces shouldn't have), see also these files with entries are not in bidix:
- dev/bidix-pronouns.todo.dix
- dev/bidix-abbr.todo.dix
- dev/bidix-shcmp.todo.dix
- TODO: After bidix additions, Francis will run apertium-lex-learner to automatically discover lex.sel rules
- TODO: URL recognition (I tried making some with lexc, it worked by itself, but took ages to compile into the regular lexc, not sure why)
- DONE: In chunker/t1x, add tags to certain verbs (eg. using def-lists) that can be used in t3x to change/remove case-prepositions (see dev/valence.txt)
- TODO: fill out with more def-list entries
- TODO: go through dev/valence.txt and mark off which examples can be handled with noun-lists, and which with verb-lists
Derivations mess up CG[edit]
Mainly a problem with the PoS-changing derivations.
In lookup2cg, PoS tags are given stars if they appear before derivational tags, so "V Der N" becomes "V* Der N". We could do this with twol, but we have no way of removing them before bidix.
Simple / clean solution: lexicalise.
Testvoc[edit]
Before release, we need to get testvoc out of the way – making sure there are no #'s and @'s in the output. The analyser is trimmed, but we can still have generator errors (and CG substitute errors). We can't create all possible surface forms from a cyclic HFST analyser, but we can at least run as large a corpus as we can find through sme-nob and look for # and @, and also send through the right-hand-sides of the bidix.
There are two helper scripts:
dev/sme-nob.inconsistency.sh | grep '^#'
should give no results. This script just sends the rhs of the bidix through the generator.sed 's/^/sme\t/' smecorpus.txt | dev/analyse-all-stages.sh > analysedcorpus.txt</dev> – here
grep 'DGEN.*#'
should give no hits. This script also outputs the input and morph and tagger stages (and nob-sentence if handed a parallel corpus)
Expanding the morphology[edit]
Running hfst-fst2strings sme-nob.automorf.hfst.ol
creates an expansion of the morphology, might be possible to use for testvoc.
There's a handy script
dev/gt-expand-to-bidix.sh
that takes as input one word and PoS (tab-separated) per line, e.g.
galbmit V
beana N
, and uses the giellatekno web-based lemma expander to create all forms of each lemma (since hfst is still not very good for that), then runs them through bidix to check for missing bidix entries.