Northern Sámi and Norwegian/release
This page holds information about the release schedule for apertium-sme-nob.
Contents
Issues[edit]
sme-nob-specific stuff should be in sme-nob[edit]
e.g. langs/sme/tools/mt/apertium/tagsets/apertium.nob.relabel
Alternatively (ideally), there shouldn't be sme-nob-specific tag changes; all apertium pairs should have pretty similar tagsets.
Periods in abbreviations missing from lemma[edit]
Forms "nr" and "nr." get the exact same analysis:
$ echo -e 'nr\nnr.'|apertium -f none -d . sme-nob-morph ^nr/nr<n><abbr><acc>/nr<n><abbr><attr>/nr<n><abbr><gen>/nr<n><abbr><nom>$ ^nr./nr<n><abbr><acc>/nr<n><abbr><attr>/nr<n><abbr><gen>/nr<n><abbr><nom>$
So when translating to Norwegian, we have no idea whether to include the dot or not.
If form "nr." had lemma "nr." this would be simple.
High priority bad translations[edit]
What are the high-priority linguistic issues to deal with?
- TODO: Bidix will be added to with stuff from GTSVN
- see dev/inc* (verbs need transitivity, nouns need gender checked, words with spaces shouldn't have), see also these files with entries are not in bidix:
- dev/bidix-pronouns.todo.dix
- dev/bidix-abbr.todo.dix
- dev/bidix-shcmp.todo.dix
- TODO: After bidix additions, Francis will run apertium-lex-learner to automatically discover lex.sel rules
- TODO: URL recognition (I tried making some with lexc, it worked by itself, but took ages to compile into the regular lexc, not sure why)
- DONE: In chunker/t1x, add tags to certain verbs (eg. using def-lists) that can be used in t3x to change/remove case-prepositions (see dev/valence.txt)
- TODO: fill out with more def-list entries
- TODO: go through dev/valence.txt and mark off which examples can be handled with noun-lists, and which with verb-lists
Derivations mess up CG[edit]
Mainly a problem with the PoS-changing derivations.
In lookup2cg, PoS tags are given stars if they appear before derivational tags, so "V Der N" becomes "V* Der N". We could do this with twol, but we have no way of removing them before bidix.
Simple / clean solution: lexicalise.
Testvoc[edit]
Before release, we need to get testvoc out of the way – making sure there are no #'s and @'s in the output. The analyser is trimmed, but we can still have generator errors (and CG substitute errors). We can't create all possible surface forms from a cyclic HFST analyser, but we can at least run as large a corpus as we can find through sme-nob and look for # and @, and also send through the right-hand-sides of the bidix.
There are two helper scripts:
dev/sme-nob.inconsistency.sh | grep '^#'
should give no results. This script just sends the rhs of the bidix through the generator.sed 's/^/sme\t/' smecorpus.txt | dev/analyse-all-stages.sh > analysedcorpus.txt</dev> – here
grep 'DGEN.*#'
should give no hits. This script also outputs the input and morph and tagger stages (and nob-sentence if handed a parallel corpus)
Expanding the morphology[edit]
Running hfst-fst2strings sme-nob.automorf.hfst.ol
creates an expansion of the morphology, might be possible to use for testvoc.
There's a handy script
dev/gt-expand-to-bidix.sh
that takes as input one word and PoS (tab-separated) per line, e.g.
galbmit V
beana N
, and uses the giellatekno web-based lemma expander to create all forms of each lemma (since hfst is still not very good for that), then runs them through bidix to check for missing bidix entries.