User:Unhammer/wishlist
My wishlist for Apertium features (mostly just useful for language pair developers).
See also Talk:Northern Sámi and Norwegian#Wishlist / Difficulties with the architecture / Ugly_hacks
Contents
- 1 some error handling in lt-comp
- 2 lttoolbox section type=samecase
- 3 lttoolbox section type=aftercompounding
- 4 call-macro with-param var?
- 5 lt-proc mode for adding analyses to already analysed text
- 6 Fallthrough option in transfer
- 7 Keep surface ("superficial") forms at least until transfer
- 8
Allow the chunk tag wherever we allow other "strings" - 9 Allow "postchunking" of chunks in interchunk
- 10 A "grouping" tag for bidix
- 11 option to output pardefs in lt-expand
some error handling in lt-comp
This really ought to fail:
$ lt-comp lr nonexsistentfile foo.bin main@standard 1 0 $ echo $? 0
This ought to fail with a warning and a line number:
$ lt-comp lr whitespace-beginning.dix foo.bin main@standard 20 22 $ echo 'meh' | lt-proc foo.bin Error: Invalid dictionary (hint: entry beginning with whitespace)
lttoolbox section type=samecase
For words where you don't want uppercase input to match lowercase forms.
For the postblank/preblank/inconditional sections, we only check the section type when we've reached a final state; during step application they're considered one and the same transducer (unioned with an epsilon transition from root).
For a "samecase" section we'd have to have two current_state State pointers, e.g. State *current_state_regular and State *current_state_samecase (and similarly two initial states). Alternatively, we could run "lt-proc --casesensitive", and preprocess the non-samecase sections so every a:a transition turned into [aA]:a. That should make it possible to have samecase not as a section type, but as another attribute on a section casefold=false.
lttoolbox section type=aftercompounding
For e.g. proper noun regexes that you want to prioritise lower than compounds. Currently, if you have a proper noun regex, it'll match compounds at the start of sentences.
This could probably be implemented by checking section type only when we've reached a final state.
call-macro with-param var?
Both sme-nob and nno-nob have lots of postchunk rules that match names like "adj_adj_n_n" – this means a four-word chunk, two adjs followed by two nouns (due to compounding in this case). So t1x has to ensure the chunk has the right name. When t1x matches input "adj adj n n", it may add/remove words like "more/most" in t1x if we're translating to/from synthetic adj's, so for that input we can output "adj adj adj n n" (added more/most), or "adj adj n n", or "adj n n" (removed "more/most").
Keeping track of this is a chore (seems like something a computer ought to be able to help with).
One option would be to have a special attribute like <chunk namegen="tags"> that created a chunk name based on the first tag of each lexical unit inside <chunk>. That's a bit non-general though.
Another option is to allow sending variables as arguments to macros, then the language pair could have a macro gen_chunk_name that takes the list of clips and vars and sets the chunk name variable.
lt-proc mode for adding analyses to already analysed text
(or for combining several analysers at once)
Expected usage:
$ echo '^already/*already$' '^analysed/analyse<vblex><pp>$' | lt-proc --add-analyses en.automorf.bin ^already/already<adv>$' '^analysed/analyse<vblex><pp>/analysed<adj>$'
This should be fairly easy to implement in fst_processor.cc.
Whatever for?
- You can use several CPU's at once by chaining analysers:
bzcat corpus.bz2 | lt-proc analysis-and-tokenisation.bin | lt-proc --add-analyses huge-slightly-shoddy-lexicon.bin | lt-proc --add-analysis huge-wikipedia-propername-lexicon.bin | …
- You can compile enormous lexicons in multiple pieces.
- You can lt-expand those pieces.
- http://comments.gmane.org/gmane.comp.nlp.apertium/1099 / http://comments.gmane.org/gmane.comp.nlp.apertium/1100
- Or, as suggested you can make lt-proc combine multiple analysers, as it already in a sense does (but then I guess you wouldn't get to use all your CPU's)
However:
You'd have to be careful about mwe's. If you first have
^take/take<vblex>$ ^out/out<prep>$
and then the second analyser wanted to analyse that as one lexical unit, the best thing would probably be to discard the individual analyses and just keep the mwe one; however, ideally only the first analyser should contain mwe's so you don't get that situation at all.
Probably there are other tokenisation pitfalls too.
Fallthrough option in transfer
Some times, you match an input pattern in a rule, eg. "n vblex", and you check whether the target-language n has some feature, and then only if it has that feature do you do something special with it. It would be great if we could specify in the <otherwise>
that we want to fall through, ignoring that this rule matched.
There are two options for how to "ignore", the best (but possibly slowest?) would be to go on with trying to match on the rest of the rules, the other option is to act as if no rules matched. Both would be an improvement.
- This has been implemented in jimregan's
exception
patch
Keep surface ("superficial") forms at least until transfer
Right now, all steps of the pipeline up until apertium-tagger support keeping the surface forms along with the lemma:
$ echo C-vitaminets effekt | lt-proc -w nb-nn.automorf.bin ^C-vitaminets/C-vitamin<n><nt><sg><def><gen>$ ^effekt/effekt<n><m><sg><ind>$ $ echo C-vitaminets effekt | lt-proc -w nb-nn.automorf.bin | cg-proc nb-nn.rlx.bin ^C-vitaminets/C-vitamin<n><nt><sg><def><gen>$ ^effekt/effekt<n><m><sg><ind>$ $ echo C-vitaminets effekt | lt-proc -w nb-nn.automorf.bin | cg-proc nb-nn.rlx.bin | apertium-tagger -p -g nb-nn.prob ^C-vitaminets/C-vitamin<n><nt><sg><def><gen>$ ^effekt/effekt<n><m><sg><ind>$
(The -w switch to lt-proc makes sure the lemma has the same typographical case as given by the dictionary.)
It would be useful to have surface form and lemma separate in apertium-transfer too; mostly because we would then be able to avoid all those horrible hacks with trying to maintain typographical case.
Consider:
- C-vitaminets effekt => Effekten til C-vitaminet
- Vitaminets effekt => Effekten til vitaminet
The reason for keeping the case on "C-vitaminet" but not "Vitaminet" should be that the lemma is capitalised. However, before transfer, the case from surface form is applied to the lemma, and we don't know whether it was there from before or not. This is the input to the transfer module:
^C-vitamin<n><nt><sg><def><gen>$ ^effekt<n><m><sg><ind>$
^Vitamin<n><nt><sg><def><gen>$ ^effekt<n><m><sg><ind>$
So how can you avoid *"Effekten til Vitaminet" or *"Effekten til c-vitaminet"? (At the moment, this is dealt with in nn-nb by using only lowercase lemmata for stuff like "C-vitamin", and RL entries which apply correct capitalisation -- not very pretty, and pardefs don't really help here.)
- See how it is done in
is-en
with gentilics, e.g. "English-speaking", etc. - Francis Tyers 19:56, 11 March 2010 (UTC)
- Switched to that method as it's slightly better, but still...
<e lm="BCG-vaksine"><par n="Bb"/><par n="Cc"/><par n="Gg"/>-vaksin<par n="r/e__n"/></e>
--unhammer 08:26, 12 March 2010 (UTC)
- Switched to that method as it's slightly better, but still...
Solution:
If transfer could read
^C-vitaminets/C-vitamin<n><nt><sg><def><gen>$ ^effekt/effekt<n><m><sg><ind>$
^Vitaminets/vitamin<n><nt><sg><def><gen>$ ^effekt/effekt<n><m><sg><ind>$
then we could keep the capitalisation on C-vitamin because we see that the lemma has capitalisation, while we change "Vitamin" to "vitamin" since the lemma is regular lowercased.
Other considerations:
The transfer.dtd would of course need a new attribute like part="sform"
.
By interchunk I guess we can throw away the surface form.
- 640K should be enough for anyone.
apertium-pretransfer changes ^ombud<n><nt><sg><ind><ep-s>+kvinne<n><f><sg><ind>$
into ^ombud<n><nt><sg><ind><ep-s>$ ^kvinne<n><f><sg><ind>$
.
So, should
^Ombudskvinne/ombud<n><nt><sg><ind><ep-s>+kvinne<n><f><sg><ind>$
become^Ombudskvinne/ombud<n><nt><sg><ind><ep-s>$ ^Ombudskvinne/kvinne<n><f><sg><ind>$
or^Ombudskvinne/ombud<n><nt><sg><ind><ep-s>$ ^ombudskvinne/kvinne<n><f><sg><ind>$
?
Should
^OMBUDSKVINNE/ombud<n><nt><sg><ind><ep-s>+kvinne<n><f><sg><ind>$
become^OMBUDSKVINNE/ombud<n><nt><sg><ind><ep-s>$ ^ombudskvinne/kvinne<n><f><sg><ind>$
or^OMBUDSKVINNE/ombud<n><nt><sg><ind><ep-s>$ ^OMBUDSKVINNE/kvinne<n><f><sg><ind>$
?
- If you, in transfer, know that they used to be part of the same lexical unit in the source language, this probably doesn't matter too much.
Allow the chunk tag wherever we allow other "strings"
Implemented by sortiz as of r39106.
<chunk name="foo"><tags><tag><lit-tag v="bar"/></tag></tags><lu><lit v="fie"/></lu></chunk>
just outputs ^foo<bar>{fie}$
-- a simple string. We can have strings from tags, literals and variables inside variables, but not with the chunk tag, leading to this kind of mess:
<let> <concat> <lit v="^pron"/> <lit-tag v="@SUBJ→"/> <clip pos="1" part="pers"/> <lit-tag v="GD"/> <clip pos="1" part="nbr"/> <lit-tag v="nom"/> <lit v="{^"/> <lit v="prpers"/> <lit-tag v="prn"/> <clip pos="1" part="pers"/> <lit-tag v="mf"/> <clip pos="1" part="nbr"/> <lit-tag v="nom"/> <lit v="$}$"/> </concat> </let>
Wish: allow <let><chunk>...</chunk></let>
and <concat><chunk>...</chunk></concat>
(chunk "returns" a string, variables hold strings).
Allow "postchunking" of chunks in interchunk
When you want to merge chunks in interchunk it would be nice to be able to collapse the tags of non-head chunks.
For example, if we want to do: SN PREP SN. "The 10 most popular films in American cinemas", we get:
t1x: ^Det_num_adj_nom<SN><@X><pl>{^the<det><def><3>$ ^10<num>$ ^most<preadv>$ ^popular<adj>$ ^film<n><3>$}$ ^í<PREP>{^in<pr>$}$ ^adj_nom<SN><@X><pl>{^American<adj>$ ^cinema<n><3>$}$ t2x: ^sn_prep_sn<SN><@X><pl>{^the<det><def><3>$ ^10<num>$ ^most<preadv>$ ^popular<adj>$ ^film<n><3>$ ^in<pr>$ ^American<adj>$ ^cinema<n><3>$}$
The <3>
is replaced with <pl>
for both (merged) chunks in postchunk. In this case 'pl' is the same in both, but if not, it would be nice to be able to do something like
<rule comment="REGLA: SN PREP SN"> <pattern> <pattern-item n="SN"/> <pattern-item n="PREP"/> <pattern-item n="SN"/> </pattern> <action> <out> <chunk> <lit v="sn_prep_sn"/> <clip pos="1" part="tags"/> <lit v="{"/> <clip pos="1" part="content"/> <b pos="1"/> <clip pos="2" part="content"/> <b pos="2"/> <merge-tags> <clip pos="3" part="tags"/> <clip pos="3" part="content"/> </merge-tags> <lit v="}"/> </chunk> </out> </action> </rule>
so that we get
^sn_prep_sn<SN><@X><pl>{^the<det><def><3>$ ^10<num>$ ^most<preadv>$ ^popular<adj>$ ^film<n><3>$ ^in<pr>$ ^American<adj>$ ^cinema<n><pl>$}
A "grouping" tag for bidix
Most of the time when LR-ing and RL-ing in bidix, we have one pair of entries that work in both directions, with possibly lots of LR's that all go to the same <r>
, or lots of RL's that all go to the same <l>
. Making certain these actually _do_ go to the same, where they should, means looking through lots of entries manually, since in some cases we _don't_ want it to be like that (ie. we can't just write a program to check this since there are general rules and there are exceptions).
What I'd like is just some way of keeping LR's and RL's in bidix together. One possibility would be to represent it this way:
<eg> <em> <p><l>foo</l><r>bar</r></p></em> <LR> <p><l>fie</l> </p></LR> <RL> <p> <r>bum</r></p></RL> </eg> <e r="LR"><p><l>foe</l><r>baz</r></p></e>
This would be equivalent to:
<e> <p><l>foo</l><r>bar</r></p></e> <e r="LR"><p><l>fie</l><r>bar</r></p></e> <e r="RL"><p><l>foo</l><r>bum</r></p></e> <e r="LR"><p><l>foe</l><r>baz</r></p></e>
The idea is that within the <eg>
entries, we know that all LR's have the same <r>
, and all RL's have the same <l>
, and so an LR can't have an <r>
specified.
option to output pardefs in lt-expand
Sometimes you want to see what pardefs an entry uses, you can do
Index: lttoolbox/expander.cc =================================================================== --- lttoolbox/expander.cc (revision 21713) +++ lttoolbox/expander.cc (working copy) @@ -366,6 +366,8 @@ else if(name == Compiler::COMPILER_PAR_ELEM) { wstring p = procPar(); + fputws_unlocked(p.c_str(), output); + fputwc_unlocked(L'\t', output); // detecci�n del uso de paradigmas no definidos if(paradigm.find(p) == paradigm.end() &&
but it'd be cool to have a command line option to lt-expand to do this. Also, it shouldn't output pardef names if there's nothing output from the <e>.