From Apertium
< User:Unhammer
Revision as of 06:49, 16 April 2018 by Unhammer (talk | contribs) (→‎corpus-testing makefile)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

My wishlist for Apertium features (mostly just useful for language pair developers).

See also Talk:Northern Sámi and Norwegian#Wishlist / Difficulties with the architecture / Ugly_hacks

some error handling in lt-comp[edit]

This really ought to fail:

$ lt-comp lr nonexsistentfile foo.bin  
main@standard 1 0
$ echo $?

This ought to fail with a warning and a line number:

$ lt-comp lr whitespace-beginning.dix foo.bin
main@standard 20 22
$ echo 'meh' | lt-proc foo.bin
Error: Invalid dictionary (hint: entry beginning with whitespace)

lttoolbox section type=samecase[edit]

For words where you don't want uppercase input to match lowercase forms.

For the postblank/preblank/inconditional sections, we only check the section type when we've reached a final state; during step application they're considered one and the same transducer (unioned with an epsilon transition from root).

For a "samecase" section we'd have to have two current_state State pointers, e.g. State *current_state_regular and State *current_state_samecase (and similarly two initial states).

Alternatively, we could run "lt-proc --casesensitive", and preprocess the non-samecase sections so every a:a transition turned into [aA]:a. That should make it possible to have samecase not as a section type, but as another attribute on a section casefold=false.

lttoolbox section type=aftercompounding[edit]

For e.g. proper noun regexes that you want to prioritise lower than compounds. Currently, if you have a proper noun regex, it'll match compounds at the start of sentences.

This could probably be implemented by checking section type only when we've reached a final state.


Having compound-only-L and compound-R is silly, you end up using pardefs anyway.

call-macro with-param var?[edit]

Both sme-nob and nno-nob have lots of postchunk rules that match names like "adj_adj_n_n" – this means a four-word chunk, two adjs followed by two nouns (due to compounding in this case). So t1x has to ensure the chunk has the right name. When t1x matches input "adj adj n n", it may add/remove words like "more/most" in t1x if we're translating to/from synthetic adj's, so for that input we can output "adj adj adj n n" (added more/most), or "adj adj n n", or "adj n n" (removed "more/most").

Keeping track of this is a chore (seems like something a computer ought to be able to help with).

One option would be to have a special attribute like <chunk namegen="tags"> that created a chunk name based on the first tag of each lexical unit inside <chunk>. That's a bit non-general though.

Another option is to allow sending variables as arguments to macros, then the language pair could have a macro gen_chunk_name that takes the list of clips and vars and sets the chunk name variable.

lt-proc mode for adding analyses to already analysed text[edit]

(or for combining several analysers at once)

Expected usage:

$ echo '^already/*already$' '^analysed/analyse<vblex><pp>$' | lt-proc --add-analyses en.automorf.bin
^already/already<adv>$' '^analysed/analyse<vblex><pp>/analysed<adj>$'

This should be fairly easy to implement in

Whatever for?[edit]

  • You can use several CPU's at once by chaining analysers:
    bzcat corpus.bz2 | lt-proc analysis-and-tokenisation.bin | lt-proc --add-analyses huge-slightly-shoddy-lexicon.bin | lt-proc --add-analysis huge-wikipedia-propername-lexicon.bin | …
  • You can compile enormous lexicons in multiple pieces.
  • You can lt-expand those pieces.
  • /
    • Or, as suggested you can make lt-proc combine multiple analysers, as it already in a sense does (but then I guess you wouldn't get to use all your CPU's)


You'd have to be careful about mwe's. If you first have

^take/take<vblex>$ ^out/out<prep>$

and then the second analyser wanted to analyse that as one lexical unit, the best thing would probably be to discard the individual analyses and just keep the mwe one; however, ideally only the first analyser should contain mwe's so you don't get that situation at all.

Probably there are other tokenisation pitfalls too.

Fallthrough option in transfer[edit]

Some times, you match an input pattern in a rule, eg. "n vblex", and you check whether the target-language n has some feature, and then only if it has that feature do you do something special with it. It would be great if we could specify in the <otherwise> that we want to fall through, ignoring that this rule matched.

There are two options for how to "ignore", the best (but possibly slowest?) would be to go on with trying to match on the rest of the rules, the other option is to act as if no rules matched. Both would be an improvement.

This has been implemented in jimregan's exception patch

tl-lemma/tl-tags attributes in t1x def-cat's (pattern-items)[edit]

Now that bidix happens before t1x, apertium-transfer ought to be able to match on the full source-and-target-language input.

However, def-cat's are turned into an FST which matches on only the source part of input, so it might be non-trivial (at least if we want to allow restrictions on *both* source and target side in one def-cat, e.g. <def-cat lemma="cheese" tl-lemma="käse"> – it might be easier if we can do with <def-cat lemma="käse" side="tl">)

Keep surface ("superficial") forms at least until transfer[edit]

Right now, all steps of the pipeline up until apertium-tagger support keeping the surface forms along with the lemma:

$ echo C-vitaminets effekt | lt-proc -w nb-nn.automorf.bin
^C-vitaminets/C-vitamin<n><nt><sg><def><gen>$ ^effekt/effekt<n><m><sg><ind>$
$ echo C-vitaminets effekt | lt-proc -w nb-nn.automorf.bin | cg-proc nb-nn.rlx.bin 
^C-vitaminets/C-vitamin<n><nt><sg><def><gen>$ ^effekt/effekt<n><m><sg><ind>$
$ echo C-vitaminets effekt | lt-proc -w nb-nn.automorf.bin | cg-proc nb-nn.rlx.bin | apertium-tagger -p -g nb-nn.prob 
^C-vitaminets/C-vitamin<n><nt><sg><def><gen>$ ^effekt/effekt<n><m><sg><ind>$

(The -w switch to lt-proc makes sure the lemma has the same typographical case as given by the dictionary.)

It would be useful to have surface form and lemma separate in apertium-transfer too; mostly because we would then be able to avoid all those horrible hacks with trying to maintain typographical case.


  • C-vitaminets effekt => Effekten til C-vitaminet
  • Vitaminets effekt => Effekten til vitaminet

The reason for keeping the case on "C-vitaminet" but not "Vitaminet" should be that the lemma is capitalised. However, before transfer, the case from surface form is applied to the lemma, and we don't know whether it was there from before or not. This is the input to the transfer module:

  • ^C-vitamin<n><nt><sg><def><gen>$ ^effekt<n><m><sg><ind>$
  • ^Vitamin<n><nt><sg><def><gen>$ ^effekt<n><m><sg><ind>$

So how can you avoid *"Effekten til Vitaminet" or *"Effekten til c-vitaminet"? (At the moment, this is dealt with in nn-nb by using only lowercase lemmata for stuff like "C-vitamin", and RL entries which apply correct capitalisation -- not very pretty, and pardefs don't really help here.)

See how it is done in is-en with gentilics, e.g. "English-speaking", etc. - Francis Tyers 19:56, 11 March 2010 (UTC)
Switched to that method as it's slightly better, but still... <e lm="BCG-vaksine"><par n="Bb"/><par n="Cc"/><par n="Gg"/>-vaksin<par n="r/e__n"/></e> --unhammer 08:26, 12 March 2010 (UTC)


If transfer could read

  • ^C-vitaminets/C-vitamin<n><nt><sg><def><gen>$ ^effekt/effekt<n><m><sg><ind>$
  • ^Vitaminets/vitamin<n><nt><sg><def><gen>$ ^effekt/effekt<n><m><sg><ind>$

then we could keep the capitalisation on C-vitamin because we see that the lemma has capitalisation, while we change "Vitamin" to "vitamin" since the lemma is regular lowercased.

Other considerations:[edit]

The transfer.dtd would of course need a new attribute like part="sform".

By interchunk I guess we can throw away the surface form.

640K should be enough for anyone.

apertium-pretransfer changes ^ombud<n><nt><sg><ind><ep-s>+kvinne<n><f><sg><ind>$ into ^ombud<n><nt><sg><ind><ep-s>$ ^kvinne<n><f><sg><ind>$.

So, should

  • ^Ombudskvinne/ombud<n><nt><sg><ind><ep-s>+kvinne<n><f><sg><ind>$ become
  • ^Ombudskvinne/ombud<n><nt><sg><ind><ep-s>$ ^Ombudskvinne/kvinne<n><f><sg><ind>$ or
  • ^Ombudskvinne/ombud<n><nt><sg><ind><ep-s>$ ^ombudskvinne/kvinne<n><f><sg><ind>$?


  • ^OMBUDSKVINNE/ombud<n><nt><sg><ind><ep-s>+kvinne<n><f><sg><ind>$ become
  • ^OMBUDSKVINNE/ombud<n><nt><sg><ind><ep-s>$ ^ombudskvinne/kvinne<n><f><sg><ind>$ or
  • ^OMBUDSKVINNE/ombud<n><nt><sg><ind><ep-s>$ ^OMBUDSKVINNE/kvinne<n><f><sg><ind>$?
If you, in transfer, know that they used to be part of the same lexical unit in the source language, this probably doesn't matter too much.

Allow the chunk tag wherever we allow other "strings"[edit]

Implemented by sortiz as of r39106.

<chunk name="foo"><tags><tag><lit-tag v="bar"/></tag></tags><lu><lit v="fie"/></lu></chunk> just outputs ^foo<bar>{fie}$ -- a simple string. We can have strings from tags, literals and variables inside variables, but not with the chunk tag, leading to this kind of mess:

             <lit v="^pron"/>
             <lit-tag v="@SUBJ→"/>
             <clip pos="1" part="pers"/>
             <lit-tag v="GD"/>
             <clip pos="1" part="nbr"/>
             <lit-tag v="nom"/>
             <lit v="{^"/>
             <lit v="prpers"/>
             <lit-tag v="prn"/>
             <clip pos="1" part="pers"/>
             <lit-tag v="mf"/>
             <clip pos="1" part="nbr"/>
             <lit-tag v="nom"/>
             <lit v="$}$"/>

Wish: allow <let><chunk>...</chunk></let> and <concat><chunk>...</chunk></concat> (chunk "returns" a string, variables hold strings).

Allow "postchunking" of chunks in interchunk[edit]

When you want to merge chunks in interchunk it would be nice to be able to collapse the tags of non-head chunks.

For example, if we want to do: SN PREP SN. "The 10 most popular films in American cinemas", we get:

^Det_num_adj_nom<SN><@X><pl>{^the<det><def><3>$ ^10<num>$ ^most<preadv>$ ^popular<adj>$ ^film<n><3>$}$ 
^adj_nom<SN><@X><pl>{^American<adj>$ ^cinema<n><3>$}$

^sn_prep_sn<SN><@X><pl>{^the<det><def><3>$ ^10<num>$ ^most<preadv>$ ^popular<adj>$ ^film<n><3>$ ^in<pr>$ ^American<adj>$ ^cinema<n><3>$}$

The <3> is replaced with <pl> for both (merged) chunks in postchunk. In this case 'pl' is the same in both, but if not, it would be nice to be able to do something like

    <rule comment="REGLA: SN PREP SN">
        <pattern-item n="SN"/>
        <pattern-item n="PREP"/>
        <pattern-item n="SN"/>
            <lit v="sn_prep_sn"/>
            <clip pos="1" part="tags"/>
            <lit v="{"/>
              <clip pos="1" part="content"/>
              <b pos="1"/>
              <clip pos="2" part="content"/>
              <b pos="2"/>
                <clip pos="3" part="tags"/>
                <clip pos="3" part="content"/>
            <lit v="}"/>

so that we get

 ^sn_prep_sn<SN><@X><pl>{^the<det><def><3>$ ^10<num>$ ^most<preadv>$ ^popular<adj>$ ^film<n><3>$ ^in<pr>$ ^American<adj>$ ^cinema<n><pl>$}

A "grouping" tag for bidix[edit]

Most of the time when LR-ing and RL-ing in bidix, we have one pair of entries that work in both directions, with possibly lots of LR's that all go to the same <r>, or lots of RL's that all go to the same <l>. Making certain these actually _do_ go to the same, where they should, means looking through lots of entries manually, since in some cases we _don't_ want it to be like that (ie. we can't just write a program to check this since there are general rules and there are exceptions).

What I'd like is just some way of keeping LR's and RL's in bidix together. One possibility would be to represent it this way:

   <em>       <p><l>foo</l><r>bar</r></p></em>
   <LR>        <p><l>fie</l>                    </p></LR>
   <RL>        <p>                  <r>bum</r></p></RL>
 <e r="LR"><p><l>foe</l><r>baz</r></p></e>

This would be equivalent to:

 <e>           <p><l>foo</l><r>bar</r></p></e>
 <e r="LR"><p><l>fie</l><r>bar</r></p></e>
 <e r="RL"><p><l>foo</l><r>bum</r></p></e>
 <e r="LR"><p><l>foe</l><r>baz</r></p></e>

The idea is that within the <eg> entries, we know that all LR's have the same <r>, and all RL's have the same <l>, and so an LR can't have an <r> specified.

Or better, selimcan's multidix idea:

       <bak r="NG">bbb</bak>  <!-- NG=no-gen, analyse bbb into ttt, but don't translate ttt into bbb -->
       <bak r="NA">aaa</bak>  <!-- NA=no-ana, don't translate aaa into ttt, but do generate aaa when translating ttt -->

 <e r="LR"><l>ttt</><r>aaa</> 
 <e r="RL"><l>ttt</><r>bbb</> 
 <e r="LR"><l>uuu</><r>aaa</> 
 <e r="RL"><l>uuu</><r>bbb</> 

 and you get cartesian products the expected way, e.g.

       <bak r="NG">bbb</bak>
       <bak r="NA">aaa</bak>

 <e r="LR"><l>ttt</><r>aaa</> 
 <e r="RL"><l>ttt</><r>bbb</> 
 <e r="LR"><l>ttt2</><r>aaa</> 
 <e r="RL"><l>ttt2</><r>bbb</> 
 <e r="LR"><l>uuu</><r>aaa</> 
 <e r="RL"><l>uuu</><r>bbb</> 

option to output pardefs in lt-expand[edit]

Sometimes you want to see what pardefs an entry uses, you can do

Index: lttoolbox/
--- lttoolbox/       (revision 21713)
+++ lttoolbox/       (working copy)
@@ -366,6 +366,8 @@
     else if(name == Compiler::COMPILER_PAR_ELEM)
       wstring p = procPar();
+      fputws_unlocked(p.c_str(), output);
+      fputwc_unlocked(L'\t', output);
       // detecci�n del uso de paradigmas no definidos
       if(paradigm.find(p) == paradigm.end() &&

but it'd be cool to have a command line option to lt-expand to do this. Also, it shouldn't output pardef names if there's nothing output from the <e>.

lt-expand query tool[edit]

Oftentimes, late at night, I will wonder: how many lemmas in nno.dix are verb-noun ambiguous for more than two forms?

It'd be cool to do

 lt-expand-select "l1,l2 if l1.pos=='vblex' and l2.pos=='n' and len(intersection(l1.forms, l2.forms)) > 2" nno.dix

corpus-testing makefile[edit]

Say you're doing before-after word diffs on a corpus when testing some new thing in t1x, so to save time you translate the corpus up to bidix, then when messing with t1x you just have to translate from t1x and on to dgen. Faster debug cycle. But then you suddenly have to change something in bidix. Or analyser. So you have to go back to the command that translated up to that point. Really, this seems like something a Makefile would be suited for, e.g.

edit *t1x
make -f corptest.make # first run runs full translation, but saves intermediate output
edit *t1x
make -f corptest.make # now it just re-runs from t1x onwards
edit *bi.dix 
make -f corptest.make # now it re-runs from bidix onwards

(This is actually implemented for wiki-based regression tests in apertium-sme-nob if you export AP_LAZY=true. See t/lazytranslate and t/translate.make. Not yet done for corpus tests, but it should be easy to build on that.)


(only half-joking … The Document Foundation actually sends out a congratulations email when you get your first patch merged, with a little PDF certificate)