Constraint Grammar/Optimisation

From Apertium
Jump to navigation Jump to search

Ideas and discussions on how to optimize VISL CG-3. When it says "I", that means Tino Didriksen.


VISL CG-3 is too slow for some users. It is clearly faster than the old vislcg, but it is also clearly slower than Pasi Tapanainen's CG-2 (up to 10 times slower for compatible features).

Current Performance[edit]

For the reference test set:

  • As of 2014-07-14 (r10044) CG-3 performance is at 378 cohorts per second. Total run time 2.08 seconds.
  • vislcg is at 32 cohorts per second. Total run time 24.63 seconds.
  • CG-2 cannot handle the grammar size.

For all 192042 lines of input:

  • CG-3 (r6898) takes 231.97s
  • vislcg takes 2521.59, or 980% slower than CG-3
  • CG-2 cannot handle the grammar size.

If the grammar is split into smaller morf and syn parts:

  • CG-3 (r6898): morf 53.30s + syn 56.54s
  • vislcg: morf 85.15s + syn 98.69s, or 60-70% slower than CG-3
  • CG-2: morf 6.41s + syn 7.01s, or 80-90% faster than CG-3

Algorithm Pseudocode[edit]

foreach (input_sentence)
    for (max_section=0 ; max_section < num_sections ; )
        changed = false;
        for (current_section=0 ; current_section <= max_section ; current_section++)
            foreach (rule in current_section)
                foreach (cohort in input_sentence)
                    foreach (reading in cohort)
                        if ( matches reading && rule.context is valid)
                            execute rule on reading;
                            changed = true;
        if (changed == false)
  • Sections are re-run if any rule actually did anything, meaning worst case runtime of O(num_rules * factorial(num_sections)). On average, this happens twice so expected runtime is Θ(num_rules * num_sections * 2).
  • Rule order is fully deterministic.
  • Section order is fully deterministic.
  • In fact, the whole program is deterministic.

On the ToDo[edit]

  • Make all tag containers have Tag* instead of uint32_t, to eliminate frequent hash table lookups.
  • Make all set containers have Set* instead of uint32_t, for same reasons.
  • Make all other storages be sequential, so instead of doing a hash table lookup it can just do a vector access.
  • Split runRulesOnSingleWindow() into smaller functions or functors...tried this a few times, but the smaller functions always seem to require 15 input/output arguments.

Wild Ideas[edit]


I have considered and opted not to play with threads (or OpenMP / OpenCL / Intel TBB) myself, for three reasons:

  • CG-3 is never run as a single program; it is always part of a chain of 10 or more programs, usually with 2-4 other CG-3 processes in the chain, so the CPU is always fully utilized anyway; adding threads to the scheduler pool would not benefit normal runs and might indeed slow the overall processing down. This is my primary reason for not getting serious with threading.
  • The strict rule ordering makes it very hard to split into threads. One would have to identify which clusters of rules can be run independently from other clusters. Or throw rule ordering out the window (possibly cmdline flag qualified so people don't get surprises).
  • The most trivial piece to split into a thread is the I/O - reading in and writing out while applying rules elsewhere; but I/O takes 0.01% runtime...

It might be possible to split each grammar section runlevel into threads and run them on the input windows in parallel...but then again, each later window may need the previous window's results thanks to window-spanning contexts.

Finite State Transducer[edit]

  • We believe Pasi Tapanainen's CG-2 is so fast because it uses finite state technology.
  • I personally do not have much experience with finite state machines, so I haven't looked deeply into that.
  • It would seem that getting the max speed gains of FST would also affect the deterministic nature of rule ordering. This may be ok for some users. CG-2 had a non-deterministic rule order, but a deterministic section order.