User:Asfrent/Application
Contents
Project description
The project aims to speed up the VM for transfer code (the transfer step being the slowest of the translation pipeline at the moment).
Preparation
Before starting to actually implement the optimisations I will make sure I have a good testing framework that covers as much of the code as possible. This would also involve fixing the VM-for-transfer code in order to produce the same output as the XML tree-walking transfer code.
High level optimisations
Algorithms and data structures
Look for better data structures and algorithms that will result in a lower complexity of the system. One thing that we could do to the current design is to compress the trie that stores rules (trie compression involves finding path in which the nodes have only one child and compress that path into a single node). The lookup table in each node will be string indexed, allowing the trie to have better search times because less nodes will be explored (although the same number of characters).
We could also introduce supplementary fast links in the trie (in the lookup tables present at each node) by making an analysis of the most common matches that are needed in practice.
Interpreter redesign
A profile of a slightly optimised code shows that a lot of time is spent in the interpreter code in order to convert strings to integers. This is because the interpreter stack has been designed to work with a a single data type (strings), so every time a VM instruction needs integral data types we need to convert the strings.
Instruction set
There are some instructions that take much longer time than others. By analysing the instruction set we should be able to further fragment these time consuming instructions into smaller ones that will perform faster and could be easier to optimise.
Port code to LLVM
Redesign the compiler to produce LLVM code which would enable us to take advantage of the optimisations built into LLVM. Most of the stack / branching / logical / math instructions will be translated directly to LLVM instructions, while complex instructions would be translated to calls to C++ code (compiled to LLVM as well).
By compiling to LLVM we get rid of most of the VM implementation (the parts that handle stack / branching / logical / math operations), we gain a better, simplified architecture and speed. Hopefully, a lot of speed. This can be done in conjunction with the other optimisations, which would then only apply to the complex instructions of the transfer VM (such as clip) and working with the matching data structures.
Low level optimisations
Memory allocator
We could design a memory allocator for the VM data structures that will keep related objects (like nodes in a tree) as close as possible one to the other in order to increase the cache hits).
Branch prediction
Some parts can be restructured in order to make use of the branch prediction at the CPU level. I worked a little on some parts of the code that had this issue and gained some speedup using this technique.
Constant pool
The current implementation of the virtual machine spends a lot of time in copying strings from one method to the other. This could be greatly improved by using a string pool.
Caching
Caching can really improve the behavior of the VM - methods that are called often with (more or less) the same parameters can maintain a cache of the results if they do not have any side effects or depend on the global state. In the proof of concept, caching itself made a difference of about 10%.
Reduce copying of the data structures
As you will see in the code profiling report below, a lot of time was spent in moving vectors around. This can be avoided by a stricter coding style (marking returned vectors with const& or, when not possible, just using plain references).
Proof of concept
The tools I used to produce the proof of concept were:
- vim - the best editor there is
- valgrind - check for memory leaks
- callgrind (a valgrind tool) - profiling the code
- kcachegrind - GUI for callgrind reports
First I compiled and installed apertium on my machine (using the es-ro pair). I started with a fresh version of VM-for-transfer, by checking out the master branch of the repository. The test files I am using are part of the Parliament Proceedings Parallel Corpus 1996-2011.
In order to properly test the VM, I created two input files, pre-xfervm-10000 and pre-xfervm-1000. Both of them are created by running the first part of the pipeline (until just before the transfer), the only difference is the number of lines that were processed out of the original text (which is given by their suffix). The purpose of the pre-xfervm-10000 file is time evaluation, while the profiling will be done on the smaller one, pre-xfervm-1000, because valgrind takes a lot of time to produce the profiling report.
The fresh version (master branch) of the VM-for-transfer takes about 40s to do the transfer:
andrei@andrei-xps bin $ ./_tm real 0m40.275s user 0m40.183s sys 0m0.052s
Let's take a look at the profile (remember, times are measured on the -10000 file, while the profile is done on the -1000 input file):
The odd thing here is the wtolower static method that seems to consume a lot of time. So most of the time spent in selecting a matching rule is actually spent making strings lowercase. This was because a locale variable that was always initialized in the respective function. In this branch I fixed the wtolower method. Now the code runs approx 60% faster:
andrei@andrei-xps bin $ ./_tm real 0m25.168s user 0m25.078s sys 0m0.072s
We can see that we got a huge improvement just by fixing one function. I ran the profiler again, looking for the next bottleneck:
As we can see, the SystemTrie::getPatternNodes method is still quite slow. Since the method returns all nodes that correspond to a certain pattern starting from a node, we could cache the results at node level. Another thing that makes this method work slower than it should is vector copying. I decided to do the caching and restructure the code a bit in order to use constant references instead of copying things around (see this commit). The new time we get is, again, smaller:
andrei@andrei-xps bin $ ./_tm real 0m15.677s user 0m15.589s sys 0m0.080s
The next step I tried is to do some small, low level optimisations (like improving the code for branch prediction hits, see here). The result is almost unnoticeable, but these kind of optimisations could also improve the efficiency of the code:
andrei@andrei-xps bin $ ./_tm real 0m15.447s user 0m15.377s sys 0m0.064s
This is the first time when we can look at the profiler report and see that one of the slowest parts is the actual execution of the VM instructions:
TODO more.
Application
TODO