Difference between revisions of "User:Asfrent/Application"

Revision as of 14:05, 17 March 2014

Project description

The project aims to speed up the VM for transfer code (the transfer step being the slowest of the translation pipeline at the moment).

High level optimisations

Algorithms and data structures

Look for better data structures and algorithms that will result in a lower complexity of the system. One thing that we could do to the current design is to compress the trie that stores rules (trie compression involves finding path in which the nodes have only one child and compress that path into a single node). The lookup table in each node will be string indexed, allowing the trie to have better search times because less nodes will be explored (although the same number of characters).

We could also introduce supplementary fast links in the trie (in the lookup tables present at each node) by making an analysis of the most common matches that are needed in practice.

Interpreter redesign

A profile of a slightly optimised code shows that a lot of time is spent in the interpreter code in order to convert strings to integers. This is because the interpreter stack has been designed to work with a a single data type (strings), so every time a VM instruction needs integral data types we need to convert the strings.

Instruction set

There are some instructions that take much longer time than others. By analysing the instruction set we should be able to further fragment these time consuming instructions into smaller ones that will perform faster and could be easier to optimise.

Port code to LLVM

Redesign the compiler to produce LLVM code which would enable us to take advantage of the optimisations built into LLVM. Most of the stack / branching / logical / math instructions will be translated directly to LLVM instructions, while complex instructions would be translated to calls to C++ code (compiled to LLVM as well).

By compiling to LLVM we get rid of most of the VM implementation (the parts that handle stack / branching / logical / math operations), we gain a better, simplified architecture and speed. Hopefully, a lot of speed. This can be done in conjunction with the other optimisations, which would then only apply to the complex instructions of the transfer VM (such as clip) and working with the matching data structures.

Low level optimisations

Memory allocator

We could design a memory allocator for the VM data structures that will keep related objects (like nodes in a tree) as close as possible one to the other in order to increase the cache hits).

Branch prediction

Some parts can be restructured in order to make use of the branch prediction at the CPU level. I worked a little on some parts of the code that had this issue and gained some speedup using this technique.

Constant pool

The current implementation of the virtual machine spends a lot of time in copying strings from one method to the other. This could be greatly improved by using a string pool.

Caching

Caching can really improve the behavior of the VM - methods that are called often with (more or less) the same parameters can maintain a cache of the results if they do not have any side effects or depend on the global state. In the proof of concept, caching itself made a difference of about 10%.

Reduce copying of the data structures

As you will see in the code profiling report below, a lot of time was spent in moving vectors around. This can be avoided by a stricter coding style (marking returned vectors with const& or, when not possible, just using plain references).

Proof of concept

The tools I used to produce the proof of concept were:

vim - the best editor there is
valgrind - check for memory leaks
callgrind (a valgrind tool) - profiling the code
kcachegrind - GUI for callgrind reports

First I compiled and installed apertium on my machine (using the es-ro pair). I started with a fresh version of VM-for-transfer, by checking out the master branch of the repository. The test files I am using are part of the Parliament Proceedings Parallel Corpus 1996-2011.

In order to properly test the VM, I created two input files, pre-xfervm-10000 and pre-xfervm-1000. Both of them are created by running the first part of the pipeline (until just before the transfer), the only difference is the number of lines that were processed out of the original text (which is given by their suffix). The purpose of the pre-xfervm-10000 file is time evaluation, while the profiling will be done on the smaller one, pre-xfervm-1000, because valgrind takes a lot of time to produce the profiling report.

The fresh version (master branch) of the VM-for-transfer takes about 40s to do the transfer:

andrei@andrei-xps bin $ ./_tm
real	0m40.275s
user	0m40.183s
sys	0m0.052s

Let's take a look at the profile (remember, times are measured on the -10000 file, while the profile is done on the -1000 input file):

The odd thing here is the wtolower static method that seems to consume a lot of time. So most of the time spent in selecting a matching rule is actually spent making strings lowercase. This was because a locale variable that was always initialized in the respective function. In this branch I fixed the wtolower method. Now the code runs approx 60% faster:

andrei@andrei-xps bin $ ./_tm
real	0m25.168s
user	0m25.078s
sys	0m0.072s

We can see that we got a huge improvement just by fixing one function. I ran the profiler again, looking for the next bottleneck:

As we can see, the SystemTrie::getPatternNodes method is still quite slow. Since the method returns all nodes that correspond to a certain pattern starting from a node, we could cache the results at node level. Another thing that makes this method work slower than it should is vector copying. I decided to do the caching and restructure the code a bit in order to use constant references instead of copying things around (see this commit). The new time we get is, again, smaller:

andrei@andrei-xps bin $ ./_tm
real	0m15.677s
user	0m15.589s
sys	0m0.080s

The next step I tried is to do some small, low level optimisations (like improving the code for branch prediction hits, see here). The result is almost unnoticeable, but these kind of optimisations could also improve the efficiency of the code:

andrei@andrei-xps bin $ ./_tm
real	0m15.447s
user	0m15.377s
sys	0m0.064s

This is the first time when we can look at the profiler report and see that one of the slowest parts is the actual execution of the VM instructions:

TODO more.

Application

TODO

Difference between revisions of "User:Asfrent/Application"

Revision as of 14:05, 17 March 2014

Contents

Project description

High level optimisations

Algorithms and data structures

Interpreter redesign

Instruction set

Port code to LLVM

Low level optimisations

Memory allocator

Branch prediction

Constant pool

Caching

Reduce copying of the data structures

Proof of concept

Application

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools

@@ Line 18: / Line 18: @@
 === Interpreter redesign ===
-A profile of a slightly optimized code shows that a lot of time is spent in the interpreter code in order
+A profile of a slightly optimised code shows that a lot of time is spent in the interpreter code in order
 to convert strings to integers. This is because the interpreter stack has been designed to work with a
 a single data type (strings), so every time a VM instruction needs integral data types we need to convert
@@ Line 27: / Line 27: @@
 should be able to further fragment these time consuming instructions into smaller ones that will perform
 faster and could be easier to optimise.
+=== Port code to LLVM ===
+Redesign the compiler to produce [http://llvm.org/ LLVM] code which would enable us to take advantage of
+the optimisations built into LLVM. Most of the stack / branching / logical / math instructions will be translated
+directly to LLVM instructions, while complex instructions would be translated to calls to C++ code (compiled
+to LLVM as well).
+By compiling to LLVM we get rid of most of the VM implementation (the parts that handle stack / branching /
+logical / math operations), we gain a better, simplified architecture and speed. Hopefully, a lot of speed.
+This can be done in conjunction with the other optimisations, which would then only apply to the complex
+instructions of the transfer VM (such as '''clip''') and working with the matching data structures.
 == Low level optimisations ==