User:Asfrent/Application
Contents
Project description
The project aims to speed up the VM for transfer code (the transfer step being the slowest of the translation pipeline at the moment).
Preparation
Before starting to actually implement the optimisations I will make sure I have a good testing framework that covers as much of the code as possible. This would also involve fixing the VM-for-transfer code in order to produce the same output as the XML tree-walking transfer code.
High level optimisations
Algorithms and data structures
Look for better data structures and algorithms that will result in a lower complexity of the system. One thing that we could do to the current design is to compress the trie that stores rules (trie compression involves finding path in which the nodes have only one child and compress that path into a single node). The lookup table in each node will be string indexed, allowing the trie to have better search times because less nodes will be explored (although the same number of characters).
We could also introduce supplementary fast links in the trie (in the lookup tables present at each node) by making an analysis of the most common matches that are needed in practice.
Interpreter redesign
A profile of a slightly optimised code shows that a lot of time is spent in the interpreter code in order to convert strings to integers. This is because the interpreter stack has been designed to work with a a single data type (strings), so every time a VM instruction needs integral data types we need to convert the strings.
Instruction set
There are some instructions that take much longer time than others. By analysing the instruction set we should be able to further fragment these time consuming instructions into smaller ones that will perform faster and could be easier to optimise.
Port code to LLVM
Redesign the compiler to produce LLVM code which would enable us to take advantage of the optimisations built into LLVM. Most of the stack / branching / logical / math instructions will be translated directly to LLVM instructions, while complex instructions would be translated to calls to C++ code (compiled to LLVM as well).
By compiling to LLVM we get rid of most of the VM implementation (the parts that handle stack / branching / logical / math operations), we gain a better, simplified architecture and speed. Hopefully, a lot of speed. This can be done in conjunction with the other optimisations, which would then only apply to the complex instructions of the transfer VM (such as clip) and working with the matching data structures.
Low level optimisations
Memory allocator
We could design a memory allocator for the VM data structures that will keep related objects (like nodes in a tree) as close as possible one to the other in order to increase the cache hits).
Branch prediction
Some parts can be restructured in order to make use of the branch prediction at the CPU level. I worked a little on some parts of the code that had this issue and gained some speedup using this technique.
Constant pool
The current implementation of the virtual machine spends a lot of time in copying strings from one method to the other. This could be greatly improved by using a string pool.
Caching
Caching can really improve the behavior of the VM - methods that are called often with (more or less) the same parameters can maintain a cache of the results if they do not have any side effects or depend on the global state. In the proof of concept, caching itself made a difference of about 10%.
Reduce copying of the data structures
As you will see in the code profiling report below, a lot of time was spent in moving vectors around. This can be avoided by a stricter coding style (marking returned vectors with const& or, when not possible, just using plain references).
Proof of concept (coding challenge and one step further)
The tools I used to produce the proof of concept were:
- vim - the best editor there is
- valgrind - check for memory leaks
- callgrind (a valgrind tool) - profiling the code
- kcachegrind - GUI for callgrind reports
First I compiled and installed apertium on my machine (using the es-ro pair). I started with a fresh version of VM-for-transfer, by checking out the master branch of the repository. The test files I am using are part of the Parliament Proceedings Parallel Corpus 1996-2011.
In order to properly test the VM, I created two input files, pre-xfervm-10000 and pre-xfervm-1000. Both of them are created by running the first part of the pipeline (until just before the transfer), the only difference is the number of lines that were processed out of the original text (which is given by their suffix). The purpose of the pre-xfervm-10000 file is time evaluation, while the profiling will be done on the smaller one, pre-xfervm-1000, because valgrind takes a lot of time to produce the profiling report.
The fresh version (master branch) of the VM-for-transfer takes about 40s to do the transfer:
andrei@andrei-xps bin $ ./_tm real 0m40.275s user 0m40.183s sys 0m0.052s
Let's take a look at the profile (remember, times are measured on the -10000 file, while the profile is done on the -1000 input file):
The odd thing here is the wtolower static method that seems to consume a lot of time. So most of the time spent in selecting a matching rule is actually spent making strings lowercase. This was because a locale variable that was always initialized in the respective function. In this branch I fixed the wtolower method. Now the code runs approx 60% faster:
andrei@andrei-xps bin $ ./_tm real 0m25.168s user 0m25.078s sys 0m0.072s
We can see that we got a huge improvement just by fixing one function. I ran the profiler again, looking for the next bottleneck:
As we can see, the SystemTrie::getPatternNodes method is still quite slow. Since the method returns all nodes that correspond to a certain pattern starting from a node, we could cache the results at node level. Another thing that makes this method work slower than it should is vector copying. I decided to do the caching and restructure the code a bit in order to use constant references instead of copying things around (see this commit). The new time we get is, again, smaller:
andrei@andrei-xps bin $ ./_tm real 0m15.677s user 0m15.589s sys 0m0.080s
The next step I tried is to do some small, low level optimisations (like improving the code for branch prediction hits, see here). The result is almost unnoticeable, but these kind of optimisations could also improve the efficiency of the code:
andrei@andrei-xps bin $ ./_tm real 0m15.447s user 0m15.377s sys 0m0.064s
This is the first time when we can look at the profiler report and see that one of the slowest parts is the actual execution of the VM instructions:
TODO more.
Application
General information
Name: Andrei Sfrent
E-mail address: gsoc a asfrent d net
Other information that may be useful to contact you:
IRC - asfrent on #apertium
Why is it that you are interested in the Apertium project?
I find the projects of this organisation interesting and much related to some of the theory that I studied in school (context free grammars, state machines, compilers). I like getting involved in projects like this one, where I can learn new things while improving the speed and reliability by applying my experience (so a good starting point for me would be VM-for-transfer). Also, I met on IRC some really cool guys, so I think this project has a great team!
Which of the published tasks are you interested in? What do you plan to do?
VM-for-transfer project, along with any other optimisation tasks. I am planning to make VM-for-transfer at least a bit better than the XML tree-walking code and then focus on other bottlenecks in the project.
About me (short summary)
My name is Andrei Sfrent, I am studying for a Master's Degree in Machine Learning at the Imperial College and I am interested in working on one of the apertium projects in GSoC this summer.
I am proficient in C++ and I also have a good background in the software optimisation area (last year I participated in an optimisation contest organized by Intel and ended on the first place in Europe / forth place worldwide 1 2).
I previously had two internships with Google and I had the chance to work with Map Reduce / Knowledge Graph on YouTube data, enhancing my engineering and design skills.
You can also find out more about me from my resume.
[1] http://software.intel.com/fr-fr/articles/contest-winners-are-announced
[2] http://intel-software-academic-program.com/tmp/certificates/4.pdf
About the project
TODO