- 1 VM for the transfer module application
- 1.1 About me
- 1.2 Introduction
- 1.3 Why is it you are interested in machine translation?
- 1.4 Why is it that you are interested in the Apertium project?
- 1.5 Which of the published tasks are you interested in? What do you plan to do?
- 1.6 Work plan
- 1.7 Why should Google and Apertium sponsor this project?
- 1.8 Other commitments for the summer
- 1.9 References
VM for the transfer module application
Name: Gabriel Gregori Manzano
Email/Google chat: Email me
IRC nick: ggregori
My name is Gabriel Gregori Manzano, I am a student of Computer Science at the University of Alicante, Spain. I am in my fifth year and finishing all my required courses this year I plan to complete my degree next year with the developing of my final degree project and some elective credits remaining. Being at the end of my degree I have been exposed to a lot of programming languages, technologies and techniques, but the most relevant to this project would be my focus on C++ and Java at the university, and the learning of Python on my own.
Why is it you are interested in machine translation?
My interest in this field began with my first Compilers course (taught by some of the researchers involved in the Apertium project) where, although we didn't studied machine translation, we were reminded that a lot of the techniques and algorithms we learned were being used in the machine translation field. That was my first introduction to the field, from there I tried to satisfy my curiosity on my own, until I could study the relevant course my university offered, but unfortunately it wasn't electable this year. For me, the most amazing thing about machine translation is trying to make a computer understand a, by definition ambiguous, language and the relation between different languages, therefore my interest in the transfer rules.
Why is it that you are interested in the Apertium project?
Apertium is the union of two things I am interested in: machine translation and free software. This last feature makes it the best candidate for people, with knowledge of programming, to learn how a real and complete machine translation systems works. Being free software means all kinds of projects and uses can be discovered or developed with it, and the possibility for a lot of people to use it without having to spend a lot of money in a proprietary software.
Which of the published tasks are you interested in? What do you plan to do?
I have been discussing with my possible mentor Sergio Ortiz the development of the “VM for the transfer module” project. Although this project was started by another student last year, he wasn't able to complete the project, so my intentions are to reuse all the thinking already done that I can, in particular, the instruction set and the foundation of the compiler's architecture, trying to make it more flexible, decoupling its components and making it easier to change some of them. My tasks will be to develop firstly a prototype written in Python, which will be useful in the future to test and develop more complex things, and later port it to the C++ programming language.
This project aims to provide a solution to one of the major current bottlenecks of Apertium which is the processing of the transfer rules. This is a known issue documented in the wiki and there has been different proposed solutions and some of them implemented. The problem lies in the processing of the XML files used by the three levels transfer system.
The proposed solution is to compile the transfer rules files to a pseudo-assembly defined for this task and make a light interpreter for the pseudo-assembly generated. Therefore the scope of the project is to build or develop three main components:
1. An instruction set for the pseudo-assembly. 2. A compiler for the transfer rules files to the pseudo-assembly. 3. An interpreter of the final pseudo-assembly generated.
As I said before, after discussing the project with my possible mentor Sergio Ortiz, the first prototype will be implemented in Python. The main objective of this prototype is to make use of the expressiveness of the language to ease the development of the system and to be able, in the future, to develop more complex systems like a full one-pass parser of the transfer rules with ambiguity solved by statistics methods.
Once the prototype is complete, I will be able to port it to C++ more easily. The main reason behind this choice is that implementing it with C++ will minimize the number of external libraries required, taking into account that Apertium is already built in C++. In addition, performance and memory usage will be easier to maintain under control that with other languages or platforms. In conclusion, keeping Apertium as light, fast and easy to execute and manage as possible.
My intention is to first define the pseudo-assembly as an instruction set, then build the compiler of the transfer rules to this pseudo-assembly and finally build the interpreter or VM.
Although I am going to define the pseudo-assembly at the start of the project, it doesn't mean that it won't change, I will be refining it as necessary during the development of the compiler.
Then I will start working in the compiler in a incremental way, defining a set of tests for a small subset of the instructions and implementing these instructions into the compiler.
Once I have the compiler for the full instruction set, I will start the work in the interpreter in a similar way. Defining a set of tests and then implementing these in the interpreter. Finally when the system is complete, I will start the port to the C++ language.
Detailed work plan:
Community bonding period
In this period my intentions are to become more a part of the Apertium community, explore more in depth the code parts of Apertium which I haven't explored so much, i.e. the parts which are not directly involved with my project, and to become completely sure I know and understand all the processes of the community which I have already started to look at. Another important use of this time will be to make some prototypes and try different ideas or architectures in order to gain extensive knowledge of what can or cannot work so I lose the minimum amount of time when I start the first week of the program.
Week 1: Start implementing the foundation of the compiler's architecture. Create a set of tests for the first subset of instructions which are going to be implemented.
Week 2: Continue implementing the architecture, focusing on the system's trie and start implementing the tests for the first subset of instructions.
Week 3: Design another batch of tests for the next part of the instruction set and start implementing those on the compiler.
Week 4: Implement the last part of the instruction set and make the additional tests and documentation required
Deliverable #1: A compiler written in Python which converts the XML transfer files to the defined pseudo-assembly.
Week 5: Use the compiler already built to design some test cases for the VM, maybe reuse the result of the compiler's tests. Start implementing the foundation of the VM and all the methods related to the VM's stack control.
Week 6: Continue implementing the VM focusing on the first subset of instructions to be interpreted by it. Develop the next set of tests.
Week 7: Complete the implementation of the instruction set in the interpreter and add some integration tests if needed.
Week 8: Finish the remaining work on the VM, documenting as needed, refactoring and organizing the code to facilitate the port to C++.
Deliverable #2: A complete VM with the ability to process the entire instruction set.
Week 9: Start adapting the tests to the C++ environment, begin the port of the VM structure to C++, maybe with the use of some stubs until I complete the port of all the code.
Week 10: Continue porting the code until I fill all the stubs and make sure every test passes, so the VM behaves exactly as its Python port.
Deliverable #3: The VM fully ported to C++ with the same features of its Python port.
Week 11: Repeat the process this time for the Compiler, starting with the adaptation of the compiler's tests to the C++ environment and later with the cloning of the Python implementation's structure.
Week 12: Port the remaining code to C++ until I am completely sure that all the tests are satisfied. Finish all the remaining tasks: complete the documentation, clean and organize all the code, update installation/configuration notes as needed...
Final deliverable: The complete system will be the two compilers (the Python one and its C++ port), the two interpreters or VM (the Python one and its C++ port), a set of tests for every part of the system and all the documentation needed (focusing on the design part, e.g. instruction set, data structures, details of the VM's stack...).
Why should Google and Apertium sponsor this project?
I think I have the knowledge and skills required to do this project and the objective of the project is to improve some of the current bottlenecks therefore improving the performance of the system, so this will benefit every current user of Apertium and its developers. I would be thankful to Google and Apertium for giving me this opportunity of contributing to a well establish and important project and in exchange I will commit to the completion of this project and to my desire of being part of this great project not only for a summer.
Other commitments for the summer
I finish all my final exams the first week of the time line (23-30 May), and from there on I have no other commitments so I will be able to focus all my energy on my Apertium project and its completion. Because of this lack of other commitments I will be able to work on my project all the week-hours which are requested.
- Last year's instruction set, http://wiki.apertium.org/wiki/VM_for_transfer#Instruction_Sets