From Apertium
Jump to navigation Jump to search

About me[edit]

Name: Gabriel Gregori Manzano

Email/Google chat: Email me

IRC nick: ggregori

GSoC 2011[edit]

VM for the transfer module - Application

Github repository (python): [1]

Github repository (c++): [2]

TODO list[edit]

  • Port modules: debugger.

Weekly reports[edit]

Community Bonding Period[edit]

Week 1 - (25/04 - 01/05): Basically this week has been dedicated to research/review some topics (some of them suggested by my mentor)

  • I have been reviewing NLP and Python using 'Natural Language Processing with Python' book.
  • I have been looking for a way to represent morphological labels in UCS/UTF and my mentor suggested using negative numbers as in Apertium internals. Anyway, I can worry about this later.
  • Using UTF with Python: 'codecs' and 'unicodedata' can be some useful modules.
  • Testing the option 'lt-proc' -b which is going to be the input of my compiler.

Week 2 - (02/05 - 08/05): This week I ended all the review/research needed, although I couldn't do all I wanted because I had to travel.

  • Ended with the introductory book reviewing NLP and Python.
  • Started designing and redefining the compiler's architecture following last year work and selected and did some tests with some modules. Some of the changes or improvements:
    • Use of pipes/command-line arguments for the input of the compiler (like the rest of Apertium).
    • Configurable logging module for info and debugging purposes (module: logging).
    • Refactoring some methods in the expatparser class (e.g. extracting common code of the callback method).
    • Create some additional classes in order to add some flexibility (e.g. parent class parser with the common code).

Week 3 - (09/05 - 15/05): This week I had to redo some work because of the Python3 switch, so didn't accomplish want I wanted. Anyway, two weeks of university classes remaining until I can focus exclusively in this project.

  • Switched to Python 3, reasons:
    • I hope to get better UTF-8 support among other things.
    • Had to test if the modules I use were fully available/compatible in Python3.
    • Had to read and research (again...) about str/bytes and std{in,out}.buffer and, in general, everything related to Unicode, UTF-8...
  • Started implementing the really basics of the compiler’s architecture:
    • Command-line arguments and help, input and output, logging...
  • Another think I realized this week is that a lot of the thinking done last week about trying to make a flexible prototype so it is easy to modify in the future doesn’t really apply to Python. For example, my design involved creating interfaces/abstract classes in order to be able to easily change components, but that in Python isn’t needed. In conclusion: duck-typing, although I will need my design in the C++ version.

Coding Period[edit]

Week 1 - (16/05 - 29/05): This last days have been impossible with university work, just this week I had like 4 class projects and 2 exams... Tuesday next week I will finish everything and will be able to focus completely on my project.

Week 2 - (30/05 - 05/06): Finally I can focus completely on my project and this week I have developed a lot the compiler:

  • Finished the structure of the project, now I am ready to start generating code from the transfer rules.
  • Created the Github repository where I will submit my work (link is at the top).
  • Implemented all the handling of the sections: def-cats, def-attrs, def-vars, def-lists and def-macros.
  • Created some test macros with the desired output in pseudo-assembly.
  • Implemented the generation of code for some elements: <not>, <equal>, b, <lit>, ...
  • Improved some of the code, creating a SymbolTable, separating debugging output and actual output etc.

Week 3 - (06/06 - 12/06): This week I finished my compiler which is able to generate pseudo-assembly for every element of the transfer rules files.

  • Added the ability to store some attributes like the number of children of an event, its parent etc.
  • Created more tests of macro's code generation and added new for rules and t2x/t3x.
  • Added some necessary instruction and change some other to maintain coherence.
  • Added code generation for all the remaining elements and its attributes: <when>, <test>, var, <let>, <lit-tag>, <clip>, <choose>, <otherwise>, <equal caseless=yes>, <and>, <or>, <in>, <list>, <get-case-from>, <concat>, <append>, <modify-case>, <case-of>, <begins-with> <begins-with-list>, <end-with>, <end-with-list>, <contains-substring>, <rule>, <pattern>, <pattern-item>, <action>, <lu>, <mlu>, <tags>, <chunk>, <call-macro>, <with-param>, <interchunk>, <postchunk>, <lu count>.
  • Created error detection and reporting for the input transfer rules files.

Week 4 - (13/06 - 19/06): I've spent most of the week thinking and designing the vm's architecture and started implementing it:

  • Updated the vm-for-transfer wiki page with the current implemented instruction set.
  • Created the initial architecture for the vm: dynamic instruction loader which converts instructions to a vm representation, and then an interpreter executes every instruction.
  • Implemented some of it, for example the assemblyloader reads a file, converts some of its contents and fills the appropriate data structures.

Week 5 - (20/06 - 26/06): This week I have implemented almost all the vm, just a little but important detail remaining. Now the only thing left is the implementation of every instruction.

  • Added more error checking to the compiler: check every call to a macro without doing a second full pass.
  • Added proper handling of labels in the vm with backpatching for all the rules, macros and instructions needed (jmp, jz, jnz, addtrie and call).
  • Implemented a simple system trie.
  • Added a code-to-preload section to only add patterns to the trie once.
  • Created the interpreter which initializes dynamically a dictionary with opCode : processingMethod pairs.
  • Added preprocessing and execution capabilities including the structure for the creation of all the instruction processing methods and the vm's main loop.
  • Implemented a callstack to handle rules calling macros because we need to store the last PC and its code section.

Week 6 - (27/06 - 03/07): This week's focus was on the reading of patterns which turned out to be harder than I thought, thanks to Sergio I now know how to do it (or at least I think so!).

  • Improved/corrected some code generation like the variables problem or the modify-case hell.
  • Implemented instructions: and, or, not, cmp, cmpi, cmp-substr, cmpi-substr, push, append, jz, jnz, lu, mlu, begins-with, begins-with-ig, ends-with, ends-with-ig, modifycase.
  • Implemented LRLM of patterns to select the rules to execute, although this still needs work.
  • Implemented more feature on the system trie, like the insertion of the '|' symbol for pattern's options.

Week 7 - (04/07 - 10/07): After a week of really hard work and long hours the vm is finished. There are still two things that I need to fix (blanks, and link-to) and I need to test it thoroughly too.

  • Added more things to the compiler like the case attribute of a chunk.
  • Implemented instructions: case-of, clipsl, cliptl, storesl, storetl, pushsb, pushbl, out, getCaseFrom, clip, storecl, lu-count.
  • Implemented and added tests for proper handling of patterns in the trie:
    • Patterns which start directly with tags (should accept any lemma, e.g. <n><pl> -> should accept student<n><pl>).
    • Patterns which contain *, e.g. <n><*><sg><*><gen>.
    • Patterns starting with a lemma should accept any case variation of that lemma.
  • Implemented support for shallow transfer or advanced transfer.
  • Added support for the interchunk and postchunk stages in the vm: parsing its input, implementing its specific instructions, specific tag values like “chcontent”, etc.
  • Fixed some bugs in the vm, there is still a need to test it more though.

Week 8 - (11/07 - 17/07): This week has been focused in polishing the vm and preparing it for the evaluation.

  • Fixed all the bugs that I have found.
  • Polished it a bit more, adding some things like clip and store clip using the longest match not the first one and rewriting entire modules like the transferword.
  • Added a gdb-like debug mode for the vm (run, continue, step, breakpoints, print, info...).
  • Compiled and generated the code for the full en-ca transfer system and made some tests.
  • Done some tests/experiments with some of the things my mentor told me.
  • Updated the wiki with more explanation and examples.

Week 9 - (18/07 - 24/07): After evaluation I had to finish some things Sergio told me and experimented some other. The bad news are that the experiments weren't successful, on the other hand I implemented blanks and fixed what was different from Apertium (I spent a lot of hours debugging some corner cases).

  • Finally, implemented blanks and superblanks, although it isn't the exact same behaviour as Apertium, it has some advantages and maybe some drawbacks. We'll have to look more into it, when gsoc finishes.
  • Tried to dump the trie to a file with pickle using every protocol available and the c extension but is still too slow and too much memory. Sergio wanted to expend a weekend improving the trie’s data structure, so I’ll continue with my project for now.
  • Did some testing with full articles, fixing all discrepancies I found (mainly blanks and some corner cases like passing the pos=0 as a macro parameter, in the postchunk stage, and then using an storecl instruction).
  • Fixed the trie for patterns like "to<pr>" and "*<pr>" with input "in order to<pr>" which has to be accepted only by the second one.
  • Added the remaining features to the debugger, only remaining linking to the transfer file right now.
  • Added messages “Path to rule blocked” and tested them with existing transfer files.

Week 10 - (25/07 - 31/07): This week I ported the compiler to C++, although there are some small things to refine, it works exactly the same as the Python one and passes all the tests.

  • Fixed and refactored some small things in the Python compiler.
  • Ported the entire compiler to C++, including adapting some things to the xml parser used in Apertium which worked slightly different than the Python one.
  • Adapted the test script, test data and test expected output and passed all tests.

Week 11 - (01/08 - 07/08): This week I started porting the vm to c++, as with the compiler the start was slow but at the end I have ported several classes and more importantly have reimplemented and plan to change several components to improve their performance.

  • Solved the problem with the case policy in the postchunk.
  • Ported several classes:
    • Main program, VM, scope, loader, exceptions...
  • Added lazy-loading to the loader, so that rules and macros are only preloaded. This means that the first time, and not until then, a rule or a macro is called, the loader will convert them to the internal vm representation and process them properly.
  • I have also been thinking a lot about the design and how to improve some components like the transferword module.

Week 12 - (08/08 - 14/08): A lot of time went on redesigning the transfer words module, but as I need time to measure things and make decisions based on data and there isn’t much time left, I am going to just port the left code and when everything is finished I’ll start improving it. Only the interpreter and some vm methods left to have a fully working vm in C++.

  • Implemented the entire transferwords module changing it to only parse the content of words when strictly necessary, therefore improving performance.
  • Implemented all the systemtrie module as in the Python version.
  • Implemented the callstack module extracting a method to better suit C++.

Week 13 - (15/08 - 22/08): This week I finished porting the VM to C++, meeting the project deadline and its four deliverables requirement.

  • Implemented all the interpreter module.
  • Fixed the locale setup memory leak.
  • Some refactoring: created a class hierarchy for lexical units with a common base.
  • Ported the remaining methods in the involving pattern matching and such.
  • Fixed everything until all vm tests passed exactly as in the Python version.
  • As it was expected the C++ version is faster than the Python one, not only due to the language choice but because the Python version has been programmed more to be an easy to understand reference than a fast one. In addition, there is a lot of room for performance improvement, for example implementing the binary format for the vm.

Just to show some numbers:

Test Python version C++ version Improvement 5.372s 0.968s 5.5x 37.337s 11.702s 3.2x

Final report[edit]

This project aimed at creating a transfer system with two main components: a compiler and a virtual machine. The compiler generates a pseudo-assembly representation of a transfer rule file (.t1x, .t2x or .t3x) and, then, the stack-based virtual machine is able to read and execute the instructions generated.

The project also aimed to create two versions of this new transfer system. One which would become the foundation of an easy to modify and experiment with system and would be implemented in Python. And another one which implementation would be more focused on performance and would be implemented in C++. Also, a lot of last year's work was intended to be reused and refined to some extent, for example, I ended up making the instruction set simpler and more consistent[1].

After three months of hard work, both versions are completed and can be downloaded from their respective repositories[2][3]. In addition, a small new feature, not initially planed, was added during the development of the Python version and consists of a gdb-like debugger for the virtual machine. In hindsight, most of the difficulties of the project weren't found on the programming side but on discovering exactly how each part of the transfer system works and, although a bit frustrating at some points, it has been a rewarding experience.

Finally, regarding performance the C++ version is faster than the Python one, as was expected[4]. Unfortunately, it is too soon to compare the vm with Apertium's transfer for several reasons but, mainly, because Apertium's one has a lot of optimizations which are lacking, for now, in the virtual machine (preprocessed and binarised trie, stringCache...). Therefore, currently, Apertium's transfer is faster but I plan to continue improving and following the path this project has started to see if better performance can be achieved.