Difference between revisions of "User:Ggregori"

From Apertium
Jump to navigation Jump to search
Line 15: Line 15:


=== TODO list ===
=== TODO list ===
*Sergio has to explain me how the case policy is applied in the postchunk.
*Find out if there are rules interfering with how the case policy is applied in the postchunk.
*Start porting the VM to C++.
*Create the C++ repository and start with the basic structure.
*Pass all the tests of the Python version to make sure it behaves likewise.
*Start porting the Compiler to C++.
*Pass all the tests of the Python version to make sure it likewise.


=== Weekly reports ===
=== Weekly reports ===
Line 112: Line 111:
*Added messages “Path to rule blocked” and tested them with existing transfer files.
*Added messages “Path to rule blocked” and tested them with existing transfer files.


'''Week 10''' - (25/07 - 31/07): This week I ported the compiler to C++, although there are some small things to refine, it works exactly the same as the Python one and passes all the tests.
'''Week 10''' - (25/07 - 01/08):
*Fixed and refactored some small things in the Python compiler.
*Ported the entire compiler to C++, including adapting some things to the xml parser used in Apertium which worked slightly different than the Python one.
*Adapted the test script, test data and test expected output and passed all tests.

'''Week 11''' - (01/08 - 07/08):
----
----



Revision as of 11:29, 2 August 2011

About me

Name: Gabriel Gregori Manzano

Email/Google chat: Email me

IRC nick: ggregori

GSoC 2011

VM for the transfer module - Application

Github repository (python): [1]

Github repository (c++): [2]

TODO list

  • Find out if there are rules interfering with how the case policy is applied in the postchunk.
  • Start porting the VM to C++.
  • Pass all the tests of the Python version to make sure it behaves likewise.

Weekly reports

Community Bonding Period

Week 1 - (25/04 - 01/05): Basically this week has been dedicated to research/review some topics (some of them suggested by my mentor)

  • I have been reviewing NLP and Python using 'Natural Language Processing with Python' book.
  • I have been looking for a way to represent morphological labels in UCS/UTF and my mentor suggested using negative numbers as in Apertium internals. Anyway, I can worry about this later.
  • Using UTF with Python: 'codecs' and 'unicodedata' can be some useful modules.
  • Testing the option 'lt-proc' -b which is going to be the input of my compiler.

Week 2 - (02/05 - 08/05): This week I ended all the review/research needed, although I couldn't do all I wanted because I had to travel.

  • Ended with the introductory book reviewing NLP and Python.
  • Started designing and redefining the compiler's architecture following last year work and selected and did some tests with some modules. Some of the changes or improvements:
    • Use of pipes/command-line arguments for the input of the compiler (like the rest of Apertium).
    • Configurable logging module for info and debugging purposes (module: logging).
    • Refactoring some methods in the expatparser class (e.g. extracting common code of the callback method).
    • Create some additional classes in order to add some flexibility (e.g. parent class parser with the common code).

Week 3 - (09/05 - 15/05): This week I had to redo some work because of the Python3 switch, so didn't accomplish want I wanted. Anyway, two weeks of university classes remaining until I can focus exclusively in this project.

  • Switched to Python 3, reasons:
    • I hope to get better UTF-8 support among other things.
    • Had to test if the modules I use were fully available/compatible in Python3.
    • Had to read and research (again...) about str/bytes and std{in,out}.buffer and, in general, everything related to Unicode, UTF-8...
  • Started implementing the really basics of the compiler’s architecture:
    • Command-line arguments and help, input and output, logging...
  • Another think I realized this week is that a lot of the thinking done last week about trying to make a flexible prototype so it is easy to modify in the future doesn’t really apply to Python. For example, my design involved creating interfaces/abstract classes in order to be able to easily change components, but that in Python isn’t needed. In conclusion: duck-typing, although I will need my design in the C++ version.

Coding Period

Week 1 - (16/05 - 29/05): This last days have been impossible with university work, just this week I had like 4 class projects and 2 exams... Tuesday next week I will finish everything and will be able to focus completely on my project.

Week 2 - (30/05 - 05/06): Finally I can focus completely on my project and this week I have developed a lot the compiler:

  • Finished the structure of the project, now I am ready to start generating code from the transfer rules.
  • Created the Github repository where I will submit my work (link is at the top).
  • Implemented all the handling of the sections: def-cats, def-attrs, def-vars, def-lists and def-macros.
  • Created some test macros with the desired output in pseudo-assembly.
  • Implemented the generation of code for some elements: <not>, <equal>, b, <lit>, ...
  • Improved some of the code, creating a SymbolTable, separating debugging output and actual output etc.

Week 3 - (06/06 - 12/06): This week I finished my compiler which is able to generate pseudo-assembly for every element of the transfer rules files.

  • Added the ability to store some attributes like the number of children of an event, its parent etc.
  • Created more tests of macro's code generation and added new for rules and t2x/t3x.
  • Added some necessary instruction and change some other to maintain coherence.
  • Added code generation for all the remaining elements and its attributes: <when>, <test>, var, <let>, <lit-tag>, <clip>, <choose>, <otherwise>, <equal caseless=yes>, <and>, <or>, <in>, <list>, <get-case-from>, <concat>, <append>, <modify-case>, <case-of>, <begins-with> <begins-with-list>, <end-with>, <end-with-list>, <contains-substring>, <rule>, <pattern>, <pattern-item>, <action>, <lu>, <mlu>, <tags>, <chunk>, <call-macro>, <with-param>, <interchunk>, <postchunk>, <lu count>.
  • Created error detection and reporting for the input transfer rules files.

Week 4 - (13/06 - 19/06): I've spent most of the week thinking and designing the vm's architecture and started implementing it:

  • Updated the vm-for-transfer wiki page with the current implemented instruction set.
  • Created the initial architecture for the vm: dynamic instruction loader which converts instructions to a vm representation, and then an interpreter executes every instruction.
  • Implemented some of it, for example the assemblyloader reads a file, converts some of its contents and fills the appropriate data structures.

Week 5 - (20/06 - 26/06): This week I have implemented almost all the vm, just a little but important detail remaining. Now the only thing left is the implementation of every instruction.

  • Added more error checking to the compiler: check every call to a macro without doing a second full pass.
  • Added proper handling of labels in the vm with backpatching for all the rules, macros and instructions needed (jmp, jz, jnz, addtrie and call).
  • Implemented a simple system trie.
  • Added a code-to-preload section to only add patterns to the trie once.
  • Created the interpreter which initializes dynamically a dictionary with opCode : processingMethod pairs.
  • Added preprocessing and execution capabilities including the structure for the creation of all the instruction processing methods and the vm's main loop.
  • Implemented a callstack to handle rules calling macros because we need to store the last PC and its code section.

Week 6 - (27/06 - 03/07): This week's focus was on the reading of patterns which turned out to be harder than I thought, thanks to Sergio I now know how to do it (or at least I think so!).

  • Improved/corrected some code generation like the variables problem or the modify-case hell.
  • Implemented instructions: and, or, not, cmp, cmpi, cmp-substr, cmpi-substr, push, append, jz, jnz, lu, mlu, begins-with, begins-with-ig, ends-with, ends-with-ig, modifycase.
  • Implemented LRLM of patterns to select the rules to execute, although this still needs work.
  • Implemented more feature on the system trie, like the insertion of the '|' symbol for pattern's options.

Week 7 - (04/07 - 10/07): After a week of really hard work and long hours the vm is finished. There are still two things that I need to fix (blanks, and link-to) and I need to test it thoroughly too.

  • Added more things to the compiler like the case attribute of a chunk.
  • Implemented instructions: case-of, clipsl, cliptl, storesl, storetl, pushsb, pushbl, out, getCaseFrom, clip, storecl, lu-count.
  • Implemented and added tests for proper handling of patterns in the trie:
    • Patterns which start directly with tags (should accept any lemma, e.g. <n><pl> -> should accept student<n><pl>).
    • Patterns which contain *, e.g. <n><*><sg><*><gen>.
    • Patterns starting with a lemma should accept any case variation of that lemma.
  • Implemented support for shallow transfer or advanced transfer.
  • Added support for the interchunk and postchunk stages in the vm: parsing its input, implementing its specific instructions, specific tag values like “chcontent”, etc.
  • Fixed some bugs in the vm, there is still a need to test it more though.

Week 8 - (11/07 - 17/07): This week has been focused in polishing the vm and preparing it for the evaluation.

  • Fixed all the bugs that I have found.
  • Polished it a bit more, adding some things like clip and store clip using the longest match not the first one and rewriting entire modules like the transferword.
  • Added a gdb-like debug mode for the vm (run, continue, step, breakpoints, print, info...).
  • Compiled and generated the code for the full en-ca transfer system and made some tests.
  • Done some tests/experiments with some of the things my mentor told me.
  • Updated the wiki with more explanation and examples.

Week 9 - (18/07 - 24/07): After evaluation I had to finish some things Sergio told me and experimented some other. The bad news are that the experiments weren't successful, on the other hand I implemented blanks and fixed what was different from Apertium (I spent a lot of hours debugging some corner cases).

  • Finally, implemented blanks and superblanks, although it isn't the exact same behaviour as Apertium, it has some advantages and maybe some drawbacks. We'll have to look more into it, when gsoc finishes.
  • Tried to dump the trie to a file with pickle using every protocol available and the c extension but is still too slow and too much memory. Sergio wanted to expend a weekend improving the trie’s data structure, so I’ll continue with my project for now.
  • Did some testing with full articles, fixing all discrepancies I found (mainly blanks and some corner cases like passing the pos=0 as a macro parameter, in the postchunk stage, and then using an storecl instruction).
  • Fixed the trie for patterns like "to<pr>" and "*<pr>" with input "in order to<pr>" which has to be accepted only by the second one.
  • Added the remaining features to the debugger, only remaining linking to the transfer file right now.
  • Added messages “Path to rule blocked” and tested them with existing transfer files.

Week 10 - (25/07 - 31/07): This week I ported the compiler to C++, although there are some small things to refine, it works exactly the same as the Python one and passes all the tests.

  • Fixed and refactored some small things in the Python compiler.
  • Ported the entire compiler to C++, including adapting some things to the xml parser used in Apertium which worked slightly different than the Python one.
  • Adapted the test script, test data and test expected output and passed all tests.

Week 11 - (01/08 - 07/08):