User:Asfrent/GSoC Log

From Apertium
Jump to navigation Jump to search

GSoC Log[edit]

06.06.2014[edit]

  • wrote a new test suite under new_tests folder.
  • added language pairs es-ro, ro-es, en-es, es-en.
  • three types of tests:
    • normal - compare the output of xfervm with the one of the xml treewalking based transfer.
    • memory - tests using valgrind --tool=memcheck.
    • performance - tests using valgrind --tool=callgrind.
  • fixed valgrind error complaining about uninitialized raw pointers.
  • starting running time of all test suite is 8m28.204s (508s).

07.06.2014[edit]

  • fixed memory leak in SystemTrie, most of the memcheck tests pass now.
  • fixed invalid memory access in ChunkWord.
  • analysed the code for other bugs, discovered issues because of rule ambiguity in XML files.
  • sped up methods of VMWstringUtils. Tests run twice as fast.

10.06.2014[edit]

  • bug hunting all day.
  • discovered issues with <modify-case> in XML rules of es-ro.
  • implemented instruction line number for debugging purposes.
  • fixed "numbers are considered uppercase issue".
  • after a discussion with spectie on IRC we decided to move to the next phase of the project, the implementation of a compressed trie datastructure. The tests have to be redone, I will do the changes to the code as to keep the current behavior (output), buggy or not. The rationale behind this decision is that most of the bugs we analysed so far were due to wrong XML rules, rather than code bugs.
  • implemented the new testing strategy. All tests pass. There is still one memcheck test not passing because of wrong XML rules.
  • ran the es-ro es_1000 stage1 test under callgrind tool in order to analyse the performance and decide the next thing to optimize before merging testing branch. The total running time under callgrind was 307 seconds.
  • ran the normal tests, took 4m12.871s (252 seconds).
  • replaced vectors with lists, used list::splice and std::move. Tests pass, the full test suite tales 3m19.230s (199 seconds).
  • merged the testing branch into master. Remaining todos in the testing framework will be addressed in a separate branch.
  • created a new branch for SystemTrie optimizations, system-trie-opt.
  • it seems that making a function for converting strings (template<typename T> T stringTo(const string&)) lowers the time by about 20 more seconds. Two tests fail because the rules.xml file is wrong - test generation gives some warnings about it. This will be taken to spectie tomorrow. Changes undone.
  • minor readability improvements.

11.06.2014[edit]

  • fixed rules.xml in stage1 of language pair es-en. Regenerated new tests for the language pair, ran memcheck, all memory tests pass.
  • added stringTo method to VMWstringUtils and replaced the use of string streams everywhere. All tests pass, new time is 3m1.693s (181 seconds).
  • added clean action to tests script.
  • further analysis and readability improvements.

12.06.2014[edit]

  • analysed the code of SystemTrie.
  • designed a new data structure for matching.
  • set a goal to explore alternate matching using regex.

13.06.2014[edit]

  • started to write the new data structure that will be used for matching instead of SystemTrie.
  • optimised toLower / toUpper methods.
  • finished writing, tests do not pass. Sunday is the day!

15.06.2014[edit]

  • found and fixed the bug that was causing the NSystemTrie not to work properly.
  • new SystemTrie in place, further optimisations to come.
  • all tests pass, current time is 1m59.198s (119 seconds)
  • small tweak to executePush method of interpreter, time is 1m49.773s (109 seconds)
  • compiled code with clang++, same performance as g++
  • compiled interpreter code to LLVM IR with clang++. Looks cool.
  • analysed code with valgrind, slowest parts are cliptl, clipsl and rule matching.
  • implemented test tags.

16-17.06.2014[edit]

  • started new branch, compilation-opt
  • analyzed the interpreter and compiler code.
  • started transitions towards better string management, less stringTo<int> calls.
  • added more instructions, replacing some old ones with new, specialized versions of themselves.
  • work in progress, looking for the next thing to optimize.

18.06.2014[edit]

  • spent some time on string matching algorithms
  • implemented and tested KMP, Rabin Karp, they are less efficient than the O(N^2) version in the C++ standard template library.
  • decided to go on with implementing a string pool.

19.06.2014[edit]

  • continued transition to less stringTo<int> calls, implemented for jump instructions (JMP, JNZ, JZ).
  • started to rewrite input loading methods.
  • after a discussion with TinoDidrinksen on IRC, decided to implement a string pool and replace tags with int values everywhere.
  • current time is 1m36.319s (96 seconds).
  • removed calls to getOperands, time is 1m33.355s (93 seconds).
  • redesigned SystemStack for more flexibility and speed. Big step towards removing stringTo<int> calls. The idea is to keep int data as int on the stack whenever possible.
  • transition to true integers on SystemStack. Reimplemented all logical, jump and compare operations. Current time is 1m29.094s (89 seconds).
  • transition to true integers done on most of the code. The only remaining instruction that still operates on strings, even though its result is an integer is lu-count. The transition is not possible without major compiler refactoring, which is not desirable at the moment. The speed gain would be insignificant.
  • ran tests, all pass, time is 1m24.082s (84 seconds).

22-23.06.2014[edit]

  • slight optimisation of tokenizeInput.
  • implemented a StringPool.
  • implemented a ListPool.
  • refactored the intepreter and compiler code: optimisation using StringPool and ListPool.
  • StringPool optimisations WIP.
  • recompiled rules.vm, ran tests, all pass, time is 1m4.778s (64 seconds).

24-25.06.2014[edit]

  • met with Francis in London, woohoo.
  • small changes
  • mid term target achieved.