User:Mikel/Embeddable lttoolbox-java: Progress

From Apertium
< User:Mikel
Revision as of 14:19, 22 June 2012 by Mikel (talk | contribs) (Created page with 'This is the place where I will try to summarize the progress that I am doing in my project. I have divided it in two sections: in the first one I will try to explain all the deve…')
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

This is the place where I will try to summarize the progress that I am doing in my project. I have divided it in two sections: in the first one I will try to explain all the development that I am doing and, in the second one, I will reflect the work that I have done week by week, which can be contrasted with the original plan. I will try to keep all this updated.

Development work

I have been working on quite different stuff, so I have tried to organize it in several points for this section:

lttoolbox-java embeddability

This is the central part of the whole project. The aim is to make lttoolbox-java easy to integrate in bigger Java projects, as well as usable as a standalone Java program for final users (i.e. have self-contained Jar files that can run without any external dependency or requirement except a JVM).

All the development about it has been articulated around a single API class, org.apertium.Translator, which acts as an abstraction layer over the whole lttoolbox-java and offers a simple way to interact with it. At the same time, lttoolbox-java has been extended so that it can work with embedded files (that is, resources contained in a Zip file or the Jar itself), and this new API class can deal with the new concept of language pair packages that comes up from it. A language pair package would basically be a Zip file that contains all the files/resources needed for translating one (or several) language pair. A Jar itself can act as a language pair package and include all the required resources for translation as well as lttoolbox-java itself, which would constitute the above mentioned concept of a self-contained Jar addressed to final users.

How does it actually work?

As an answer to this question, here you have the basic steps to use the Translator API class:

  • The setBase method at Translator is first used to specify the "base file" for translation. A base file can be either a path to a language pair package (a Zip or Jar file that contains all the resources required for translation), a path to a .mode file, or a ClassLoader that can load all the resources required for translation. In the case of the self-contained Jar files, it is possible to set the Jar itself as the base file by calling the setJarAsBase method.
  • Once the base file has been established, the available modes can be gotten by calling the getAvailableModes method, which returns a String array of the mode names (for instance, ["eo-en", "eu-en", "es-en"]). In order to obtain the full name of a mode (for instance, "Esperanto-English" instead of "eo-en") the getTitleForMode method can be used, which takes a mode name as argument and returns the title that corresponds to it.
  • Once we know the available modes, the setMode method at Translator has to be used to specify the mode that has to be used for translation. For instance, setMode("eo-en") would tell lttoolbox-java that we want to translate from Esperanto to English.
  • Once both the base and the mode have been established, the translate method at Translator can be used to finally translate a text. This method takes a String that corresponds to the source text as argument and returns a String that corresponds to its translation.

Of course, it is not necessary to repeat all those steps for subsequent translations. It is possible (and logical) to perform several translations one after the other without changing the base or the mode, or changing the mode without changing the base, for instance.

Some more features

Apart from that, I have also worked on some features that extend and improve this basic behaviour:

  • lttoolbox-java itself can deal with language pair packages when using from the command line by simply indicating the path to the package after the -d flag. In case we want to use the resources included in the Jar itself, it is not necessary to indicate any path.
  • A simple user-oriented GUI class, org.apertium.ApertiumGUI, has been developed on top of the Translator API class. It is loaded by default if no argument is provided to lttoolbox-java and it uses the resources in the Jar itself. This makes sense for the above mentioned self-contained Jars addressed to final users, since all users have to do is simply double click them and use the intuitive GUI, without any dependency or requirement problem except for a JVM.
  • All this (and, in particular, this sample GUI), has been tested to work under Java Web Start as long as all-permissions is set to true.
  • The API class can be directly used in Android as well, as long as a DexClassLoader is set as the base. Regarding it, I developed a sample Android app as an example for other possible apps. Arink has also adopted it for his project and is working in a much more complete app.
Caching

This feature deserves a subsection for itself:

Caching has been implemented as an optional feature to improve performance. Basically, each time a resource (tipically, a .bin file) is loaded, it is possible to keep it in memory so that it is not needed to reload it in subsequent translations. The Translator API class manages all this by itself: you simply have to enable/disable caching by calling the setCacheEnabled(boolean enabled) method there. What's more, if caching is enabled, the API class will load all those resources in the background so, even in the first translation, it will be using the cached resources (and, in case it is still loading the resources when asked for a translation, it will naturally wait for it to finish first).

In terms of real performance, the improvement is notable. Thanks to caching, we save all the I/O operations required to load the resources for translation, which makes a really big difference for translating small texts, making them practically instantaneous. In my tests, translating a single sentence took about 500-1000 ms without caching (the times strongly depend on the language pair), and less than 20 ms with caching.

As the only drawback that I have found, caching consumes some extra memory, about 10-20 MB in my tests.

Language pair package maintenance

It is obvious that what makes lttoolbox-java really embeddable, as described in the previous point, are the language pair packages. This way, we should naturally be somehow providing ready to use packages to end users since, in case they had to create them, all this wouldn't make much sense. Things being so, it is necessary (or, at least, desirable) for us (us = the Apertium community) to have an easy way to create these packages and keep them updated somewhere. For this purpose, I have worked on two bash scripts:

  • apertium-pack takes the path to one (or some) mode file(s) as argument and automatically creates ready to use packages for them. Currently, three packages are generated: a self-contained one that includes lttoolbox-java, another one that doesn't include lttoolbox-java and a last one that uses dalvik bytecode (for Android). The script requires to have lttoolbox-java (the one in my branch) and android-sdk installed, and their location has to be specified by setting the LTTOOLBOX_JAVA_PATH and ANDROID_SDK_PATH environment variables or by editing the script to make the changes permanent (instructions inside).
  • apertium-upload takes the path to one (or some) language pair package(s) as argument and uploads them to the correct directory at SVN (right now, inside my branch).

As an idea, both scripts could be integrated in the makefiles of each language pair so that a simple "make-upload" would automatically create the appropiate packages and upload them to SVN.

NOTE: We are still discussing some stuff and, definitely, this is still under development and major changes can be expected.

Integrating apertium-viewer with lttoolbox-java

I have worked on integrating lttoolbox-java with apertium-viewer. You might be asking: what for? Doesn't apertium-viewer work correctly without it? Well, it does, but this brings some advantages as well. First of all, we don't depend on external programs like the C++ version of Apertium. At the same time, thanks to the caching feature previously described, the performance improvement is incredible. Apart from that, it opens the doors to use language pair packages with apertium-viewer as well, and it would even be possible to directly use online packages with it.

External processing is still optional in the current implementation (which can be found, as usual, at http://javabog.dk:8080/apertium-viewer/launch.jnlp): it is disabled by default, but it can be activated in the app preferences. Also, internal processing requires to have JDK and lttoolbox-java installed in the user's machine (apertium-viewer will try to switch to external processing if they are not found). I am currently working on direct generation of bytecode for transfer, which would solve these two dependency problems (see the next point). So, once again, this is still under development, and major changes can be expected.

Direct generation of bytecode for transfer

As you will probably know, lttoolbox-java uses bytecode for transfer. So, in order to generate the appropiate classes, lttoolbox-java currently generates Java source code for each transfer file, which is then compiled into bytecode. For it to work, it is obvious that a Java compiler (and, presumably, JDK) must be installed in the user's machine, and having lttoolbox-java installed in a well-known directory is also required (because the compiler needs some of its classes to carry out the compilation). Final users are likely to not satisfy these requirements (in particular, I guess that very few will have installed JDK), which would be probematic if they need to generate the bytecode for transfer. This is, for instance, the problem that we currently have with apertium-viewer, but we could have similar problems in some other contexts as well.

As a solution, I am currently working on the direct generation of bytecode for transfer. The basic idea is that lttoolbox-java would directly create the bytecode by itself without relying on an external compiler for this task. For this purpose, I am using the BCEL API. Things are looking good so far, but I am having quite a lot of problems and it will take me some more time to finish it.

Timeline

This is, more or less, the timeline that I have followed. You can contrast it with the original plan.

- Week 1-2: General work in lttoolbox-java embeddability, which includes adapting lttoolbox-java so that it can directly work with embedded files, make an API class to easily interact with it, and develop a sample Android app and a simple Swing GUI class on top of it. At the same time, develop a bash script to automatically create ready to use language pair packages.

- Week 2-3: Implement caching to improve performance.

- Week 3-4: Integrate lttoolbox-java with apertium-viewer. Develop a bash script to automatically upload language pair packages to SVN.

- Week 4-5: Work (and still working) on adopting the BCEL API to directly generate the bytecode for transfer.