Difference between revisions of "Bytecode for transfer"

From Apertium
Jump to navigation Jump to search
Line 15: Line 15:
== A concrete example: Esperanto-English ==
== A concrete example: Esperanto-English ==
Take a look at
Take a look at
[http://apertium.svn.sourceforge.net/viewvc/apertium/trunk/lttoolbox-java/testdata/transfer/apertium-eo-en.eo-en.t1x?view=markup apertium-eo-en.eo-en.t1x]
[http://apertium.svn.sourceforge.net/viewvc/apertium/trunk/lttoolbox-java/testdata/transfer/apertium-eo-en.eo-en.t1x?view=markup apertium-eo-en.eo-en.t1x] and [http://apertium.svn.sourceforge.net/viewvc/apertium/trunk/lttoolbox-java/src/org/apertium/transfer/generated/apertium_eo_en_eo_en_t1x.java?view=markup apertium_eo_en_eo_en_t1x.java] (the same file converted into Java).
The Java version is compiled into bytecode and executed with the Java JVM and JIT (Just-in-time) compiler which converts it into machine code during run-time.
and compare with the Java version [http://apertium.svn.sourceforge.net/viewvc/apertium/trunk/lttoolbox-java/src/org/apertium/transfer/generated/apertium_eo_en_eo_en_t1x.java?view=markup
apertium_eo_en_eo_en_t1x.java].
This is compiled into bytecode and executed with the Java JIT (Just-in-time) compiler.


<pre>
<pre>
Line 33: Line 31:
Speedup factor: 5.76
Speedup factor: 5.76
</pre>
</pre>

== Using it in a language pair ==

Add an entry to modes.xml where you replace "apertium-transfer" with "apertium-transfer-j" and use the .class file instead if the .t1x file.

For example, replace
<program name="apertium-transfer">
<file name="apertium-eo-en.eo-en.t1x"/>
<file name="eo-en.t1x.bin"/>
<file name="eo-en.autobil.bin"/>
</program>
with
<program name="apertium-transfer-j">
<file name="apertium_eo_en_eo_en_t1x.class"/>
<file name="eo-en.t1x.bin"/>
<file name="eo-en.autobil.bin"/>
</program>

Also add apertium-preprocess-transfer-bytecode-j to Makefile.am, or do it manually:
$ apertium-preprocess-transfer-bytecode-j apertium-eo-en.eo-en.t1x apertium_eo_en_eo_en_t1x.class




== Further work ==
== Further work ==

Revision as of 00:24, 28 February 2010

Currently transfer is the bottleneck in Apertium, processing here takes 95% CPU. This is because the transfer file is being interpreted (tree walking of the XML in the transfer t1x file) instead of being compiled into machine code.

The Java transfer bytecode compiler converts arbitrarily complex transfer files into Java source code, which is then compiled into platform-indepent bytecode.

During transfer the Java Virtual Machine will convert the most used part (the 'hot spots') into machine code.

This enables

  • Faster transfer (currently factor 5) of a corpus
  • Debuggable transfer (using a Java development tool, for example Netbeans, you can step thru the transfer code to see exactly what is happening)
  • Validating transfer files

A concrete example: Esperanto-English

Take a look at apertium-eo-en.eo-en.t1x and apertium_eo_en_eo_en_t1x.java (the same file converted into Java). The Java version is compiled into bytecode and executed with the Java JVM and JIT (Just-in-time) compiler which converts it into machine code during run-time.

Parsing /home/j/esperanto/apertium-svn/apertium/trunk/lttoolbox-java/testdata/transfer/apertium-eo-en.eo-en.t1x
// WARNING: Attribute a_np_acr is not defined. Valid attributes are: [a_nom, a_prp, a_adv, a_adj, a_vrb, a_vrb2, a_det, a_ord, a_prn, a_tns, a_nepersonaj_tempoj, a_gen, a_prs, a_nbr, a_cas, lem, lemq, lemh, whole, tags, chname, chcontent, content]
// Replacing with error_UNKNOWN_ATTR - for <transfer default="chunk">/<section-def-macros>/<def-macro n="firstWord" npar="1">/<choose>/<when>/<test>/<equal>/<clip part="a_np_acr" pos="1" side="sl">
Compiling: javac -cp dist/lttoolbox.jar transfertest/res/lttoolbox-java/testdata/transfer/apertium_eo_en_eo_en_t1x.java

Here is a speed comparison:

Interpreted transfer took 91.59 secs
bytecode compiled transfer took 15.88 secs
Speedup factor: 5.76

Using it in a language pair

Add an entry to modes.xml where you replace "apertium-transfer" with "apertium-transfer-j" and use the .class file instead if the .t1x file.

For example, replace

     <program name="apertium-transfer">
       <file name="apertium-eo-en.eo-en.t1x"/>
       <file name="eo-en.t1x.bin"/>
       <file name="eo-en.autobil.bin"/>
     </program>

with

     <program name="apertium-transfer-j">
       <file name="apertium_eo_en_eo_en_t1x.class"/>
       <file name="eo-en.t1x.bin"/>
       <file name="eo-en.autobil.bin"/>
     </program>

Also add apertium-preprocess-transfer-bytecode-j to Makefile.am, or do it manually:

$ apertium-preprocess-transfer-bytecode-j apertium-eo-en.eo-en.t1x apertium_eo_en_eo_en_t1x.class


Further work

  • The Java code have not been optimized for speed, so perhaps the real potential speedup is 6-8, or even a higher factor, if using a mixed mode (mixing C and Java code instead of doing pure-Java).
  • Memory usage is also higher than really needed. I.a.
  • The underlying library, lttoolbox-java, is using 50% of the CPU, and there are some well known performance issues which are fixable
  • The bytecode should be pulled thru an optimizer, like Soot
  • There is a zillion of Open Source Java bytecode interpreters to choose from, most prominent Sun's own and http://kaffe.org. Only Sun's have been tested. At least GCJ should be tried out.
  • A step for post-compiling to native code should be tried out.
  • With http://xmlvm.org/ there could be a way for iPhones as well
  • Considering that we have a full port lttoolbox, Apertium could be made to run purely on Java, enabling a wide range of platforms, i.a. Windows, phones (J2ME or Android), web pages, server systems. Only the tagger is missing for a full system.