Bytecode for transfer

From Apertium
Revision as of 10:47, 9 December 2014 by Unhammer (talk | contribs) (links)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Currently transfer is the bottleneck in Apertium, processing here takes 95% CPU. This is because the transfer file is being interpreted (tree walking of the XML in the transfer t1x file) instead of being compiled into machine code.

The Java transfer bytecode compiler converts arbitrarily complex transfer files into Java source code, which is then compiled into platform-indepent bytecode. Thanks to the use of BCEL, the bytecode is now generated directly, so the intermediate step of generating Java source code to later compile it to bytecode is no longer required. This way, JDK is not necessary anymore in order to use lttoolbox-java, and JRE (which is what most users would install) is enough.

During transfer the Java Virtual Machine will convert the most used part (the 'hot spots') into machine code.

This enables

  • Faster transfer (currently factor 5). As startup times are higher by 0.33 seconds this is only feasible when processing more than 100 sentences (2000 words).
  • Debuggable transfer. Using a Java development tool, for example Netbeans, you can step thru the transfer code line-by-line, inspecting variables and see exactly what is happening
  • Validating transfer files

A concrete example: Esperanto-English

Take a look at apertium-eo-en.eo-en.t1x and apertium_eo_en_eo_en_t1x.java (the same file converted into Java). The Java version is compiled into bytecode (An equivalent bytecode class would be directly generated now) and executed with the Java JVM and JIT (Just-in-time) compiler which converts it into machine code during run-time.

Here is a speed comparison on a corpus (testdata/transfer/transferinput-en-eo.t1x.txt - 20000 sentences, 423215 words 7527866 bytes).

Interpreted transfer took 91.59 secs
bytecode compiled transfer took 15.88 secs
Speedup factor: 5.76

Using it

First, compile the t1x to bytecode:

$ apertium-preprocess-transfer-bytecode-j file.t1x file.class

Then replace 'apertium-transfer file.t1x' with 'apertium-transfer-j file.class'.

Using it in a language pair

Add an entry to modes.xml where you replace "apertium-transfer" with "apertium-transfer-j" and use the .class file instead of the .t1x file.

For example, replace

     <program name="apertium-transfer">
       <file name="apertium-eo-en.eo-en.t1x"/>
       <file name="eo-en.t1x.bin"/>
       <file name="eo-en.autobil.bin"/>
     </program>

with

     <program name="apertium-transfer-j">
       <file name="eo-en.t1x.class"/>
       <file name="eo-en.t1x.bin"/>
       <file name="eo-en.autobil.bin"/>
     </program>

Now you can compile manually with

$ apertium-preprocess-transfer-bytecode-j apertium-eo-en.eo-en.t1x eo-en.t1x.class

Adding it to your Makefile

You can also add optional support for bytecode compilation to Makefile.am:

Under the lines

$(PREFIX1).t1x.bin: $(BASENAME).$(PREFIX1).t1x
       apertium-validate-transfer $(BASENAME).$(PREFIX1).t1x
       apertium-preprocess-transfer $(BASENAME).$(PREFIX1).t1x $@

Add

       @if [ "`which apertium-preprocess-transfer-bytecode-j`" == "" ]; then echo && echo "NOTE: lttoolbox-java (used for bytecode accelerated transfer) is missing" && echo "      Therefore the following will fail (but it's OK)" && echo; fi
       -apertium-preprocess-transfer-bytecode-j $(BASENAME).$(PREFIX1).t1x $(PREFIX1).t1x.class

If lttoolbox-java isnt installed a warning is emitted and compilation continues (so things still work).

Remember to do the same for $(PREFIX2).

See https://sourceforge.net/p/apertium/svn/21146/log/?path=/trunk/lttoolbox-java for a complete example of the changes

Further work

  • The Java code have not been optimized for speed, so perhaps the real potential speedup is 6-8, or even a higher factor, if using a mixed mode (mixing C and Java code instead of doing pure-Java).
  • Memory usage is also higher than really needed. I.a.
  • The underlying library, lttoolbox-java, is using 50% of the CPU, and there are some well known performance issues which are fixable
  • There is a zillion of Open Source Java bytecode interpreters to choose from, most prominent Sun's own and http://kaffe.org. Only Sun's have been tested. At least GCJ should be tried out.
  • A step for post-compiling to native code should be tried out.
  • With http://xmlvm.org/ there could be a way for iPhones as well
  • Considering that we have a full port lttoolbox, Apertium could be made to run purely on Java, enabling a wide range of platforms, i.a. Windows, phones (J2ME or Android), web pages, server systems. Only the tagger is missing for a full system.

Bytecode optimisations

Although the achieved performance is good and certainly represents a great improvement over interpreted transfer, the generated bytecode is far from being optimal. Having this in mind, it was considered that a Java bytecode optimizer could provide a great improvement in both transfer speed and the size of the resulting bytecode classes, and ProGuard was chosen to experimentally test it. Unfortunately, the results achieved after trying with several different configurations were far from the expected ones.

The following configuration was used for testing purposes with the English ⇆ Spanish language pair package:

java -jar proguard.jar -injars apertium-en-es.jar -outjars apertium-en-es-optimized.jar -libraryjars /home/mikel/developer/java/jdk1.6.0_33/jre/lib/rt.jar -target 6 -keepattributes InnerClasses -optimisationpasses 5 -overloadaggressively -allowaccessmodification -keepclasseswithmembers "class * {public static void main(java.lang.String[]);}" -keep "class org.apertium.interchunk.InterchunkWord" -keep "class org.apertium.transfer.TransferWord" -keep "class transfer_classes.* {public *;}" -assumenosideeffects "class org.apertium.transfer.TransferWord {public java.lang.String tl(org.apertium.transfer.ApertiumRE); public java.lang.String sl(org.apertium.transfer.ApertiumRE);}" -verbose

Using this configuration, ProGuard claims to be performing over 10,000 optimisations of different types. For instance, this is the output of the first iteration, which is the most significant one (a total of 5 iterations were performed):

ProGuard, version 4.8
Reading input...
Reading program jar [/home/mikel/apertium/temp/apertium-en-es.jar]
Reading library jar [/home/mikel/developer/java/jdk1.6.0_33/jre/lib/rt.jar]
Initializing...
Ignoring unused library classes...
  Original number of library classes: 17441
  Final number of library classes:    526
Setting target versions...
Shrinking...
Removing unused program classes and class elements...
  Original number of program classes: 127
  Final number of program classes:    124
Inlining subroutines...
Optimizing...
  Number of finalized classes:                 73
  Number of vertically merged classes:         0
  Number of horizontally merged classes:       5
  Number of removed write-only fields:         83
  Number of privatized fields:                 321
  Number of inlined constant fields:           17
  Number of privatized methods:                101
  Number of staticized methods:                30
  Number of finalized methods:                 316
  Number of removed method parameters:         55
  Number of inlined constant parameters:       9
  Number of inlined constant return values:    0
  Number of inlined short method calls:        178
  Number of inlined unique method calls:       130
  Number of inlined tail recursion calls:      7
  Number of merged code blocks:                37
  Number of variable peephole optimisations:   2179
  Number of arithmetic peephole optimisations: 21
  Number of cast peephole optimisations:       0
  Number of field peephole optimisations:      9
  Number of branch peephole optimisations:     712
  Number of string peephole optimisations:     385
  Number of simplified instructions:           207
  Number of removed instructions:              7720
  Number of removed local variables:           34
  Number of removed exception blocks:          2
  Number of optimized local variable frames:   584
Shrinking...
Removing unused program classes and class elements...
  Original number of program classes: 124
  Final number of program classes:    118

As for the actual results, the file size is certainly reduced after it:

[mikel@fedora temp]$ ls -l apertium-en-es.jar apertium-en-es-optimized.jar
-rw-rw-r--. 1 mikel mikel 3470823 abu  4 15:48 apertium-en-es.jar
-rw-rw-r--. 1 mikel mikel 3343968 abu  4 16:05 apertium-en-es-optimized.jar

Not a big difference (about 130 KB, less than the 5% of the total file size), but not bad... The funny thing is that the improvement corresponds entirely to lttoolbox-java, since the transfer class files are actually bigger than before! Once extracted, the transfer classes took 379.9 KB before optimizing, and 407.8 KB after it.

What's about translation speed? Let's see how much it takes translating a single paragraph before and after the optimisation:

[mikel@fedora temp]$ time echo $SOME_TEXT | java -jar apertium-en-es.jar apertium en-es > output

real    0m1.573s
user    0m1.513s
sys     0m0.092s
[mikel@fedora temp]$ time echo $SOME_TEXT | java -jar apertium-en-es-optimized.jar apertium en-es > output

real    0m1.398s
user    0m1.273s
sys     0m0.103s

So it has taken 175 ms less after the optimisation... it seems that it has been about 10% faster then! Really? What will happen if we try with a longer input? Let's test it with a big text file (the text file used consists of the same paragraph as before repeated 1000 times, and it takes 378 KB):

[mikel@fedora temp]$ time java -jar apertium-en-es.jar apertium en-es input output
real    0m31.139s
user    0m30.373s
sys     0m0.266s
[mikel@fedora temp]$ time java -jar apertium-en-es-optimized.jar apertium en-es input output

real    0m30.840s
user    0m30.022s
sys     0m0.197s

So it has taken about 300 ms less after the optimisation, that is, it has been almost 1% faster, which is certainly an unnoticeable difference!

All in all, it seems that the time we save after the optimisation is practically constant, and it doesn't seem to be too related to the transfer classes. I suspect that it greatly corresponds to the preverification step that ProGuard carries out because we target Java 6, and not to the optimisations themselves. In fact, comparing a bytecode fragment before and after the optimisation, it seems that there isn't any considerable change. The following corresponds to the non-optimized version:

 public void rule99__prep__probj(Writer arg0, TransferWord arg1, String arg2, TransferWord arg3)
   throws IOException
 {
   if (this.debug)
     logCall("rule99__prep__probj", new Object[] { arg1, arg2, arg3 });
   macro_firstWord(arg0, arg1);
   if (((arg3.sl(this.attr_pers).equals("<p3>")) && (arg3.sl(this.attr_nbr).equals("<pl>"))) || ((arg3.sl(this.attr_pers).equals("<p1>")) && (arg3.sl(this.attr_nbr).equals("<pl>"))))
   {
     arg3.tlSet(this.attr_gen, "<m>");
   }
   else if ((arg3.sl(this.attr_pers).equals("<p3>")) && (arg3.sl(this.attr_nbr).equals("<sg>")) && (!arg3.sl(this.attr_gen).equals("<nt>")))
   {
     arg3.tlSet(this.attr_gen, arg3.sl(this.attr_gen));
   }
   else if ((arg3.sl(this.attr_pers).equals("<p1>")) && (arg3.sl(this.attr_nbr).equals("<sg>")))
   {
     arg3.tlSet(this.attr_lem, "mí");
   }
   else if ((arg3.sl(this.attr_pers).equals("<p2>")) && (arg3.sl(this.attr_nbr).equals("<sp>")))
   {
     arg3.tlSet(this.attr_lem, "ti");
   }
   else if ((arg3.sl(this.attr_pers).equals("<p2>")) && (arg3.sl(this.attr_nbr).equals("<pl>")))
   {
     arg3.tlSet(this.attr_lem, "prpers");
     arg3.tlSet(this.attr_gen, "<m>");
   }
   String str = arg1.tl(this.attr_whole);
   if (str.length() > 0)
     arg0.append('^').append(str).append('$');
   arg0.append('^').append(TransferWord.copycase(this.var_caseFirstWord, "pr")).append("<PREP>").append('{').append("}$");
   arg0.append(arg2);
   arg0.append('^').append(arg3.tl(this.attr_lem)).append("<prn>").append("<2>").append(arg3.tl(this.attr_pers)).append(arg3.tl(this.attr_gen).isEmpty() ? "" : "<4>").append(arg3.tl(this.attr_nbr).isEmpty() ? "" : "<5>").append('$');
   arg0.append('^').append("probj").append("<SN>").append("<tn>").append(arg3.tl(this.attr_pers)).append(arg3.tl(this.attr_gen)).append(arg3.tl(this.attr_nbr)).append('{').append("}$");
 }

In this particular case, it would be obvious to store arg3.sl(this.attr_pers) in a variable, but the "optimized" version does not do this (here it is called paramTransferWord2.a(this.jdField_o_of_type_OrgApertiumTransferA)). In fact, it seems that nothing is changed apart from some variable and method names, which, at the same time, makes the code less readable:

 public void rule99__prep__probj(Writer paramWriter, TransferWord paramTransferWord1, String paramString, TransferWord paramTransferWord2)
 {
   if (this.debug)
     a("rule99__prep__probj", new Object[] { paramTransferWord1, paramString, paramTransferWord2 });
   h(paramTransferWord1);
   if (((paramTransferWord2.a(this.jdField_o_of_type_OrgApertiumTransferA).equals("<p3>")) && (paramTransferWord2.a(this.jdField_s_of_type_OrgApertiumTransferA).equals("<pl>"))) || ((paramTransferWord2.a(this.jdField_o_of_type_OrgApertiumTransferA).equals("<p1>")) && (paramTransferWord2.a(this.jdField_s_of_type_OrgApertiumTransferA).equals("<pl>"))))
   {
     paramTransferWord2.a(this.jdField_p_of_type_OrgApertiumTransferA, "<m>");
   }
   else if ((paramTransferWord2.a(this.jdField_o_of_type_OrgApertiumTransferA).equals("<p3>")) && (paramTransferWord2.a(this.jdField_s_of_type_OrgApertiumTransferA).equals("<sg>")) && (!paramTransferWord2.a(this.jdField_p_of_type_OrgApertiumTransferA).equals("<nt>")))
   {
     paramTransferWord2.a(this.jdField_p_of_type_OrgApertiumTransferA, paramTransferWord2.a(this.jdField_p_of_type_OrgApertiumTransferA));
   }
   else if ((paramTransferWord2.a(this.jdField_o_of_type_OrgApertiumTransferA).equals("<p1>")) && (paramTransferWord2.a(this.jdField_s_of_type_OrgApertiumTransferA).equals("<sg>")))
   {
     paramTransferWord2.a(this.w, "mí");
   }
   else if ((paramTransferWord2.a(this.jdField_o_of_type_OrgApertiumTransferA).equals("<p2>")) && (paramTransferWord2.a(this.jdField_s_of_type_OrgApertiumTransferA).equals("<sp>")))
   {
     paramTransferWord2.a(this.w, "ti");
   }
   else if ((paramTransferWord2.a(this.jdField_o_of_type_OrgApertiumTransferA).equals("<p2>")) && (paramTransferWord2.a(this.jdField_s_of_type_OrgApertiumTransferA).equals("<pl>")))
   {
     paramTransferWord2.a(this.w, "prpers");
     paramTransferWord2.a(this.jdField_p_of_type_OrgApertiumTransferA, "<m>");
   }
   if ((paramTransferWord1 = paramTransferWord1.b(this.z)).length() > 0)
     paramWriter.append('^').append(paramTransferWord1).append('$');
   paramWriter.append('^').append(TransferWord.a(this.jdField_f_of_type_JavaLangString, "pr")).append("<PREP>").append('{').append("}$");
   paramWriter.append(paramString);
   paramWriter.append('^').append(paramTransferWord2.b(this.w)).append("<prn>").append("<2>").append(paramTransferWord2.b(this.jdField_o_of_type_OrgApertiumTransferA)).append(paramTransferWord2.b(this.jdField_p_of_type_OrgApertiumTransferA).isEmpty() ? "" : "<4>").append(paramTransferWord2.b(this.jdField_s_of_type_OrgApertiumTransferA).isEmpty() ? "" : "<5>").append('$');
   paramWriter.append('^').append("probj").append("<SN>").append("<tn>").append(paramTransferWord2.b(this.jdField_o_of_type_OrgApertiumTransferA)).append(paramTransferWord2.b(this.jdField_p_of_type_OrgApertiumTransferA)).append(paramTransferWord2.b(this.jdField_s_of_type_OrgApertiumTransferA)).append('{').append("}$");
 }

In a nutshell, although ProGuard certainly makes things slightly better, the difference is so small that it can be considered that it is not worth using it. Later on, we could consider trying another bytecode optimizer, or implementing some optimisations ourselves during bytecode generation.