Difference between revisions of "Bytecode for transfer"
(links) |
|||
(One intermediate revision by the same user not shown) | |||
Line 15: | Line 15: | ||
== A concrete example: Esperanto-English == |
== A concrete example: Esperanto-English == |
||
Take a look at |
Take a look at |
||
[ |
[https://sourceforge.net/p/apertium/svn/HEAD/tree/trunk/lttoolbox-java/testdata/transfer/apertium-eo-en.eo-en.t1x?view=markup apertium-eo-en.eo-en.t1x] and [https://sourceforge.net/p/apertium/svn/HEAD/tree/trunk/lttoolbox-java/src/org/apertium/transfer/generated/apertium_eo_en_eo_en_t1x.java?view=markup apertium_eo_en_eo_en_t1x.java] (the same file converted into Java). |
||
<s>The Java version is compiled into bytecode</s> (''An equivalent bytecode class would be directly generated now'') and executed with the Java JVM and JIT (Just-in-time) compiler which converts it into machine code during run-time. |
<s>The Java version is compiled into bytecode</s> (''An equivalent bytecode class would be directly generated now'') and executed with the Java JVM and JIT (Just-in-time) compiler which converts it into machine code during run-time. |
||
Line 66: | Line 66: | ||
Remember to do the same for $(PREFIX2). |
Remember to do the same for $(PREFIX2). |
||
See |
See https://sourceforge.net/p/apertium/svn/21146/log/?path=/trunk/lttoolbox-java for a complete example of the changes |
||
== Further work == |
== Further work == |
||
Line 77: | Line 77: | ||
* Considering that we have a full port lttoolbox, Apertium could be made to run purely on Java, enabling a wide range of platforms, i.a. Windows, phones (J2ME or Android), web pages, server systems. <s>Only the tagger is missing for a full system.</s> |
* Considering that we have a full port lttoolbox, Apertium could be made to run purely on Java, enabling a wide range of platforms, i.a. Windows, phones (J2ME or Android), web pages, server systems. <s>Only the tagger is missing for a full system.</s> |
||
=== Bytecode |
=== Bytecode optimisations === |
||
Although the achieved performance is good and certainly represents a great improvement over interpreted transfer, the generated bytecode is far from being optimal. Having this in mind, it was considered that a Java bytecode optimizer could provide a great improvement in both transfer speed and the size of the resulting bytecode classes, and [http://proguard.sourceforge.net/ ProGuard] was chosen to experimentally test it. Unfortunately, the results achieved after trying with several different configurations were far from the expected ones. |
Although the achieved performance is good and certainly represents a great improvement over interpreted transfer, the generated bytecode is far from being optimal. Having this in mind, it was considered that a Java bytecode optimizer could provide a great improvement in both transfer speed and the size of the resulting bytecode classes, and [http://proguard.sourceforge.net/ ProGuard] was chosen to experimentally test it. Unfortunately, the results achieved after trying with several different configurations were far from the expected ones. |
||
Line 84: | Line 84: | ||
<code> |
<code> |
||
java -jar proguard.jar -injars apertium-en-es.jar -outjars apertium-en-es-optimized.jar -libraryjars /home/mikel/developer/java/jdk1.6.0_33/jre/lib/rt.jar -target 6 -keepattributes InnerClasses - |
java -jar proguard.jar -injars apertium-en-es.jar -outjars apertium-en-es-optimized.jar -libraryjars /home/mikel/developer/java/jdk1.6.0_33/jre/lib/rt.jar -target 6 -keepattributes InnerClasses -optimisationpasses 5 -overloadaggressively -allowaccessmodification -keepclasseswithmembers "class * {public static void main(java.lang.String[]);}" -keep "class org.apertium.interchunk.InterchunkWord" -keep "class org.apertium.transfer.TransferWord" -keep "class transfer_classes.* {public *;}" -assumenosideeffects "class org.apertium.transfer.TransferWord {public java.lang.String tl(org.apertium.transfer.ApertiumRE); public java.lang.String sl(org.apertium.transfer.ApertiumRE);}" -verbose |
||
</code> |
</code> |
||
Using this configuration, [http://proguard.sourceforge.net/ ProGuard] claims to be performing over 10,000 |
Using this configuration, [http://proguard.sourceforge.net/ ProGuard] claims to be performing over 10,000 optimisations of different types. For instance, this is the output of the first iteration, which is the most significant one (a total of 5 iterations were performed): |
||
<pre> |
<pre> |
||
Line 121: | Line 121: | ||
Number of inlined tail recursion calls: 7 |
Number of inlined tail recursion calls: 7 |
||
Number of merged code blocks: 37 |
Number of merged code blocks: 37 |
||
Number of variable peephole |
Number of variable peephole optimisations: 2179 |
||
Number of arithmetic peephole |
Number of arithmetic peephole optimisations: 21 |
||
Number of cast peephole |
Number of cast peephole optimisations: 0 |
||
Number of field peephole |
Number of field peephole optimisations: 9 |
||
Number of branch peephole |
Number of branch peephole optimisations: 712 |
||
Number of string peephole |
Number of string peephole optimisations: 385 |
||
Number of simplified instructions: 207 |
Number of simplified instructions: 207 |
||
Number of removed instructions: 7720 |
Number of removed instructions: 7720 |
||
Line 148: | Line 148: | ||
Not a big difference (about 130 KB, less than the 5% of the total file size), but not bad... The funny thing is that the improvement corresponds entirely to [[lttoolbox-java]], since the transfer class files are actually bigger than before! Once extracted, the transfer classes took 379.9 KB before optimizing, and 407.8 KB after it. |
Not a big difference (about 130 KB, less than the 5% of the total file size), but not bad... The funny thing is that the improvement corresponds entirely to [[lttoolbox-java]], since the transfer class files are actually bigger than before! Once extracted, the transfer classes took 379.9 KB before optimizing, and 407.8 KB after it. |
||
What's about translation speed? Let's see how much it takes translating a single paragraph before and after the |
What's about translation speed? Let's see how much it takes translating a single paragraph before and after the optimisation: |
||
<pre> |
<pre> |
||
Line 163: | Line 163: | ||
</pre> |
</pre> |
||
So it has taken 175 ms less after the |
So it has taken 175 ms less after the optimisation... it seems that it has been about 10% faster then! Really? What will happen if we try with a longer input? Let's test it with a big text file (the text file used consists of the same paragraph as before repeated 1000 times, and it takes 378 KB): |
||
<pre> |
<pre> |
||
Line 177: | Line 177: | ||
</pre> |
</pre> |
||
So it has taken about 300 ms less after the |
So it has taken about 300 ms less after the optimisation, that is, it has been almost 1% faster, which is certainly an unnoticeable difference! |
||
All in all, it seems that the time we save after the |
All in all, it seems that the time we save after the optimisation is practically constant, and it doesn't seem to be too related to the transfer classes. I suspect that it greatly corresponds to the preverification step that [http://proguard.sourceforge.net/ ProGuard] carries out because we target Java 6, and not to the optimisations themselves. In fact, comparing a bytecode fragment before and after the optimisation, it seems that there isn't any considerable change. The following corresponds to the non-optimized version: |
||
public void rule99__prep__probj(Writer arg0, TransferWord arg1, String arg2, TransferWord arg3) |
public void rule99__prep__probj(Writer arg0, TransferWord arg1, String arg2, TransferWord arg3) |
||
Line 253: | Line 253: | ||
} |
} |
||
In a nutshell, although [http://proguard.sourceforge.net/ ProGuard] certainly makes things slightly better, the difference is so small that it can be considered that it is not worth using it. Later on, we could consider trying another bytecode optimizer, or implementing some |
In a nutshell, although [http://proguard.sourceforge.net/ ProGuard] certainly makes things slightly better, the difference is so small that it can be considered that it is not worth using it. Later on, we could consider trying another bytecode optimizer, or implementing some optimisations ourselves during bytecode generation. |
||
[[Category:Development]] |
[[Category:Development]] |
||
[[Category:Documentation in English]] |
[[Category:Documentation in English]] |
||
[[Category:Transfer]] |
Latest revision as of 10:47, 9 December 2014
Currently transfer is the bottleneck in Apertium, processing here takes 95% CPU. This is because the transfer file is being interpreted (tree walking of the XML in the transfer t1x file) instead of being compiled into machine code.
The Java transfer bytecode compiler converts arbitrarily complex transfer files into Java source code, which is then compiled into platform-indepent bytecode. Thanks to the use of BCEL, the bytecode is now generated directly, so the intermediate step of generating Java source code to later compile it to bytecode is no longer required. This way, JDK is not necessary anymore in order to use lttoolbox-java, and JRE (which is what most users would install) is enough.
During transfer the Java Virtual Machine will convert the most used part (the 'hot spots') into machine code.
This enables
- Faster transfer (currently factor 5). As startup times are higher by 0.33 seconds this is only feasible when processing more than 100 sentences (2000 words).
- Debuggable transfer. Using a Java development tool, for example Netbeans, you can step thru the transfer code line-by-line, inspecting variables and see exactly what is happening
- Validating transfer files
A concrete example: Esperanto-English[edit]
Take a look at
apertium-eo-en.eo-en.t1x and apertium_eo_en_eo_en_t1x.java (the same file converted into Java).
The Java version is compiled into bytecode (An equivalent bytecode class would be directly generated now) and executed with the Java JVM and JIT (Just-in-time) compiler which converts it into machine code during run-time.
Here is a speed comparison on a corpus (testdata/transfer/transferinput-en-eo.t1x.txt - 20000 sentences, 423215 words 7527866 bytes).
Interpreted transfer took 91.59 secs bytecode compiled transfer took 15.88 secs Speedup factor: 5.76
Using it[edit]
First, compile the t1x to bytecode:
$ apertium-preprocess-transfer-bytecode-j file.t1x file.class
Then replace 'apertium-transfer file.t1x' with 'apertium-transfer-j file.class'.
Using it in a language pair[edit]
Add an entry to modes.xml where you replace "apertium-transfer" with "apertium-transfer-j" and use the .class file instead of the .t1x file.
For example, replace
<program name="apertium-transfer"> <file name="apertium-eo-en.eo-en.t1x"/> <file name="eo-en.t1x.bin"/> <file name="eo-en.autobil.bin"/> </program>
with
<program name="apertium-transfer-j"> <file name="eo-en.t1x.class"/> <file name="eo-en.t1x.bin"/> <file name="eo-en.autobil.bin"/> </program>
Now you can compile manually with
$ apertium-preprocess-transfer-bytecode-j apertium-eo-en.eo-en.t1x eo-en.t1x.class
Adding it to your Makefile[edit]
You can also add optional support for bytecode compilation to Makefile.am:
Under the lines
$(PREFIX1).t1x.bin: $(BASENAME).$(PREFIX1).t1x apertium-validate-transfer $(BASENAME).$(PREFIX1).t1x apertium-preprocess-transfer $(BASENAME).$(PREFIX1).t1x $@
Add
@if [ "`which apertium-preprocess-transfer-bytecode-j`" == "" ]; then echo && echo "NOTE: lttoolbox-java (used for bytecode accelerated transfer) is missing" && echo " Therefore the following will fail (but it's OK)" && echo; fi -apertium-preprocess-transfer-bytecode-j $(BASENAME).$(PREFIX1).t1x $(PREFIX1).t1x.class
If lttoolbox-java isnt installed a warning is emitted and compilation continues (so things still work).
Remember to do the same for $(PREFIX2).
See https://sourceforge.net/p/apertium/svn/21146/log/?path=/trunk/lttoolbox-java for a complete example of the changes
Further work[edit]
- The Java code have not been optimized for speed, so perhaps the real potential speedup is 6-8, or even a higher factor, if using a mixed mode (mixing C and Java code instead of doing pure-Java).
- Memory usage is also higher than really needed. I.a.
- The underlying library, lttoolbox-java, is using 50% of the CPU, and there are some well known performance issues which are fixable
- There is a zillion of Open Source Java bytecode interpreters to choose from, most prominent Sun's own and http://kaffe.org. Only Sun's have been tested. At least GCJ should be tried out.
- A step for post-compiling to native code should be tried out.
- With http://xmlvm.org/ there could be a way for iPhones as well
- Considering that we have a full port lttoolbox, Apertium could be made to run purely on Java, enabling a wide range of platforms, i.a. Windows, phones (J2ME or Android), web pages, server systems.
Only the tagger is missing for a full system.
Bytecode optimisations[edit]
Although the achieved performance is good and certainly represents a great improvement over interpreted transfer, the generated bytecode is far from being optimal. Having this in mind, it was considered that a Java bytecode optimizer could provide a great improvement in both transfer speed and the size of the resulting bytecode classes, and ProGuard was chosen to experimentally test it. Unfortunately, the results achieved after trying with several different configurations were far from the expected ones.
The following configuration was used for testing purposes with the English ⇆ Spanish language pair package:
java -jar proguard.jar -injars apertium-en-es.jar -outjars apertium-en-es-optimized.jar -libraryjars /home/mikel/developer/java/jdk1.6.0_33/jre/lib/rt.jar -target 6 -keepattributes InnerClasses -optimisationpasses 5 -overloadaggressively -allowaccessmodification -keepclasseswithmembers "class * {public static void main(java.lang.String[]);}" -keep "class org.apertium.interchunk.InterchunkWord" -keep "class org.apertium.transfer.TransferWord" -keep "class transfer_classes.* {public *;}" -assumenosideeffects "class org.apertium.transfer.TransferWord {public java.lang.String tl(org.apertium.transfer.ApertiumRE); public java.lang.String sl(org.apertium.transfer.ApertiumRE);}" -verbose
Using this configuration, ProGuard claims to be performing over 10,000 optimisations of different types. For instance, this is the output of the first iteration, which is the most significant one (a total of 5 iterations were performed):
ProGuard, version 4.8 Reading input... Reading program jar [/home/mikel/apertium/temp/apertium-en-es.jar] Reading library jar [/home/mikel/developer/java/jdk1.6.0_33/jre/lib/rt.jar] Initializing... Ignoring unused library classes... Original number of library classes: 17441 Final number of library classes: 526 Setting target versions... Shrinking... Removing unused program classes and class elements... Original number of program classes: 127 Final number of program classes: 124 Inlining subroutines... Optimizing... Number of finalized classes: 73 Number of vertically merged classes: 0 Number of horizontally merged classes: 5 Number of removed write-only fields: 83 Number of privatized fields: 321 Number of inlined constant fields: 17 Number of privatized methods: 101 Number of staticized methods: 30 Number of finalized methods: 316 Number of removed method parameters: 55 Number of inlined constant parameters: 9 Number of inlined constant return values: 0 Number of inlined short method calls: 178 Number of inlined unique method calls: 130 Number of inlined tail recursion calls: 7 Number of merged code blocks: 37 Number of variable peephole optimisations: 2179 Number of arithmetic peephole optimisations: 21 Number of cast peephole optimisations: 0 Number of field peephole optimisations: 9 Number of branch peephole optimisations: 712 Number of string peephole optimisations: 385 Number of simplified instructions: 207 Number of removed instructions: 7720 Number of removed local variables: 34 Number of removed exception blocks: 2 Number of optimized local variable frames: 584 Shrinking... Removing unused program classes and class elements... Original number of program classes: 124 Final number of program classes: 118
As for the actual results, the file size is certainly reduced after it:
[mikel@fedora temp]$ ls -l apertium-en-es.jar apertium-en-es-optimized.jar -rw-rw-r--. 1 mikel mikel 3470823 abu 4 15:48 apertium-en-es.jar -rw-rw-r--. 1 mikel mikel 3343968 abu 4 16:05 apertium-en-es-optimized.jar
Not a big difference (about 130 KB, less than the 5% of the total file size), but not bad... The funny thing is that the improvement corresponds entirely to lttoolbox-java, since the transfer class files are actually bigger than before! Once extracted, the transfer classes took 379.9 KB before optimizing, and 407.8 KB after it.
What's about translation speed? Let's see how much it takes translating a single paragraph before and after the optimisation:
[mikel@fedora temp]$ time echo $SOME_TEXT | java -jar apertium-en-es.jar apertium en-es > output real 0m1.573s user 0m1.513s sys 0m0.092s [mikel@fedora temp]$ time echo $SOME_TEXT | java -jar apertium-en-es-optimized.jar apertium en-es > output real 0m1.398s user 0m1.273s sys 0m0.103s
So it has taken 175 ms less after the optimisation... it seems that it has been about 10% faster then! Really? What will happen if we try with a longer input? Let's test it with a big text file (the text file used consists of the same paragraph as before repeated 1000 times, and it takes 378 KB):
[mikel@fedora temp]$ time java -jar apertium-en-es.jar apertium en-es input output real 0m31.139s user 0m30.373s sys 0m0.266s [mikel@fedora temp]$ time java -jar apertium-en-es-optimized.jar apertium en-es input output real 0m30.840s user 0m30.022s sys 0m0.197s
So it has taken about 300 ms less after the optimisation, that is, it has been almost 1% faster, which is certainly an unnoticeable difference!
All in all, it seems that the time we save after the optimisation is practically constant, and it doesn't seem to be too related to the transfer classes. I suspect that it greatly corresponds to the preverification step that ProGuard carries out because we target Java 6, and not to the optimisations themselves. In fact, comparing a bytecode fragment before and after the optimisation, it seems that there isn't any considerable change. The following corresponds to the non-optimized version:
public void rule99__prep__probj(Writer arg0, TransferWord arg1, String arg2, TransferWord arg3) throws IOException { if (this.debug) logCall("rule99__prep__probj", new Object[] { arg1, arg2, arg3 }); macro_firstWord(arg0, arg1); if (((arg3.sl(this.attr_pers).equals("<p3>")) && (arg3.sl(this.attr_nbr).equals("<pl>"))) || ((arg3.sl(this.attr_pers).equals("<p1>")) && (arg3.sl(this.attr_nbr).equals("<pl>")))) { arg3.tlSet(this.attr_gen, "<m>"); } else if ((arg3.sl(this.attr_pers).equals("<p3>")) && (arg3.sl(this.attr_nbr).equals("<sg>")) && (!arg3.sl(this.attr_gen).equals("<nt>"))) { arg3.tlSet(this.attr_gen, arg3.sl(this.attr_gen)); } else if ((arg3.sl(this.attr_pers).equals("<p1>")) && (arg3.sl(this.attr_nbr).equals("<sg>"))) { arg3.tlSet(this.attr_lem, "mí"); } else if ((arg3.sl(this.attr_pers).equals("<p2>")) && (arg3.sl(this.attr_nbr).equals("<sp>"))) { arg3.tlSet(this.attr_lem, "ti"); } else if ((arg3.sl(this.attr_pers).equals("<p2>")) && (arg3.sl(this.attr_nbr).equals("<pl>"))) { arg3.tlSet(this.attr_lem, "prpers"); arg3.tlSet(this.attr_gen, "<m>"); } String str = arg1.tl(this.attr_whole); if (str.length() > 0) arg0.append('^').append(str).append('$'); arg0.append('^').append(TransferWord.copycase(this.var_caseFirstWord, "pr")).append("<PREP>").append('{').append("}$"); arg0.append(arg2); arg0.append('^').append(arg3.tl(this.attr_lem)).append("<prn>").append("<2>").append(arg3.tl(this.attr_pers)).append(arg3.tl(this.attr_gen).isEmpty() ? "" : "<4>").append(arg3.tl(this.attr_nbr).isEmpty() ? "" : "<5>").append('$'); arg0.append('^').append("probj").append("<SN>").append("<tn>").append(arg3.tl(this.attr_pers)).append(arg3.tl(this.attr_gen)).append(arg3.tl(this.attr_nbr)).append('{').append("}$"); }
In this particular case, it would be obvious to store arg3.sl(this.attr_pers)
in a variable, but the "optimized" version does not do this (here it is called paramTransferWord2.a(this.jdField_o_of_type_OrgApertiumTransferA)
). In fact, it seems that nothing is changed apart from some variable and method names, which, at the same time, makes the code less readable:
public void rule99__prep__probj(Writer paramWriter, TransferWord paramTransferWord1, String paramString, TransferWord paramTransferWord2) { if (this.debug) a("rule99__prep__probj", new Object[] { paramTransferWord1, paramString, paramTransferWord2 }); h(paramTransferWord1); if (((paramTransferWord2.a(this.jdField_o_of_type_OrgApertiumTransferA).equals("<p3>")) && (paramTransferWord2.a(this.jdField_s_of_type_OrgApertiumTransferA).equals("<pl>"))) || ((paramTransferWord2.a(this.jdField_o_of_type_OrgApertiumTransferA).equals("<p1>")) && (paramTransferWord2.a(this.jdField_s_of_type_OrgApertiumTransferA).equals("<pl>")))) { paramTransferWord2.a(this.jdField_p_of_type_OrgApertiumTransferA, "<m>"); } else if ((paramTransferWord2.a(this.jdField_o_of_type_OrgApertiumTransferA).equals("<p3>")) && (paramTransferWord2.a(this.jdField_s_of_type_OrgApertiumTransferA).equals("<sg>")) && (!paramTransferWord2.a(this.jdField_p_of_type_OrgApertiumTransferA).equals("<nt>"))) { paramTransferWord2.a(this.jdField_p_of_type_OrgApertiumTransferA, paramTransferWord2.a(this.jdField_p_of_type_OrgApertiumTransferA)); } else if ((paramTransferWord2.a(this.jdField_o_of_type_OrgApertiumTransferA).equals("<p1>")) && (paramTransferWord2.a(this.jdField_s_of_type_OrgApertiumTransferA).equals("<sg>"))) { paramTransferWord2.a(this.w, "mí"); } else if ((paramTransferWord2.a(this.jdField_o_of_type_OrgApertiumTransferA).equals("<p2>")) && (paramTransferWord2.a(this.jdField_s_of_type_OrgApertiumTransferA).equals("<sp>"))) { paramTransferWord2.a(this.w, "ti"); } else if ((paramTransferWord2.a(this.jdField_o_of_type_OrgApertiumTransferA).equals("<p2>")) && (paramTransferWord2.a(this.jdField_s_of_type_OrgApertiumTransferA).equals("<pl>"))) { paramTransferWord2.a(this.w, "prpers"); paramTransferWord2.a(this.jdField_p_of_type_OrgApertiumTransferA, "<m>"); } if ((paramTransferWord1 = paramTransferWord1.b(this.z)).length() > 0) paramWriter.append('^').append(paramTransferWord1).append('$'); paramWriter.append('^').append(TransferWord.a(this.jdField_f_of_type_JavaLangString, "pr")).append("<PREP>").append('{').append("}$"); paramWriter.append(paramString); paramWriter.append('^').append(paramTransferWord2.b(this.w)).append("<prn>").append("<2>").append(paramTransferWord2.b(this.jdField_o_of_type_OrgApertiumTransferA)).append(paramTransferWord2.b(this.jdField_p_of_type_OrgApertiumTransferA).isEmpty() ? "" : "<4>").append(paramTransferWord2.b(this.jdField_s_of_type_OrgApertiumTransferA).isEmpty() ? "" : "<5>").append('$'); paramWriter.append('^').append("probj").append("<SN>").append("<tn>").append(paramTransferWord2.b(this.jdField_o_of_type_OrgApertiumTransferA)).append(paramTransferWord2.b(this.jdField_p_of_type_OrgApertiumTransferA)).append(paramTransferWord2.b(this.jdField_s_of_type_OrgApertiumTransferA)).append('{').append("}$"); }
In a nutshell, although ProGuard certainly makes things slightly better, the difference is so small that it can be considered that it is not worth using it. Later on, we could consider trying another bytecode optimizer, or implementing some optimisations ourselves during bytecode generation.