Difference between revisions of "Lttoolbox-java"
(50 intermediate revisions by 10 users not shown) | |||
Line 1: | Line 1: | ||
[[Lttoolbox-java (français)]] |
|||
{{TOCD}} |
{{TOCD}} |
||
Nic Cottrell contributed a Java port of lttoolbox but it needs work to finish it. |
|||
== What is lttoolbox-java == |
|||
You don't need much knowlede of MT or NLP to do lttoolbox-java. But you need to know C++ and Java and be able to debug both. |
|||
lttoolbox-java is a Java port of the whole Apertium runtime system, ''including both lttoolbox and apertium''. |
|||
=== lttoolbox functions === |
|||
You only have to understand what lt-expand, lt-comp and lt-proc does with a .dix file. |
|||
[[lttoolbox]] can do the following: |
|||
* Compile: make binary files out of the .dix files (lt-comp), |
|||
* Proces: analysing or generating text (lt-proc) and |
|||
* Expand: Expand a dictionary .dix file (lt-expand). |
|||
The Java port of lttoolbox is also capable of |
|||
* [[Compounds]] (experimental) |
|||
[[lttoolbox]] are 1) making binary files out of the .dix files (lt-comp), 2) analysing or generating text (lt-proc) and 3) expanding a .dix file (lt-expand). |
|||
* [[lttoolbox-java/Flag diacritics]] (highly experimental) |
|||
* Validate .dix files |
|||
=== apertium runtime functions === |
|||
Download preferably via [[SVN]]. It it fails, try |
|||
[http://apertium.svn.sourceforge.net/viewvc/apertium/trunk/lttoolbox/] and [http://apertium.svn.sourceforge.net/viewvc/apertium/trunk/apertium-tools/lttoolbox-java/] ("Download GNU tarball" will give a compressed archive) |
|||
The Java port implements the typical functions used by Apertium during runtime. |
|||
Pls compile lttoolbox and apertium and a language pair of your choice. Then you have the setup needed to understand the role of lt-toolbox. |
|||
* Read .mode files and execute the steps included in them |
|||
* Execute the tagger |
|||
* Execute transfer stages (all 3 of them) |
|||
The Java port needs the C++ binaries for preparing/developing a language pair, i.a. to compile transfer files and train the tagger. |
|||
==Required== |
|||
* Binary compatibility with lttoolbox (input and output files should be the same) |
|||
* a test suite which runs on both lttoolbox (C++) and lttoolbox-java |
|||
* lttoolbox-java needs to at least be able to _read_ the binary files (see 2) abobe: analysing or generating text (lt-proc)) |
|||
The Java port of lttoolbox is also capable of |
|||
* Generate [[bytecode for transfer]] and execute it. The bytecode runs typically 10 times faster than the C++ version. |
|||
==Problems== |
|||
* Right now we have a line-for-line port of the C++ code of lttoolbox in apertium-tools/lttoolbox-java. It's NOT working. |
|||
* it's amost line for line identical to the C++, aside from Java/C++ differences. |
|||
But the languages are different. C++ for example has some methods where some simple type variables are changed (the reference is passed) |
|||
But in Java simple type variables can only be passed by value, and thus the caller's value is not changes. |
|||
That sort of things needs to be sorted out. |
|||
* The biggest problem is the XML handling: The C code's library callback calls a method in the code both when it meets a START and an END tag (for C++, we use libxml2). |
|||
** The Java's XML library only calls the callback method at the START tag. |
|||
** Perhaps we could find another Java XML library that could be made also call for the end tags. Or some kind of wrapper-inbetween thing could be made. Or you could use SAX and make your own callback thing. |
|||
* There might be other problems. The project just got stranded on the XML parse part. |
|||
=== XML Handling === |
|||
As lttoolbox is parsing the same files as [[Apertium-dixtools]] it might be an idea to use dixtools to do the parsing. However, the XML handling in dixtools is in needs of improvements (see [[Apertium-dixtools#Wishlist_and_notes_for_Apertium-dixtools]]) |
|||
==Why== |
|||
A "Java port" of Apertium enables use on |
|||
* Windows, |
|||
* Android phones, |
|||
* Cross-platform desktop application, |
|||
* Java server applications. |
|||
The last 2 is relevant as, for example a LibreOffice plugin should be platform independent to be maintainable. |
|||
We havent seen anyone embedding Apertium in a desktop application. Currently Apertium is usable in a local subdir but installation isnt trivial to an end user. |
|||
Having a packaged easy-to-use version of Apertium ready for embedding MT in a larger program would be very cool. |
|||
Ideally should a self-contained Apertium JAR file, only dependent on JRE and an additional JAR file per language pair. |
|||
Another "embedding" approach is to use a client stub to one of our [[Apertium services]], but there can be reasons why people prefers to have things installed locally (we don't need to repeat them here). |
|||
==Features== |
|||
* Binary compatibility with lttoolbox. lttoolbox-java is able '''read''' and '''write''' the binary files lttoolbox and generates exactly the same output |
|||
* There is a comprehensive test suite that tests both lttoolbox (C++) and lttoolbox-java. |
|||
==Installation== |
|||
===Prerequisites: === |
|||
* java-runtime |
|||
* apache-ant (for compilation) |
|||
Under Arch Linux, you can install the prerequisites with |
|||
pacman -S openjdk6 apache-ant |
|||
Under Debian/Ubuntu: |
|||
sudo apt-get install ant ant-optional # what else?? |
|||
===Download, compile, install=== |
|||
Download the newest [https://github.com/apertium/lttoolbox-java/releases release] or check out from github: |
|||
<pre> |
<pre> |
||
git clone https://github.com/apertium/lttoolbox-java/ |
|||
[21:02:23] Apertium Java-lttoolboc Nic Cottrell: I would recommend dom4j |
|||
[21:02:43] Jacob Nordfalk: Why would you that? |
|||
[21:02:55] Apertium Java-lttoolboc Nic Cottrell: which lets you load the whole xml file into a dom tree and then you can do searches and manipulations very easily |
|||
[21:04:13] Jacob Nordfalk: yes, Nic, but what is neede is either to rewrite the code competely or somehow get callback when encountering an END tag. |
|||
[21:04:30] … and as far as I understand thats possible with SAX |
|||
[21:04:32] Apertium Java-lttoolboc Nic Cottrell: Yes, exactly |
|||
[21:04:44] … Oh, ok. then that's probably the fastest way to make it work |
|||
[21:04:49] Jacob Nordfalk: dom4j != SAX :-) |
|||
[21:05:04] … OK, so we agree :-) |
|||
[21:05:07] Apertium Java-lttoolboc Nic Cottrell: but I personally believe that dom4j gives better code readability and flexibility for later on |
|||
[21:05:40] Jacob Nordfalk: yes, you might be rignt. |
|||
[21:06:20] … its a question of how much the two sets (C++ and Java) should differ. |
|||
</pre> |
</pre> |
||
Use Netbeans or Unix, whatever suits you best: |
|||
<pre> |
|||
sh autogen.sh |
|||
make |
|||
sudo make install |
|||
</pre> |
|||
You can also build and install using Maven 2 (http://maven.apache.org), by typing: |
|||
<pre> |
|||
mvn install -DskipTests |
|||
</pre> |
|||
See also the README file |
|||
==Other notes== |
|||
== Usage == |
|||
<pre> |
<pre> |
||
$ java -jar dist/lttoolbox.jar |
|||
<Drew_> jacobEo: I can't find a main class in the source code, am I looking in the wrong place? :S |
|||
lttoolbox: is a toolbox for lexical processing, morphological analysis and generation of words |
|||
<jacobEo> Drew_: LTComp.java, LTExpand.java, LTProc.java |
|||
USAGE: java -jar dist/lttoolbox.jar [task] |
|||
Examples: |
|||
java -jar dist/lttoolbox.jar lt-expand dictionary.dix expands a dictionary |
|||
java -jar dist/lttoolbox.jar lt-comp lr dic.dix dic.bin compiles a dictionary |
|||
java -jar dist/lttoolbox.jar lt-proc dic.bin morphological analysis |
|||
</pre> |
|||
or, using the a shell scripts: |
|||
[21:08:21] Jacob Nordfalk: So, Nic, how much time do you probably have the next months? Would you like to be a co-mentor on this, or |
|||
would you like to just occasionally be informed about progress? |
|||
<pre> |
|||
[21:08:58] Apertium Java-lttoolboc Nic Cottrell: Well, I would love to be a co-mentor, but I fear that I might not be able to give |
|||
$ lt-comp-j |
|||
enough time to perform that role |
|||
v3.2j: build a letter transducer from a dictionary |
|||
[21:09:12] … But I would definitely like to be in the loop and can jump in to help when I can |
|||
USAGE: LTComp lr | rl dictionary_file output_file [acx_file] |
|||
Modes: |
|||
lr: left-to-right compilation |
|||
rl: right-to-left compilation |
|||
</pre> |
</pre> |
||
<pre> |
|||
$ lt-proc-j |
|||
LTProc: process a stream with a letter transducer |
|||
USAGE: LTProc [-c] [-a|-g|-n|-d|-b|-p|-s|-t] fst_file [input_file [output_file]] |
|||
Options: |
|||
-a: morphological analysis (default behavior) |
|||
-c: use the literal case of the incoming characters |
|||
-e: morphological analysis, with compound analysis on unknown words |
|||
-f: match flags (experimental) |
|||
-g: morphological generation |
|||
-n: morph. generation without unknown word marks |
|||
-d: morph. generation with all the stuff |
|||
-t: morph. generation, but retaining part-of-speech |
|||
-p: post-generation |
|||
-s: SAO annotation system input processing |
|||
-t: apply transliteration dictionary |
|||
-z: flush output on the null character |
|||
-v: version |
|||
-D: debug; print diagnostics to stderr |
|||
-h: show this help |
|||
</pre> |
|||
<pre> |
|||
$ lt-expand-j |
|||
v3.2j: expand the contents of a dictionary fileUSAGE: LTExpand dictionary_file [output_file] |
|||
</pre> |
|||
<pre> |
|||
$ lt-validate-j |
|||
v3.2j: validate an XML file according to a schema |
|||
USAGE : LTValidate -dix dictionary.xml |
|||
LTValidate -acx dictionary.acx |
|||
</pre> |
|||
===Examples=== |
|||
Use the new compounding feature: |
|||
echo "lambakjöti" | java -jar dist/lttoolbox.jar lt-proc -e /home/j/esperanto/apertium/apertium-is-en/is.bin |
|||
^lambakjöti/lamb<n><nt><pl><gen><ind>+kjöt<n><nt><sg><dat><ind>$ |
|||
===Encoding problems=== |
|||
Try -Dfile.encoding=UTF-8, like |
|||
echo "lambakjöti" | java -Dfile.encoding=UTF-8 -jar dist/lttoolbox.jar lt-proc -e /home/j/esperanto/apertium/apertium-is-en/is.bin |
|||
^lambakjöti/lamb<n><nt><pl><gen><ind>+kjöt<n><nt><sg><dat><ind>$ |
|||
===Mac users=== |
|||
You need JDK1.6. Try |
|||
/System/Library/Frameworks/JavaVM.framework/Versions/1.6/Commands/java -jar dist/lttoolbox.jar |
|||
=== Windows usage === |
|||
By default, the windows console uses UTF-16, whereas apertium's data |
|||
is encoded with utf-8. This command switches the dos box to utf-8: |
|||
<pre> |
|||
chcp 65001 |
|||
</pre> |
|||
Note: you also need to use an unicode-capable font for the windows console, like Lucida Console (Properties -> Font). |
|||
Also, don't forget to set these Runtime flags: -Xms64m -Xmx800 -Dfile.encoding=UTF-8 |
|||
== Reasons for a Java port == |
|||
* There are several devices (mobile phones, for example) which can run quite complicated software, but only if written in Java. lttoolbox is the first step to having Apertium run on these devices. |
|||
* Windows port. It won't be as powerfull as Unix based system, but it will be there |
|||
* Apertium will be the first MT system *ever* that can be demonstrated within a Java applets |
|||
* Transfer in bytecode has a promise of speedup factor 4 - compared to what we use now (interpreted XML). And transfer CPU usage is dominating when processing large amounts of text |
|||
== Performance of Java port == |
|||
Compatibility and performance can be checked by invoking test_java_and_c.sh in testdata/regression. |
|||
=== Single-core processor (Jimmy O'Regan)=== |
|||
<pre> |
|||
java version "1.6.0_18" |
|||
OpenJDK Runtime Environment (IcedTea6 1.8) (6b18~pre4-1ubuntu1) |
|||
OpenJDK Client VM (build 16.0-b13, mixed mode, sharing) |
|||
C analysis is... 0.59sec |
|||
OK |
|||
Java analysis is... 1.15sec |
|||
OK |
|||
C generator -g is ... 0.54sec |
|||
OK |
|||
Java generator -g is ... 1.13sec |
|||
OK |
|||
C generator -d is ... 0.56sec |
|||
OK |
|||
Java generator -d is ... 1.12sec |
|||
OK |
|||
C generator -n is ... 0.52sec |
|||
OK |
|||
Java generator -n is ... 1.12sec |
|||
OK |
|||
C postgenerator -p is ... 0.07sec |
|||
OK |
|||
Java postgenerator -p is ... 0.33sec |
|||
OK |
|||
All tests passed |
|||
</pre> |
|||
=== Dual-core processor (Jacob)=== |
|||
<pre> |
|||
Java HotSpot(TM) Client VM (build 1.6.0-beta2-b86, mixed mode, sharing) |
|||
C analysis is... 0.39sec |
|||
OK |
|||
Java analysis is... 0.66sec |
|||
OK |
|||
C generator -g is ... 0.32sec |
|||
OK |
|||
Java generator -g is ... 0.62sec |
|||
OK |
|||
C generator -d is ... 0.33sec |
|||
OK |
|||
Java generator -d is ... 0.58sec |
|||
OK |
|||
C generator -n is ... 0.32sec |
|||
OK |
|||
Java generator -n is ... 0.64sec |
|||
OK |
|||
C postgenerator -p is ... 0.03sec |
|||
OK |
|||
Java postgenerator -p is ... 0.20sec |
|||
OK |
|||
All tests passed |
|||
</pre> |
|||
As you see Java version is currently (april 2010) a factor 2 slower than the C version. There are ways to remedy this (using simple types collection classes), but it hasnt been implemented, as no-one has requested it. |
|||
It still gives great performance, however, and Apertium running on Java is very fast, compared to other MT systems. The overhead of using the Java version instead of the C version is negligible, as transfer is the big ressource hog anyway. |
|||
The above test compares the basic lttoolbox functions. As Java transfer is much faster the result of performance test of a pure-Java and pure-C++ chain are comparable (and mostly in Java's favor). A hybrid can be made which beats performance of both systems. |
|||
== Known bugs == |
|||
Oct 2019: |
|||
lttoolbox-java kan read files compiled with lttoolbox version 3.5.0 |
|||
*note* lttoolbox-java currently lacks support for functionality added the last 5 years - and it doesent work in Java JDK 9+, as it uses BCEL classes that was embedded Java 8 but changed package name in Java 9. |
|||
Instead we should include BCEL as a library. |
|||
==Thanks== |
|||
* Nic Cottrell contributed an initial version of a Java port of [[lttoolbox]]. |
|||
* During [[Google Summer of Code|GSOC2009]] [[User:Rah|Raphaël]] and [[User:Sortiz|Sergio]] worked on it, but processing still didnt work (compilation and expansion worked) |
|||
* November 2009 [[User:Jacob Nordfalk|Jacob Nordfalk]] finished it up and optimized some parts of it |
|||
* During GSOC 2010 Jacob mentored a full [[User:Kanmuri/GSoC 2010 Application/Java Runtime Port|Java Runtime Port of Apertium]] (i.e. not just lttoolbox) made by [[User:Kanmuri|Stephen Tigner]]. |
|||
[[Category:Lttoolbox]] |
|||
[[Category:Documentation in English]] |
Latest revision as of 09:49, 7 April 2020
What is lttoolbox-java[edit]
lttoolbox-java is a Java port of the whole Apertium runtime system, including both lttoolbox and apertium.
lttoolbox functions[edit]
lttoolbox can do the following:
- Compile: make binary files out of the .dix files (lt-comp),
- Proces: analysing or generating text (lt-proc) and
- Expand: Expand a dictionary .dix file (lt-expand).
The Java port of lttoolbox is also capable of
- Compounds (experimental)
- lttoolbox-java/Flag diacritics (highly experimental)
- Validate .dix files
apertium runtime functions[edit]
The Java port implements the typical functions used by Apertium during runtime.
- Read .mode files and execute the steps included in them
- Execute the tagger
- Execute transfer stages (all 3 of them)
The Java port needs the C++ binaries for preparing/developing a language pair, i.a. to compile transfer files and train the tagger.
The Java port of lttoolbox is also capable of
- Generate bytecode for transfer and execute it. The bytecode runs typically 10 times faster than the C++ version.
Why[edit]
A "Java port" of Apertium enables use on
- Windows,
- Android phones,
- Cross-platform desktop application,
- Java server applications.
The last 2 is relevant as, for example a LibreOffice plugin should be platform independent to be maintainable.
We havent seen anyone embedding Apertium in a desktop application. Currently Apertium is usable in a local subdir but installation isnt trivial to an end user.
Having a packaged easy-to-use version of Apertium ready for embedding MT in a larger program would be very cool. Ideally should a self-contained Apertium JAR file, only dependent on JRE and an additional JAR file per language pair.
Another "embedding" approach is to use a client stub to one of our Apertium services, but there can be reasons why people prefers to have things installed locally (we don't need to repeat them here).
Features[edit]
- Binary compatibility with lttoolbox. lttoolbox-java is able read and write the binary files lttoolbox and generates exactly the same output
- There is a comprehensive test suite that tests both lttoolbox (C++) and lttoolbox-java.
Installation[edit]
Prerequisites:[edit]
- java-runtime
- apache-ant (for compilation)
Under Arch Linux, you can install the prerequisites with
pacman -S openjdk6 apache-ant
Under Debian/Ubuntu:
sudo apt-get install ant ant-optional # what else??
Download, compile, install[edit]
Download the newest release or check out from github:
git clone https://github.com/apertium/lttoolbox-java/
Use Netbeans or Unix, whatever suits you best:
sh autogen.sh make sudo make install
You can also build and install using Maven 2 (http://maven.apache.org), by typing:
mvn install -DskipTests
See also the README file
Usage[edit]
$ java -jar dist/lttoolbox.jar lttoolbox: is a toolbox for lexical processing, morphological analysis and generation of words USAGE: java -jar dist/lttoolbox.jar [task] Examples: java -jar dist/lttoolbox.jar lt-expand dictionary.dix expands a dictionary java -jar dist/lttoolbox.jar lt-comp lr dic.dix dic.bin compiles a dictionary java -jar dist/lttoolbox.jar lt-proc dic.bin morphological analysis
or, using the a shell scripts:
$ lt-comp-j v3.2j: build a letter transducer from a dictionary USAGE: LTComp lr | rl dictionary_file output_file [acx_file] Modes: lr: left-to-right compilation rl: right-to-left compilation
$ lt-proc-j LTProc: process a stream with a letter transducer USAGE: LTProc [-c] [-a|-g|-n|-d|-b|-p|-s|-t] fst_file [input_file [output_file]] Options: -a: morphological analysis (default behavior) -c: use the literal case of the incoming characters -e: morphological analysis, with compound analysis on unknown words -f: match flags (experimental) -g: morphological generation -n: morph. generation without unknown word marks -d: morph. generation with all the stuff -t: morph. generation, but retaining part-of-speech -p: post-generation -s: SAO annotation system input processing -t: apply transliteration dictionary -z: flush output on the null character -v: version -D: debug; print diagnostics to stderr -h: show this help
$ lt-expand-j v3.2j: expand the contents of a dictionary fileUSAGE: LTExpand dictionary_file [output_file]
$ lt-validate-j v3.2j: validate an XML file according to a schema USAGE : LTValidate -dix dictionary.xml LTValidate -acx dictionary.acx
Examples[edit]
Use the new compounding feature:
echo "lambakjöti" | java -jar dist/lttoolbox.jar lt-proc -e /home/j/esperanto/apertium/apertium-is-en/is.bin
^lambakjöti/lamb<n><nt><pl><gen><ind>+kjöt<n><nt><sg><dat><ind>$
Encoding problems[edit]
Try -Dfile.encoding=UTF-8, like
echo "lambakjöti" | java -Dfile.encoding=UTF-8 -jar dist/lttoolbox.jar lt-proc -e /home/j/esperanto/apertium/apertium-is-en/is.bin
^lambakjöti/lamb<n><nt><pl><gen><ind>+kjöt<n><nt><sg><dat><ind>$
Mac users[edit]
You need JDK1.6. Try
/System/Library/Frameworks/JavaVM.framework/Versions/1.6/Commands/java -jar dist/lttoolbox.jar
Windows usage[edit]
By default, the windows console uses UTF-16, whereas apertium's data is encoded with utf-8. This command switches the dos box to utf-8:
chcp 65001
Note: you also need to use an unicode-capable font for the windows console, like Lucida Console (Properties -> Font).
Also, don't forget to set these Runtime flags: -Xms64m -Xmx800 -Dfile.encoding=UTF-8
Reasons for a Java port[edit]
- There are several devices (mobile phones, for example) which can run quite complicated software, but only if written in Java. lttoolbox is the first step to having Apertium run on these devices.
- Windows port. It won't be as powerfull as Unix based system, but it will be there
- Apertium will be the first MT system *ever* that can be demonstrated within a Java applets
- Transfer in bytecode has a promise of speedup factor 4 - compared to what we use now (interpreted XML). And transfer CPU usage is dominating when processing large amounts of text
Performance of Java port[edit]
Compatibility and performance can be checked by invoking test_java_and_c.sh in testdata/regression.
Single-core processor (Jimmy O'Regan)[edit]
java version "1.6.0_18" OpenJDK Runtime Environment (IcedTea6 1.8) (6b18~pre4-1ubuntu1) OpenJDK Client VM (build 16.0-b13, mixed mode, sharing) C analysis is... 0.59sec OK Java analysis is... 1.15sec OK C generator -g is ... 0.54sec OK Java generator -g is ... 1.13sec OK C generator -d is ... 0.56sec OK Java generator -d is ... 1.12sec OK C generator -n is ... 0.52sec OK Java generator -n is ... 1.12sec OK C postgenerator -p is ... 0.07sec OK Java postgenerator -p is ... 0.33sec OK All tests passed
Dual-core processor (Jacob)[edit]
Java HotSpot(TM) Client VM (build 1.6.0-beta2-b86, mixed mode, sharing) C analysis is... 0.39sec OK Java analysis is... 0.66sec OK C generator -g is ... 0.32sec OK Java generator -g is ... 0.62sec OK C generator -d is ... 0.33sec OK Java generator -d is ... 0.58sec OK C generator -n is ... 0.32sec OK Java generator -n is ... 0.64sec OK C postgenerator -p is ... 0.03sec OK Java postgenerator -p is ... 0.20sec OK All tests passed
As you see Java version is currently (april 2010) a factor 2 slower than the C version. There are ways to remedy this (using simple types collection classes), but it hasnt been implemented, as no-one has requested it.
It still gives great performance, however, and Apertium running on Java is very fast, compared to other MT systems. The overhead of using the Java version instead of the C version is negligible, as transfer is the big ressource hog anyway.
The above test compares the basic lttoolbox functions. As Java transfer is much faster the result of performance test of a pure-Java and pure-C++ chain are comparable (and mostly in Java's favor). A hybrid can be made which beats performance of both systems.
Known bugs[edit]
Oct 2019: lttoolbox-java kan read files compiled with lttoolbox version 3.5.0
- note* lttoolbox-java currently lacks support for functionality added the last 5 years - and it doesent work in Java JDK 9+, as it uses BCEL classes that was embedded Java 8 but changed package name in Java 9.
Instead we should include BCEL as a library.
Thanks[edit]
- Nic Cottrell contributed an initial version of a Java port of lttoolbox.
- During GSOC2009 Raphaël and Sergio worked on it, but processing still didnt work (compilation and expansion worked)
- November 2009 Jacob Nordfalk finished it up and optimized some parts of it
- During GSOC 2010 Jacob mentored a full Java Runtime Port of Apertium (i.e. not just lttoolbox) made by Stephen Tigner.