Difference between revisions of "Lttoolbox-java"

From Apertium
Jump to navigation Jump to search
(New page: Notes <jimregan> Nic Cottrell contributed a Java port of lttoolbox <jimregan> but it needs work to finish it <jimregan> and a test suite, in both C++ and Java <Drew_> ah, I've found the ...)
 
 
(57 intermediate revisions by 11 users not shown)
Line 1: Line 1:
[[Lttoolbox-java (français)]]
Notes


{{TOCD}}


== What is lttoolbox-java ==
<jimregan> Nic Cottrell contributed a Java port of lttoolbox
lttoolbox-java is a Java port of the whole Apertium runtime system, ''including both lttoolbox and apertium''.
<jimregan> but it needs work to finish it
<jimregan> and a test suite, in both C++ and Java
<Drew_> ah, I've found the Lttoolbox page on the wiki
<jimregan> ok
<Drew_> this is a project I may be interested in - my specialty language is Java
<jacobEo> Great Drew_ !
<Drew_> :)
<jacobEo> its in in apertium-tools/lttoolbox-java
<Drew_> Do you have any more information on it at the minute?
<jacobEo> Drew_: What is in apertium-tools/lttoolbox-java right now
<jacobEo> is NOT working.
<jacobEo> in apertium-tools/lttoolbox-java is a line-for-line port of the C++ code of lttoolbox
<jacobEo> and the great problem is the XML handling


=== lttoolbox functions ===
[[lttoolbox]] can do the following:
* Compile: make binary files out of the .dix files (lt-comp),
* Proces: analysing or generating text (lt-proc) and
* Expand: Expand a dictionary .dix file (lt-expand).


The Java port of lttoolbox is also capable of
<jimregan> it has to be binary compatible
* [[Compounds]] (experimental)
<Drew_> jacobEo, I will download Ubuntu now
* [[lttoolbox-java/Flag diacritics]] (highly experimental)
<jacobEo> jimregan: Did you look at the Java code?
* Validate .dix files
<jimregan> and the test suite has to be in both C++ and Java, to ensure that
<jacobEo> ok. "medium" then, or if there is anything betw "easy" and "medium" choose that
<jimregan> yeah, it's amost line for line identical to the C++, aside from Java/C++ differences
<jimregan> but, the binary stuff can be hard
<jacobEo> therefore jimregan its not that hard.
<jimregan> all you need is one bit in the wrong place, and it's useless
<jacobEo> jimregan: Binary stuff?
<jimregan> medium, then
<jimregan> jacobEo, yeah
<jimregan> well
<jimregan> the compression stuff
<cseong> if i dont know one of the required language, for example is C,C++ and XML are the requirements and i dont know XML, can i still choose it ?
<jimregan> and the transducer
<jacobEo> jimregan: The binary stuff is _probably_ easy, as you can debug the C++ and compare variables etc
<jimregan> cseong, XML is easy to pick up
<jimregan> there are plenty of APIs availabl
<jacobEo> cseong: Which project are you thinking of?
<jimregan> for C++, we use libxml2


=== apertium runtime functions ===


The Java port implements the typical functions used by Apertium during runtime.
<jacobEo> Rah2: lttoolbox are making binary files out of the .dix files.
<jacobEo> Rah2: lttoolbox-java needs to at least be able to _read_ these binary files.
<Rah2> ok
* vaasu (i=73548f22@gateway/web/ajax/mibbit.com/x-423d7da178407283) has joined #apertium
<jacobEo> Rah2: Did you try Apertium? Have a language pair installed?
<Rah2> I just svn checked out
* vaasu has quit (Client Quit)
* Drew_ (n=chatzill@5ac42755.bb.sky.com) has joined #apertium
<jimregan> wow!
<jimregan> Rah2, that was /fast/
<jimregan> I only finsihed adding that 5 minutes ago :)
<Rah2> no in fact it wasn't
<Rah2> It's like 900 Mo
<Rah2> I took it all
<Rah2> I just started before you mentionned that project
<jimregan> no; I mean the Java lttoolbox idea :)
<jacobEo> Rah2: Pls compile lttoolbox and apertium and a language pair of your choice.
<Rah2> I was idling on that chan
<jacobEo> Rah2: Then much more will be clear


* Read .mode files and execute the steps included in them
* Execute the tagger
* Execute transfer stages (all 3 of them)


The Java port needs the C++ binaries for preparing/developing a language pair, i.a. to compile transfer files and train the tagger.
<jacobEo> Rah2: You don't need much knowlede of MT or NLP to do lttoolbox-java. But you need to know C++ and Java and be able to debug both
<Drew_> jacobEo: What was the location of lttoolbox again?
<jacobEo> Drew_: With SVN or as a ZIP file?
<CIA-18> apertium: nordfalk * r9192 /trunk/apertium-eo-en/apertium-eo-en.en-eo.t1x: Pli da simpligo. set_gender1 estas preskaux ne-necesa
<Drew_> um, I am using Tortoise SVN, is there a ZIP file uploaded somewhere?
<jacobEo> You can get SVN things as ZIP files.
<Drew_> ah right
<jacobEo> http://apertium.svn.sourceforge.net/viewvc/apertium/trunk/lttoolbox/
<jacobEo> http://apertium.svn.sourceforge.net/viewvc/apertium/trunk/apertium-tools/lttoolbox-java/


<jacobEo> "Download GNU tarball" will give a compressed archive


The Java port of lttoolbox is also capable of
<jacobEo> The problem, I think, is the XML handling: The C code's library callback calls a method in the code both when it meets a START and an END tag.
* Generate [[bytecode for transfer]] and execute it. The bytecode runs typically 10 times faster than the C++ version.
<jimregan> avinesh, maybe you should say it to spectie because I'm not interested in unicode -> wx

<jimregan> not for any reason

<jacobEo> the Java's XML library only calls the callback method at the START tag.

<jimregan> jacobEo, that will be necessary for chunk merging
==Why==
<jimregan> we don't have it yet, but it will be necessary

<Leftmost> jimregan, I guess I'm a bit unclear as to what form the regression tests should take. Simply translations between ga and gd?
A "Java port" of Apertium enables use on
<jimregan> because when chcontent in t2 is written in chunk mode, it will be without { or }, otherwise with

<avinesh> ok got it
* Windows,
<jimregan> to fit the current model, that has to be a bool set and unset on entry/exit
* Android phones,
<Drew_> jacobEo: Is it a big job to make it work with the END tag?
* Cross-platform desktop application,
<avinesh> no wx right :D
* Java server applications.
<jimregan> that's it

<jimregan> avinesh, noone told me anudev is a course supervisor :/
The last 2 is relevant as, for example a LibreOffice plugin should be platform independent to be maintainable.
<jacobEo> Drew_: I don't know. Perhaps we could find another Java XML library that could be made also call for the end tags. Or some kind of wrapper-inbetween thing could be made. Or you could use SAX and make your own callback thing.

<jimregan> I think I would have expected more of his opinions if I knew he wasn't actually doing any of the work
We havent seen anyone embedding Apertium in a desktop application. Currently Apertium is usable in a local subdir but installation isnt trivial to an end user.
<avinesh> umm he mainly working on anusaraka

<jacobEo> Drew_: There might be other problems. The project just got stranded on the XML parse part.
Having a packaged easy-to-use version of Apertium ready for embedding MT in a larger program would be very cool.
<Drew_> jacobEo: Ah, ok. I'm just compiling it now
Ideally should a self-contained Apertium JAR file, only dependent on JRE and an additional JAR file per language pair.
<jacobEo> Drew_: You have to run the code to see. To do that you need to have at least one language pair runnning on your machine

* vaasu (n=yt@123.176.16.43) has joined #apertium
Another "embedding" approach is to use a client stub to one of our [[Apertium services]], but there can be reasons why people prefers to have things installed locally (we don't need to repeat them here).
<Drew_> jacobEo: I can't find a main class in the source code, am I looking in the wrong place? :S

<jacobEo> Drew_: The Java code?
==Features==
<cseong> uhm..i am interested in improving interoperability..but what formats are u refering to ?
* Binary compatibility with lttoolbox. lttoolbox-java is able '''read''' and '''write''' the binary files lttoolbox and generates exactly the same output
<Drew_> jacobEo: Yeah, I loaded the java code into eclipse but it can't find a main method to compile the .java's
* There is a comprehensive test suite that tests both lttoolbox (C++) and lttoolbox-java.
<jacobEo> Drew_: LTComp.java, LTExpand.java, LTProc.java

<jimregan> avinesh, yeah. So I was right when I thought he expected us to change all of apertium to suit the analyser :/
==Installation==
* abhiSri (i=AB-Alway@220.224.99.238) has joined #apertium
===Prerequisites: ===
<jacobEo> Drew_: Use Netbeans if you can. It's kinda standard here in Apertium
* java-runtime
* apache-ant (for compilation)

Under Arch Linux, you can install the prerequisites with

pacman -S openjdk6 apache-ant

Under Debian/Ubuntu:

sudo apt-get install ant ant-optional # what else??
===Download, compile, install===
Download the newest [https://github.com/apertium/lttoolbox-java/releases release] or check out from github:
<pre>
git clone https://github.com/apertium/lttoolbox-java/
</pre>

Use Netbeans or Unix, whatever suits you best:
<pre>
sh autogen.sh
make
sudo make install
</pre>

You can also build and install using Maven 2 (http://maven.apache.org), by typing:
<pre>
mvn install -DskipTests
</pre>

See also the README file

== Usage ==
<pre>
$ java -jar dist/lttoolbox.jar
lttoolbox: is a toolbox for lexical processing, morphological analysis and generation of words
USAGE: java -jar dist/lttoolbox.jar [task]
Examples:
java -jar dist/lttoolbox.jar lt-expand dictionary.dix expands a dictionary
java -jar dist/lttoolbox.jar lt-comp lr dic.dix dic.bin compiles a dictionary
java -jar dist/lttoolbox.jar lt-proc dic.bin morphological analysis
</pre>

or, using the a shell scripts:

<pre>
$ lt-comp-j
v3.2j: build a letter transducer from a dictionary
USAGE: LTComp lr | rl dictionary_file output_file [acx_file]
Modes:
lr: left-to-right compilation
rl: right-to-left compilation
</pre>


<pre>
$ lt-proc-j
LTProc: process a stream with a letter transducer
USAGE: LTProc [-c] [-a|-g|-n|-d|-b|-p|-s|-t] fst_file [input_file [output_file]]
Options:
-a: morphological analysis (default behavior)
-c: use the literal case of the incoming characters
-e: morphological analysis, with compound analysis on unknown words
-f: match flags (experimental)
-g: morphological generation
-n: morph. generation without unknown word marks
-d: morph. generation with all the stuff
-t: morph. generation, but retaining part-of-speech
-p: post-generation
-s: SAO annotation system input processing
-t: apply transliteration dictionary
-z: flush output on the null character
-v: version
-D: debug; print diagnostics to stderr
-h: show this help
</pre>


<pre>
$ lt-expand-j
v3.2j: expand the contents of a dictionary fileUSAGE: LTExpand dictionary_file [output_file]
</pre>


<pre>
$ lt-validate-j
v3.2j: validate an XML file according to a schema
USAGE : LTValidate -dix dictionary.xml
LTValidate -acx dictionary.acx
</pre>

===Examples===

Use the new compounding feature:
echo "lambakjöti" | java -jar dist/lttoolbox.jar lt-proc -e /home/j/esperanto/apertium/apertium-is-en/is.bin
^lambakjöti/lamb<n><nt><pl><gen><ind>+kjöt<n><nt><sg><dat><ind>$

===Encoding problems===
Try -Dfile.encoding=UTF-8, like

echo "lambakjöti" | java -Dfile.encoding=UTF-8 -jar dist/lttoolbox.jar lt-proc -e /home/j/esperanto/apertium/apertium-is-en/is.bin
^lambakjöti/lamb<n><nt><pl><gen><ind>+kjöt<n><nt><sg><dat><ind>$

===Mac users===
You need JDK1.6. Try

/System/Library/Frameworks/JavaVM.framework/Versions/1.6/Commands/java -jar dist/lttoolbox.jar


=== Windows usage ===

By default, the windows console uses UTF-16, whereas apertium's data
is encoded with utf-8. This command switches the dos box to utf-8:

<pre>
chcp 65001
</pre>

Note: you also need to use an unicode-capable font for the windows console, like Lucida Console (Properties -> Font).

Also, don't forget to set these Runtime flags: -Xms64m -Xmx800 -Dfile.encoding=UTF-8

== Reasons for a Java port ==
* There are several devices (mobile phones, for example) which can run quite complicated software, but only if written in Java. lttoolbox is the first step to having Apertium run on these devices.
* Windows port. It won't be as powerfull as Unix based system, but it will be there
* Apertium will be the first MT system *ever* that can be demonstrated within a Java applets
* Transfer in bytecode has a promise of speedup factor 4 - compared to what we use now (interpreted XML). And transfer CPU usage is dominating when processing large amounts of text

== Performance of Java port ==

Compatibility and performance can be checked by invoking test_java_and_c.sh in testdata/regression.

=== Single-core processor (Jimmy O'Regan)===
<pre>
java version "1.6.0_18"
OpenJDK Runtime Environment (IcedTea6 1.8) (6b18~pre4-1ubuntu1)
OpenJDK Client VM (build 16.0-b13, mixed mode, sharing)
C analysis is... 0.59sec
OK
Java analysis is... 1.15sec
OK
C generator -g is ... 0.54sec
OK
Java generator -g is ... 1.13sec
OK
C generator -d is ... 0.56sec
OK
Java generator -d is ... 1.12sec
OK
C generator -n is ... 0.52sec
OK
Java generator -n is ... 1.12sec
OK
C postgenerator -p is ... 0.07sec
OK
Java postgenerator -p is ... 0.33sec
OK
All tests passed
</pre>

=== Dual-core processor (Jacob)===
<pre>
Java HotSpot(TM) Client VM (build 1.6.0-beta2-b86, mixed mode, sharing)
C analysis is... 0.39sec
OK
Java analysis is... 0.66sec
OK
C generator -g is ... 0.32sec
OK
Java generator -g is ... 0.62sec
OK
C generator -d is ... 0.33sec
OK
Java generator -d is ... 0.58sec
OK
C generator -n is ... 0.32sec
OK
Java generator -n is ... 0.64sec
OK
C postgenerator -p is ... 0.03sec
OK
Java postgenerator -p is ... 0.20sec
OK
All tests passed
</pre>

As you see Java version is currently (april 2010) a factor 2 slower than the C version. There are ways to remedy this (using simple types collection classes), but it hasnt been implemented, as no-one has requested it.

It still gives great performance, however, and Apertium running on Java is very fast, compared to other MT systems. The overhead of using the Java version instead of the C version is negligible, as transfer is the big ressource hog anyway.

The above test compares the basic lttoolbox functions. As Java transfer is much faster the result of performance test of a pure-Java and pure-C++ chain are comparable (and mostly in Java's favor). A hybrid can be made which beats performance of both systems.

== Known bugs ==

Oct 2019:
lttoolbox-java kan read files compiled with lttoolbox version 3.5.0
*note* lttoolbox-java currently lacks support for functionality added the last 5 years - and it doesent work in Java JDK 9+, as it uses BCEL classes that was embedded Java 8 but changed package name in Java 9.
Instead we should include BCEL as a library.

==Thanks==
* Nic Cottrell contributed an initial version of a Java port of [[lttoolbox]].
* During [[Google Summer of Code|GSOC2009]] [[User:Rah|Raphaël]] and [[User:Sortiz|Sergio]] worked on it, but processing still didnt work (compilation and expansion worked)
* November 2009 [[User:Jacob Nordfalk|Jacob Nordfalk]] finished it up and optimized some parts of it
* During GSOC 2010 Jacob mentored a full [[User:Kanmuri/GSoC 2010 Application/Java Runtime Port|Java Runtime Port of Apertium]] (i.e. not just lttoolbox) made by [[User:Kanmuri|Stephen Tigner]].

[[Category:Lttoolbox]]
[[Category:Documentation in English]]

Latest revision as of 09:49, 7 April 2020

Lttoolbox-java (français)

What is lttoolbox-java[edit]

lttoolbox-java is a Java port of the whole Apertium runtime system, including both lttoolbox and apertium.

lttoolbox functions[edit]

lttoolbox can do the following:

  • Compile: make binary files out of the .dix files (lt-comp),
  • Proces: analysing or generating text (lt-proc) and
  • Expand: Expand a dictionary .dix file (lt-expand).

The Java port of lttoolbox is also capable of

apertium runtime functions[edit]

The Java port implements the typical functions used by Apertium during runtime.

  • Read .mode files and execute the steps included in them
  • Execute the tagger
  • Execute transfer stages (all 3 of them)

The Java port needs the C++ binaries for preparing/developing a language pair, i.a. to compile transfer files and train the tagger.


The Java port of lttoolbox is also capable of

  • Generate bytecode for transfer and execute it. The bytecode runs typically 10 times faster than the C++ version.


Why[edit]

A "Java port" of Apertium enables use on

  • Windows,
  • Android phones,
  • Cross-platform desktop application,
  • Java server applications.

The last 2 is relevant as, for example a LibreOffice plugin should be platform independent to be maintainable.

We havent seen anyone embedding Apertium in a desktop application. Currently Apertium is usable in a local subdir but installation isnt trivial to an end user.

Having a packaged easy-to-use version of Apertium ready for embedding MT in a larger program would be very cool. Ideally should a self-contained Apertium JAR file, only dependent on JRE and an additional JAR file per language pair.

Another "embedding" approach is to use a client stub to one of our Apertium services, but there can be reasons why people prefers to have things installed locally (we don't need to repeat them here).

Features[edit]

  • Binary compatibility with lttoolbox. lttoolbox-java is able read and write the binary files lttoolbox and generates exactly the same output
  • There is a comprehensive test suite that tests both lttoolbox (C++) and lttoolbox-java.

Installation[edit]

Prerequisites:[edit]

  • java-runtime
  • apache-ant (for compilation)

Under Arch Linux, you can install the prerequisites with

pacman -S openjdk6 apache-ant

Under Debian/Ubuntu:

sudo apt-get install ant ant-optional      # what else??

Download, compile, install[edit]

Download the newest release or check out from github:

git clone https://github.com/apertium/lttoolbox-java/

Use Netbeans or Unix, whatever suits you best:

sh autogen.sh
make
sudo make install

You can also build and install using Maven 2 (http://maven.apache.org), by typing:

mvn install -DskipTests

See also the README file

Usage[edit]

$ java -jar dist/lttoolbox.jar
lttoolbox: is a toolbox for lexical processing, morphological analysis and generation of words
USAGE: java -jar dist/lttoolbox.jar [task]
Examples:
 java -jar dist/lttoolbox.jar lt-expand dictionary.dix     expands a dictionary
 java -jar dist/lttoolbox.jar lt-comp lr dic.dix dic.bin   compiles a dictionary
 java -jar dist/lttoolbox.jar lt-proc dic.bin              morphological analysis

or, using the a shell scripts:

$ lt-comp-j 
 v3.2j: build a letter transducer from a dictionary
USAGE: LTComp lr | rl dictionary_file output_file [acx_file]
Modes:
  lr:     left-to-right compilation
  rl:     right-to-left compilation


$ lt-proc-j
LTProc: process a stream with a letter transducer
USAGE: LTProc [-c] [-a|-g|-n|-d|-b|-p|-s|-t] fst_file [input_file [output_file]]
Options:
  -a:   morphological analysis (default behavior)
  -c:   use the literal case of the incoming characters
  -e:   morphological analysis, with compound analysis on unknown words
  -f:   match flags (experimental)
  -g:   morphological generation
  -n:   morph. generation without unknown word marks
  -d:   morph. generation with all the stuff
  -t:   morph. generation, but retaining part-of-speech
  -p:   post-generation
  -s:   SAO annotation system input processing
  -t:   apply transliteration dictionary
  -z:   flush output on the null character 
  -v:   version
  -D:   debug; print diagnostics to stderr
  -h:   show this help


$ lt-expand-j 
 v3.2j: expand the contents of a dictionary fileUSAGE: LTExpand dictionary_file [output_file]


$ lt-validate-j 
 v3.2j: validate an XML file according to a schema
USAGE : LTValidate -dix dictionary.xml
        LTValidate -acx dictionary.acx

Examples[edit]

Use the new compounding feature:

echo "lambakjöti" | java -jar dist/lttoolbox.jar lt-proc -e /home/j/esperanto/apertium/apertium-is-en/is.bin

^lambakjöti/lamb<n><nt><pl><gen><ind>+kjöt<n><nt><sg><dat><ind>$

Encoding problems[edit]

Try -Dfile.encoding=UTF-8, like

echo "lambakjöti" | java -Dfile.encoding=UTF-8 -jar dist/lttoolbox.jar lt-proc -e /home/j/esperanto/apertium/apertium-is-en/is.bin

^lambakjöti/lamb<n><nt><pl><gen><ind>+kjöt<n><nt><sg><dat><ind>$

Mac users[edit]

You need JDK1.6. Try

/System/Library/Frameworks/JavaVM.framework/Versions/1.6/Commands/java -jar dist/lttoolbox.jar


Windows usage[edit]

By default, the windows console uses UTF-16, whereas apertium's data is encoded with utf-8. This command switches the dos box to utf-8:

chcp 65001

Note: you also need to use an unicode-capable font for the windows console, like Lucida Console (Properties -> Font).

Also, don't forget to set these Runtime flags: -Xms64m -Xmx800 -Dfile.encoding=UTF-8

Reasons for a Java port[edit]

  • There are several devices (mobile phones, for example) which can run quite complicated software, but only if written in Java. lttoolbox is the first step to having Apertium run on these devices.
  • Windows port. It won't be as powerfull as Unix based system, but it will be there
  • Apertium will be the first MT system *ever* that can be demonstrated within a Java applets
  • Transfer in bytecode has a promise of speedup factor 4 - compared to what we use now (interpreted XML). And transfer CPU usage is dominating when processing large amounts of text

Performance of Java port[edit]

Compatibility and performance can be checked by invoking test_java_and_c.sh in testdata/regression.

Single-core processor (Jimmy O'Regan)[edit]

java version "1.6.0_18"
OpenJDK Runtime Environment (IcedTea6 1.8) (6b18~pre4-1ubuntu1)
OpenJDK Client VM (build 16.0-b13, mixed mode, sharing)
C analysis is... 0.59sec
OK
Java analysis is... 1.15sec
OK
C generator -g is ... 0.54sec
OK
Java generator -g is ... 1.13sec
OK
C generator -d is ... 0.56sec
OK
Java generator -d is ... 1.12sec
OK
C generator -n is ... 0.52sec
OK
Java generator -n is ... 1.12sec
OK
C postgenerator -p is ... 0.07sec
OK
Java postgenerator -p is ... 0.33sec
OK
All tests passed

Dual-core processor (Jacob)[edit]

Java HotSpot(TM) Client VM (build 1.6.0-beta2-b86, mixed mode, sharing)
C analysis is... 0.39sec
OK
Java analysis is... 0.66sec
OK
C generator -g is ... 0.32sec
OK
Java generator -g is ... 0.62sec
OK
C generator -d is ... 0.33sec
OK
Java generator -d is ... 0.58sec
OK
C generator -n is ... 0.32sec
OK
Java generator -n is ... 0.64sec
OK
C postgenerator -p is ... 0.03sec
OK
Java postgenerator -p is ... 0.20sec
OK
All tests passed

As you see Java version is currently (april 2010) a factor 2 slower than the C version. There are ways to remedy this (using simple types collection classes), but it hasnt been implemented, as no-one has requested it.

It still gives great performance, however, and Apertium running on Java is very fast, compared to other MT systems. The overhead of using the Java version instead of the C version is negligible, as transfer is the big ressource hog anyway.

The above test compares the basic lttoolbox functions. As Java transfer is much faster the result of performance test of a pure-Java and pure-C++ chain are comparable (and mostly in Java's favor). A hybrid can be made which beats performance of both systems.

Known bugs[edit]

Oct 2019: lttoolbox-java kan read files compiled with lttoolbox version 3.5.0

  • note* lttoolbox-java currently lacks support for functionality added the last 5 years - and it doesent work in Java JDK 9+, as it uses BCEL classes that was embedded Java 8 but changed package name in Java 9.

Instead we should include BCEL as a library.

Thanks[edit]