Apertium has moved from SourceForge to GitHub.
If you have any questions, please come and talk to us on #apertium on irc.freenode.net or contact the GitHub migration team.


From Apertium
Jump to: navigation, search

Lttoolbox-java (français)

Note: After Apertium's migration to GitHub, this tool is read-only on the SourceForge repository and does not exist on GitHub. If you are interested in migrating this tool to GitHub, see Migrating tools to GitHub.


[edit] What is lttoolbox-java

lttoolbox-java is a Java port of the whole Apertium runtime system, including both lttoolbox and apertium.

[edit] lttoolbox functions

lttoolbox can do the following:

  • Compile: make binary files out of the .dix files (lt-comp),
  • Proces: analysing or generating text (lt-proc) and
  • Expand: Expand a dictionary .dix file (lt-expand).

The Java port of lttoolbox is also capable of

[edit] apertium runtime functions

The Java port implements the typical functions used by Apertium during runtime.

  • Read .mode files and execute the steps included in them
  • Execute the tagger
  • Execute transfer stages (all 3 of them)

The Java port needs the C++ binaries for preparing/developing a language pair, i.a. to compile transfer files and train the tagger.

The Java port of lttoolbox is also capable of

  • Generate bytecode for transfer and execute it. The bytecode runs typically 10 times faster than the C++ version.

[edit] Why

A "Java port" of Apertium enables use on

  • Windows,
  • J2ME/Android phones,
  • web pages (applets),
  • desktop application,
  • Java server applications.

The last 2 is relevant as, for example an OpenOffice.org plugin should be platform independent to be maintainable.

We havent seen anyone embedding Apertium in a desktop application. Currently Apertium is usable in a local subdir but installation isnt trivial to an end user.

Having a packaged easy-to-use version of Apertium ready for embedding MT in a larger program would be very cool. Ideally should a self-contained Apertium JAR file, only dependent on JRE and an additional JAR file per language pair.

Another "embedding" approach is to use a client stub to one of our Apertium services, but there can be reasons why people prefers to have things installed locally (we don't need to repeat them here).

[edit] Features

  • Binary compatibility with lttoolbox. lttoolbox-java is able read and write the binary files lttoolbox and generates exactly the same output
  • There is a comprehensive test suite that tests both lttoolbox (C++) and lttoolbox-java.

[edit] Installation

[edit] Prerequisites:

  • java-runtime
  • apache-ant (for compilation)

Under Arch Linux, you can install the prerequisites with

pacman -S openjdk6 apache-ant

Under Debian/Ubuntu:

sudo apt-get install ant ant-optional      # what else??

[edit] Download, compile, install

Download the newest release or check out from SVN:

svn co https://apertium.svn.sourceforge.net/svnroot/apertium/trunk/lttoolbox-java

Use Netbeans or Unix, whatever suits you best:

sh autogen.sh
sudo make install

You can also build and install using Maven 2 (http://maven.apache.org), by typing:

mvn install -DskipTests

See also the README file

[edit] Usage

$ java -jar dist/lttoolbox.jar
lttoolbox: is a toolbox for lexical processing, morphological analysis and generation of words
USAGE: java -jar dist/lttoolbox.jar [task]
 java -jar dist/lttoolbox.jar lt-expand dictionary.dix     expands a dictionary
 java -jar dist/lttoolbox.jar lt-comp lr dic.dix dic.bin   compiles a dictionary
 java -jar dist/lttoolbox.jar lt-proc dic.bin              morphological analysis

or, using the a shell scripts:

$ lt-comp-j 
 v3.2j: build a letter transducer from a dictionary
USAGE: LTComp lr | rl dictionary_file output_file [acx_file]
  lr:     left-to-right compilation
  rl:     right-to-left compilation

$ lt-proc-j
LTProc: process a stream with a letter transducer
USAGE: LTProc [-c] [-a|-g|-n|-d|-b|-p|-s|-t] fst_file [input_file [output_file]]
  -a:   morphological analysis (default behavior)
  -c:   use the literal case of the incoming characters
  -e:   morphological analysis, with compound analysis on unknown words
  -f:   match flags (experimental)
  -g:   morphological generation
  -n:   morph. generation without unknown word marks
  -d:   morph. generation with all the stuff
  -t:   morph. generation, but retaining part-of-speech
  -p:   post-generation
  -s:   SAO annotation system input processing
  -t:   apply transliteration dictionary
  -z:   flush output on the null character 
  -v:   version
  -D:   debug; print diagnostics to stderr
  -h:   show this help

$ lt-expand-j 
 v3.2j: expand the contents of a dictionary fileUSAGE: LTExpand dictionary_file [output_file]

$ lt-validate-j 
 v3.2j: validate an XML file according to a schema
USAGE : LTValidate -dix dictionary.xml
        LTValidate -acx dictionary.acx

[edit] Examples

Use the new compounding feature:

echo "lambakjöti" | java -jar dist/lttoolbox.jar lt-proc -e /home/j/esperanto/apertium/apertium-is-en/is.bin


[edit] Encoding problems

Try -Dfile.encoding=UTF-8, like

echo "lambakjöti" | java -Dfile.encoding=UTF-8 -jar dist/lttoolbox.jar lt-proc -e /home/j/esperanto/apertium/apertium-is-en/is.bin


[edit] Mac users

You need JDK1.6. Try

/System/Library/Frameworks/JavaVM.framework/Versions/1.6/Commands/java -jar dist/lttoolbox.jar

[edit] Windows usage

By default, the windows console uses UTF-16, whereas apertium's data is encoded with utf-8. This command switches the dos box to utf-8:

chcp 65001

Note: you also need to use an unicode-capable font for the windows console, like Lucida Console (Properties -> Font).

Also, don't forget to set these Runtime flags: -Xms64m -Xmx800 -Dfile.encoding=UTF-8

[edit] Reasons for a Java port

  • There are several devices (mobile phones, for example) which can run quite complicated software, but only if written in Java. lttoolbox is the first step to having Apertium run on these devices.
  • Windows port. It won't be as powerfull as Unix based system, but it will be there
  • Apertium will be the first MT system *ever* that can be demonstrated within a Java applets
  • Transfer in bytecode has a promise of speedup factor 4 - compared to what we use now (interpreted XML). And transfer CPU usage is dominating when processing large amounts of text

[edit] Performance of Java port

Compatibility and performance can be checked by invoking test_java_and_c.sh in testdata/regression.

[edit] Single-core processor (Jimmy O'Regan)

java version "1.6.0_18"
OpenJDK Runtime Environment (IcedTea6 1.8) (6b18~pre4-1ubuntu1)
OpenJDK Client VM (build 16.0-b13, mixed mode, sharing)
C analysis is... 0.59sec
Java analysis is... 1.15sec
C generator -g is ... 0.54sec
Java generator -g is ... 1.13sec
C generator -d is ... 0.56sec
Java generator -d is ... 1.12sec
C generator -n is ... 0.52sec
Java generator -n is ... 1.12sec
C postgenerator -p is ... 0.07sec
Java postgenerator -p is ... 0.33sec
All tests passed

[edit] Dual-core processor (Jacob)

Java HotSpot(TM) Client VM (build 1.6.0-beta2-b86, mixed mode, sharing)
C analysis is... 0.39sec
Java analysis is... 0.66sec
C generator -g is ... 0.32sec
Java generator -g is ... 0.62sec
C generator -d is ... 0.33sec
Java generator -d is ... 0.58sec
C generator -n is ... 0.32sec
Java generator -n is ... 0.64sec
C postgenerator -p is ... 0.03sec
Java postgenerator -p is ... 0.20sec
All tests passed

As you see Java version is currently (april 2010) a factor 2 slower than the C version. There are ways to remedy this (using simple types collection classes), but it hasnt been implemented, as no-one has requested it.

It still gives great performance, however, and Apertium running on Java is very fast, compared to other MT systems. The overhead of using the Java version instead of the C version is negligible, as transfer is the big ressource hog anyway.

The above test compares the basic lttoolbox functions. As Java transfer is much faster the result of performance test of a pure-Java and pure-C++ chain are comparable (and mostly in Java's favor). A hybrid can be made which beats performance of both systems.

[edit] Known bugs

There are currently (jan 2010) problems compiling some very seldom strange constructs (testdata/strange.dix). You can use the C version to compile these, and the binary files will work fine when used from lttoolbox-java.

[edit] Thanks

Personal tools