Lttoolbox-java
What is lttoolbox
lttoolbox is:
- making binary files out of the .dix files (lt-comp),
- analysing or generating text (lt-proc) and
- expanding a .dix file (lt-expand).
The Java port is also capable of
- Generating bytecode for transfer
- Validating .dix files
- Compounding (experimental)
- Flag diacritics (highly experimental)
Features
- Binary compatibility with lttoolbox. lttoolbox-java is able read and write the binary files lttoolbox and generates exactly the same output
- There is a comprehensive test suite that tests both lttoolbox (C++) and lttoolbox-java.
Installation
Download the newest release or check out from SVN:
svn co https://apertium.svn.sourceforge.net/svnroot/apertium/trunk/lttoolbox-java
Use Netbeans or Unix, whatever suits you best:
sh autogen.sh make sudo make install
You can also build and install using Maven 2 (http://maven.apache.org), by typing:
mvn install -DskipTests
See also the README file
Usage
$ java -jar dist/lttoolbox.jar lttoolbox: is a toolbox for lexical processing, morphological analysis and generation of words USAGE: java -jar dist/lttoolbox.jar [task] Examples: java -jar dist/lttoolbox.jar lt-expand dictionary.dix expands a dictionary java -jar dist/lttoolbox.jar lt-comp lr dic.dix dic.bin compiles a dictionary java -jar dist/lttoolbox.jar lt-proc dic.bin morphological analysis
or, using the a shell scripts:
$ lt-comp-j v3.2j: build a letter transducer from a dictionary USAGE: LTComp lr | rl dictionary_file output_file [acx_file] Modes: lr: left-to-right compilation rl: right-to-left compilation
$ lt-proc-j LTProc: process a stream with a letter transducer USAGE: LTProc [-c] [-a|-g|-n|-d|-b|-p|-s|-t] fst_file [input_file [output_file]] Options: -a: morphological analysis (default behavior) -c: use the literal case of the incoming characters -e: morphological analysis, with compound analysis on unknown words -f: match flags (experimental) -g: morphological generation -n: morph. generation without unknown word marks -d: morph. generation with all the stuff -t: morph. generation, but retaining part-of-speech -p: post-generation -s: SAO annotation system input processing -t: apply transliteration dictionary -z: flush output on the null character -v: version -D: debug; print diagnostics to stderr -h: show this help
$ lt-expand-j v3.2j: expand the contents of a dictionary fileUSAGE: LTExpand dictionary_file [output_file]
$ lt-validate-j v3.2j: validate an XML file according to a schema USAGE : LTValidate -dix dictionary.xml LTValidate -acx dictionary.acx
Examples
Use the new compounding feature:
echo "lambakjöti" | java -jar dist/lttoolbox.jar lt-proc -e /home/j/esperanto/apertium/apertium-is-en/is.bin
^lambakjöti/lamb<n><nt><pl><gen><ind>+kjöt<n><nt><sg><dat><ind>$
Encoding problems
Try -Dfile.encoding=UTF-8, like
echo "lambakjöti" | java -Dfile.encoding=UTF-8 -jar dist/lttoolbox.jar lt-proc -e /home/j/esperanto/apertium/apertium-is-en/is.bin
^lambakjöti/lamb<n><nt><pl><gen><ind>+kjöt<n><nt><sg><dat><ind>$
Mac users
You need JDK1.6. Try
/System/Library/Frameworks/JavaVM.framework/Versions/1.6/Commands/java -jar dist/lttoolbox.jar
Windows usage
By default, the windows console uses UTF-16, whereas apertium's data is encoded with utf-8. This command switches the dos box to utf-8:
chcp 65001
Note: you also need to use an unicode-capable font for the windows console, like Lucida Console (Properties -> Font).
Also, don't forget to set these Runtime flags: -Xms64m -Xmx800 -Dfile.encoding=UTF-8
Reasons for a Java port
- There are several devices (mobile phones, for example) which can run quite complicated software, but only if written in Java. lttoolbox is the first step to having Apertium run on these devices.
- Windows port. It won't be as powerfull as Unix based system, but it will be there
- Apertium will be the first MT system *ever* that can be demonstradet within a Java applets
- Transfer in bytecode has a promise of speedup factor 4 - compared to what we use now (interpreted XML). And transfer CPU usage is dominating when processing large amounts of text
State of Java port
Compatibility and performance can be checked by invoking test_java_and_c.sh in testdata/regression.
C analysis is... 0.40sec OK Java analysis is... 1.60sec OK C generator -g is ... 0.34sec OK Java generator -g is ... 1.19sec OK C generator -d is ... 0.33sec OK Java generator -d is ... 1.22sec OK C generator -n is ... 0.34sec OK Java generator -n is ... 1.22sec OK C postgenerator -p is ... 0.04sec OK Java postgenerator -p is ... 0.67sec OK All tests passed
As you see Java version is currently (mar 2010) a factor 2-6 slower than the C version. There are ways to remedy this (i.a. using simple types collection classes), but it hasnt been implemented, as no-one has requested it.
It still gives great performance, however, and Apertium running on Java is very fast, compared to other MT systems. The overhead of using the Java version instead of the C version is negligible, as transfer is the big ressource hog anyway.
Known bugs
There are currently (jan 2010) problems compiling some very seldom strange constructs (testdata/strange.dix). You can use the C version to compile these, and the binary files will work fine when used from lttoolbox-java.
Other notes
<Drew_> jacobEo: I can't find a main class in the source code, am I looking in the wrong place? :S <jacobEo> Drew_: LTComp.java, LTExpand.java, LTProc.java
Thanks
- Nic Cottrell contributed an initial version of a Java port of lttoolbox.
- During GSOC2009 Raphaël and Sergio worked on it, but processing still didnt work (compilation and expansion worked)
- November 2009 Jacob Nordfalk finished it up and optimized some parts of it