Difference between revisions of "Lttoolbox-java"

From Apertium
Jump to navigation Jump to search
Line 13: Line 13:
 
java -jar dist/lttoolbox.jar lt-comp lr dic.dix dic.bin compiles a dictionary
 
java -jar dist/lttoolbox.jar lt-comp lr dic.dix dic.bin compiles a dictionary
 
java -jar dist/lttoolbox.jar lt-proc dic.bin morphological analysis
 
java -jar dist/lttoolbox.jar lt-proc dic.bin morphological analysis
  +
</pre>
  +
  +
or, using the a shell scripts:
  +
  +
<pre>
  +
$ lt-comp-j
  +
v3.2j: build a letter transducer from a dictionary
  +
USAGE: LTComp lr | rl dictionary_file output_file [acx_file]
  +
Modes:
  +
lr: left-to-right compilation
  +
rl: right-to-left compilation
  +
</pre>
  +
  +
  +
<pre>
  +
$ lt-proc-j
  +
LTProc: process a stream with a letter transducer
  +
USAGE: LTProc [-c] [-a|-g|-n|-d|-b|-p|-s|-t] fst_file [input_file [output_file]]
  +
Options:
  +
-a: morphological analysis (default behavior)
  +
-c: use the literal case of the incoming characters
  +
-e: morphological analysis, with compound analysis on unknown words
  +
-f: match flags (experimental)
  +
-g: morphological generation
  +
-n: morph. generation without unknown word marks
  +
-d: morph. generation with all the stuff
  +
-t: morph. generation, but retaining part-of-speech
  +
-p: post-generation
  +
-s: SAO annotation system input processing
  +
-t: apply transliteration dictionary
  +
-z: flush output on the null character
  +
-v: version
  +
-D: debug; print diagnostics to stderr
  +
-h: show this help
  +
</pre>
  +
  +
  +
<pre>
  +
$ lt-expand-j
  +
v3.2j: expand the contents of a dictionary fileUSAGE: LTExpand dictionary_file [output_file]
  +
</pre>
  +
  +
<pre>
  +
$ lt-expand-j
  +
v3.2j: expand the contents of a dictionary fileUSAGE: LTExpand dictionary_file [output_file]
  +
</pre>
  +
  +
<pre>
  +
$ lt-validate-j
  +
v3.2j: validate an XML file according to a schema
  +
USAGE : LTValidate -dix dictionary.xml
  +
LTValidate -acx dictionary.acx
 
</pre>
 
</pre>
   

Revision as of 14:01, 26 January 2010

What is lttoolbox

lttoolbox are 1) making binary files out of the .dix files (lt-comp), 2) analysing or generating text (lt-proc) and 3) expanding a .dix file (lt-expand).

Usage

$ java -jar dist/lttoolbox.jar
lttoolbox: is a toolbox for lexical processing, morphological analysis and generation of words
USAGE: java -jar dist/lttoolbox.jar [task]
Examples:
 java -jar dist/lttoolbox.jar lt-expand dictionary.dix     expands a dictionary
 java -jar dist/lttoolbox.jar lt-comp lr dic.dix dic.bin   compiles a dictionary
 java -jar dist/lttoolbox.jar lt-proc dic.bin              morphological analysis

or, using the a shell scripts:

$ lt-comp-j 
 v3.2j: build a letter transducer from a dictionary
USAGE: LTComp lr | rl dictionary_file output_file [acx_file]
Modes:
  lr:     left-to-right compilation
  rl:     right-to-left compilation


$ lt-proc-j
LTProc: process a stream with a letter transducer
USAGE: LTProc [-c] [-a|-g|-n|-d|-b|-p|-s|-t] fst_file [input_file [output_file]]
Options:
  -a:   morphological analysis (default behavior)
  -c:   use the literal case of the incoming characters
  -e:   morphological analysis, with compound analysis on unknown words
  -f:   match flags (experimental)
  -g:   morphological generation
  -n:   morph. generation without unknown word marks
  -d:   morph. generation with all the stuff
  -t:   morph. generation, but retaining part-of-speech
  -p:   post-generation
  -s:   SAO annotation system input processing
  -t:   apply transliteration dictionary
  -z:   flush output on the null character 
  -v:   version
  -D:   debug; print diagnostics to stderr
  -h:   show this help


$ lt-expand-j 
 v3.2j: expand the contents of a dictionary fileUSAGE: LTExpand dictionary_file [output_file]
$ lt-expand-j 
 v3.2j: expand the contents of a dictionary fileUSAGE: LTExpand dictionary_file [output_file]
$ lt-validate-j 
 v3.2j: validate an XML file according to a schema
USAGE : LTValidate -dix dictionary.xml
        LTValidate -acx dictionary.acx

Examples

Use the new compounding feature:

echo "lambakjöti" | java -jar dist/lttoolbox.jar lt-proc -e /home/j/esperanto/apertium/apertium-is-en/is.bin

^lambakjöti/lamb<n><nt><pl><gen><ind>+kjöt<n><nt><sg><dat><ind>$

Encoding problems

Try -Dfile.encoding=UTF-8, like

echo "lambakjöti" | java -Dfile.encoding=UTF-8 -jar dist/lttoolbox.jar lt-proc -e /home/j/esperanto/apertium/apertium-is-en/is.bin

^lambakjöti/lamb<n><nt><pl><gen><ind>+kjöt<n><nt><sg><dat><ind>$

Mac users

You need JDK1.6. Try

/System/Library/Frameworks/JavaVM.framework/Versions/1.6/Commands/java -jar dist/lttoolbox.jar

Reasons for a Java port

  • There are several devices (mobile phones, for example) which can run quite complicated software, but only if written in Java. lttoolbox is the first step to having Apertium run on these devices.
  • Windows port. It won't be as powerfull as Unix based system, but it will be there
  • Apertium will be the first MT system *ever* that can be demonstradet within a Java applets
  • Transfer in bytecode has a promise of speedup factor 4 - compared to what we use now (interpreted XML). And transfer CPU usage is dominating when processing large amounts of text

State of Java port

j@j-laptop-nova:~/esperanto/apertium/lttoolbox-java/testdata/regression$ ./compare_java_and_c.sh
C analysis is... 0.39sec
OK
Java analysis is... 1.91sec
OK
C generator -g is ... 0.33sec
OK
Java generator -g is ... 1.26sec
OK
C generator -d is ... 0.33sec
OK
Java generator -d is ... 1.27sec
OK
C generator -n is ... 0.33sec
OK
Java generator -n is ... 1.25sec
OK
C postgenerator -p is ... 0.04sec
OK
Java postgenerator -p is ... 0.72sec
OK
All tests passed

--Jacob Nordfalk 08:52, 30 November 2009 (UTC)


Features

  • Binary compatibility with lttoolbox. lttoolbox-java is able _read_ and _write_ the binary files lttoolbox and generates exactly the same output
  • There is a comprehensive test suite that tests both lttoolbox (C++) and lttoolbox-java.


Other notes

<Drew_> jacobEo: I can't find a main class in the source code, am I looking in the wrong place? :S
<jacobEo> Drew_: LTComp.java, LTExpand.java, LTProc.java

Thanks

  • Nic Cottrell contributed an initial version of a Java port of lttoolbox.
  • During GSOC2009 Raphaël and Sergio worked on it, but processing still didnt work (compilation and expansion worked)
  • November 2009 Jacob Nordfalk finished it up and optimized it