Difference between revisions of "Lttoolbox-java"
Jump to navigation
Jump to search
Line 1: | Line 1: | ||
{{TOCD}} |
{{TOCD}} |
||
Nic Cottrell contributed an initial version of a Java port of [[lttoolbox]]; this work needs to be completed. |
Nic Cottrell contributed an initial version of a Java port of [[lttoolbox]]; this work needs to be completed. |
||
⚫ | |||
== Prerequisites == |
|||
* You don't need much knowlede of MT or NLP to do lttoolbox-java. But you need to know C++ and Java and be able to debug both. |
|||
* You only have to understand what lt-expand, lt-comp and lt-proc does with a .dix file. |
|||
* Ability to debug C++ and Java |
|||
Line 13: | Line 6: | ||
[[lttoolbox]] are 1) making binary files out of the .dix files (lt-comp), 2) analysing or generating text (lt-proc) and 3) expanding a .dix file (lt-expand). |
[[lttoolbox]] are 1) making binary files out of the .dix files (lt-comp), 2) analysing or generating text (lt-proc) and 3) expanding a .dix file (lt-expand). |
||
== Reasons for a Java port == |
|||
Download preferably via [[SVN]]. It it fails, try |
|||
⚫ | |||
[http://apertium.svn.sourceforge.net/viewvc/apertium/trunk/lttoolbox/] and [http://apertium.svn.sourceforge.net/viewvc/apertium/trunk/apertium-tools/lttoolbox-java/] ("Download GNU tarball" will give a compressed archive) |
|||
* Windows port. It won't be as powerfull as Unix based system, but it will be there |
|||
* Apertium will be the first MT system *ever* that can be demonstradet within a Java applets |
|||
* Transfer in bytecode has a promise of speedup factor 4 - compared to what we use now (interpreted XML). And transfer CPU usage is dominating when processing large amounts of text |
|||
== State fo Java port == |
|||
Pls compile lttoolbox and apertium and a language pair of your choice. Then you have the setup needed to understand the role of lt-toolbox. |
|||
⚫ | |||
j@j-laptop-nova:~/esperanto/apertium/lttoolbox-java/testdata/regression$ ./compare_java_and_c.sh |
|||
C analysis is... 0.41sec |
|||
OK |
|||
Java analysis is... 3.13sec |
|||
OK |
|||
C generator -g is ... 0.34sec |
|||
OK |
|||
Java generator -g is ... 2.31sec |
|||
OK |
|||
C generator -d is ... 0.32sec |
|||
OK |
|||
Java generator -d is ... 2.09sec |
|||
OK |
|||
C generator -n is ... 0.32sec |
|||
OK |
|||
Java generator -n is ... 2.56sec |
|||
OK |
|||
C postgenerator -p is ... 0.04sec |
|||
OK |
|||
Java postgenerator -p is ... 1.19sec |
|||
OK |
|||
All tests passed |
|||
⚫ | |||
--[[User:Jacob Nordfalk|Jacob Nordfalk]] 10:52, 24 November 2009 (UTC) |
|||
==Required== |
|||
⚫ | |||
* a test suite which runs on both lttoolbox (C++) and lttoolbox-java |
|||
* lttoolbox-java needs to at least be able to _read_ the binary files (see 2) abobe: analysing or generating text (lt-proc)) |
|||
==Features== |
|||
⚫ | |||
==Problems== |
|||
* |
* There is a comprehensive test suite that tests both lttoolbox (C++) and lttoolbox-java. |
||
* it's amost line for line identical to the C++, aside from Java/C++ differences. |
|||
But the languages are different. C++ for example has some methods where some simple type variables are changed (the reference is passed) |
|||
But in Java simple type variables can only be passed by value, and thus the caller's value is not changes. |
|||
That sort of things needs to be sorted out. |
|||
* The biggest problem is the XML handling: The C code's library callback calls a method in the code both when it meets a START and an END tag (for C++, we use libxml2). |
|||
** The Java's XML library only calls the callback method at the START tag. |
|||
** Perhaps we could find another Java XML library that could be made also call for the end tags. Or some kind of wrapper-inbetween thing could be made. Or you could use SAX and make your own callback thing. |
|||
* There might be other problems. The project just got stranded on the XML parse part. |
|||
=== XML Handling === |
|||
As lttoolbox is parsing the same files as [[Apertium-dixtools]] it might be an idea to use dixtools to do the parsing. However, the XML handling in dixtools is in needs of improvements (see [[Apertium-dixtools#Wishlist_and_notes_for_Apertium-dixtools]]) |
|||
⚫ | |||
[21:02:23] Apertium Java-lttoolboc Nic Cottrell: I would recommend dom4j |
|||
[21:02:43] Jacob Nordfalk: Why would you that? |
|||
[21:02:55] Apertium Java-lttoolboc Nic Cottrell: which lets you load the whole xml file into a dom tree and then you can do searches and manipulations very easily |
|||
[21:04:13] Jacob Nordfalk: yes, Nic, but what is neede is either to rewrite the code competely or somehow get callback when encountering an END tag. |
|||
[21:04:30] … and as far as I understand thats possible with SAX |
|||
[21:04:32] Apertium Java-lttoolboc Nic Cottrell: Yes, exactly |
|||
[21:04:44] … Oh, ok. then that's probably the fastest way to make it work |
|||
[21:04:49] Jacob Nordfalk: dom4j != SAX :-) |
|||
[21:05:04] … OK, so we agree :-) |
|||
[21:05:07] Apertium Java-lttoolboc Nic Cottrell: but I personally believe that dom4j gives better code readability and flexibility for later on |
|||
[21:05:40] Jacob Nordfalk: yes, you might be rignt. |
|||
[21:06:20] … its a question of how much the two sets (C++ and Java) should differ. |
|||
⚫ | |||
Line 62: | Line 53: | ||
<jacobEo> Drew_: LTComp.java, LTExpand.java, LTProc.java |
<jacobEo> Drew_: LTComp.java, LTExpand.java, LTProc.java |
||
[21:08:21] Jacob Nordfalk: So, Nic, how much time do you probably have the next months? Would you like to be a co-mentor on this, or |
|||
would you like to just occasionally be informed about progress? |
|||
[21:08:58] Apertium Java-lttoolboc Nic Cottrell: Well, I would love to be a co-mentor, but I fear that I might not be able to give |
|||
enough time to perform that role |
|||
[21:09:12] … But I would definitely like to be in the loop and can jump in to help when I can |
|||
</pre> |
</pre> |
Revision as of 10:52, 24 November 2009
Nic Cottrell contributed an initial version of a Java port of lttoolbox; this work needs to be completed.
What is lttoolbox
lttoolbox are 1) making binary files out of the .dix files (lt-comp), 2) analysing or generating text (lt-proc) and 3) expanding a .dix file (lt-expand).
Reasons for a Java port
- There are several devices (mobile phones, for example) which can run quite complicated software, but only if written in Java. lttoolbox is the first step to having Apertium run on these devices.
- Windows port. It won't be as powerfull as Unix based system, but it will be there
- Apertium will be the first MT system *ever* that can be demonstradet within a Java applets
- Transfer in bytecode has a promise of speedup factor 4 - compared to what we use now (interpreted XML). And transfer CPU usage is dominating when processing large amounts of text
State fo Java port
j@j-laptop-nova:~/esperanto/apertium/lttoolbox-java/testdata/regression$ ./compare_java_and_c.sh C analysis is... 0.41sec OK Java analysis is... 3.13sec OK C generator -g is ... 0.34sec OK Java generator -g is ... 2.31sec OK C generator -d is ... 0.32sec OK Java generator -d is ... 2.09sec OK C generator -n is ... 0.32sec OK Java generator -n is ... 2.56sec OK C postgenerator -p is ... 0.04sec OK Java postgenerator -p is ... 1.19sec OK All tests passed
--Jacob Nordfalk 10:52, 24 November 2009 (UTC)
Features
- Binary compatibility with lttoolbox. lttoolbox-java is able _read_ and _write_ the binary files lttoolbox and generates exactly the same output
- There is a comprehensive test suite that tests both lttoolbox (C++) and lttoolbox-java.
Other notes
<Drew_> jacobEo: I can't find a main class in the source code, am I looking in the wrong place? :S <jacobEo> Drew_: LTComp.java, LTExpand.java, LTProc.java