Difference between revisions of "Apertium-dixtools"

From Apertium
Jump to navigation Jump to search
 
(20 intermediate revisions by 10 users not shown)
Line 1: Line 1:
  +
[[Apertium-dixtools (français)|En français]]
See also [[Crossdics]]
 
   
  +
{{TOCD}}
  +
Apertium-dixtools is a shell command used to execute different tools written in java. These tools permit to do automatic process on a dictionary file (or sometimes several dictionary files).
   
== Usage ==
+
Usage :
   
Usage: apertium-dixtools [task] [generic options] [task parameters] ...
+
apertium-dixtools task options parameters (especially dictionaries filenames as parameters)
Tasks:
 
cross: cross 2 language pairs (using linguistic res. XML file - see [[Cross Model]])
 
cross-param: cross 2 language pairs (using command line parameters) [[Crossdics]]
 
merge-morph: merges two morphological dictionaries (monodix) [[Merge dictionaries]]
 
equiv-paradigms: finds [[equivalent paradigms]] and updates references
 
list: lists entries in a dictionary - see [[Dictionary reader]]
 
dix2trie: create a Trie from an existing bilingual dictionary
 
dix2tiny: create data for mobile platforms (j2me, palm) from bidix
 
reverse-bil: reverses a bilingual dictionary
 
sort: sorts (and groups by category) a dictionary - see [[Sort a dictionary]]
 
format: [[Format dictionaries]] (according to Generic Options)
 
fix: fix a dictionary (remove duplicates, convert spaces)
 
   
For help on a task, invoke it without parameters
+
As availlable ''task'', we can find :
   
  +
* [[Format dictionaries]] for formatting word definitions/translations, one word per line.
  +
* [[Sort a dictionary]] words are more easy to find for a human developper if they are in alphabetic order.
  +
* [[Dictionary reader]] to get list of elements (lemmas , paradigms, definitions ... from a dictionary.
  +
* [[Reverse a dictionary]] not necessary to be able to translate on both sides, but can be usefull for the task following.
  +
* [[Crossdics]] more precisely à bilingual dictionary for languages A and C is built from dictionaries for A-B and B-C language pairs.
  +
* [[Dictionary coverage]] to make statistics about how often differents word of a dictionary are used.
  +
* [[Autoconcord]] to make bidixes concord with the monodices when there are differences (gender, number, ...) between the two languages.
  +
* [[Dixtools: Equivalent paradigms]] to find unused paradigms and paradigm working the same as another in a dictionary.
  +
* [[Dixtools: Merge dictionaries]] permit to merge list of words from several monodices.
  +
* [[Dixtools: Enhance]] allows to interactively add new words to the dictionary.
  +
* [[Dixtools: Grep]] allows you to filter a dix based on a specific paradigm or lemma and generate a new dix.
   
Generic options: (mostly for tasks that outputs dix files)
 
<pre>
 
-debug print extra debugging information
 
-noProcComments don't add processing comments (telling what was done)
 
-stripEmptyLines removes empty lines (originating from original file)
 
-alignBidix align a bidix (&lt;p> or &lt;i> at col 10, &lt;r> at col 55)
 
-alignMonodix align a monodix (pardef 10, 30, other entries 25, 45)
 
-align [[E] P R] custom align (default &lt;p>/&lt;i> at col 10, &lt;r> at col 55)
 
-alignpardef [[E] P R] paradigm alignment (if differ from general align)
 
   
Any -align option implies 'compact output style' (one dict entry per line)
 
otherwise output is noncompact XML style (one tag per line, lots of indents)
 
   
  +
== Installing Apertium-dixtools ==
Use - as file name for piping (read/write .dix files on standard input/output)
 
  +
</pre>
 
  +
=== Software prerequisites ===
   
  +
You will need to install [http://ant.apache.org/ Ant] and [http://www.oracle.com/technetwork/java/javase/downloads/jdk7-downloads-1880260.html Open Java Development Kit 7 (OpenJDK 7)]
   
  +
$ sudo apt-get install ant openjdk-7-jdk
   
== Download ==
+
=== Download ===
   
 
<pre>
 
<pre>
$ svn co https://apertium.svn.sourceforge.net/svnroot/apertium/trunk/apertium-dixtools
+
$ git clone https://github.com/apertium/apertium-dixtools
 
</pre>
 
</pre>
   
  +
=== Compiling ===
== Software prerequisites ==
 
   
  +
<pre>
You will need to install [http://ant.apache.org/ Ant] and [http://java.sun.com/javase/downloads/index.jsp Java Development Kit 6 (JDK6)]
 
  +
$ cd apertium-dixtools
  +
$ ant jar
  +
</pre>
   
  +
You can also build and install using Maven 2 (http://maven.apache.org), by typing:
$ sudo apt-get install ant sun-java6-jdk
 
  +
<pre>
  +
mvn install
  +
</pre>
   
== Compiling ==
+
=== Installing ===
  +
Installing is actually optional. The program can also be execute using only the .jar file:
  +
<pre>
  +
$ java -jar dist/apertium-dixtools.jar
  +
</pre>
  +
  +
but if you want to do so, the program will be installed in /usr/local/apertium-dixtools
   
 
<pre>
 
<pre>
  +
$ sudo ant install
$ cd apertium-dixtools
 
$ ant jar
 
 
</pre>
 
</pre>
   
 
Note:
 
Note:
* If you update from SVN its always a good idea to do 'ant clean' first.
+
* If you pull from Github, its always a good idea to do 'ant clean' first.
 
* 'ant jar' also attempts to do some testing of itself. This might fail, if someone made changes without ensuring that the tests runs. Just continue with installation and report the test failures to the list.
 
* 'ant jar' also attempts to do some testing of itself. This might fail, if someone made changes without ensuring that the tests runs. Just continue with installation and report the test failures to the list.
  +
* 'ant install' without the sudo will say "apertium-dixtools was successfully installed!" even though it was not
  +
  +
==== Problems ====
   
=== Problems ===
 
 
If you get an "The J2SE Platform is not correctly set up." error with property "platforms.default_platform.home" is not found, then try
 
If you get an "The J2SE Platform is not correctly set up." error with property "platforms.default_platform.home" is not found, then try
 
<pre>
 
<pre>
Line 72: Line 78:
 
$ ant -Dplatforms.JDK_1.6.home=/usr/lib/jvm/java-6-sun jar
 
$ ant -Dplatforms.JDK_1.6.home=/usr/lib/jvm/java-6-sun jar
 
</pre>
 
</pre>
 
 
   
 
(On Mac: if you want to put the full "/System/Library/Frameworks/JavaVM.framework/Versions/1.6.0/" (or whatever) path in there, first make a symlink from the .../1.6.0/Commands folder to .../1.6.0/bin, since ant expects javac to be in the bin-subdirectory of platforms...home)
 
(On Mac: if you want to put the full "/System/Library/Frameworks/JavaVM.framework/Versions/1.6.0/" (or whatever) path in there, first make a symlink from the .../1.6.0/Commands folder to .../1.6.0/bin, since ant expects javac to be in the bin-subdirectory of platforms...home)
   
== Testing ==
+
=== Testing ===
   
 
The testing is quite verbose. The output looks like:
 
The testing is quite verbose. The output looks like:
Line 98: Line 102:
 
</pre>
 
</pre>
   
=== If the test fails ===
+
==== If the test fails ====
  +
 
If you get somethink like
 
If you get somethink like
   
Line 122: Line 127:
 
If that doesent help please report it on the mailing list.
 
If that doesent help please report it on the mailing list.
   
== Installing ==
+
=== Installing ===
  +
 
$ sudo ant install
 
$ sudo ant install
   
  +
== Using Apertium-dixtools ==
   
  +
Usage: apertium-dixtools [task] [generic options] [task parameters] ...
  +
Tasks:
  +
cross: cross 2 language pairs (using linguistic res. XML file - see [[Cross Model]])
  +
cross-param: cross 2 language pairs (using command line parameters) [[Crossdics]]
  +
merge-morph: merges two morphological dictionaries (monodix) [[Merge dictionaries]]
  +
equiv-paradigms: finds [[equivalent paradigms]] and updates references
  +
list: lists entries in a dictionary - see [[Dictionary reader]]
  +
dix2trie: create a Trie from an existing bilingual dictionary
  +
dix2tiny: create data for mobile platforms (j2me, palm) from bidix
  +
reverse-bil: reverses a bilingual dictionary
  +
sort: sorts (and groups by category) a dictionary - see [[Sort a dictionary]]
  +
format: [[Format dictionaries]] (according to Generic Options)
  +
fix: fix a dictionary (remove duplicates, convert spaces)
  +
  +
For help on a task, invoke it without parameters
  +
  +
  +
Generic options: (mostly for tasks that outputs dix files)
  +
<pre>
  +
-debug print extra debugging information
  +
-noProcComments don't add processing comments (telling what was done)
  +
-stripEmptyLines removes empty lines (originating from original file)
  +
-alignBidix align a bidix (&lt;p> or &lt;i> at col 10, &lt;r> at col 55)
  +
-alignMonodix align a monodix (pardef 10, 30, other entries 25, 45)
  +
-align [[E] P R] custom align (default &lt;p>/&lt;i> at col 10, &lt;r> at col 55)
  +
-alignpardef [[E] P R] paradigm alignment (if differ from general align)
  +
  +
Any -align option implies 'compact output style' (one dict entry per line)
  +
otherwise output is noncompact XML style (one tag per line, lots of indents)
  +
  +
Use - as file name for piping (read/write .dix files on standard input/output)
  +
</pre>
  +
  +
See also [[Crossdics]]
   
= Notes for developers =
+
== Notes for developers ==
== Wishlist and notes for Apertium-dixtools ==
 
   
  +
=== Wishlist and notes for Apertium-dixtools ===
* theres awful lot of code, much more than needed. another way of handling XML where you dont have to write classes (and formatting code!!) for each tag.
 
** ''If you already have a XML schema (.xsd) for your XML file structure, [http://jaxb.dev.java.net JAXB (Java Api for XML Binding)] might be your choice. You just run the .xsd through the JAXB compiler (xjc) and get a bunch of classes (yes, one class per tag/type, but you don't have to write them yourself). Then you use the JAXB marshaller to convert XML documents to object structures and vice versa (with optional validation support). The JAXB marshalling code is included in the Sun JRE since version 6, and the JAXB compiler is available under a dual GPL+CDDL license. I used JAXB a lot (both at work and for hobby projects) and I really like it. Of course, it is still your decision. --[[User:Mihi|Mihi]] 19:18, 24 February 2009 (UTC)''
 
   
 
There should be many more options, and ALL sub-commands should take a -fmt parameter where all could be specified:
 
There should be many more options, and ALL sub-commands should take a -fmt parameter where all could be specified:
* 1line or multiline entries
 
* indenting
 
* also 1line on pardefs
 
 
* multiwords -- one line or many lines
 
* multiwords -- one line or many lines
 
* multiwords -- should they be separated
 
* multiwords -- should they be separated
Line 147: Line 183:
 
[[Category:Dixtools|*]]
 
[[Category:Dixtools|*]]
 
[[Category:Installation]]
 
[[Category:Installation]]
  +
[[Category:Documentation in English]]

Latest revision as of 01:29, 26 October 2018

En français

Apertium-dixtools is a shell command used to execute different tools written in java. These tools permit to do automatic process on a dictionary file (or sometimes several dictionary files).

Usage :

apertium-dixtools task options parameters (especially dictionaries filenames as parameters)

As availlable task, we can find :

  • Format dictionaries for formatting word definitions/translations, one word per line.
  • Sort a dictionary words are more easy to find for a human developper if they are in alphabetic order.
  • Dictionary reader to get list of elements (lemmas , paradigms, definitions ... from a dictionary.
  • Reverse a dictionary not necessary to be able to translate on both sides, but can be usefull for the task following.
  • Crossdics more precisely à bilingual dictionary for languages A and C is built from dictionaries for A-B and B-C language pairs.
  • Dictionary coverage to make statistics about how often differents word of a dictionary are used.
  • Autoconcord to make bidixes concord with the monodices when there are differences (gender, number, ...) between the two languages.
  • Dixtools: Equivalent paradigms to find unused paradigms and paradigm working the same as another in a dictionary.
  • Dixtools: Merge dictionaries permit to merge list of words from several monodices.
  • Dixtools: Enhance allows to interactively add new words to the dictionary.
  • Dixtools: Grep allows you to filter a dix based on a specific paradigm or lemma and generate a new dix.


Installing Apertium-dixtools[edit]

Software prerequisites[edit]

You will need to install Ant and Open Java Development Kit 7 (OpenJDK 7)

$ sudo apt-get install ant openjdk-7-jdk

Download[edit]

$ git clone https://github.com/apertium/apertium-dixtools

Compiling[edit]

$ cd apertium-dixtools
$ ant jar

You can also build and install using Maven 2 (http://maven.apache.org), by typing:

mvn install

Installing[edit]

Installing is actually optional. The program can also be execute using only the .jar file:

$ java -jar dist/apertium-dixtools.jar

but if you want to do so, the program will be installed in /usr/local/apertium-dixtools

$ sudo ant install

Note:

  • If you pull from Github, its always a good idea to do 'ant clean' first.
  • 'ant jar' also attempts to do some testing of itself. This might fail, if someone made changes without ensuring that the tests runs. Just continue with installation and report the test failures to the list.
  • 'ant install' without the sudo will say "apertium-dixtools was successfully installed!" even though it was not

Problems[edit]

If you get an "The J2SE Platform is not correctly set up." error with property "platforms.default_platform.home" is not found, then try

$ ant -Dplatforms.default_platform.home=/usr jar

or if it f.ex. says error with property "platforms.JDK_1.6.home" is not found, and you want to point to a specific Java version, then try

$ ant -Dplatforms.JDK_1.6.home=/usr/lib/jvm/java-6-sun jar

(On Mac: if you want to put the full "/System/Library/Frameworks/JavaVM.framework/Versions/1.6.0/" (or whatever) path in there, first make a symlink from the .../1.6.0/Commands folder to .../1.6.0/bin, since ant expects javac to be in the bin-subdirectory of platforms...home)

Testing[edit]

The testing is quite verbose. The output looks like:

-do-test-run:
    [junit] Testsuite: dictools.CrossDictTest
    [junit] [1] Loading bilingual AB (regression_test_data/crossdict/input/apertium-es-ca.es-ca.dix)
    [junit] Reading file regression_test_data/crossdict/input/apertium-es-ca.es-ca.dix

... 200 lines of text

    [junit] ------------- ---------------- ---------------
    [junit] Testsuite: misc.eoen.SubstractBidixTest
    [junit] Tests run: 3, Failures: 0, Errors: 0, Time elapsed: 0.073 sec
    [junit] 
    [junit] ------------- Standard Error -----------------
    [junit] checkEarlierAndRestrict
    [junit] ------------- ---------------- ---------------
    [junit] checkEarlierAndRestrict

If the test fails[edit]

If you get somethink like

    [junit] ------------- ---------------- ---------------

test-report:

-post-test-run:

BUILD FAILED
/home/j/esperanto/apertium/apertium-dixtools/nbproject/build-impl.xml:595: Some tests failed; see details above.

... then the program was correctly build, but it didn't pass the tests. It can still be installed and will probably run fine.

The most probably reason (apart from someone has changed the program in a way that breaks the tests), is that your'e not using Unicode. Try writing

export LANG=en_US.UTF-8

(or some other Unicode language installed - I use eo.UTF-8) and run the tests again. If that doesent help please report it on the mailing list.

Installing[edit]

$ sudo ant install

Using Apertium-dixtools[edit]

Usage: apertium-dixtools [task] [generic options] [task parameters] ...

 Tasks:
   cross:              cross 2 language pairs (using linguistic res. XML file - see Cross Model)
   cross-param:        cross 2 language pairs (using command line parameters) Crossdics
   merge-morph:        merges two morphological dictionaries (monodix) Merge dictionaries
   equiv-paradigms:    finds equivalent paradigms and updates references
   list:               lists entries in a dictionary - see Dictionary reader
   dix2trie:           create a Trie from an existing bilingual dictionary
   dix2tiny:           create data for mobile platforms (j2me, palm) from bidix
   reverse-bil:        reverses a bilingual dictionary
   sort:               sorts (and groups by category) a dictionary - see Sort a dictionary
   format:             Format dictionaries (according to Generic Options)
   fix:                fix a dictionary (remove duplicates, convert spaces)

For help on a task, invoke it without parameters


Generic options: (mostly for tasks that outputs dix files)

    -debug              print extra debugging information
    -noProcComments     don't add processing comments (telling what was done)
    -stripEmptyLines    removes empty lines (originating from original file)
    -alignBidix         align a bidix (<p> or <i> at col 10, <r> at col 55)
    -alignMonodix       align a monodix (pardef 10, 30, other entries 25, 45)
    -align [[E] P R]    custom align (default <p>/<i> at col 10, <r> at col 55)
    -alignpardef [[E] P R] paradigm alignment (if differ from general align)

  Any -align option implies 'compact output style' (one dict entry per line)
  otherwise output is noncompact XML style (one tag per line, lots of indents)

  Use - as file name for piping (read/write .dix files on standard input/output)

See also Crossdics

Notes for developers[edit]

Wishlist and notes for Apertium-dixtools[edit]

There should be many more options, and ALL sub-commands should take a -fmt parameter where all could be specified:

  • multiwords -- one line or many lines
  • multiwords -- should they be separated

(because sometimes with complex multiwords you want to have them laid out differently and apart e.g. you have a section for verbs and it has first "simple" verbs, then it has the multiword verbs)

  • multiwords -- the simple verbs are one per line
  • multiwords -- and the multiword verbs are over several lines