Getting started with induction tools

⚠ WARNING ⚠

This page is out of date as a result of the migration to GitHub. Please update this page with new documentation and remove this warning. If you are unsure how to proceed, please contact the GitHub migration team.

En français

Installing the necessary programs[edit]

Prerequisite Ubuntu packages[edit]

This is, unfortunately, not a complete list; I'll attempt to add one once I've set this up on a clean ubuntu install. However, the following should still be of some help; all are required:

sudo apt-get install automake libtool libxml2-dev flex libpcre3-dev

Installing Crossdics[edit]

Crossdics explains this correctly.

You may need to export a new value for JAVA_HOME before running ant jar.

$ export JAVA_HOME=/usr/lib/jvm/java-6-sun-1.6.0.03

ls /usr/lib/jvm/ if in doubt as to what the exact version on your system is; it may be different than the above.

Installing lttoolbox[edit]

$ svn co https://apertium.svn.sourceforge.net/svnroot/apertium/trunk/lttoolbox
$ cd lttoolbox
$ ./autogen.sh
$ ./configure
$ make
$ sudo make install

Note where lttoolbox.pc gets placed, in case autogen.sh doesn't find lttoolbox.

Installing Apertium[edit]

$ svn co https://apertium.svn.sourceforge.net/svnroot/apertium/trunk/apertium
$ cd apertium
$ ./autogen.sh
$ ./configure
$ make
$ sudo make install

Installing ReTraTos[edit]

$ svn co https://retratos.svn.sourceforge.net/svnroot/retratos/trunk retratos
$ cd retratos
$ ./autogen.sh
$ ./configure
$ make
$ sudo make install

(In case it doesn't find lttoolbox, see this workaround.)

Installing mkcls[edit]

Ubuntu:[edit]

Get the deb package appropriate for your version of ubuntu at ubuntu-nlp.

$ dpkg --install mkcls*.deb

Mac:[edit]

Get giza-pp from http://code.google.com/p/giza-pp/, then

$ tar xvzf giza-pp-VERSION.tar.gz
$ cd giza-pp/mkcls-v2
$ make
$ sudo cp mkcls /usr/bin/

Installing GIZA++[edit]

Unfortunately, the deb package at ubuntu-nlp was built with -DBINARY_SEARCH_FOR_TTABLE. You need to prepare the input files differently for this, or it dies with "ERROR: NO COOCURRENCE FILE GIVEN!". I don't know how to do that, so here are the instructions on compiling a version that will work with the rest of this page.

$ wget http://giza-pp.googlecode.com/files/giza-pp-v1.0.1.tar.gz
$ tar xvzf giza-pp-v1.0.1.tar.gz
$ cd giza-pp/GIZA++-v2
$ cp Makefile Makefile.orig
$ sed -i 's/ -DBINARY_SEARCH_FOR_TTABLE//;s/mkdir/mkdir -p/g' Makefile
$ make
$ sudo make install

Creating bilingual dictionaries.[edit]

Obtaining corpora (and getAlignmentWithText.pl)[edit]

$ wget http://wt.jrc.it/lt/Acquis/JRC-Acquis.3.0/alignments/getAlignmentWithText.pl

Choose a language pair. For this example, it will be Italian (it) and English (en). To use another pair, find out the filenames by browsing [1] and [2].

Get the associated corpus and alignment files.

$ wget http://wt.jrc.it/lt/Acquis/JRC-Acquis.3.0/alignments/jrc-en-it.xml.gz
$ wget http://wt.jrc.it/lt/Acquis/JRC-Acquis.3.0/corpus/jrc-en.tgz
$ wget http://wt.jrc.it/lt/Acquis/JRC-Acquis.3.0/corpus/jrc-it.tgz

The corpora need to be untarred, and inserted into a new, common directory.

$ tar xvzf jrc-en.tgz 
$ tar xvzf jrc-it.tgz
$ mkdir acquis
$ mv en it acquis

Running getAlignmentWithText.pl[edit]

$ perl getAlignmentWithText.pl -acquisDir acquis/ jrc-en-it.xml > alignment

Creating subsets of the alignment file[edit]

At this point, there will be an alignment file containing a few million lines of xml. It will refer frequently to s1 (the first language of the two in the filename jrc-lang1-lang2.xml, which is jrc-en-it.xml in this example, hence, English) and s2 (Italian). The content of each should be extracted to its own file.

$ grep s1 alignment > en
$ grep s2 alignment > it

An optional step may be useful. If you're just testing to see if it's working, or have an old machine, it's better to only use a subset of the data, so that it can be analyzed in a couple of hours, rather than days. Using 10,000 lines from each language took around 2 hours on my 1.8 Ghz Athlon with 512 megs of RAM. Other numbers are also fine, but use the same number for both files.

 
$ head -n 10000 en > en.10000
$ head -n 10000 it > it.10000

Running plain2snt.out[edit]

The filenames will vary depending on whether or not the optional step was done.

$ plain2snt.out en.10000 it.10000

Running GIZA++[edit]

Main article: Using GIZA++

$ GIZA++ -S en.10000.vcb -T it.10000.vcb -C en.10000_it.10000.snt -p0 0.98 -o dictionary >& dictionary.log

If it stops in under a minute, unless the input files were tiny, check the log - it almost certainly failed.

If it worked, a number of files will be created. The one most likely to be of interest is dictionary.A3.final, which contains alignment information.

The first 3 lines of dictionary.A3.final, on the example corpus, are the following. They are apparently not particularly good, probably due to the small corpus.

# Sentence pair (1) source length 26 target length 21 alignment score : 7.66784e-37
<s2>ACCORDO COMPLEMENTARE all'accordo tra la Comunità economica europea nonché i suoi Stati membri e la Confederazione svizzera, concernente i 
prodotti dell'orologeria</s2>
NULL ({ 10 }) <s1>ADDITIONAL ({ 1 }) AGREEMENT ({ }) to ({ }) the ({ }) Agreement ({ }) concerning ({ }) products ({ 20 }) of ({ }) the ({ }) 
clock ({ 2 3 17 18 19 }) and ({ }) watch ({ }) industry ({ }) between ({ 4 }) the ({ 5 }) European ({ 6 }) Economic ({ 7 8 }) Community ({ }) 
and ({ 9 }) its ({ 11 }) Member ({ 13 }) States ({ 12 }) and ({ 14 }) the ({ 15 }) Swiss ({ 16 }) Confederation</s1> ({ 21 })

Acknowledgements[edit]

This page wouldn't be possible without the kind assistance of 'spectie' on OFTC's #apertium. That said, all errors/typos are mine, not his.

Getting started with induction tools

Contents

Installing the necessary programs[edit]

Prerequisite Ubuntu packages[edit]

Installing Crossdics[edit]

Installing lttoolbox[edit]

Installing Apertium[edit]

Installing ReTraTos[edit]

Installing mkcls[edit]

Ubuntu:[edit]

Mac:[edit]

Installing GIZA++[edit]

Creating bilingual dictionaries.[edit]

Obtaining corpora (and getAlignmentWithText.pl)[edit]

Running getAlignmentWithText.pl[edit]

Creating subsets of the alignment file[edit]

Running plain2snt.out[edit]

Running GIZA++[edit]

See also[edit]

Acknowledgements[edit]

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools