Getting started with induction tools

From Apertium
Revision as of 12:02, 28 March 2008 by Francis Tyers (talk | contribs)
Jump to navigation Jump to search

These are partial directions to getting started with Apertium and related tools (such as GIZA++), with the end goal of creating bilingual dictionaries. This may also be useful as a prerequisite to following the Apertium_New_Language_Pair_HOWTO.

A few steps are ubuntu-specific. Everything except mkcls is built from the SVN sources.

Installing the necessary programs

Prerequisite Ubuntu packages

This is, unfortunately, not a complete list; I'll attempt to add one once I've set this up on a clean ubuntu install. However, the following should still be of some help; all are required:

sudo apt-get install automake libtool libxml2-dev flex libpcre3-dev

Installing Crossdics

Crossdics explains this correctly.

You may need to export a new value for JAVA_HOME before running ant jar.

$ export JAVA_HOME=/usr/lib/jvm/java-6-sun-1.6.0.03

ls /usr/lib/jvm/ if in doubt as to what the exact version on your system is; it may be different than the above.


Installing Apertium

$ svn co https://apertium.svn.sourceforge.net/svnroot/apertium/trunk/apertium
$ ./autogen.sh
$ ./configure
$ make
$ sudo make install

Installing lttoolbox

$ svn co https://apertium.svn.sourceforge.net/svnroot/apertium/trunk/lttoolbox
$ ./autogen.sh
$ ./configure
$ make
$ sudo make install

Installing ReTraTos

$ svn co https://apertium.svn.sourceforge.net/svnroot/apertium/trunk/retratos
$ ./autogen.sh
$ ./configure
$ make
$ sudo make install

Installing mkcls

Get the deb package appropriate for your version of ubuntu at ubuntu-nlp.

$ dpkg --install mkcls*.deb

Installing GIZA++

Unfortunately, the deb package at ubuntu-nlp was built with -DBINARY_SEARCH_FOR_TTABLE. You need to prepare the input files differently for this, or it dies with "ERROR: NO COOCURRENCE FILE GIVEN!". I don't know how to do that, so here are the instructions on compiling a version that will work with the rest of this page.

$ wget http://giza-pp.googlecode.com/files/giza-pp-v1.0.1.tar.gz
$ tar xvzf giza-pp-v1.0.1.tar.gz
$ cd giza-pp/GIZA++-v2
$ cat Makefile | sed -e 's/-DBINARY_SEARCH_FOR_TTABLE//' | sed -e 's/mkdir/mkdir -p/g' > tmp
$ mv Makefile Makefile.orig
$ mv tmp Makefile
$ make
$ sudo make install


Creating bilingual dictionaries.

Obtaining corpora (and getAlignmentWithText.pl)

$ wget http://wt.jrc.it/lt/Acquis/JRC-Acquis.3.0/alignments/getAlignmentWithText.pl

Choose a language pair. For this example, it will be Italian (it) and English (en). To use another pair, find out the filenames by browsing [1] and [2].

Get the associated corpus and alignment files.

$ wget http://wt.jrc.it/lt/Acquis/JRC-Acquis.3.0/alignments/jrc-en-it.xml.gz
$ wget http://wt.jrc.it/lt/Acquis/JRC-Acquis.3.0/corpus/jrc-en.tgz
$ wget http://wt.jrc.it/lt/Acquis/JRC-Acquis.3.0/corpus/jrc-it.tgz

The corpora need to be untarred, and inserted into a new, common directory.

$ tar xvzf jrc-en.tgz 
$ tar xvzf jrc-it.tgz
$ mkdir acquis
$ mv en it acquis

Running getAlignmentWithText.pl

$ perl getAlignmentWithText.pl -acquisDir acquis/ jrc-en-it.xml > alignment
</perl>

=== Creating subsets of the alignment file ===
At this point, there will be an alignment file containing a few million lines of xml.  It will refer frequently to s1 (the first language of the two in the filename jrc-lang1-lang2.xml, which is jrc-en-it.xml in this example, hence, English) and s2 (Italian).  The content of each should be extracted to its own file.

<pre>
$ grep s1 alignment > en
$ grep s2 alignment > it

An optional step may be useful. If you're just testing to see if it's working, or have an old machine, it's better to only use a subset of the data, so that it can be analyzed in a couple of hours, rather than days. Using 10,000 lines from each language took around 2 hours on my 1.8 Ghz Athlon with 512 megs of RAM. Other numbers are also fine, but use the same number for both files.

 
$ head -n 10000 en > en.10000
$ head -n 10000 it > it.10000

Running plain2snt.out

The filenames will vary depending on whether or not the optional step was done.

$ plain2snt.out en.10000 it.10000

Running GIZA++

$ GIZA++ -S en.10000.vcb -T it.10000.vcb -C en.10000_it.10000.snt -p0 0.98 -o dictionary >& dictionary.log

If it stops in under a minute, unless the input files were tiny, check the log - it almost certainly failed.

If it worked, a number of files will be created. The one most likely to be of interest is dictionary.A3.final, which contains alignment information.

The first 3 lines of dictionary.A3.final, on the example corpus, are the following. They are apparently not particularly good, probably due to the small corpus.

# Sentence pair (1) source length 26 target length 21 alignment score : 7.66784e-37
<s2>ACCORDO COMPLEMENTARE all'accordo tra la Comunità economica europea nonché i suoi Stati membri e la Confederazione svizzera, concernente 
i prodotti dell'orologeria</s2>
NULL ({ 10 }) <s1>ADDITIONAL ({ 1 }) AGREEMENT ({ }) to ({ }) the ({ }) Agreement ({ }) concerning ({ }) products ({ 20 }) of ({ }) the ({ }) clock ({ 2 3 17 18 19 }) and ({ }) watch ({ }) industry ({ }) between ({ 4 }) the ({ 5 }) European ({ 6 }) Economic ({ 7 8 }) Community ({ }) and ({ 9 }) its ({ 11 }) Member ({ 13 }) States ({ 12 }) and ({ 14 }) the ({ 15 }) Swiss ({ 16 }) Confederation</s1> ({ 21 })

Running GIZA++

$ GIZA++ -S en.10000.vcb -T it.10000.vcb -C en.10000_it.10000.snt -p0 0.98 -o dictionary >& dictionary.log

If it stops in under a minute, unless the input files were tiny, check the log - it almost certainly failed.

If it worked, a number of files will be created. The one most likely to be of interest is dictionary.A3.final, which contains alignment information.

The first 3 lines of dictionary.A3.final, on the example corpus, are the following. They are apparently not particularly good, probably due to the small corpus.

# Sentence pair (1) source length 26 target length 21 alignment score : 7.66784e-37
<s2>ACCORDO COMPLEMENTARE all'accordo tra la Comunità economica europea nonché i suoi Stati membri e la Confederazione svizzera, concernente i 
prodotti dell'orologeria</s2>
NULL ({ 10 }) <s1>ADDITIONAL ({ 1 }) AGREEMENT ({ }) to ({ }) the ({ }) Agreement ({ }) concerning ({ }) products ({ 20 }) of ({ }) the ({ }) 
clock ({ 2 3 17 18 19 }) and ({ }) watch ({ }) industry ({ }) between ({ 4 }) the ({ 5 }) European ({ 6 }) Economic ({ 7 8 }) Community ({ }) 
and ({ 9 }) its ({ 11 }) Member ({ 13 }) States ({ 12 }) and ({ 14 }) the ({ 15 }) Swiss ({ 16 }) Confederation</s1> ({ 21 })

Acknowledgements

This page wouldn't be possible without the kind assistance of 'spectie' on Freenode's #apertium. That said, all errors/typos are mine, not his.