Difference between revisions of "Getting started with induction tools"

Revision as of 22:54, 9 March 2019

⚠ WARNING ⚠

This page is out of date as a result of the migration to GitHub. Please update this page with new documentation and remove this warning. If you are unsure how to proceed, please contact the GitHub migration team.

En français

Installing the necessary programs

Prerequisite Ubuntu packages

This is, unfortunately, not a complete list; I'll attempt to add one once I've set this up on a clean ubuntu install. However, the following should still be of some help; all are required:

sudo apt-get install automake libtool libxml2-dev flex libpcre3-dev

Installing Crossdics

Crossdics explains this correctly.

You may need to export a new value for JAVA_HOME before running ant jar.

$ export JAVA_HOME=/usr/lib/jvm/java-6-sun-1.6.0.03

ls /usr/lib/jvm/ if in doubt as to what the exact version on your system is; it may be different than the above.

Installing lttoolbox

$ svn co https://apertium.svn.sourceforge.net/svnroot/apertium/trunk/lttoolbox
$ cd lttoolbox
$ ./autogen.sh
$ ./configure
$ make
$ sudo make install

Note where lttoolbox.pc gets placed, in case autogen.sh doesn't find lttoolbox.

Installing Apertium

$ svn co https://apertium.svn.sourceforge.net/svnroot/apertium/trunk/apertium
$ cd apertium
$ ./autogen.sh
$ ./configure
$ make
$ sudo make install

Installing ReTraTos

$ svn co https://retratos.svn.sourceforge.net/svnroot/retratos/trunk retratos
$ cd retratos
$ ./autogen.sh
$ ./configure
$ make
$ sudo make install

(In case it doesn't find lttoolbox, see this workaround.)

Installing mkcls

Ubuntu:

Get the deb package appropriate for your version of ubuntu at ubuntu-nlp.

$ dpkg --install mkcls*.deb

Mac:

Get giza-pp from http://code.google.com/p/giza-pp/, then

$ tar xvzf giza-pp-VERSION.tar.gz
$ cd giza-pp/mkcls-v2
$ make
$ sudo cp mkcls /usr/bin/

Installing GIZA++

Unfortunately, the deb package at ubuntu-nlp was built with -DBINARY_SEARCH_FOR_TTABLE. You need to prepare the input files differently for this, or it dies with "ERROR: NO COOCURRENCE FILE GIVEN!". I don't know how to do that, so here are the instructions on compiling a version that will work with the rest of this page.

$ wget http://giza-pp.googlecode.com/files/giza-pp-v1.0.1.tar.gz
$ tar xvzf giza-pp-v1.0.1.tar.gz
$ cd giza-pp/GIZA++-v2
$ cp Makefile Makefile.orig
$ sed -i 's/ -DBINARY_SEARCH_FOR_TTABLE//;s/mkdir/mkdir -p/g' Makefile
$ make
$ sudo make install

Creating bilingual dictionaries.

Obtaining corpora (and getAlignmentWithText.pl)

$ wget http://wt.jrc.it/lt/Acquis/JRC-Acquis.3.0/alignments/getAlignmentWithText.pl

Choose a language pair. For this example, it will be Italian (it) and English (en). To use another pair, find out the filenames by browsing [1] and [2].

Get the associated corpus and alignment files.

$ wget http://wt.jrc.it/lt/Acquis/JRC-Acquis.3.0/alignments/jrc-en-it.xml.gz
$ wget http://wt.jrc.it/lt/Acquis/JRC-Acquis.3.0/corpus/jrc-en.tgz
$ wget http://wt.jrc.it/lt/Acquis/JRC-Acquis.3.0/corpus/jrc-it.tgz

The corpora need to be untarred, and inserted into a new, common directory.

$ tar xvzf jrc-en.tgz 
$ tar xvzf jrc-it.tgz
$ mkdir acquis
$ mv en it acquis

Running getAlignmentWithText.pl

$ perl getAlignmentWithText.pl -acquisDir acquis/ jrc-en-it.xml > alignment

Creating subsets of the alignment file

At this point, there will be an alignment file containing a few million lines of xml. It will refer frequently to s1 (the first language of the two in the filename jrc-lang1-lang2.xml, which is jrc-en-it.xml in this example, hence, English) and s2 (Italian). The content of each should be extracted to its own file.

$ grep s1 alignment > en
$ grep s2 alignment > it

An optional step may be useful. If you're just testing to see if it's working, or have an old machine, it's better to only use a subset of the data, so that it can be analyzed in a couple of hours, rather than days. Using 10,000 lines from each language took around 2 hours on my 1.8 Ghz Athlon with 512 megs of RAM. Other numbers are also fine, but use the same number for both files.

 
$ head -n 10000 en > en.10000
$ head -n 10000 it > it.10000

Running plain2snt.out

The filenames will vary depending on whether or not the optional step was done.

$ plain2snt.out en.10000 it.10000

Running GIZA++

Main article: Using GIZA++

$ GIZA++ -S en.10000.vcb -T it.10000.vcb -C en.10000_it.10000.snt -p0 0.98 -o dictionary >& dictionary.log

If it stops in under a minute, unless the input files were tiny, check the log - it almost certainly failed.

If it worked, a number of files will be created. The one most likely to be of interest is dictionary.A3.final, which contains alignment information.

The first 3 lines of dictionary.A3.final, on the example corpus, are the following. They are apparently not particularly good, probably due to the small corpus.

# Sentence pair (1) source length 26 target length 21 alignment score : 7.66784e-37
<s2>ACCORDO COMPLEMENTARE all'accordo tra la Comunità economica europea nonché i suoi Stati membri e la Confederazione svizzera, concernente i 
prodotti dell'orologeria</s2>
NULL ({ 10 }) <s1>ADDITIONAL ({ 1 }) AGREEMENT ({ }) to ({ }) the ({ }) Agreement ({ }) concerning ({ }) products ({ 20 }) of ({ }) the ({ }) 
clock ({ 2 3 17 18 19 }) and ({ }) watch ({ }) industry ({ }) between ({ 4 }) the ({ 5 }) European ({ 6 }) Economic ({ 7 8 }) Community ({ }) 
and ({ 9 }) its ({ 11 }) Member ({ 13 }) States ({ 12 }) and ({ 14 }) the ({ 15 }) Swiss ({ 16 }) Confederation</s1> ({ 21 })

Acknowledgements

This page wouldn't be possible without the kind assistance of 'spectie' on Freenode's #apertium. That said, all errors/typos are mine, not his.

@@ Line 1: / Line 1: @@
+{{Github-migration-check}}
-{{github}}
 [[Comment démarrer avec les outils d'induction|En français]]

Difference between revisions of "Getting started with induction tools"

Revision as of 22:54, 9 March 2019

Contents

Installing the necessary programs

Prerequisite Ubuntu packages

Installing Crossdics

Installing lttoolbox

Installing Apertium

Installing ReTraTos

Installing mkcls

Ubuntu:

Mac:

Installing GIZA++

Creating bilingual dictionaries.

Obtaining corpora (and getAlignmentWithText.pl)

Running getAlignmentWithText.pl

Creating subsets of the alignment file

Running plain2snt.out

Running GIZA++

See also

Acknowledgements

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools