Difference between revisions of "Getting started with induction tools"

From Apertium
Jump to navigation Jump to search
(One intermediate revision by the same user not shown)
Line 1: Line 1:
[[Comment démarrer avec les outils d'induction|En français]]
[[Comment démarrer avec les outils d'induction|En français]]
Line 170: Line 170:
== Acknowledgements ==
== Acknowledgements ==
This page wouldn't be possible without the kind assistance of 'spectie' on Freenode's #apertium. That said, all errors/typos are mine, not his.
This page wouldn't be possible without the kind assistance of 'spectie' on OFTC's #apertium. That said, all errors/typos are mine, not his.

Latest revision as of 02:52, 20 May 2021


This page is out of date as a result of the migration to GitHub. Please update this page with new documentation and remove this warning. If you are unsure how to proceed, please contact the GitHub migration team.

En français

These are partial directions to getting started with Apertium and related tools (such as GIZA++), with the end goal of creating bilingual dictionaries. This may also be useful as a prerequisite to following the Apertium_New_Language_Pair_HOWTO.

See Installation troubleshooting for what to do on errors.

A few steps are ubuntu-specific. Everything except mkcls is built from the SVN sources.

Installing the necessary programs[edit]

Prerequisite Ubuntu packages[edit]

This is, unfortunately, not a complete list; I'll attempt to add one once I've set this up on a clean ubuntu install. However, the following should still be of some help; all are required:

sudo apt-get install automake libtool libxml2-dev flex libpcre3-dev

Installing Crossdics[edit]

Crossdics explains this correctly.

You may need to export a new value for JAVA_HOME before running ant jar.

$ export JAVA_HOME=/usr/lib/jvm/java-6-sun-

ls /usr/lib/jvm/ if in doubt as to what the exact version on your system is; it may be different than the above.

Installing lttoolbox[edit]

$ svn co https://apertium.svn.sourceforge.net/svnroot/apertium/trunk/lttoolbox
$ cd lttoolbox
$ ./autogen.sh
$ ./configure
$ make
$ sudo make install

Note where lttoolbox.pc gets placed, in case autogen.sh doesn't find lttoolbox.

Installing Apertium[edit]

$ svn co https://apertium.svn.sourceforge.net/svnroot/apertium/trunk/apertium
$ cd apertium
$ ./autogen.sh
$ ./configure
$ make
$ sudo make install

Installing ReTraTos[edit]

$ svn co https://retratos.svn.sourceforge.net/svnroot/retratos/trunk retratos
$ cd retratos
$ ./autogen.sh
$ ./configure
$ make
$ sudo make install

(In case it doesn't find lttoolbox, see this workaround.)

Installing mkcls[edit]


Get the deb package appropriate for your version of ubuntu at ubuntu-nlp.

$ dpkg --install mkcls*.deb


Get giza-pp from http://code.google.com/p/giza-pp/, then

$ tar xvzf giza-pp-VERSION.tar.gz
$ cd giza-pp/mkcls-v2
$ make
$ sudo cp mkcls /usr/bin/

Installing GIZA++[edit]

Unfortunately, the deb package at ubuntu-nlp was built with -DBINARY_SEARCH_FOR_TTABLE. You need to prepare the input files differently for this, or it dies with "ERROR: NO COOCURRENCE FILE GIVEN!". I don't know how to do that, so here are the instructions on compiling a version that will work with the rest of this page.

$ wget http://giza-pp.googlecode.com/files/giza-pp-v1.0.1.tar.gz
$ tar xvzf giza-pp-v1.0.1.tar.gz
$ cd giza-pp/GIZA++-v2
$ cp Makefile Makefile.orig
$ sed -i 's/ -DBINARY_SEARCH_FOR_TTABLE//;s/mkdir/mkdir -p/g' Makefile
$ make
$ sudo make install

Creating bilingual dictionaries.[edit]

Obtaining corpora (and getAlignmentWithText.pl)[edit]

$ wget http://wt.jrc.it/lt/Acquis/JRC-Acquis.3.0/alignments/getAlignmentWithText.pl

Choose a language pair. For this example, it will be Italian (it) and English (en). To use another pair, find out the filenames by browsing [1] and [2].

Get the associated corpus and alignment files.

$ wget http://wt.jrc.it/lt/Acquis/JRC-Acquis.3.0/alignments/jrc-en-it.xml.gz
$ wget http://wt.jrc.it/lt/Acquis/JRC-Acquis.3.0/corpus/jrc-en.tgz
$ wget http://wt.jrc.it/lt/Acquis/JRC-Acquis.3.0/corpus/jrc-it.tgz

The corpora need to be untarred, and inserted into a new, common directory.

$ tar xvzf jrc-en.tgz 
$ tar xvzf jrc-it.tgz
$ mkdir acquis
$ mv en it acquis

Running getAlignmentWithText.pl[edit]

$ perl getAlignmentWithText.pl -acquisDir acquis/ jrc-en-it.xml > alignment

Creating subsets of the alignment file[edit]

At this point, there will be an alignment file containing a few million lines of xml. It will refer frequently to s1 (the first language of the two in the filename jrc-lang1-lang2.xml, which is jrc-en-it.xml in this example, hence, English) and s2 (Italian). The content of each should be extracted to its own file.

$ grep s1 alignment > en
$ grep s2 alignment > it

An optional step may be useful. If you're just testing to see if it's working, or have an old machine, it's better to only use a subset of the data, so that it can be analyzed in a couple of hours, rather than days. Using 10,000 lines from each language took around 2 hours on my 1.8 Ghz Athlon with 512 megs of RAM. Other numbers are also fine, but use the same number for both files.

$ head -n 10000 en > en.10000
$ head -n 10000 it > it.10000

Running plain2snt.out[edit]

The filenames will vary depending on whether or not the optional step was done.

$ plain2snt.out en.10000 it.10000

Running GIZA++[edit]

Main article: Using GIZA++
$ GIZA++ -S en.10000.vcb -T it.10000.vcb -C en.10000_it.10000.snt -p0 0.98 -o dictionary >& dictionary.log

If it stops in under a minute, unless the input files were tiny, check the log - it almost certainly failed.

If it worked, a number of files will be created. The one most likely to be of interest is dictionary.A3.final, which contains alignment information.

The first 3 lines of dictionary.A3.final, on the example corpus, are the following. They are apparently not particularly good, probably due to the small corpus.

# Sentence pair (1) source length 26 target length 21 alignment score : 7.66784e-37
<s2>ACCORDO COMPLEMENTARE all'accordo tra la Comunità economica europea nonché i suoi Stati membri e la Confederazione svizzera, concernente i 
prodotti dell'orologeria</s2>
NULL ({ 10 }) <s1>ADDITIONAL ({ 1 }) AGREEMENT ({ }) to ({ }) the ({ }) Agreement ({ }) concerning ({ }) products ({ 20 }) of ({ }) the ({ }) 
clock ({ 2 3 17 18 19 }) and ({ }) watch ({ }) industry ({ }) between ({ 4 }) the ({ 5 }) European ({ 6 }) Economic ({ 7 8 }) Community ({ }) 
and ({ 9 }) its ({ 11 }) Member ({ 13 }) States ({ 12 }) and ({ 14 }) the ({ 15 }) Swiss ({ 16 }) Confederation</s1> ({ 21 })

See also[edit]


This page wouldn't be possible without the kind assistance of 'spectie' on OFTC's #apertium. That said, all errors/typos are mine, not his.