Apertium has moved from SourceForge to GitHub.
If you have any questions, please come and talk to us on #apertium on irc.freenode.net or contact the GitHub migration team.

Getting started with induction tools

From Apertium
(Difference between revisions)
Jump to: navigation, search
 
Line 1: Line 1:
{{github}}
+
{{Github-migration-check}}
   
 
[[Comment démarrer avec les outils d'induction|En français]]
 
[[Comment démarrer avec les outils d'induction|En français]]

Latest revision as of 23:54, 9 March 2019

WARNING

This page is out of date as a result of the migration to GitHub. Please update this page with new documentation and remove this warning. If you are unsure how to proceed, please contact the GitHub migration team.

En français

Contents

These are partial directions to getting started with Apertium and related tools (such as GIZA++), with the end goal of creating bilingual dictionaries. This may also be useful as a prerequisite to following the Apertium_New_Language_Pair_HOWTO.

See Installation troubleshooting for what to do on errors.

A few steps are ubuntu-specific. Everything except mkcls is built from the SVN sources.

[edit] Installing the necessary programs

[edit] Prerequisite Ubuntu packages

This is, unfortunately, not a complete list; I'll attempt to add one once I've set this up on a clean ubuntu install. However, the following should still be of some help; all are required:

sudo apt-get install automake libtool libxml2-dev flex libpcre3-dev

[edit] Installing Crossdics

Crossdics explains this correctly.

You may need to export a new value for JAVA_HOME before running ant jar.

$ export JAVA_HOME=/usr/lib/jvm/java-6-sun-1.6.0.03

ls /usr/lib/jvm/ if in doubt as to what the exact version on your system is; it may be different than the above.

[edit] Installing lttoolbox

$ svn co https://apertium.svn.sourceforge.net/svnroot/apertium/trunk/lttoolbox
$ cd lttoolbox
$ ./autogen.sh
$ ./configure
$ make
$ sudo make install

Note where lttoolbox.pc gets placed, in case autogen.sh doesn't find lttoolbox.

[edit] Installing Apertium

$ svn co https://apertium.svn.sourceforge.net/svnroot/apertium/trunk/apertium
$ cd apertium
$ ./autogen.sh
$ ./configure
$ make
$ sudo make install

[edit] Installing ReTraTos

$ svn co https://retratos.svn.sourceforge.net/svnroot/retratos/trunk retratos
$ cd retratos
$ ./autogen.sh
$ ./configure
$ make
$ sudo make install

(In case it doesn't find lttoolbox, see this workaround.)

[edit] Installing mkcls

[edit] Ubuntu:

Get the deb package appropriate for your version of ubuntu at ubuntu-nlp.

$ dpkg --install mkcls*.deb

[edit] Mac:

Get giza-pp from http://code.google.com/p/giza-pp/, then

$ tar xvzf giza-pp-VERSION.tar.gz
$ cd giza-pp/mkcls-v2
$ make
$ sudo cp mkcls /usr/bin/

[edit] Installing GIZA++

Unfortunately, the deb package at ubuntu-nlp was built with -DBINARY_SEARCH_FOR_TTABLE. You need to prepare the input files differently for this, or it dies with "ERROR: NO COOCURRENCE FILE GIVEN!". I don't know how to do that, so here are the instructions on compiling a version that will work with the rest of this page.

$ wget http://giza-pp.googlecode.com/files/giza-pp-v1.0.1.tar.gz
$ tar xvzf giza-pp-v1.0.1.tar.gz
$ cd giza-pp/GIZA++-v2
$ cp Makefile Makefile.orig
$ sed -i 's/ -DBINARY_SEARCH_FOR_TTABLE//;s/mkdir/mkdir -p/g' Makefile
$ make
$ sudo make install


[edit] Creating bilingual dictionaries.

[edit] Obtaining corpora (and getAlignmentWithText.pl)

$ wget http://wt.jrc.it/lt/Acquis/JRC-Acquis.3.0/alignments/getAlignmentWithText.pl

Choose a language pair. For this example, it will be Italian (it) and English (en). To use another pair, find out the filenames by browsing [1] and [2].

Get the associated corpus and alignment files.

$ wget http://wt.jrc.it/lt/Acquis/JRC-Acquis.3.0/alignments/jrc-en-it.xml.gz
$ wget http://wt.jrc.it/lt/Acquis/JRC-Acquis.3.0/corpus/jrc-en.tgz
$ wget http://wt.jrc.it/lt/Acquis/JRC-Acquis.3.0/corpus/jrc-it.tgz

The corpora need to be untarred, and inserted into a new, common directory.

$ tar xvzf jrc-en.tgz 
$ tar xvzf jrc-it.tgz
$ mkdir acquis
$ mv en it acquis

[edit] Running getAlignmentWithText.pl

$ perl getAlignmentWithText.pl -acquisDir acquis/ jrc-en-it.xml > alignment

[edit] Creating subsets of the alignment file

At this point, there will be an alignment file containing a few million lines of xml. It will refer frequently to s1 (the first language of the two in the filename jrc-lang1-lang2.xml, which is jrc-en-it.xml in this example, hence, English) and s2 (Italian). The content of each should be extracted to its own file.

$ grep s1 alignment > en
$ grep s2 alignment > it

An optional step may be useful. If you're just testing to see if it's working, or have an old machine, it's better to only use a subset of the data, so that it can be analyzed in a couple of hours, rather than days. Using 10,000 lines from each language took around 2 hours on my 1.8 Ghz Athlon with 512 megs of RAM. Other numbers are also fine, but use the same number for both files.

 
$ head -n 10000 en > en.10000
$ head -n 10000 it > it.10000

[edit] Running plain2snt.out

The filenames will vary depending on whether or not the optional step was done.

$ plain2snt.out en.10000 it.10000


[edit] Running GIZA++

Main article: Using GIZA++
$ GIZA++ -S en.10000.vcb -T it.10000.vcb -C en.10000_it.10000.snt -p0 0.98 -o dictionary >& dictionary.log

If it stops in under a minute, unless the input files were tiny, check the log - it almost certainly failed.

If it worked, a number of files will be created. The one most likely to be of interest is dictionary.A3.final, which contains alignment information.

The first 3 lines of dictionary.A3.final, on the example corpus, are the following. They are apparently not particularly good, probably due to the small corpus.

# Sentence pair (1) source length 26 target length 21 alignment score : 7.66784e-37
<s2>ACCORDO COMPLEMENTARE all'accordo tra la Comunità economica europea nonché i suoi Stati membri e la Confederazione svizzera, concernente i 
prodotti dell'orologeria</s2>
NULL ({ 10 }) <s1>ADDITIONAL ({ 1 }) AGREEMENT ({ }) to ({ }) the ({ }) Agreement ({ }) concerning ({ }) products ({ 20 }) of ({ }) the ({ }) 
clock ({ 2 3 17 18 19 }) and ({ }) watch ({ }) industry ({ }) between ({ 4 }) the ({ 5 }) European ({ 6 }) Economic ({ 7 8 }) Community ({ }) 
and ({ 9 }) its ({ 11 }) Member ({ 13 }) States ({ 12 }) and ({ 14 }) the ({ 15 }) Swiss ({ 16 }) Confederation</s1> ({ 21 })

[edit] See also

[edit] Acknowledgements

This page wouldn't be possible without the kind assistance of 'spectie' on Freenode's #apertium. That said, all errors/typos are mine, not his.

Personal tools