Difference between revisions of "Getting started with induction tools"

From Apertium
Jump to navigation Jump to search
 
(16 intermediate revisions by 5 users not shown)
Line 1: Line 1:
{{Github-migration-check}}

[[Comment démarrer avec les outils d'induction|En français]]

{{TOCD}}
{{TOCD}}


These are partial directions to getting started with Apertium and related tools (such as GIZA++), with the end goal of creating bilingual dictionaries. This may also be useful as a prerequisite to following the [[Apertium_New_Language_Pair_HOWTO]].
These are partial directions to getting started with Apertium and related tools (such as GIZA++), with the end goal of creating bilingual dictionaries. This may also be useful as a prerequisite to following the [[Apertium_New_Language_Pair_HOWTO]].

See [[Installation troubleshooting]] for what to do on errors.


A few steps are ubuntu-specific. Everything except mkcls is built from the SVN sources.
A few steps are ubuntu-specific. Everything except mkcls is built from the SVN sources.
Line 24: Line 30:
ls /usr/lib/jvm/ if in doubt as to what the exact version on your system is; it may be different than the above.
ls /usr/lib/jvm/ if in doubt as to what the exact version on your system is; it may be different than the above.


=== Installing lttoolbox ===

<pre>
=== Installing Apertium ===
<pre>$ svn co https://apertium.svn.sourceforge.net/svnroot/apertium/trunk/apertium
$ svn co https://apertium.svn.sourceforge.net/svnroot/apertium/trunk/lttoolbox
$ cd lttoolbox
$ ./autogen.sh
$ ./autogen.sh
$ ./configure
$ ./configure
Line 33: Line 40:
</pre>
</pre>


Note where lttoolbox.pc gets placed, in case [[Installing_Apertium_3.0#When_running_configure_script_for_Aperitum|autogen.sh doesn't find lttoolbox]].
=== Installing lttoolbox ===

<pre>
=== Installing Apertium ===
$ svn co https://apertium.svn.sourceforge.net/svnroot/apertium/trunk/lttoolbox
<pre>$ svn co https://apertium.svn.sourceforge.net/svnroot/apertium/trunk/apertium
$ cd apertium
$ ./autogen.sh
$ ./autogen.sh
$ ./configure
$ ./configure
Line 44: Line 53:
=== Installing ReTraTos ===
=== Installing ReTraTos ===
<pre>
<pre>
$ svn co https://apertium.svn.sourceforge.net/svnroot/apertium/trunk/retratos
$ svn co https://retratos.svn.sourceforge.net/svnroot/retratos/trunk retratos
$ cd retratos
$ ./autogen.sh
$ ./autogen.sh
$ ./configure
$ ./configure
Line 50: Line 60:
$ sudo make install
$ sudo make install
</pre>
</pre>

(In case it doesn't find lttoolbox, see [[Installing_Apertium_3.0#Workaround_for_apertium-3.1|this workaround]].)


=== Installing mkcls ===
=== Installing mkcls ===
==== Ubuntu: ====
Get the deb package appropriate for your version of ubuntu at [http://cl.aist-nara.ac.jp/~eric-n/ubuntu-nlp/ ubuntu-nlp].
Get the deb package appropriate for your version of ubuntu at [http://cl.aist-nara.ac.jp/~eric-n/ubuntu-nlp/ ubuntu-nlp].
<pre>
<pre>
$ dpkg --install mkcls*.deb
$ dpkg --install mkcls*.deb
</pre>

==== Mac: ====
Get giza-pp from [http://code.google.com/p/giza-pp/ http://code.google.com/p/giza-pp/], then
<pre>
$ tar xvzf giza-pp-VERSION.tar.gz
$ cd giza-pp/mkcls-v2
$ make
$ sudo cp mkcls /usr/bin/
</pre>
</pre>


Line 64: Line 86:
$ tar xvzf giza-pp-v1.0.1.tar.gz
$ tar xvzf giza-pp-v1.0.1.tar.gz
$ cd giza-pp/GIZA++-v2
$ cd giza-pp/GIZA++-v2
$ cp Makefile Makefile.orig
$ cat Makefile | sed -e 's/-DBINARY_SEARCH_FOR_TTABLE//' | sed -e 's/mkdir/mkdir -p/g' > tmp
$ sed -i 's/ -DBINARY_SEARCH_FOR_TTABLE//;s/mkdir/mkdir -p/g' Makefile
$ mv Makefile Makefile.orig
$ mv tmp Makefile
$ make
$ make
$ sudo make install
$ sudo make install
Line 99: Line 120:
<pre>
<pre>
$ perl getAlignmentWithText.pl -acquisDir acquis/ jrc-en-it.xml > alignment
$ perl getAlignmentWithText.pl -acquisDir acquis/ jrc-en-it.xml > alignment
</perl>
</pre>


=== Creating subsets of the alignment file ===
=== Creating subsets of the alignment file ===
Line 122: Line 143:
$ plain2snt.out en.10000 it.10000
$ plain2snt.out en.10000 it.10000
</pre>
</pre>



=== Running GIZA++ ===
=== Running GIZA++ ===
{{main|Using GIZA++}}
<pre>
<pre>
$ GIZA++ -S en.10000.vcb -T it.10000.vcb -C en.10000_it.10000.snt -p0 0.98 -o dictionary >& dictionary.log
$ GIZA++ -S en.10000.vcb -T it.10000.vcb -C en.10000_it.10000.snt -p0 0.98 -o dictionary >& dictionary.log
Line 135: Line 158:
<pre>
<pre>
# Sentence pair (1) source length 26 target length 21 alignment score : 7.66784e-37
# Sentence pair (1) source length 26 target length 21 alignment score : 7.66784e-37
<s2>ACCORDO COMPLEMENTARE all'accordo tra la Comunità economica europea nonché i suoi Stati membri e la Confederazione svizzera, concernente i prodotti
<s2>ACCORDO COMPLEMENTARE all'accordo tra la Comunità economica europea nonché i suoi Stati membri e la Confederazione svizzera, concernente i
dell'orologeria</s2>
prodotti dell'orologeria</s2>
NULL ({ 10 }) <s1>ADDITIONAL ({ 1 }) AGREEMENT ({ }) to ({ }) the ({ }) Agreement ({ }) concerning ({ }) products ({ 20 }) of ({ }) the ({ })
NULL ({ 10 }) <s1>ADDITIONAL ({ 1 }) AGREEMENT ({ }) to ({ }) the ({ }) Agreement ({ }) concerning ({ }) products ({ 20 }) of ({ }) the ({ })
clock ({ 2 3 17 18 19 }) and ({ }) watch ({ }) industry ({ }) between ({ 4 }) the ({ 5 }) European ({ 6 }) Economic ({ 7 8 }) Community ({ })
clock ({ 2 3 17 18 19 }) and ({ }) watch ({ }) industry ({ }) between ({ 4 }) the ({ 5 }) European ({ 6 }) Economic ({ 7 8 }) Community ({ })
Line 142: Line 165:
</pre>
</pre>


=== Running GIZA++ ===
== See also ==
<pre>
$ GIZA++ -S en.10000.vcb -T it.10000.vcb -C en.10000_it.10000.snt -p0 0.98 -o dictionary >& dictionary.log
</pre>


* [[ReTraTos]]
If it stops in under a minute, unless the input files were tiny, check the log - it almost certainly failed.


== Acknowledgements ==
If it worked, a number of files will be created. The one most likely to be of interest is dictionary.A3.final, which contains alignment information.
This page wouldn't be possible without the kind assistance of 'spectie' on OFTC's #apertium. That said, all errors/typos are mine, not his.


The first 3 lines of dictionary.A3.final, on the example corpus, are the following. They are apparently not particularly good, probably due to the small corpus.
<pre>
# Sentence pair (1) source length 26 target length 21 alignment score : 7.66784e-37
<s2>ACCORDO COMPLEMENTARE all'accordo tra la Comunità economica europea nonché i suoi Stati membri e la Confederazione svizzera, concernente i prodotti dell'orologeria</s2>
NULL ({ 10 }) <s1>ADDITIONAL ({ 1 }) AGREEMENT ({ }) to ({ }) the ({ }) Agreement ({ }) concerning ({ }) products ({ 20 }) of ({ }) the ({ }) clock ({ 2 3 17 18 19 }) and ({ }) watch ({ }) industry ({ }) between ({ 4 }) the ({ 5 }) European ({ 6 }) Economic ({ 7 8 }) Community ({ }) and ({ 9 }) its ({ 11 }) Member ({ 13 }) States ({ 12 }) and ({ 14 }) the ({ 15 }) Swiss ({ 16 }) Confederation</s1> ({ 21 })
</pre>


[[Category:Documentation]]
== Acknowledgements ==
[[Category:Installation]]
This page wouldn't be possible without the kind assistance of 'spectie' on Freenode's #apertium. That said, all errors/typos are mine, not his.
[[Category:Documentation in English]]

Latest revision as of 02:52, 20 May 2021

WARNING

This page is out of date as a result of the migration to GitHub. Please update this page with new documentation and remove this warning. If you are unsure how to proceed, please contact the GitHub migration team.

En français

These are partial directions to getting started with Apertium and related tools (such as GIZA++), with the end goal of creating bilingual dictionaries. This may also be useful as a prerequisite to following the Apertium_New_Language_Pair_HOWTO.

See Installation troubleshooting for what to do on errors.

A few steps are ubuntu-specific. Everything except mkcls is built from the SVN sources.

Installing the necessary programs[edit]

Prerequisite Ubuntu packages[edit]

This is, unfortunately, not a complete list; I'll attempt to add one once I've set this up on a clean ubuntu install. However, the following should still be of some help; all are required:

sudo apt-get install automake libtool libxml2-dev flex libpcre3-dev

Installing Crossdics[edit]

Crossdics explains this correctly.

You may need to export a new value for JAVA_HOME before running ant jar.

$ export JAVA_HOME=/usr/lib/jvm/java-6-sun-1.6.0.03

ls /usr/lib/jvm/ if in doubt as to what the exact version on your system is; it may be different than the above.

Installing lttoolbox[edit]

$ svn co https://apertium.svn.sourceforge.net/svnroot/apertium/trunk/lttoolbox
$ cd lttoolbox
$ ./autogen.sh
$ ./configure
$ make
$ sudo make install

Note where lttoolbox.pc gets placed, in case autogen.sh doesn't find lttoolbox.

Installing Apertium[edit]

$ svn co https://apertium.svn.sourceforge.net/svnroot/apertium/trunk/apertium
$ cd apertium
$ ./autogen.sh
$ ./configure
$ make
$ sudo make install

Installing ReTraTos[edit]

$ svn co https://retratos.svn.sourceforge.net/svnroot/retratos/trunk retratos
$ cd retratos
$ ./autogen.sh
$ ./configure
$ make
$ sudo make install

(In case it doesn't find lttoolbox, see this workaround.)

Installing mkcls[edit]

Ubuntu:[edit]

Get the deb package appropriate for your version of ubuntu at ubuntu-nlp.

$ dpkg --install mkcls*.deb

Mac:[edit]

Get giza-pp from http://code.google.com/p/giza-pp/, then

$ tar xvzf giza-pp-VERSION.tar.gz
$ cd giza-pp/mkcls-v2
$ make
$ sudo cp mkcls /usr/bin/

Installing GIZA++[edit]

Unfortunately, the deb package at ubuntu-nlp was built with -DBINARY_SEARCH_FOR_TTABLE. You need to prepare the input files differently for this, or it dies with "ERROR: NO COOCURRENCE FILE GIVEN!". I don't know how to do that, so here are the instructions on compiling a version that will work with the rest of this page.

$ wget http://giza-pp.googlecode.com/files/giza-pp-v1.0.1.tar.gz
$ tar xvzf giza-pp-v1.0.1.tar.gz
$ cd giza-pp/GIZA++-v2
$ cp Makefile Makefile.orig
$ sed -i 's/ -DBINARY_SEARCH_FOR_TTABLE//;s/mkdir/mkdir -p/g' Makefile
$ make
$ sudo make install


Creating bilingual dictionaries.[edit]

Obtaining corpora (and getAlignmentWithText.pl)[edit]

$ wget http://wt.jrc.it/lt/Acquis/JRC-Acquis.3.0/alignments/getAlignmentWithText.pl

Choose a language pair. For this example, it will be Italian (it) and English (en). To use another pair, find out the filenames by browsing [1] and [2].

Get the associated corpus and alignment files.

$ wget http://wt.jrc.it/lt/Acquis/JRC-Acquis.3.0/alignments/jrc-en-it.xml.gz
$ wget http://wt.jrc.it/lt/Acquis/JRC-Acquis.3.0/corpus/jrc-en.tgz
$ wget http://wt.jrc.it/lt/Acquis/JRC-Acquis.3.0/corpus/jrc-it.tgz

The corpora need to be untarred, and inserted into a new, common directory.

$ tar xvzf jrc-en.tgz 
$ tar xvzf jrc-it.tgz
$ mkdir acquis
$ mv en it acquis

Running getAlignmentWithText.pl[edit]

$ perl getAlignmentWithText.pl -acquisDir acquis/ jrc-en-it.xml > alignment

Creating subsets of the alignment file[edit]

At this point, there will be an alignment file containing a few million lines of xml. It will refer frequently to s1 (the first language of the two in the filename jrc-lang1-lang2.xml, which is jrc-en-it.xml in this example, hence, English) and s2 (Italian). The content of each should be extracted to its own file.

$ grep s1 alignment > en
$ grep s2 alignment > it

An optional step may be useful. If you're just testing to see if it's working, or have an old machine, it's better to only use a subset of the data, so that it can be analyzed in a couple of hours, rather than days. Using 10,000 lines from each language took around 2 hours on my 1.8 Ghz Athlon with 512 megs of RAM. Other numbers are also fine, but use the same number for both files.

 
$ head -n 10000 en > en.10000
$ head -n 10000 it > it.10000

Running plain2snt.out[edit]

The filenames will vary depending on whether or not the optional step was done.

$ plain2snt.out en.10000 it.10000


Running GIZA++[edit]

Main article: Using GIZA++
$ GIZA++ -S en.10000.vcb -T it.10000.vcb -C en.10000_it.10000.snt -p0 0.98 -o dictionary >& dictionary.log

If it stops in under a minute, unless the input files were tiny, check the log - it almost certainly failed.

If it worked, a number of files will be created. The one most likely to be of interest is dictionary.A3.final, which contains alignment information.

The first 3 lines of dictionary.A3.final, on the example corpus, are the following. They are apparently not particularly good, probably due to the small corpus.

# Sentence pair (1) source length 26 target length 21 alignment score : 7.66784e-37
<s2>ACCORDO COMPLEMENTARE all'accordo tra la Comunità economica europea nonché i suoi Stati membri e la Confederazione svizzera, concernente i 
prodotti dell'orologeria</s2>
NULL ({ 10 }) <s1>ADDITIONAL ({ 1 }) AGREEMENT ({ }) to ({ }) the ({ }) Agreement ({ }) concerning ({ }) products ({ 20 }) of ({ }) the ({ }) 
clock ({ 2 3 17 18 19 }) and ({ }) watch ({ }) industry ({ }) between ({ 4 }) the ({ 5 }) European ({ 6 }) Economic ({ 7 8 }) Community ({ }) 
and ({ 9 }) its ({ 11 }) Member ({ 13 }) States ({ 12 }) and ({ 14 }) the ({ 15 }) Swiss ({ 16 }) Confederation</s1> ({ 21 })

See also[edit]

Acknowledgements[edit]

This page wouldn't be possible without the kind assistance of 'spectie' on OFTC's #apertium. That said, all errors/typos are mine, not his.