Difference between revisions of "Getting started with induction tools"
(Initial half-done checkin, so as not to tempt Murphy.) |
|||
Line 72: | Line 72: | ||
== Creating bilingual dictionaries. == |
== Creating bilingual dictionaries. == |
||
=== Obtaining |
=== Obtaining corpora (and getAlignmentWithText.pl) === |
||
<pre> |
|||
$ wget http://wt.jrc.it/lt/Acquis/JRC-Acquis.3.0/alignments/getAlignmentWithText.pl |
|||
</pre> |
|||
Choose a language pair. For this example, it will be Italian (it) and English (en). To use another pair, find out the filenames by browsing [http://wt.jrc.it/lt/Acquis/JRC-Acquis.3.0/alignments/] and [http://wt.jrc.it/lt/Acquis/JRC-Acquis.3.0/corpus/]. |
|||
Get the associated corpus and alignment files. |
|||
<pre> |
|||
$ wget http://wt.jrc.it/lt/Acquis/JRC-Acquis.3.0/alignments/jrc-en-it.xml.gz |
|||
$ wget http://wt.jrc.it/lt/Acquis/JRC-Acquis.3.0/corpus/jrc-en.tgz |
|||
$ wget http://wt.jrc.it/lt/Acquis/JRC-Acquis.3.0/corpus/jrc-it.tgz |
|||
</pre> |
|||
The corpora need to be untarred, and inserted into a new, common directory. |
|||
<pre> |
|||
$ tar xvzf jrc-en.tgz |
|||
$ tar xvzf jrc-it.tgz |
|||
$ mkdir acquis |
|||
$ mv en it acquis |
|||
</pre> |
|||
=== Running getAlignmentWithText.pl === |
|||
<pre> |
|||
$ perl getAlignmentWithText.pl -acquisDir acquis/ jrc-en-it.xml > alignment |
|||
</perl> |
|||
=== Creating subsets of the alignment file === |
|||
At this point, there will be an alignment file containing a few million lines of xml. It will refer frequently to s1 (the first language of the two in the filename jrc-lang1-lang2.xml, which is jrc-en-it.xml in this example, hence, English) and s2 (Italian). The content of each should be extracted to its own file. |
|||
<pre> |
|||
$ grep s1 alignment > en |
|||
$ grep s2 alignment > it |
|||
</pre> |
|||
An optional step may be useful. If you're just testing to see if it's working, or have an old machine, it's better to only use a subset of the data, so that it can be analyzed in a couple of hours, rather than days. Using 10,000 lines from each language took around 2 hours on my 1.8 Ghz Athlon with 512 megs of RAM. Other numbers are also fine, but use the same number for both files. |
|||
<pre> |
|||
$ head -n 10000 en > en.10000 |
|||
$ head -n 10000 it > it.10000 |
|||
</pre> |
|||
=== Running plain2snt.out === |
|||
The filenames will vary depending on whether or not the optional step was done. |
|||
<pre> |
|||
$ plain2snt.out en.10000 it.10000 |
|||
</pre> |
|||
=== Running mkcls === |
|||
mkcls needs to be run once per language. |
|||
<pre> |
|||
$ mkcls -m2 -pen.10000 -c50 -Ven10000.vcb.classes opt >& mkcls1.log |
|||
$ mkcls -m2 -pit.10000 -c50 -Vit.10000.vcb.classes opt >& mkcls2.log |
|||
</pre> |
|||
=== Running GIZA++ === |
|||
<pre> |
|||
$ GIZA++ -S en.10000.vcb -T it.10000.vcb -C en.10000_it.10000.snt -p0 0.98 -o dictionary >& dictionary.log |
|||
</pre> |
|||
If it stops in under a minute, unless the input files were tiny, check the log - it almost certainly failed. |
|||
If it worked, the following files will have been created: |
|||
<pre> |
|||
dictionary.perp dictionary.trn.src.vcb dictionary.trn.trg.vcb dictionary.tst.src.vcb dictionary.tst.trg.vcb |
|||
</pre> |
|||
== Acknowledgements == |
|||
This page wouldn't be possible without the kind assistance of 'spectie' on Freenode's #apertium. That said, all errors/typos are mine, not his. |
Revision as of 09:52, 28 March 2008
These are partial directions to getting started with Apertium and related tools (such as GIZA++), with the end goal of creating bilingual dictionaries. A few steps are ubuntu-specific. Everything except mkcls is built from the SVN sources.
Installing the necessary programs
Prerequisite Ubuntu packages
This is, unfortunately, not a complete list; I'll attempt to add one once I've set this up on a clean ubuntu install. However, the following should still be of some help; all are required:
sudo apt-get install automake libtool libxml2-dev flex libpcre3-dev
Installing Crossdics
Crossdics explains this correctly.
You may need to export a new value for JAVA_HOME before running ant jar.
$ export JAVA_HOME=/usr/lib/jvm/java-6-sun-1.6.0.03
ls /usr/lib/jvm/ if in doubt as to what the exact version on your system is; it may be different than the above.
Installing Apertium
$ svn co https://apertium.svn.sourceforge.net/svnroot/apertium/trunk/apertium $ ./autogen.sh $ ./configure $ make $ sudo make install
Installing lttoolbox
$ svn co https://apertium.svn.sourceforge.net/svnroot/apertium/trunk/lttoolbox $ ./autogen.sh $ ./configure $ make $ sudo make install
Installing ReTraTos
$ svn co https://apertium.svn.sourceforge.net/svnroot/apertium/trunk/retratos $ ./autogen.sh $ ./configure $ make $ sudo make install
Installing mkcls
Get the deb package appropriate for your version of ubuntu at ubuntu-nlp.
$ dpkg --install mkcls*.deb
Installing GIZA++
Unfortunately, the deb package at ubuntu-nlp was built with -DBINARY_SEARCH_FOR_TTABLE. You need to prepare the input files differently for this, or it dies with "ERROR: NO COOCURRENCE FILE GIVEN!". I don't know how to do that, so here are the instructions on compiling a version that will work with the rest of this page.
$ wget http://giza-pp.googlecode.com/files/giza-pp-v1.0.1.tar.gz $ tar xvzf giza-pp-v1.0.1.tar.gz $ cd giza-pp/GIZA++-v2 $ cat Makefile | sed -e 's/-DBINARY_SEARCH_FOR_TTABLE//' | sed -e 's/mkdir/mkdir -p/g' > tmp $ mv Makefile Makefile.orig $ mv tmp Makefile $ make $ sudo make install
Creating bilingual dictionaries.
Obtaining corpora (and getAlignmentWithText.pl)
$ wget http://wt.jrc.it/lt/Acquis/JRC-Acquis.3.0/alignments/getAlignmentWithText.pl
Choose a language pair. For this example, it will be Italian (it) and English (en). To use another pair, find out the filenames by browsing [1] and [2].
Get the associated corpus and alignment files.
$ wget http://wt.jrc.it/lt/Acquis/JRC-Acquis.3.0/alignments/jrc-en-it.xml.gz $ wget http://wt.jrc.it/lt/Acquis/JRC-Acquis.3.0/corpus/jrc-en.tgz $ wget http://wt.jrc.it/lt/Acquis/JRC-Acquis.3.0/corpus/jrc-it.tgz
The corpora need to be untarred, and inserted into a new, common directory.
$ tar xvzf jrc-en.tgz $ tar xvzf jrc-it.tgz $ mkdir acquis $ mv en it acquis
Running getAlignmentWithText.pl
$ perl getAlignmentWithText.pl -acquisDir acquis/ jrc-en-it.xml > alignment </perl> === Creating subsets of the alignment file === At this point, there will be an alignment file containing a few million lines of xml. It will refer frequently to s1 (the first language of the two in the filename jrc-lang1-lang2.xml, which is jrc-en-it.xml in this example, hence, English) and s2 (Italian). The content of each should be extracted to its own file. <pre> $ grep s1 alignment > en $ grep s2 alignment > it
An optional step may be useful. If you're just testing to see if it's working, or have an old machine, it's better to only use a subset of the data, so that it can be analyzed in a couple of hours, rather than days. Using 10,000 lines from each language took around 2 hours on my 1.8 Ghz Athlon with 512 megs of RAM. Other numbers are also fine, but use the same number for both files.
$ head -n 10000 en > en.10000 $ head -n 10000 it > it.10000
Running plain2snt.out
The filenames will vary depending on whether or not the optional step was done.
$ plain2snt.out en.10000 it.10000
Running mkcls
mkcls needs to be run once per language.
$ mkcls -m2 -pen.10000 -c50 -Ven10000.vcb.classes opt >& mkcls1.log $ mkcls -m2 -pit.10000 -c50 -Vit.10000.vcb.classes opt >& mkcls2.log
Running GIZA++
$ GIZA++ -S en.10000.vcb -T it.10000.vcb -C en.10000_it.10000.snt -p0 0.98 -o dictionary >& dictionary.log
If it stops in under a minute, unless the input files were tiny, check the log - it almost certainly failed.
If it worked, the following files will have been created:
dictionary.perp dictionary.trn.src.vcb dictionary.trn.trg.vcb dictionary.tst.src.vcb dictionary.tst.trg.vcb
Acknowledgements
This page wouldn't be possible without the kind assistance of 'spectie' on Freenode's #apertium. That said, all errors/typos are mine, not his.