https://wiki.apertium.org/w/api.php?action=feedcontributions&user=Kb&feedformat=atomApertium - User contributions [en]2024-03-29T15:41:40ZUser contributionsMediaWiki 1.34.1https://wiki.apertium.org/w/index.php?title=Main_Page&diff=4593Main Page2008-03-28T11:49:03Z<p>Kb: </p>
<hr />
<div>{{Main page header}}<br />
<br />
{|style="width:100%; background:#fbfbfb; margin-top:1.2em; border:1px solid #ccc;"<br />
|style="width:50%; color:#000; vertical-align: top;padding-left:20px; padding-right:20px;"|<br />
<br />
==Documentation==<br />
{{main|Documentation}}<br />
* '''[[Using SVN]]''' &mdash; cheatsheet on how to use SVN for users and developers.<br />
* '''[[Getting Started]]''' &mdash; how to install Apertium, GIZA++, etc, and create a bilingual dictionary.<br />
* '''[[Apertium New Language Pair HOWTO]]''' &mdash; step-by-step description of how to start a new language pair in Apertium.<br />
** [[Créer une nouvelle paire de langues]] <br />
** [[Uputstvo za novi jezički par za Apertium]]<br />
** [[Апертиум, как се създава нова езикова двойка]]<br />
** [[Kiel aldoni novan lingvan duon]]<br />
** [[Упатство за креирање на нови јазични парови]]<br />
<br />
* '''[[Contributing to an existing pair]]''' &mdash; some pointers for contributing to an existing pair.<br />
** [[Comment contribuer à une paire de langues existante]]<br />
<br />
* '''[[Frequently Asked Questions]]''' &mdash; what it says in the link.<br />
|style="width:50%; color:#000; vertical-align: top; padding-left:20px; padding-right:20px;"|<br />
==Translators==<br />
{{main|List of language pairs}}<br />
The following pairs have released versions and are considered to be stable:<br />
<br />
:*Spanish ⇆ Catalan (es-ca)<br />
:*Spanish ← Romanian (es-ro)<br />
:*French ⇆ Catalan (fr-ca)<br />
:*Occitan ⇆ Catalan (oc-ca)<br />
:*Spanish ⇆ Portuguese (es-pt)<br />
:*English ⇆ Catalan (en-ca)<br />
:*English ⇆ Spanish (en-es)<br />
:*Spanish ⇆ Galician (es-gl)<br />
<br />
:*French ⇆ Spanish (fr-es)<br />
:*Esperanto ← Spanish (eo-es)<br />
:*Esperanto ← Catalan (eo-es)<br />
<br />
Other pairs currently in development may be found in our [[Using SVN|SVN repository]]. You can [[test drive]] release versions [http://xixona.dlsi.ua.es/apertium/ here], and bi-daily builds [http://xixona.dlsi.ua.es/testing/ here].<br />
<br />
|}<br />
<br />
==Discussions==<br />
<br />
*[[Afrikaans to Dutch]]<br />
*[[Afrikaans to English]]<br />
*[[Basque to Spanish]]<br />
*[[French to Catalan]]<br />
*[[Welsh to English]]<br />
*[[English to Polish]]<br />
*[[English to Catalan]]<br />
<br />
__NOTOC__<br />
__NOEDITSECTION__<br />
<br />
[[Category:Top-level categories]]</div>Kbhttps://wiki.apertium.org/w/index.php?title=Getting_started_with_induction_tools&diff=4592Getting started with induction tools2008-03-28T11:47:00Z<p>Kb: </p>
<hr />
<div>{{TOCD}}<br />
<br />
These are partial directions to getting started with Apertium and related tools (such as GIZA++), with the end goal of creating bilingual dictionaries. This may also be useful as a prerequisite to following the [[Apertium_New_Language_Pair_HOWTO]].<br />
<br />
A few steps are ubuntu-specific. Everything except mkcls is built from the SVN sources.<br />
<br />
== Installing the necessary programs ==<br />
<br />
=== Prerequisite Ubuntu packages ===<br />
This is, unfortunately, not a complete list; I'll attempt to add one once I've set this up on a clean ubuntu install. However, the following should still be of some help; all are required:<br />
<br />
<pre><br />
sudo apt-get install automake libtool libxml2-dev flex libpcre3-dev<br />
</pre><br />
<br />
=== Installing Crossdics ===<br />
[[Crossdics]] explains this correctly.<br />
<br />
You may need to export a new value for JAVA_HOME before running ant jar. <br />
<pre><br />
$ export JAVA_HOME=/usr/lib/jvm/java-6-sun-1.6.0.03<br />
</pre><br />
<br />
ls /usr/lib/jvm/ if in doubt as to what the exact version on your system is; it may be different than the above.<br />
<br />
<br />
=== Installing Apertium ===<br />
<pre>$ svn co https://apertium.svn.sourceforge.net/svnroot/apertium/trunk/apertium<br />
$ ./autogen.sh<br />
$ ./configure<br />
$ make<br />
$ sudo make install<br />
</pre><br />
<br />
=== Installing lttoolbox ===<br />
<pre><br />
$ svn co https://apertium.svn.sourceforge.net/svnroot/apertium/trunk/lttoolbox<br />
$ ./autogen.sh<br />
$ ./configure<br />
$ make<br />
$ sudo make install<br />
</pre><br />
<br />
=== Installing ReTraTos ===<br />
<pre><br />
$ svn co https://apertium.svn.sourceforge.net/svnroot/apertium/trunk/retratos<br />
$ ./autogen.sh<br />
$ ./configure<br />
$ make<br />
$ sudo make install<br />
</pre><br />
<br />
=== Installing mkcls ===<br />
Get the deb package appropriate for your version of ubuntu at [http://cl.aist-nara.ac.jp/~eric-n/ubuntu-nlp/ ubuntu-nlp].<br />
<pre><br />
$ dpkg --install mkcls*.deb<br />
</pre><br />
<br />
=== Installing GIZA++ ===<br />
Unfortunately, the deb package at ubuntu-nlp was built with -DBINARY_SEARCH_FOR_TTABLE. You need to prepare the input files differently for this, or it dies with "ERROR: NO COOCURRENCE FILE GIVEN!". I don't know how to do that, so here are the instructions on compiling a version that will work with the rest of this page.<br />
<br />
<pre><br />
$ wget http://giza-pp.googlecode.com/files/giza-pp-v1.0.1.tar.gz<br />
$ tar xvzf giza-pp-v1.0.1.tar.gz<br />
$ cd giza-pp/GIZA++-v2<br />
$ cat Makefile | sed -e 's/-DBINARY_SEARCH_FOR_TTABLE//' | sed -e 's/mkdir/mkdir -p/g' > tmp<br />
$ mv Makefile Makefile.orig<br />
$ mv tmp Makefile<br />
$ make<br />
$ sudo make install<br />
</pre><br />
<br />
<br />
== Creating bilingual dictionaries. ==<br />
<br />
=== Obtaining corpora (and getAlignmentWithText.pl) ===<br />
<pre><br />
$ wget http://wt.jrc.it/lt/Acquis/JRC-Acquis.3.0/alignments/getAlignmentWithText.pl<br />
</pre><br />
<br />
Choose a language pair. For this example, it will be Italian (it) and English (en). To use another pair, find out the filenames by browsing [http://wt.jrc.it/lt/Acquis/JRC-Acquis.3.0/alignments/] and [http://wt.jrc.it/lt/Acquis/JRC-Acquis.3.0/corpus/].<br />
<br />
Get the associated corpus and alignment files.<br />
<pre><br />
$ wget http://wt.jrc.it/lt/Acquis/JRC-Acquis.3.0/alignments/jrc-en-it.xml.gz<br />
$ wget http://wt.jrc.it/lt/Acquis/JRC-Acquis.3.0/corpus/jrc-en.tgz<br />
$ wget http://wt.jrc.it/lt/Acquis/JRC-Acquis.3.0/corpus/jrc-it.tgz<br />
</pre><br />
<br />
The corpora need to be untarred, and inserted into a new, common directory.<br />
<pre><br />
$ tar xvzf jrc-en.tgz <br />
$ tar xvzf jrc-it.tgz<br />
$ mkdir acquis<br />
$ mv en it acquis<br />
</pre><br />
<br />
=== Running getAlignmentWithText.pl ===<br />
<pre><br />
$ perl getAlignmentWithText.pl -acquisDir acquis/ jrc-en-it.xml > alignment<br />
</perl><br />
<br />
=== Creating subsets of the alignment file ===<br />
At this point, there will be an alignment file containing a few million lines of xml. It will refer frequently to s1 (the first language of the two in the filename jrc-lang1-lang2.xml, which is jrc-en-it.xml in this example, hence, English) and s2 (Italian). The content of each should be extracted to its own file.<br />
<br />
<pre><br />
$ grep s1 alignment > en<br />
$ grep s2 alignment > it<br />
</pre><br />
<br />
An optional step may be useful. If you're just testing to see if it's working, or have an old machine, it's better to only use a subset of the data, so that it can be analyzed in a couple of hours, rather than days. Using 10,000 lines from each language took around 2 hours on my 1.8 Ghz Athlon with 512 megs of RAM. Other numbers are also fine, but use the same number for both files.<br />
<br />
<pre> <br />
$ head -n 10000 en > en.10000<br />
$ head -n 10000 it > it.10000<br />
</pre><br />
<br />
=== Running plain2snt.out ===<br />
The filenames will vary depending on whether or not the optional step was done.<br />
<br />
<pre><br />
$ plain2snt.out en.10000 it.10000<br />
</pre><br />
<br />
=== Running mkcls ===<br />
mkcls needs to be run once per language.<br />
<br />
<pre><br />
$ mkcls -m2 -pen.10000 -c50 -Ven10000.vcb.classes opt >& mkcls1.log<br />
$ mkcls -m2 -pit.10000 -c50 -Vit.10000.vcb.classes opt >& mkcls2.log<br />
</pre><br />
<br />
=== Running GIZA++ ===<br />
<pre><br />
$ GIZA++ -S en.10000.vcb -T it.10000.vcb -C en.10000_it.10000.snt -p0 0.98 -o dictionary >& dictionary.log<br />
</pre><br />
<br />
If it stops in under a minute, unless the input files were tiny, check the log - it almost certainly failed.<br />
<br />
If it worked, a number of files will be created. The one most likely to be of interest is dictionary.A3.final, which contains alignment information.<br />
<br />
The first 3 lines of dictionary.A3.final, on the example corpus, are the following. They are apparently not particularly good, probably due to the small corpus.<br />
<pre><br />
# Sentence pair (1) source length 26 target length 21 alignment score : 7.66784e-37<br />
<s2>ACCORDO COMPLEMENTARE all'accordo tra la Comunità economica europea nonché i suoi Stati membri e la Confederazione svizzera, concernente i prodotti dell'orologeria</s2><br />
NULL ({ 10 }) <s1>ADDITIONAL ({ 1 }) AGREEMENT ({ }) to ({ }) the ({ }) Agreement ({ }) concerning ({ }) products ({ 20 }) of ({ }) the ({ }) clock ({ 2 3 17 18 19 }) and ({ }) watch ({ }) industry ({ }) between ({ 4 }) the ({ 5 }) European ({ 6 }) Economic ({ 7 8 }) Community ({ }) and ({ 9 }) its ({ 11 }) Member ({ 13 }) States ({ 12 }) and ({ 14 }) the ({ 15 }) Swiss ({ 16 }) Confederation</s1> ({ 21 })<br />
</pre><br />
<br />
== Acknowledgements ==<br />
This page wouldn't be possible without the kind assistance of 'spectie' on Freenode's #apertium. That said, all errors/typos are mine, not his.</div>Kbhttps://wiki.apertium.org/w/index.php?title=Getting_started_with_induction_tools&diff=4590Getting started with induction tools2008-03-28T09:52:32Z<p>Kb: </p>
<hr />
<div>{{TOCD}}<br />
<br />
These are partial directions to getting started with Apertium and related tools (such as GIZA++), with the end goal of creating bilingual dictionaries. A few steps are ubuntu-specific. Everything except mkcls is built from the SVN sources.<br />
<br />
== Installing the necessary programs ==<br />
<br />
=== Prerequisite Ubuntu packages ===<br />
This is, unfortunately, not a complete list; I'll attempt to add one once I've set this up on a clean ubuntu install. However, the following should still be of some help; all are required:<br />
<br />
<pre><br />
sudo apt-get install automake libtool libxml2-dev flex libpcre3-dev<br />
</pre><br />
<br />
=== Installing Crossdics ===<br />
[[Crossdics]] explains this correctly.<br />
<br />
You may need to export a new value for JAVA_HOME before running ant jar. <br />
<pre><br />
$ export JAVA_HOME=/usr/lib/jvm/java-6-sun-1.6.0.03<br />
</pre><br />
<br />
ls /usr/lib/jvm/ if in doubt as to what the exact version on your system is; it may be different than the above.<br />
<br />
<br />
=== Installing Apertium ===<br />
<pre>$ svn co https://apertium.svn.sourceforge.net/svnroot/apertium/trunk/apertium<br />
$ ./autogen.sh<br />
$ ./configure<br />
$ make<br />
$ sudo make install<br />
</pre><br />
<br />
=== Installing lttoolbox ===<br />
<pre><br />
$ svn co https://apertium.svn.sourceforge.net/svnroot/apertium/trunk/lttoolbox<br />
$ ./autogen.sh<br />
$ ./configure<br />
$ make<br />
$ sudo make install<br />
</pre><br />
<br />
=== Installing ReTraTos ===<br />
<pre><br />
$ svn co https://apertium.svn.sourceforge.net/svnroot/apertium/trunk/retratos<br />
$ ./autogen.sh<br />
$ ./configure<br />
$ make<br />
$ sudo make install<br />
</pre><br />
<br />
=== Installing mkcls ===<br />
Get the deb package appropriate for your version of ubuntu at [http://cl.aist-nara.ac.jp/~eric-n/ubuntu-nlp/ ubuntu-nlp].<br />
<pre><br />
$ dpkg --install mkcls*.deb<br />
</pre><br />
<br />
=== Installing GIZA++ ===<br />
Unfortunately, the deb package at ubuntu-nlp was built with -DBINARY_SEARCH_FOR_TTABLE. You need to prepare the input files differently for this, or it dies with "ERROR: NO COOCURRENCE FILE GIVEN!". I don't know how to do that, so here are the instructions on compiling a version that will work with the rest of this page.<br />
<br />
<pre><br />
$ wget http://giza-pp.googlecode.com/files/giza-pp-v1.0.1.tar.gz<br />
$ tar xvzf giza-pp-v1.0.1.tar.gz<br />
$ cd giza-pp/GIZA++-v2<br />
$ cat Makefile | sed -e 's/-DBINARY_SEARCH_FOR_TTABLE//' | sed -e 's/mkdir/mkdir -p/g' > tmp<br />
$ mv Makefile Makefile.orig<br />
$ mv tmp Makefile<br />
$ make<br />
$ sudo make install<br />
</pre><br />
<br />
<br />
== Creating bilingual dictionaries. ==<br />
<br />
=== Obtaining corpora (and getAlignmentWithText.pl) ===<br />
<pre><br />
$ wget http://wt.jrc.it/lt/Acquis/JRC-Acquis.3.0/alignments/getAlignmentWithText.pl<br />
</pre><br />
<br />
Choose a language pair. For this example, it will be Italian (it) and English (en). To use another pair, find out the filenames by browsing [http://wt.jrc.it/lt/Acquis/JRC-Acquis.3.0/alignments/] and [http://wt.jrc.it/lt/Acquis/JRC-Acquis.3.0/corpus/].<br />
<br />
Get the associated corpus and alignment files.<br />
<pre><br />
$ wget http://wt.jrc.it/lt/Acquis/JRC-Acquis.3.0/alignments/jrc-en-it.xml.gz<br />
$ wget http://wt.jrc.it/lt/Acquis/JRC-Acquis.3.0/corpus/jrc-en.tgz<br />
$ wget http://wt.jrc.it/lt/Acquis/JRC-Acquis.3.0/corpus/jrc-it.tgz<br />
</pre><br />
<br />
The corpora need to be untarred, and inserted into a new, common directory.<br />
<pre><br />
$ tar xvzf jrc-en.tgz <br />
$ tar xvzf jrc-it.tgz<br />
$ mkdir acquis<br />
$ mv en it acquis<br />
</pre><br />
<br />
=== Running getAlignmentWithText.pl ===<br />
<pre><br />
$ perl getAlignmentWithText.pl -acquisDir acquis/ jrc-en-it.xml > alignment<br />
</perl><br />
<br />
=== Creating subsets of the alignment file ===<br />
At this point, there will be an alignment file containing a few million lines of xml. It will refer frequently to s1 (the first language of the two in the filename jrc-lang1-lang2.xml, which is jrc-en-it.xml in this example, hence, English) and s2 (Italian). The content of each should be extracted to its own file.<br />
<br />
<pre><br />
$ grep s1 alignment > en<br />
$ grep s2 alignment > it<br />
</pre><br />
<br />
An optional step may be useful. If you're just testing to see if it's working, or have an old machine, it's better to only use a subset of the data, so that it can be analyzed in a couple of hours, rather than days. Using 10,000 lines from each language took around 2 hours on my 1.8 Ghz Athlon with 512 megs of RAM. Other numbers are also fine, but use the same number for both files.<br />
<br />
<pre> <br />
$ head -n 10000 en > en.10000<br />
$ head -n 10000 it > it.10000<br />
</pre><br />
<br />
=== Running plain2snt.out ===<br />
The filenames will vary depending on whether or not the optional step was done.<br />
<br />
<pre><br />
$ plain2snt.out en.10000 it.10000<br />
</pre><br />
<br />
=== Running mkcls ===<br />
mkcls needs to be run once per language.<br />
<br />
<pre><br />
$ mkcls -m2 -pen.10000 -c50 -Ven10000.vcb.classes opt >& mkcls1.log<br />
$ mkcls -m2 -pit.10000 -c50 -Vit.10000.vcb.classes opt >& mkcls2.log<br />
</pre><br />
<br />
=== Running GIZA++ ===<br />
<pre><br />
$ GIZA++ -S en.10000.vcb -T it.10000.vcb -C en.10000_it.10000.snt -p0 0.98 -o dictionary >& dictionary.log<br />
</pre><br />
<br />
If it stops in under a minute, unless the input files were tiny, check the log - it almost certainly failed.<br />
<br />
If it worked, the following files will have been created:<br />
<pre><br />
dictionary.perp dictionary.trn.src.vcb dictionary.trn.trg.vcb dictionary.tst.src.vcb dictionary.tst.trg.vcb<br />
</pre><br />
<br />
== Acknowledgements ==<br />
This page wouldn't be possible without the kind assistance of 'spectie' on Freenode's #apertium. That said, all errors/typos are mine, not his.</div>Kbhttps://wiki.apertium.org/w/index.php?title=Getting_started_with_induction_tools&diff=4589Getting started with induction tools2008-03-28T09:29:46Z<p>Kb: Initial half-done checkin, so as not to tempt Murphy.</p>
<hr />
<div>{{TOCD}}<br />
<br />
These are partial directions to getting started with Apertium and related tools (such as GIZA++), with the end goal of creating bilingual dictionaries. A few steps are ubuntu-specific. Everything except mkcls is built from the SVN sources.<br />
<br />
== Installing the necessary programs ==<br />
<br />
=== Prerequisite Ubuntu packages ===<br />
This is, unfortunately, not a complete list; I'll attempt to add one once I've set this up on a clean ubuntu install. However, the following should still be of some help; all are required:<br />
<br />
<pre><br />
sudo apt-get install automake libtool libxml2-dev flex libpcre3-dev<br />
</pre><br />
<br />
=== Installing Crossdics ===<br />
[[Crossdics]] explains this correctly.<br />
<br />
You may need to export a new value for JAVA_HOME before running ant jar. <br />
<pre><br />
$ export JAVA_HOME=/usr/lib/jvm/java-6-sun-1.6.0.03<br />
</pre><br />
<br />
ls /usr/lib/jvm/ if in doubt as to what the exact version on your system is; it may be different than the above.<br />
<br />
<br />
=== Installing Apertium ===<br />
<pre>$ svn co https://apertium.svn.sourceforge.net/svnroot/apertium/trunk/apertium<br />
$ ./autogen.sh<br />
$ ./configure<br />
$ make<br />
$ sudo make install<br />
</pre><br />
<br />
=== Installing lttoolbox ===<br />
<pre><br />
$ svn co https://apertium.svn.sourceforge.net/svnroot/apertium/trunk/lttoolbox<br />
$ ./autogen.sh<br />
$ ./configure<br />
$ make<br />
$ sudo make install<br />
</pre><br />
<br />
=== Installing ReTraTos ===<br />
<pre><br />
$ svn co https://apertium.svn.sourceforge.net/svnroot/apertium/trunk/retratos<br />
$ ./autogen.sh<br />
$ ./configure<br />
$ make<br />
$ sudo make install<br />
</pre><br />
<br />
=== Installing mkcls ===<br />
Get the deb package appropriate for your version of ubuntu at [http://cl.aist-nara.ac.jp/~eric-n/ubuntu-nlp/ ubuntu-nlp].<br />
<pre><br />
$ dpkg --install mkcls*.deb<br />
</pre><br />
<br />
=== Installing GIZA++ ===<br />
Unfortunately, the deb package at ubuntu-nlp was built with -DBINARY_SEARCH_FOR_TTABLE. You need to prepare the input files differently for this, or it dies with "ERROR: NO COOCURRENCE FILE GIVEN!". I don't know how to do that, so here are the instructions on compiling a version that will work with the rest of this page.<br />
<br />
<pre><br />
$ wget http://giza-pp.googlecode.com/files/giza-pp-v1.0.1.tar.gz<br />
$ tar xvzf giza-pp-v1.0.1.tar.gz<br />
$ cd giza-pp/GIZA++-v2<br />
$ cat Makefile | sed -e 's/-DBINARY_SEARCH_FOR_TTABLE//' | sed -e 's/mkdir/mkdir -p/g' > tmp<br />
$ mv Makefile Makefile.orig<br />
$ mv tmp Makefile<br />
$ make<br />
$ sudo make install<br />
</pre><br />
<br />
<br />
== Creating bilingual dictionaries. ==<br />
<br />
=== Obtaining corpera (and getAlignmentWithText.pl) ===</div>Kb