Difference between revisions of "Using GIZA++"

From Apertium
Jump to navigation Jump to search
 
(20 intermediate revisions by 4 users not shown)
Line 1: Line 1:
  +
[[Utiliser GIZA++|En français]]
If you have parallel corpora you can use GIZA++ to make bilingual dictionaries.
 
   
  +
{{TOCD}}
Download your corpora, and convert into one sentence per line.
 
  +
'''GIZA++''' is a program for aligning words and sequences of words in sentence aligned corpora. If you have parallel corpus you can use GIZA++ to make bilingual dictionaries for Apertium (e.g. using [[ReTraTos]]) or [[lexical selection]] rules.
  +
  +
==Compiling==
  +
<pre>
  +
git clone https://github.com/moses-smt/giza-pp
  +
cd giza-pp
  +
make
  +
cp GIZA++-v2/GIZA++ /path/prefix/bin/
  +
cp GIZA++-v2/plain2snt.out /path/prefix/bin/
  +
cp GIZA++-v2/snt2cooc.out /path/prefix/bin/
  +
cp GIZA++-v2/snt2plain.out /path/prefix/bin/
  +
cp GIZA++-v2/trainGIZA++.sh /path/prefix/bin/
  +
cp mkcls-v2/mkcls /path/prefix/bin/
  +
</pre>
  +
  +
The /path/prefix/bin is what you use as -external-bin-dir to moses, e.g. <code>perl train-model.perl -external-bin-dir /path/prefix/bin</code>
  +
  +
===Troubleshooting===
  +
If you're running a case-insensitive file system (typical for Mac OS) you'll have to edit the file <code>model3.cpp</code> before compiling, since it outputs two files named <code>foo.a3.final</code> and <code>foo.A3.final</code>. So before compiling, change eg. lines 321--322:
  +
<pre>
  +
alignfile = Prefix + ".A3." + number ;
  +
test_alignfile = Prefix + ".tst.A3." + number ;
  +
</pre>
  +
into something like
  +
<pre>
  +
alignfile = Prefix + ".AA3." + number ;
  +
test_alignfile = Prefix + ".tst.AA3." + number ;
  +
</pre>
  +
  +
==Usage==
 
Download your corpus, and convert into one sentence per line.
   
Download and compile GIZA++.
 
   
 
Use <code>plain2snt.out</code> to convert your corpus into GIZA++ format:
 
Use <code>plain2snt.out</code> to convert your corpus into GIZA++ format:
Line 21: Line 51:
 
</pre>
 
</pre>
   
if it is a large corpus you may get a lot of warnings...
+
if it is a large corpus you may get a lot of warnings... &mdash; if you have a ''lot'' of warnings, consider changing the corpus.
   
After you've done this, you should have a couple of <code>.snt</code> files and a couple of <code>.vcb</code> files. Now use GIZA++ to build your dictionary (<code>-S</code> is the source language, <code>-T</code> is the target language, <code>-C</code> is the generated aligned text file, and <code>-o</code> is the output file prefix):
+
After you've done this, you should have a couple of <code>.snt</code> files and a couple of <code>.vcb</code> files.
  +
  +
Next you need to generate word classes, using <code>mkcls</code>:
  +
  +
<pre>
  +
$ mkcls -m2 -psv-text.txt -c50 -Vsv-text.vcb.classes opt >& mkcls1.log
  +
$ mkcls -m2 -pda-text.txt -c50 -Vda-text.vcb.classes opt >& mkcls1.log
  +
</pre>
  +
  +
Now use GIZA++ to build your dictionary (<code>-S</code> is the source language, <code>-T</code> is the target language, <code>-C</code> is the generated aligned text file, and <code>-o</code> is the output file prefix):
   
 
<pre>
 
<pre>
Line 29: Line 68:
 
</pre>
 
</pre>
   
and wait... You can watch the log in <code>dictionary.log</code>... but the training is likely to take upwards of 10 hours, so have something else planned.
+
and wait... You can watch the log in <code>dictionary.log</code>... but the training is likely to take upwards of 10 hours (at most several days), so have something else planned.
  +
  +
The final alignment can be found in the file <code>dictionary.A3.final</code>
  +
  +
==trainGIZA++.sh==
  +
  +
:''Note: These changes only apply if you are not planning to use [[Moses]]''
  +
  +
To use the <code>trainGIZA++.sh</code> script, you need to make a few changes before compiling:
  +
  +
In <code>Makefile</code> change:
  +
<pre>
  +
CFLAGS_OPT = $(CFLAGS) -O3 -DNDEBUG -DWORDINDEX_WITH_4_BYTE -DBINARY_SEARCH_FOR_TTABLE
  +
</pre>
  +
  +
to:
  +
  +
<pre>
  +
CFLAGS_OPT = $(CFLAGS) -O3 -DNDEBUG -DWORDINDEX_WITH_4_BYTE
  +
</pre>
  +
  +
and in <code>trainGIZA++.sh</code> itself, change:
  +
  +
<pre>
  +
if( $# != 3 )
  +
</pre>
  +
  +
to:
  +
  +
<pre>
  +
if( $#argv != 3 )
  +
</pre>
  +
  +
==See also==
  +
  +
*[[Corpora]]
  +
*[[ReTraTos]]
  +
*[[Moses]]
  +
*[[Mgiza]] – a more recent alternative to Giza++
  +
*[[Learning rules from parallel and non-parallel corpora]] using Giza to make lex-sel rules
   
 
==External links==
 
==External links==
Line 36: Line 114:
   
 
[[Category:Documentation]]
 
[[Category:Documentation]]
  +
[[Category:Documentation in English]]

Latest revision as of 11:51, 29 April 2015

En français

GIZA++ is a program for aligning words and sequences of words in sentence aligned corpora. If you have parallel corpus you can use GIZA++ to make bilingual dictionaries for Apertium (e.g. using ReTraTos) or lexical selection rules.

Compiling[edit]

git clone https://github.com/moses-smt/giza-pp
cd giza-pp
make
cp GIZA++-v2/GIZA++ /path/prefix/bin/
cp GIZA++-v2/plain2snt.out /path/prefix/bin/
cp GIZA++-v2/snt2cooc.out /path/prefix/bin/
cp GIZA++-v2/snt2plain.out /path/prefix/bin/
cp GIZA++-v2/trainGIZA++.sh /path/prefix/bin/
cp mkcls-v2/mkcls /path/prefix/bin/

The /path/prefix/bin is what you use as -external-bin-dir to moses, e.g. perl train-model.perl -external-bin-dir /path/prefix/bin

Troubleshooting[edit]

If you're running a case-insensitive file system (typical for Mac OS) you'll have to edit the file model3.cpp before compiling, since it outputs two files named foo.a3.final and foo.A3.final. So before compiling, change eg. lines 321--322:

      alignfile = Prefix + ".A3." + number ;
      test_alignfile = Prefix + ".tst.A3." + number ;

into something like

      alignfile = Prefix + ".AA3." + number ;
      test_alignfile = Prefix + ".tst.AA3." + number ;

Usage[edit]

Download your corpus, and convert into one sentence per line.


Use plain2snt.out to convert your corpus into GIZA++ format:

$ plain2snt.out sv-text.txt da-text.txt 
w1:sv-text w2:da-text
sv-text -> sv-text
da-text -> da-text

You may get some warnings about empty sentences like these:

WARNING: filtered out empty sentence (source: sv-text.txt 23 target: da-text.txt 0).
WARNING: filtered out empty sentence (source: sv-text.txt 34 target: da-text.txt 0).

if it is a large corpus you may get a lot of warnings... — if you have a lot of warnings, consider changing the corpus.

After you've done this, you should have a couple of .snt files and a couple of .vcb files.

Next you need to generate word classes, using mkcls:

$ mkcls -m2 -psv-text.txt -c50 -Vsv-text.vcb.classes opt >& mkcls1.log
$ mkcls -m2 -pda-text.txt -c50 -Vda-text.vcb.classes opt >& mkcls1.log

Now use GIZA++ to build your dictionary (-S is the source language, -T is the target language, -C is the generated aligned text file, and -o is the output file prefix):

$ GIZA++ -S sv-text.vcb -T da-text.vcb -C sv-text_da-text.snt -p0 0.98 -o dictionary >& dictionary.log

and wait... You can watch the log in dictionary.log... but the training is likely to take upwards of 10 hours (at most several days), so have something else planned.

The final alignment can be found in the file dictionary.A3.final

trainGIZA++.sh[edit]

Note: These changes only apply if you are not planning to use Moses

To use the trainGIZA++.sh script, you need to make a few changes before compiling:

In Makefile change:

CFLAGS_OPT = $(CFLAGS) -O3 -DNDEBUG -DWORDINDEX_WITH_4_BYTE -DBINARY_SEARCH_FOR_TTABLE

to:

CFLAGS_OPT = $(CFLAGS) -O3 -DNDEBUG -DWORDINDEX_WITH_4_BYTE

and in trainGIZA++.sh itself, change:

if( $# != 3 )

to:

if( $#argv != 3 )

See also[edit]

External links[edit]