Difference between revisions of "Using GIZA++"
m (add my GIZA++ notes here) |
|||
Line 41: | Line 41: | ||
The final alignment can be found in the file <code>dictionary.A3.final</code> |
The final alignment can be found in the file <code>dictionary.A3.final</code> |
||
==trainGIZA++.sh== |
|||
To use the <code>trainGIZA++.sh</code> script, you need to make a few changes before compiling: |
|||
In <code>Makefile</code> change: |
|||
<pre> |
|||
CFLAGS_OPT = $(CFLAGS) -O3 -DNDEBUG -DWORDINDEX_WITH_4_BYTE -DBINARY_SEARCH_FOR_TTABLE |
|||
</pre> |
|||
to: |
|||
<pre> |
|||
CFLAGS_OPT = $(CFLAGS) -O3 -DNDEBUG -DWORDINDEX_WITH_4_BYTE |
|||
</pre> |
|||
and in <code>trainGIZA++.sh</code> itself, change: |
|||
<pre> |
|||
if( $# != 3 ) |
|||
</pre> |
|||
to: |
|||
<pre> |
|||
if( $#argv != 3 ) |
|||
</pre> |
|||
==See also== |
==See also== |
Revision as of 16:56, 12 July 2008
If you have parallel corpora you can use GIZA++ to make bilingual dictionaries (e.g. using ReTraTos).
Download your corpora, and convert into one sentence per line.
Download and compile GIZA++.
Use plain2snt.out
to convert your corpus into GIZA++ format:
$ plain2snt.out sv-text.txt da-text.txt w1:sv-text w2:da-text sv-text -> sv-text da-text -> da-text
You may get some warnings about empty sentences like these:
WARNING: filtered out empty sentence (source: sv-text.txt 23 target: da-text.txt 0). WARNING: filtered out empty sentence (source: sv-text.txt 34 target: da-text.txt 0).
if it is a large corpus you may get a lot of warnings... — if you have a lot of warnings, consider changing the corpus.
After you've done this, you should have a couple of .snt
files and a couple of .vcb
files.
Next you need to generate word classes, using mkcls
:
$ mkcls -m2 -psv-text.txt -c50 -Vsv-text.vcb.classes opt >& mkcls1.log $ mkcls -m2 -pda-text.txt -c50 -Vda-text.vcb.classes opt >& mkcls1.log
Now use GIZA++ to build your dictionary (-S
is the source language, -T
is the target language, -C
is the generated aligned text file, and -o
is the output file prefix):
$ GIZA++ -S sv-text.vcb -T da-text.vcb -C sv-text_da-text.snt -p0 0.98 -o dictionary >& dictionary.log
and wait... You can watch the log in dictionary.log
... but the training is likely to take upwards of 10 hours (at most several days), so have something else planned.
The final alignment can be found in the file dictionary.A3.final
trainGIZA++.sh
To use the trainGIZA++.sh
script, you need to make a few changes before compiling:
In Makefile
change:
CFLAGS_OPT = $(CFLAGS) -O3 -DNDEBUG -DWORDINDEX_WITH_4_BYTE -DBINARY_SEARCH_FOR_TTABLE
to:
CFLAGS_OPT = $(CFLAGS) -O3 -DNDEBUG -DWORDINDEX_WITH_4_BYTE
and in trainGIZA++.sh
itself, change:
if( $# != 3 )
to:
if( $#argv != 3 )