Difference between revisions of "Using GIZA++"

From Apertium
Jump to navigation Jump to search
Line 21: Line 21:
</pre>
</pre>


if it is a large corpus you may get a lot of warnings...
if it is a large corpus you may get a lot of warnings... &mdash; if you have a ''lot'' of warnings, consider changing the corpus.


After you've done this, you should have a couple of <code>.snt</code> files and a couple of <code>.vcb</code> files.
After you've done this, you should have a couple of <code>.snt</code> files and a couple of <code>.vcb</code> files.
Line 38: Line 38:
</pre>
</pre>


and wait... You can watch the log in <code>dictionary.log</code>... but the training is likely to take upwards of 10 hours, so have something else planned.
and wait... You can watch the log in <code>dictionary.log</code>... but the training is likely to take upwards of 10 hours (at most several days), so have something else planned.


==See also==
==See also==

Revision as of 23:30, 27 March 2008

If you have parallel corpora you can use GIZA++ to make bilingual dictionaries.

Download your corpora, and convert into one sentence per line.

Download and compile GIZA++.

Use plain2snt.out to convert your corpus into GIZA++ format:

$ plain2snt.out sv-text.txt da-text.txt 
w1:sv-text w2:da-text
sv-text -> sv-text
da-text -> da-text

You may get some warnings about empty sentences like these:

WARNING: filtered out empty sentence (source: sv-text.txt 23 target: da-text.txt 0).
WARNING: filtered out empty sentence (source: sv-text.txt 34 target: da-text.txt 0).

if it is a large corpus you may get a lot of warnings... — if you have a lot of warnings, consider changing the corpus.

After you've done this, you should have a couple of .snt files and a couple of .vcb files.

Next you need to generate word classes, using mkcls:

$ mkcls -m2 -psv-text.txt -c50 -Vsv-text.vcb.classes opt >& mkcls1.log
$ mkcls -m2 -pda-text.txt -c50 -Vda-text.vcb.classes opt >& mkcls1.log

Now use GIZA++ to build your dictionary (-S is the source language, -T is the target language, -C is the generated aligned text file, and -o is the output file prefix):

$ GIZA++ -S sv-text.vcb -T da-text.vcb -C sv-text_da-text.snt -p0 0.98 -o dictionary >& dictionary.log

and wait... You can watch the log in dictionary.log... but the training is likely to take upwards of 10 hours (at most several days), so have something else planned.

See also

External links