Difference between revisions of "Using GIZA++"

From Apertium
Jump to navigation Jump to search
Line 23: Line 23:
 
if it is a large corpus you may get a lot of warnings...
 
if it is a large corpus you may get a lot of warnings...
   
After you've done this, you should have a couple of <code>.snt</code> files and a couple of <code>.vcb</code> files. Now use GIZA++ to build your dictionary (<code>-S</code> is the source language, <code>-T</code> is the target language, <code>-C</code> is the generated aligned text file, and <code>-o</code> is the output file prefix):
+
After you've done this, you should have a couple of <code>.snt</code> files and a couple of <code>.vcb</code> files.
  +
  +
Next you need to generate word classes, using <code>mkcls</code>:
  +
  +
<pre>
  +
$ mkcls -m2 -psv-text.txt -c50 -Vsv-text.vcb.classes opt >& mkcls1.log
  +
$ mkcls -m2 -pda-text.txt -c50 -Vda-text.vcb.classes opt >& mkcls1.log
  +
</pre>
  +
  +
Now use GIZA++ to build your dictionary (<code>-S</code> is the source language, <code>-T</code> is the target language, <code>-C</code> is the generated aligned text file, and <code>-o</code> is the output file prefix):
   
 
<pre>
 
<pre>

Revision as of 12:51, 7 October 2007

If you have parallel corpora you can use GIZA++ to make bilingual dictionaries.

Download your corpora, and convert into one sentence per line.

Download and compile GIZA++.

Use plain2snt.out to convert your corpus into GIZA++ format:

$ plain2snt.out sv-text.txt da-text.txt 
w1:sv-text w2:da-text
sv-text -> sv-text
da-text -> da-text

You may get some warnings about empty sentences like these:

WARNING: filtered out empty sentence (source: sv-text.txt 23 target: da-text.txt 0).
WARNING: filtered out empty sentence (source: sv-text.txt 34 target: da-text.txt 0).

if it is a large corpus you may get a lot of warnings...

After you've done this, you should have a couple of .snt files and a couple of .vcb files.

Next you need to generate word classes, using mkcls:

$ mkcls -m2 -psv-text.txt -c50 -Vsv-text.vcb.classes opt >& mkcls1.log
$ mkcls -m2 -pda-text.txt -c50 -Vda-text.vcb.classes opt >& mkcls1.log

Now use GIZA++ to build your dictionary (-S is the source language, -T is the target language, -C is the generated aligned text file, and -o is the output file prefix):

$ GIZA++ -S sv-text.vcb -T da-text.vcb -C sv-text_da-text.snt -p0 0.98 -o dictionary >& dictionary.log

and wait... You can watch the log in dictionary.log... but the training is likely to take upwards of 10 hours, so have something else planned.

External links