Difference between revisions of "Using GIZA++"
(New page: If you have parallel corpora you can use GIZA++ to make bilingual dictionaries. Download your corpora, and convert into one sentence per line. Download and compile GIZA++. Use <code>pla...) |
|||
Line 29: | Line 29: | ||
</pre> |
</pre> |
||
and wait... You can watch the log in <code>dictionary.log</code> |
and wait... You can watch the log in <code>dictionary.log</code>... but the training is likely to take upwards of 10 hours, so have something else planned. |
||
==External links== |
|||
* [http://www.statmt.org/moses/?n=FactoredTraining.HomePage statmt.org: Factored training] |
Revision as of 21:04, 16 September 2007
If you have parallel corpora you can use GIZA++ to make bilingual dictionaries.
Download your corpora, and convert into one sentence per line.
Download and compile GIZA++.
Use plain2snt.out
to convert your corpus into GIZA++ format:
$ plain2snt.out sv-text.txt da-text.txt w1:sv-text w2:da-text sv-text -> sv-text da-text -> da-text
You may get some warnings about empty sentences like these:
WARNING: filtered out empty sentence (source: sv-text.txt 23 target: da-text.txt 0). WARNING: filtered out empty sentence (source: sv-text.txt 34 target: da-text.txt 0).
if it is a large corpus you may get a lot of warnings...
After you've done this, you should have a couple of .snt
files and a couple of .vcb
files. Now use GIZA++ to build your dictionary (-S
is the source language, -T
is the target language, -C
is the generated aligned text file, and -o
is the output file prefix):
$ GIZA++ -S sv-text.vcb -T da-text.vcb -C sv-text_da-text.snt -p0 0.98 -o dictionary >& dictionary.log
and wait... You can watch the log in dictionary.log
... but the training is likely to take upwards of 10 hours, so have something else planned.