Preparing data for Moses factored training using Apertium
Revision as of 13:05, 21 January 2010 by Francis Tyers (talk | contribs)
This page gives a description of how to preprocess a corpus using Apertium so it can be used to train Moses factoredly.
Steps
Download parallel corpus
For example Europarl:
$ wget http://www.statmt.org/europarl/v5/da-en.tgz $ tar -xzf da-en.tgz
Clean and tag both sides of corpus
Remove any lines that start with tags:
$ cat da-en/da/* | grep -v '^<' > europarl.da $ cat da-en/en/* | grep -v '^<' > europarl.en
Check that the files are the same length.
$ wc -l europarl.* 1687533 europarl.da 1687533 europarl.en
Analyse and tag the corpus:
$ cat europarl.da | apertium-destxt | lt-proc -w da-sv.automorf.bin | \ apertium-tagger -g -p da-sv.prob | apertium-retxt > tagged.da $ cat europarl.en | apertium-destxt | lt-proc -w en-ca.automorf.bin | \ apertium-tagger -g -p en-ca.prob | apertium-retxt > tagged.en
Convert to Moses factored format
Download the tagger to factored script:
$ wget http://apertium.svn.sourceforge.net/svnroot/apertium/trunk/apertium-tools/tagger-to-factored.py