Preparing data for Moses factored training using Apertium
Revision as of 13:15, 21 January 2010 by Francis Tyers (talk | contribs)
This page gives a description of how to preprocess a corpus using Apertium so it can be used to train Moses factoredly.
Steps
Download and compile data
For the parallel corpus we're going to use Europarl:
$ wget http://www.statmt.org/europarl/v5/da-en.tgz $ tar -xzf da-en.tgz
And for the morphological analyser and tagger, we're going to use apertium-sv-da
and apertium-en-ca
$ wget http://apertium.svn.sourceforge.net/svnroot/apertium/trunk/apertium-sv-da/apertium-sv-da.da.dix $ wget http://apertium.svn.sourceforge.net/svnroot/apertium/trunk/apertium-sv-da/da-sv.prob $ wget http://apertium.svn.sourceforge.net/svnroot/apertium/trunk/apertium-is-en/apertium-is-en.is.dix $ wget http://apertium.svn.sourceforge.net/svnroot/apertium/trunk/apertium-is-en/en-is.prob $ lt-comp lr apertium-sv-da.da.dix da-sv.automorf.bin final@inconditional 20 105 main@standard 9121 18055 unchecked@standard 4411 8130 $ lt-comp lr apertium-is-en.en.dix en-is.automorf.bin final@inconditional 97 2809 main@standard 22284 47423
Clean and tag both sides of corpus
Remove any lines that start with tags:
$ cat da-en/da/* | grep -v '^<' > europarl.da $ cat da-en/en/* | grep -v '^<' > europarl.en
Check that the files are the same length.
$ wc -l europarl.* 1687533 europarl.da 1687533 europarl.en
Analyse and tag the corpus:
$ cat europarl.da | apertium-destxt | lt-proc -w da-sv.automorf.bin | \ apertium-tagger -g -p da-sv.prob | apertium-retxt > tagged.da $ cat europarl.en | apertium-destxt | lt-proc -w en-ca.automorf.bin | \ apertium-tagger -g -p en-ca.prob | apertium-retxt > tagged.en
Convert to Moses factored format
Download the tagger to factored script:
$ wget http://apertium.svn.sourceforge.net/svnroot/apertium/trunk/apertium-tools/tagger-to-factored.py