Preparing data for Moses factored training using Apertium
This page gives a description of how to preprocess a corpus using Apertium so it can be used to train Moses factoredly.
Steps
Download and compile data
For the parallel corpus we're going to use Europarl:
$ wget http://www.statmt.org/europarl/v5/da-en.tgz $ tar -xzf da-en.tgz
And for the morphological analyser and tagger, we're going to use apertium-sv-da
and apertium-is-en
$ wget http://apertium.svn.sourceforge.net/svnroot/apertium/trunk/apertium-sv-da/apertium-sv-da.da.dix $ wget http://apertium.svn.sourceforge.net/svnroot/apertium/trunk/apertium-sv-da/da-sv.prob $ wget http://apertium.svn.sourceforge.net/svnroot/apertium/trunk/apertium-is-en/apertium-is-en.is.dix $ wget http://apertium.svn.sourceforge.net/svnroot/apertium/trunk/apertium-is-en/en-is.prob $ lt-comp lr apertium-sv-da.da.dix da-sv.automorf.bin final@inconditional 20 105 main@standard 9121 18055 unchecked@standard 4411 8130 $ lt-comp lr apertium-is-en.en.dix en-is.automorf.bin final@inconditional 97 2809 main@standard 22284 47423
Clean and tag both sides of corpus
Remove any lines that start with tags:
$ cat da-en/da/* | grep -v '^<' > europarl.da $ cat da-en/en/* | grep -v '^<' > europarl.en
Check that the files are the same length.
$ wc -l europarl.* 1687533 europarl.da 1687533 europarl.en
Analyse and tag the corpus:
$ cat europarl.da | apertium-destxt | lt-proc -w da-sv.automorf.bin | \ apertium-tagger -g -p da-sv.prob | apertium-retxt > tagged.da $ cat europarl.en | apertium-destxt | lt-proc -w en-is.automorf.bin | \ apertium-tagger -g -p en-is.prob | apertium-retxt > tagged.en
Note that the -w
option provides dictionary lowercasing, that is, the lemma is printed out with the case as it is in the dictionary, not with the case as it is in the superficial form.
Convert to Moses factored format
Download the tagger to factored script:
$ wget http://apertium.svn.sourceforge.net/svnroot/apertium/trunk/apertium-tools/tagger-to-factored.py
Then convert to Moses factored format:
$ cat tagged.da | python tagger-to-factored.py 2 > factored.da $ cat tagged.en | python tagger-to-factored.py 2 > factored.en
Note: The number option to tagger-to-factored.py
specifies how many tags you want to output. Giving it as 0
will output only superficial form and lemma. Putting 1
will output, superficial form, lemma and first tag (this is almost always the POS tag). Anything above this will output the same as before, but an extra factor will be output as the rest of the morphological information.
For example:
$ echo "^der/der<adv>$ ^blev/blive<vblex><past><actv>$ ^ramt/ramme<vblex><pp>$^./.<sent>$" | python tagger-to-factored.py 0 der|der blev|blive ramt|ramme .|. $ echo "^der/der<adv>$ ^blev/blive<vblex><past><actv>$ ^ramt/ramme<vblex><pp>$^./.<sent>$" | python tagger-to-factored.py 1 der|der|adv blev|blive|vblex ramt|ramme|vblex .|.|sent $ echo "^der/der<adv>$ ^blev/blive<vblex><past><actv>$ ^ramt/ramme<vblex><pp>$^./.<sent>$" | python tagger-to-factored.py 2 der|der|adv|adv blev|blive|vblex|vblex.past ramt|ramme|vblex|vblex.pp .|.|sent|sent $ echo "^der/der<adv>$ ^blev/blive<vblex><past><actv>$ ^ramt/ramme<vblex><pp>$^./.<sent>$" | python tagger-to-factored.py der|der|adv|adv blev|blive|vblex|vblex.past.actv ramt|ramme|vblex|vblex.pp .|.|sent|sent
After this, check again that the files are the same length:
$ wc -l factored.*
Train a factored phrase-based model
Now you have the data, you can train a factored phrase-based model with Moses!