Difference between revisions of "Preparing data for Moses factored training using Apertium"
Jump to navigation
Jump to search
Line 4: | Line 4: | ||
==Steps== |
==Steps== |
||
− | ===Download |
+ | ===Download and compile data=== |
− | For |
+ | For the parallel corpus we're going to use Europarl: |
<pre> |
<pre> |
||
$ wget http://www.statmt.org/europarl/v5/da-en.tgz |
$ wget http://www.statmt.org/europarl/v5/da-en.tgz |
||
$ tar -xzf da-en.tgz |
$ tar -xzf da-en.tgz |
||
+ | </pre> |
||
+ | And for the morphological analyser and tagger, we're going to use <code>apertium-sv-da</code> and <code>apertium-en-ca</code> |
||
+ | |||
+ | <pre> |
||
+ | $ wget http://apertium.svn.sourceforge.net/svnroot/apertium/trunk/apertium-sv-da/apertium-sv-da.da.dix |
||
+ | $ wget http://apertium.svn.sourceforge.net/svnroot/apertium/trunk/apertium-sv-da/da-sv.prob |
||
+ | |||
+ | |||
+ | $ wget http://apertium.svn.sourceforge.net/svnroot/apertium/trunk/apertium-is-en/apertium-is-en.is.dix |
||
+ | $ wget http://apertium.svn.sourceforge.net/svnroot/apertium/trunk/apertium-is-en/en-is.prob |
||
+ | |||
+ | $ lt-comp lr apertium-sv-da.da.dix da-sv.automorf.bin |
||
+ | final@inconditional 20 105 |
||
+ | main@standard 9121 18055 |
||
+ | unchecked@standard 4411 8130 |
||
+ | |||
+ | $ lt-comp lr apertium-is-en.en.dix en-is.automorf.bin |
||
+ | final@inconditional 97 2809 |
||
+ | main@standard 22284 47423 |
||
</pre> |
</pre> |
||
Revision as of 13:15, 21 January 2010
This page gives a description of how to preprocess a corpus using Apertium so it can be used to train Moses factoredly.
Steps
Download and compile data
For the parallel corpus we're going to use Europarl:
$ wget http://www.statmt.org/europarl/v5/da-en.tgz $ tar -xzf da-en.tgz
And for the morphological analyser and tagger, we're going to use apertium-sv-da
and apertium-en-ca
$ wget http://apertium.svn.sourceforge.net/svnroot/apertium/trunk/apertium-sv-da/apertium-sv-da.da.dix $ wget http://apertium.svn.sourceforge.net/svnroot/apertium/trunk/apertium-sv-da/da-sv.prob $ wget http://apertium.svn.sourceforge.net/svnroot/apertium/trunk/apertium-is-en/apertium-is-en.is.dix $ wget http://apertium.svn.sourceforge.net/svnroot/apertium/trunk/apertium-is-en/en-is.prob $ lt-comp lr apertium-sv-da.da.dix da-sv.automorf.bin final@inconditional 20 105 main@standard 9121 18055 unchecked@standard 4411 8130 $ lt-comp lr apertium-is-en.en.dix en-is.automorf.bin final@inconditional 97 2809 main@standard 22284 47423
Clean and tag both sides of corpus
Remove any lines that start with tags:
$ cat da-en/da/* | grep -v '^<' > europarl.da $ cat da-en/en/* | grep -v '^<' > europarl.en
Check that the files are the same length.
$ wc -l europarl.* 1687533 europarl.da 1687533 europarl.en
Analyse and tag the corpus:
$ cat europarl.da | apertium-destxt | lt-proc -w da-sv.automorf.bin | \ apertium-tagger -g -p da-sv.prob | apertium-retxt > tagged.da $ cat europarl.en | apertium-destxt | lt-proc -w en-ca.automorf.bin | \ apertium-tagger -g -p en-ca.prob | apertium-retxt > tagged.en
Convert to Moses factored format
Download the tagger to factored script:
$ wget http://apertium.svn.sourceforge.net/svnroot/apertium/trunk/apertium-tools/tagger-to-factored.py