Difference between revisions of "Preparing data for Moses factored training using Apertium"

From Apertium
Jump to navigation Jump to search
Line 4: Line 4:
 
==Steps==
 
==Steps==
   
===Download parallel corpus===
+
===Download and compile data===
   
For example Europarl:
+
For the parallel corpus we're going to use Europarl:
   
 
<pre>
 
<pre>
 
$ wget http://www.statmt.org/europarl/v5/da-en.tgz
 
$ wget http://www.statmt.org/europarl/v5/da-en.tgz
 
$ tar -xzf da-en.tgz
 
$ tar -xzf da-en.tgz
  +
</pre>
   
  +
And for the morphological analyser and tagger, we're going to use <code>apertium-sv-da</code> and <code>apertium-en-ca</code>
  +
  +
<pre>
  +
$ wget http://apertium.svn.sourceforge.net/svnroot/apertium/trunk/apertium-sv-da/apertium-sv-da.da.dix
  +
$ wget http://apertium.svn.sourceforge.net/svnroot/apertium/trunk/apertium-sv-da/da-sv.prob
  +
  +
  +
$ wget http://apertium.svn.sourceforge.net/svnroot/apertium/trunk/apertium-is-en/apertium-is-en.is.dix
  +
$ wget http://apertium.svn.sourceforge.net/svnroot/apertium/trunk/apertium-is-en/en-is.prob
  +
  +
$ lt-comp lr apertium-sv-da.da.dix da-sv.automorf.bin
  +
final@inconditional 20 105
  +
main@standard 9121 18055
  +
unchecked@standard 4411 8130
  +
  +
$ lt-comp lr apertium-is-en.en.dix en-is.automorf.bin
  +
final@inconditional 97 2809
  +
main@standard 22284 47423
 
</pre>
 
</pre>
   

Revision as of 13:15, 21 January 2010

This page gives a description of how to preprocess a corpus using Apertium so it can be used to train Moses factoredly.

Steps

Download and compile data

For the parallel corpus we're going to use Europarl:

$ wget http://www.statmt.org/europarl/v5/da-en.tgz
$ tar -xzf da-en.tgz

And for the morphological analyser and tagger, we're going to use apertium-sv-da and apertium-en-ca

$ wget http://apertium.svn.sourceforge.net/svnroot/apertium/trunk/apertium-sv-da/apertium-sv-da.da.dix
$ wget http://apertium.svn.sourceforge.net/svnroot/apertium/trunk/apertium-sv-da/da-sv.prob


$ wget http://apertium.svn.sourceforge.net/svnroot/apertium/trunk/apertium-is-en/apertium-is-en.is.dix
$ wget http://apertium.svn.sourceforge.net/svnroot/apertium/trunk/apertium-is-en/en-is.prob

$ lt-comp lr apertium-sv-da.da.dix da-sv.automorf.bin
final@inconditional 20 105
main@standard 9121 18055
unchecked@standard 4411 8130

$ lt-comp lr apertium-is-en.en.dix en-is.automorf.bin
final@inconditional 97 2809
main@standard 22284 47423

Clean and tag both sides of corpus

Remove any lines that start with tags:


$ cat da-en/da/*  | grep -v '^<' > europarl.da
$ cat da-en/en/*  | grep -v '^<' > europarl.en

Check that the files are the same length.


$ wc -l europarl.*
  1687533 europarl.da
  1687533 europarl.en

Analyse and tag the corpus:

$ cat europarl.da | apertium-destxt | lt-proc -w da-sv.automorf.bin | \
  apertium-tagger -g -p da-sv.prob | apertium-retxt > tagged.da

$ cat europarl.en | apertium-destxt | lt-proc -w en-ca.automorf.bin | \
  apertium-tagger -g -p en-ca.prob | apertium-retxt > tagged.en

Convert to Moses factored format

Download the tagger to factored script:

$ wget http://apertium.svn.sourceforge.net/svnroot/apertium/trunk/apertium-tools/tagger-to-factored.py