Difference between revisions of "Preparing data for Moses factored training using Apertium"

From Apertium
Jump to navigation Jump to search
Line 10: Line 10:
 
<pre>
 
<pre>
 
$ wget http://www.statmt.org/europarl/v5/da-en.tgz
 
$ wget http://www.statmt.org/europarl/v5/da-en.tgz
  +
$ tar -xzf da-en.tgz
   
 
</pre>
 
</pre>
   
 
===Clean and tag both sides of corpus===
 
===Clean and tag both sides of corpus===
  +
  +
Remove any lines that start with tags:
  +
  +
<pre>
  +
  +
$ cat da-en/da/* | grep -v '^<' > europarl.da
  +
$ cat da-en/en/* | grep -v '^<' > europarl.en
  +
  +
</pre>
  +
  +
Check that the files are the same length.
  +
  +
<pre>
  +
  +
$ wc -l europarl.*
  +
1687533 europarl.da
  +
1687533 europarl.en
  +
</pre>
  +
  +
Analyse and tag the corpus:
   
 
<pre>
 
<pre>
  +
$ cat europarl.da | apertium-destxt | lt-proc -w da-sv.automorf.bin | \
  +
apertium-tagger -g -p da-sv.prob | apertium-retxt > tagged.da
   
  +
$ cat europarl.en | apertium-destxt | lt-proc -w en-ca.automorf.bin | \
  +
apertium-tagger -g -p en-ca.prob | apertium-retxt > tagged.en
 
</pre>
 
</pre>
   

Revision as of 13:05, 21 January 2010

This page gives a description of how to preprocess a corpus using Apertium so it can be used to train Moses factoredly.

Steps

Download parallel corpus

For example Europarl:

$ wget http://www.statmt.org/europarl/v5/da-en.tgz
$ tar -xzf da-en.tgz

Clean and tag both sides of corpus

Remove any lines that start with tags:


$ cat da-en/da/*  | grep -v '^<' > europarl.da
$ cat da-en/en/*  | grep -v '^<' > europarl.en

Check that the files are the same length.


$ wc -l europarl.*
  1687533 europarl.da
  1687533 europarl.en

Analyse and tag the corpus:

$ cat europarl.da | apertium-destxt | lt-proc -w da-sv.automorf.bin | \
  apertium-tagger -g -p da-sv.prob | apertium-retxt > tagged.da

$ cat europarl.en | apertium-destxt | lt-proc -w en-ca.automorf.bin | \
  apertium-tagger -g -p en-ca.prob | apertium-retxt > tagged.en

Convert to Moses factored format

Download the tagger to factored script:

$ wget http://apertium.svn.sourceforge.net/svnroot/apertium/trunk/apertium-tools/tagger-to-factored.py