Difference between revisions of "Preparing data for Moses factored training using Apertium"
Jump to navigation
Jump to search
Line 10: | Line 10: | ||
<pre> |
<pre> |
||
$ wget http://www.statmt.org/europarl/v5/da-en.tgz |
$ wget http://www.statmt.org/europarl/v5/da-en.tgz |
||
+ | $ tar -xzf da-en.tgz |
||
</pre> |
</pre> |
||
===Clean and tag both sides of corpus=== |
===Clean and tag both sides of corpus=== |
||
+ | |||
+ | Remove any lines that start with tags: |
||
+ | |||
+ | <pre> |
||
+ | |||
+ | $ cat da-en/da/* | grep -v '^<' > europarl.da |
||
+ | $ cat da-en/en/* | grep -v '^<' > europarl.en |
||
+ | |||
+ | </pre> |
||
+ | |||
+ | Check that the files are the same length. |
||
+ | |||
+ | <pre> |
||
+ | |||
+ | $ wc -l europarl.* |
||
+ | 1687533 europarl.da |
||
+ | 1687533 europarl.en |
||
+ | </pre> |
||
+ | |||
+ | Analyse and tag the corpus: |
||
<pre> |
<pre> |
||
+ | $ cat europarl.da | apertium-destxt | lt-proc -w da-sv.automorf.bin | \ |
||
+ | apertium-tagger -g -p da-sv.prob | apertium-retxt > tagged.da |
||
+ | $ cat europarl.en | apertium-destxt | lt-proc -w en-ca.automorf.bin | \ |
||
+ | apertium-tagger -g -p en-ca.prob | apertium-retxt > tagged.en |
||
</pre> |
</pre> |
||
Revision as of 13:05, 21 January 2010
This page gives a description of how to preprocess a corpus using Apertium so it can be used to train Moses factoredly.
Steps
Download parallel corpus
For example Europarl:
$ wget http://www.statmt.org/europarl/v5/da-en.tgz $ tar -xzf da-en.tgz
Clean and tag both sides of corpus
Remove any lines that start with tags:
$ cat da-en/da/* | grep -v '^<' > europarl.da $ cat da-en/en/* | grep -v '^<' > europarl.en
Check that the files are the same length.
$ wc -l europarl.* 1687533 europarl.da 1687533 europarl.en
Analyse and tag the corpus:
$ cat europarl.da | apertium-destxt | lt-proc -w da-sv.automorf.bin | \ apertium-tagger -g -p da-sv.prob | apertium-retxt > tagged.da $ cat europarl.en | apertium-destxt | lt-proc -w en-ca.automorf.bin | \ apertium-tagger -g -p en-ca.prob | apertium-retxt > tagged.en
Convert to Moses factored format
Download the tagger to factored script:
$ wget http://apertium.svn.sourceforge.net/svnroot/apertium/trunk/apertium-tools/tagger-to-factored.py