Difference between revisions of "Preparing data for Moses factored training using Apertium"

From Apertium
Jump to navigation Jump to search
(Link to French page)
 
(12 intermediate revisions by 2 users not shown)
Line 1: Line 1:
  +
[[Préparation de données pour Moses|En français]]
  +
 
{{TOCD}}
 
{{TOCD}}
 
This page gives a description of how to preprocess a corpus using Apertium so it can be used to train Moses factoredly.
 
This page gives a description of how to preprocess a corpus using Apertium so it can be used to train Moses factoredly.
  +
  +
==Requirements==
  +
  +
* [[lttoolbox]]
  +
* [[apertium]]
  +
* <code>tagger-to-factored.py</code> script from [[apertium SVN]]
   
 
==Steps==
 
==Steps==
Line 6: Line 14:
 
===Download and compile data===
 
===Download and compile data===
   
For the parallel corpus we're going to use Europarl:
+
For the parallel corpus we're going to use Europarl, the page [[corpora]] lists some others:
   
 
<pre>
 
<pre>
Line 13: Line 21:
 
</pre>
 
</pre>
   
And for the morphological analyser and tagger, we're going to use <code>apertium-sv-da</code> and <code>apertium-is-en</code>
+
And for the morphological analyser and tagger, we're going to use <code>apertium-sv-da</code> and <code>apertium-is-en</code>. You can find others at: [[list of language pairs]] and [[list of dictionaries]].
   
 
<pre>
 
<pre>
Line 19: Line 27:
 
$ wget http://apertium.svn.sourceforge.net/svnroot/apertium/trunk/apertium-sv-da/da-sv.prob
 
$ wget http://apertium.svn.sourceforge.net/svnroot/apertium/trunk/apertium-sv-da/da-sv.prob
   
 
$ wget http://apertium.svn.sourceforge.net/svnroot/apertium/trunk/apertium-is-en/apertium-is-en.en.dix
 
$ wget http://apertium.svn.sourceforge.net/svnroot/apertium/trunk/apertium-is-en/apertium-is-en.is.dix
 
 
$ wget http://apertium.svn.sourceforge.net/svnroot/apertium/trunk/apertium-is-en/en-is.prob
 
$ wget http://apertium.svn.sourceforge.net/svnroot/apertium/trunk/apertium-is-en/en-is.prob
   
Line 47: Line 54:
   
 
<pre>
 
<pre>
 
 
$ wc -l europarl.*
 
$ wc -l europarl.*
 
1687533 europarl.da
 
1687533 europarl.da
Line 56: Line 62:
   
 
<pre>
 
<pre>
$ cat europarl.da | apertium-destxt | lt-proc -w da-sv.automorf.bin | \
+
$ cat europarl.da | apertium-destxt | lt-proc -e -w da-sv.automorf.bin | \
 
apertium-tagger -g -p da-sv.prob | apertium-retxt > tagged.da
 
apertium-tagger -g -p da-sv.prob | apertium-retxt > tagged.da
   
Line 62: Line 68:
 
apertium-tagger -g -p en-is.prob | apertium-retxt > tagged.en
 
apertium-tagger -g -p en-is.prob | apertium-retxt > tagged.en
 
</pre>
 
</pre>
  +
  +
Note that the <code>-w</code> option provides dictionary lowercasing, that is, the lemma is printed out with the case as it is in the dictionary, not with the case as it is in the superficial form. The <code>-e</code> option is for dynamic [[decompounding]] of unknown words.
   
 
===Convert to Moses factored format===
 
===Convert to Moses factored format===
Line 71: Line 79:
 
</pre>
 
</pre>
   
  +
Then convert to Moses factored format:
  +
  +
<pre>
  +
$ cat tagged.da | python tagger-to-factored.py 2 > factored.da
  +
$ cat tagged.en | python tagger-to-factored.py 2 > factored.en
  +
</pre>
  +
  +
'''Note:''' The number option to <code>tagger-to-factored.py</code> specifies how many tags you want to output. Giving it as <code>0</code> will output only [[superficial form]] and lemma. Putting <code>1</code> will output, superficial form, lemma and first tag (this is almost always the POS tag). Anything above this will output the same as before, but an extra factor will be output as the rest of the morphological information.
  +
  +
For example:
  +
  +
<pre>
  +
$ echo "^der/der<adv>$ ^blev/blive<vblex><past><actv>$ ^ramt/ramme<vblex><pp>$^./.<sent>$" | python tagger-to-factored.py 0
  +
der|der blev|blive ramt|ramme .|.
  +
  +
$ echo "^der/der<adv>$ ^blev/blive<vblex><past><actv>$ ^ramt/ramme<vblex><pp>$^./.<sent>$" | python tagger-to-factored.py 1
  +
der|der|adv blev|blive|vblex ramt|ramme|vblex .|.|sent
  +
  +
$ echo "^der/der<adv>$ ^blev/blive<vblex><past><actv>$ ^ramt/ramme<vblex><pp>$^./.<sent>$" | python tagger-to-factored.py 2
  +
der|der|adv|adv blev|blive|vblex|vblex.past ramt|ramme|vblex|vblex.pp .|.|sent|sent
  +
  +
$ echo "^der/der<adv>$ ^blev/blive<vblex><past><actv>$ ^ramt/ramme<vblex><pp>$^./.<sent>$" | python tagger-to-factored.py
  +
der|der|adv|adv blev|blive|vblex|vblex.past.actv ramt|ramme|vblex|vblex.pp .|.|sent|sent
  +
</pre>
  +
  +
After this, check again that the files are the same length:
  +
  +
<pre>
  +
$ wc -l factored.*
  +
  +
</pre>
  +
  +
==Train a factored phrase-based model==
  +
  +
Now you have the data, you can [http://www.statmt.org/moses/?n=Moses.FactoredTutorial train a factored phrase-based model] with Moses!
  +
  +
==See also==
  +
  +
* [[List of dictionaries]]
  +
* [[List of language pairs]]
  +
* [[Compiling dictionaries]]
  +
* [[lttoolbox]]
  +
  +
==External links==
  +
  +
* [http://www.statmt.org/moses/?n=Moses.FactoredTutorial Tutorial for Using Factored Models]
   
 
[[Category:Documentation]]
 
[[Category:Documentation]]
  +
[[Category:Documentation in English]]

Latest revision as of 07:45, 8 October 2014

En français

This page gives a description of how to preprocess a corpus using Apertium so it can be used to train Moses factoredly.

Requirements[edit]

Steps[edit]

Download and compile data[edit]

For the parallel corpus we're going to use Europarl, the page corpora lists some others:

$ wget http://www.statmt.org/europarl/v5/da-en.tgz
$ tar -xzf da-en.tgz

And for the morphological analyser and tagger, we're going to use apertium-sv-da and apertium-is-en. You can find others at: list of language pairs and list of dictionaries.

$ wget http://apertium.svn.sourceforge.net/svnroot/apertium/trunk/apertium-sv-da/apertium-sv-da.da.dix
$ wget http://apertium.svn.sourceforge.net/svnroot/apertium/trunk/apertium-sv-da/da-sv.prob

$ wget http://apertium.svn.sourceforge.net/svnroot/apertium/trunk/apertium-is-en/apertium-is-en.en.dix
$ wget http://apertium.svn.sourceforge.net/svnroot/apertium/trunk/apertium-is-en/en-is.prob

$ lt-comp lr apertium-sv-da.da.dix da-sv.automorf.bin
final@inconditional 20 105
main@standard 9121 18055
unchecked@standard 4411 8130

$ lt-comp lr apertium-is-en.en.dix en-is.automorf.bin
final@inconditional 97 2809
main@standard 22284 47423

Clean and tag both sides of corpus[edit]

Remove any lines that start with tags:


$ cat da-en/da/*  | grep -v '^<' > europarl.da
$ cat da-en/en/*  | grep -v '^<' > europarl.en

Check that the files are the same length.

$ wc -l europarl.*
  1687533 europarl.da
  1687533 europarl.en

Analyse and tag the corpus:

$ cat europarl.da | apertium-destxt | lt-proc -e -w da-sv.automorf.bin | \
  apertium-tagger -g -p da-sv.prob | apertium-retxt > tagged.da

$ cat europarl.en | apertium-destxt | lt-proc -w en-is.automorf.bin | \
  apertium-tagger -g -p en-is.prob | apertium-retxt > tagged.en

Note that the -w option provides dictionary lowercasing, that is, the lemma is printed out with the case as it is in the dictionary, not with the case as it is in the superficial form. The -e option is for dynamic decompounding of unknown words.

Convert to Moses factored format[edit]

Download the tagger to factored script:

$ wget http://apertium.svn.sourceforge.net/svnroot/apertium/trunk/apertium-tools/tagger-to-factored.py

Then convert to Moses factored format:

$ cat tagged.da | python tagger-to-factored.py 2 > factored.da
$ cat tagged.en | python tagger-to-factored.py 2 > factored.en

Note: The number option to tagger-to-factored.py specifies how many tags you want to output. Giving it as 0 will output only superficial form and lemma. Putting 1 will output, superficial form, lemma and first tag (this is almost always the POS tag). Anything above this will output the same as before, but an extra factor will be output as the rest of the morphological information.

For example:

$ echo "^der/der<adv>$ ^blev/blive<vblex><past><actv>$ ^ramt/ramme<vblex><pp>$^./.<sent>$" | python tagger-to-factored.py 0
der|der blev|blive ramt|ramme .|. 

$ echo "^der/der<adv>$ ^blev/blive<vblex><past><actv>$ ^ramt/ramme<vblex><pp>$^./.<sent>$" | python tagger-to-factored.py 1
der|der|adv blev|blive|vblex ramt|ramme|vblex .|.|sent 

$ echo "^der/der<adv>$ ^blev/blive<vblex><past><actv>$ ^ramt/ramme<vblex><pp>$^./.<sent>$" | python tagger-to-factored.py 2
der|der|adv|adv blev|blive|vblex|vblex.past ramt|ramme|vblex|vblex.pp .|.|sent|sent 

$ echo "^der/der<adv>$ ^blev/blive<vblex><past><actv>$ ^ramt/ramme<vblex><pp>$^./.<sent>$" | python tagger-to-factored.py 
der|der|adv|adv blev|blive|vblex|vblex.past.actv ramt|ramme|vblex|vblex.pp .|.|sent|sent 

After this, check again that the files are the same length:

$ wc -l factored.*

Train a factored phrase-based model[edit]

Now you have the data, you can train a factored phrase-based model with Moses!

See also[edit]

External links[edit]