Difference between revisions of "Generating lexical-selection rules from a parallel corpus"

Revision as of 11:59, 17 January 2012

You will need

Here is a list of software that you will need installed:

Giza++ (or some other word aligner)
Moses (for making Giza++ less human hostile)
All the Moses scripts
lttoolbox
Apertium
apertium-lex-tools

Furthermore you'll need:

an Apertium language pair
a parallel corpus

Getting started

We're going to do the example with Europarl and the English to Spanish pair in Apertium.

Given that you've got all the stuff installed, the work will be as follows:

Prepare corpus

To generate the rules, we need three files,

The tagged source corpus
The tagged target corpus
The output of the lexical transfer module in the source→target direction

These three files should be sentence aligned.

The first thing that you need to do is clean the corpus, to remove long sentences.

$ perl /home/fran/local/bin/scripts-20120109-1229/training/clean-corpus-n.perl europarl-v6.es-en es en europarl.clean 1 40
clean-corpus.perl: processing europarl-v6.es-en.es & .en to europarl.clean, cutoff 1-40
.....

(Replace the path /home/fran/local/bin/scripts-20120109-1229/training/ with the path to where you put the Moses scripts)

Align corpus

Extract sentences

Generate rules

@@ Line 19: / Line 19: @@
 == Getting started ==
+We're going to do the example with Europarl and the English to Spanish pair in Apertium.
 Given that you've got all the stuff installed, the work will be as follows:
 === Prepare corpus ===
+To generate the rules, we need three files,
+* The tagged source corpus
+* The tagged target corpus
+* The output of the lexical transfer module in the source→target direction
+These three files should be sentence aligned.
+The first thing that you need to do is clean the corpus, to remove long sentences.
+<pre>
+$ perl /home/fran/local/bin/scripts-20120109-1229/training/clean-corpus-n.perl europarl-v6.es-en es en europarl.clean 1 40
+clean-corpus.perl: processing europarl-v6.es-en.es & .en to europarl.clean, cutoff 1-40
+.....
+</pre>
+(Replace the path <code>/home/fran/local/bin/scripts-20120109-1229/training/</code> with the path to where you put the Moses scripts)
 === Align corpus ===

Difference between revisions of "Generating lexical-selection rules from a parallel corpus"

Revision as of 11:59, 17 January 2012

Contents

You will need

Getting started

Prepare corpus

Align corpus

Extract sentences

Generate rules

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools