Generating lexical-selection rules from a parallel corpus

You will need

Here is a list of software that you will need installed:

Giza++ (or some other word aligner)
Moses (for making Giza++ less human hostile)
All the Moses scripts
lttoolbox
Apertium
apertium-lex-tools

Furthermore you'll need:

an Apertium language pair
a parallel corpus

Getting started

We're going to do the example with Europarl and the English to Spanish pair in Apertium.

Given that you've got all the stuff installed, the work will be as follows:

Prepare corpus

To generate the rules, we need three files,

The tagged source corpus
The tagged target corpus
The output of the lexical transfer module in the source→target direction

These three files should be sentence aligned.

The first thing that you need to do is clean the corpus, to remove long sentences.

$ perl /home/fran/local/bin/scripts-20120109-1229/training/clean-corpus-n.perl europarl-v6.es-en es en europarl.clean 1 40
clean-corpus.perl: processing europarl-v6.es-en.es & .en to europarl.clean, cutoff 1-40
..........(100000)...

Input sentences: 1786594  Output sentences:  1467708

(Replace the path /home/fran/local/bin/scripts-20120109-1229/training/ with the path to where you put the Moses scripts)

We're going to cut off the bottom 67,708 for testing (also because Giza++ segfaults somewhere around there).

$ mkdir testing
$ tail -67708 europarl.clean.en > testing/europarl.clean.67708.en
$ tail -67708 europarl.clean.es > testing/europarl.clean.67708.es

$  head -1400000 europarl.clean.en > europarl.clean.en~
$  head -1400000 europarl.clean.es > europarl.clean.es~
$  mv europarl.clean.en~ europarl.clean.en
$  mv europarl.clean.es~ europarl.clean.es

$ wc -l europarl.clean.e*
  1400000 europarl.clean.en
  1400000 europarl.clean.es
  2800000 total

The next thing that we need to do is tag both sides of the corpus:

$ nohup cat europarl.clean.en | apertium-destxt |\
 apertium -f none -d /home/fran/source/apertium-en-es en-es-pretransfer > europarl.tagged.en &
$ nohup cat europarl.clean.es | apertium-destxt |\
 apertium -f none -d /home/fran/source/apertium-en-es es-en-pretransfer > europarl.tagged.es &

Align corpus

Extract sentences

Generate rules

Generating lexical-selection rules from a parallel corpus

Contents

You will need

Getting started

Prepare corpus

Align corpus

Extract sentences

Generate rules

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools