Generating lexical-selection rules from a parallel corpus

From Apertium
Jump to navigation Jump to search

If you have a parallel corpus, one of the things you can do is generate some lexical selection rules from it, to improve translation of words with more than one possible translation.

You will need

Here is a list of software that you will need installed:

  • Giza++ (or some other word aligner)
  • Moses (for making Giza++ less human hostile)
  • All the Moses scripts
  • lttoolbox
  • Apertium
  • apertium-lex-tools

Furthermore you'll need:

  • an Apertium language pair
  • a parallel corpus

Getting started

We're going to do the example with Europarl and the English to Spanish pair in Apertium.

Given that you've got all the stuff installed, the work will be as follows:

Prepare corpus

To generate the rules, we need three files,

  • The tagged source corpus
  • The tagged target corpus
  • The output of the lexical transfer module in the source→target direction

These three files should be sentence aligned.

The first thing that you need to do is clean the corpus, to remove long sentences.

$ perl /home/fran/local/bin/scripts-20120109-1229/training/clean-corpus-n.perl europarl-v6.es-en es en europarl.clean 1 40
clean-corpus.perl: processing europarl-v6.es-en.es & .en to europarl.clean, cutoff 1-40
..........(100000)...

Input sentences: 1786594  Output sentences:  1467708

(Replace the path /home/fran/local/bin/scripts-20120109-1229/training/ with the path to where you put the Moses scripts)

We're going to cut off the bottom 67,708 for testing (also because Giza++ segfaults somewhere around there).

$ mkdir testing
$ tail -67708 europarl.clean.en > testing/europarl.clean.67708.en
$ tail -67708 europarl.clean.es > testing/europarl.clean.67708.es
$  head -1400000 europarl.clean.en > europarl.clean.en~
$  head -1400000 europarl.clean.es > europarl.clean.es~
$  mv europarl.clean.en~ europarl.clean.en
$  mv europarl.clean.es~ europarl.clean.es
$ wc -l europarl.clean.e*
  1400000 europarl.clean.en
  1400000 europarl.clean.es
  2800000 total

The next thing that we need to do is tag both sides of the corpus:

$ nohup cat europarl.clean.en | apertium-destxt |\
 apertium -f none -d /home/fran/source/apertium-en-es en-es-pretransfer > europarl.tagged.en &
$ nohup cat europarl.clean.es | apertium-destxt |\
 apertium -f none -d /home/fran/source/apertium-en-es es-en-pretransfer > europarl.tagged.es &

Align corpus

Extract sentences

Generate rules