Generating lexical-selection rules from a parallel corpus

If you have a parallel corpus, one of the things you can do is generate some lexical selection rules from it, to improve translation of words with more than one possible translation.

You will need

Here is a list of software that you will need installed:

  • Giza++ (or some other word aligner)
  • Moses (for making Giza++ less human hostile)
  • All the Moses scripts
  • lttoolbox
  • Apertium
  • apertium-lex-tools

Furthermore you'll need:

  • an Apertium language pair
  • a parallel corpus

Getting started

We're going to do the example with Europarl and the English to Spanish pair in Apertium.

Given that you've got all the stuff installed, the work will be as follows:

Prepare corpus

To generate the rules, we need three files,

  • The tagged and tokenised source corpus
  • The tagged and tokenised target corpus
  • The output of the lexical transfer module in the source→target direction, tokenised

These three files should be sentence aligned.

The first thing that you need to do is clean the corpus, to remove long sentences.

$ perl /home/fran/local/bin/scripts-20120109-1229/training/clean-corpus-n.perl es en europarl.clean 1 40
clean-corpus.perl: processing & .en to europarl.clean, cutoff 1-40

Input sentences: 1786594  Output sentences:  1467708

(Replace the path /home/fran/local/bin/scripts-20120109-1229/training/ with the path to where you put the Moses scripts)

The next thing that we need to do is tag both sides of the corpus:

$ nohup cat europarl.clean.en | apertium-destxt |\
 apertium -f none -d /home/fran/source/apertium-en-es en-es-pretransfer > europarl.tagged.en &
$ nohup cat | apertium-destxt |\
 apertium -f none -d /home/fran/source/apertium-en-es es-en-pretransfer > &

Then we need to remove the lines with no analyses on... but we want to also be able to keep track of which lines we have selected from the original corpus.

$ seq 1 1467708 > europarl.lines
$ paste europarl.lines europarl.tagged.en | grep '<' | cut -f1 >
$ paste europarl.lines europarl.tagged.en | grep '<' | cut -f2 >
$ paste europarl.lines europarl.tagged.en | grep '<' | cut -f3 >
$ mv europarl.lines
$ mv europarl.tagged.en
$ mv

Then run the English side through the lexical transfer:

$ nohup cat europarl.tagged.en | lt-proc -b ~/source/apertium-en-es/en-es.autobil.bin > europarl.biltrans.en-es &

We're going to cut off the bottom 67,658 for testing (also because Giza++ segfaults somewhere around there).

$ mkdir testing
$ tail -67658 europarl.lines > testing/europarl.67658.lines
$ tail -67658 europarl.tagged.en > testing/europarl.tagged.67658.en
$ tail -67658 > testing/
$  head -1400000 europarl.lines >
$  head -1400000 europarl.tagged.en >
$  head -1400000 >
$  head -1400000 europarl.biltrans.en-es >
$  mv europarl.lines
$  mv europarl.tagged.en
$  mv
$  mv europarl.biltrans.en-es

These files are:

  • europarl.lines: The list of lines included in the corpus from the original cleaned corpus.
  • europarl.tagged.en: The tagged source language side of the corpus
  • The tagged target language side of the corpus
  • europarl.biltrans.en-es: The output of the lexical transfer SL→TL

Check that they have the same length:

$ wc -l europarl.*
   1400000 europarl.biltrans.en-es
   1400000 europarl.lines
   1400000 europarl.tagged.en
   5600000 total

The next step is to tokenise these into a format appropriate for Moses, we can also do some tag replacements here too. There are a couple of scripts in the apertium-lex-tools folder that will do this. Note: If you are not doing Spanish and English you will need to edit the script to include a translation table of all of the tag combinations that are found in your corpus.

$ nohup cat europarl.tagged.en | python ~/source/apertium-lex-tools/scripts/ en > europarl.tag-tok.en&
$ nohup cat | python ~/source/apertium-lex-tools/scripts/ es >
$ nohup cat europarl.biltrans.en-es | python ~/source/apertium-lex-tools/scripts/ > europarl.biltrans-tok.en-es &

Align corpus

Now we've got the corpus files ready, we can align the corpus using the Moses scripts:

nohup perl ~/local/bin/scripts-20120109-1229/training/train-model.perl -scripts-root-dir \
 /home/fran/local/bin/scripts-20120109-1229/ -root-dir . -corpus europarl.tag-tok \
 -f en -e es -alignment grow-diag-final-and -reordering msd-bidirectional-fe \
 -lm 0:5:/home/fran/corpora/europarl/europarl.lm:0 >log 2>&1 &

Note: Remember to change all the paths in the above command!

You'll need an LM file, but you can copy it from a previous Moses installation. If you don't have one, make an empty file and put a few words in it. We won't be using the LM anyway.

This takes a while, probably about a day. So leave it running and go and make a soufflé, or chop some wood or something.

Extract sentences

Generate rules