Generating lexical-selection rules from a parallel corpus

You will need

Here is a list of software that you will need installed:

Giza++ (or some other word aligner)
Moses (for making Giza++ less human hostile)
All the Moses scripts
lttoolbox
Apertium
apertium-lex-tools

Furthermore you'll need:

an Apertium language pair
a parallel corpus (see Corpora)

Installing prerequisites

See Minimal installation from SVN for apertium/lttoolbox.

See Constraint-based lexical selection module for apertium-lex-tools.

For Giza++ and moses-decoder, etc. you can do

$ mkdir ~/smt
$ cd ~/smt
$ mkdir local # our "install prefix"
$ wget https://giza-pp.googlecode.com/files/giza-pp-v1.0.7.tar.gz
$ tar xzvf giza-pp-v1.0.7.tar.gz
$ cd giza-pp
$ make
$ mkdir ../local/bin
$ cp GIZA++-v2/snt2cooc.out ../local/bin/
$ cp GIZA++-v2/snt2plain.out ../local/bin/
$ cp GIZA++-v2/GIZA++ ../local/bin/
$ cp mkcls-v2/mkcls ../local/bin/
$ git clone https://github.com/moses-smt/mosesdecoder
$ cd mosesdecoder/
$ ./bjam

Now e.g. the clean-corpus and train-model scripts referred to below will be in ~/smt/mosesdecoder/scripts/training/clean-corpus-n.perl See http://www.statmt.org/moses/?n=Development.GetStarted if you want to install the binaries to some other directory.

Getting started

We're going to do the example with EuroParl and the English to Spanish pair in Apertium.

Given that you've got all the stuff installed, the work will be as follows:

Prepare corpus

To generate the rules, we need three files,

The tagged and tokenised source corpus
The tagged and tokenised target corpus
The output of the lexical transfer module in the source→target direction, tokenised

These three files should be sentence aligned.

The first thing that you need to do is clean the corpus, to remove long sentences. (Make sure you are in the same directory as the one where you have your europarl corpus)

$ perl (path to your mosesdecoder)/scripts/training/clean-corpus-n.perl europarl-v7.es-en es en europarl.clean 1 40
clean-corpus.perl: processing europarl-v6.es-en.es & .en to europarl.clean, cutoff 1-40
..........(100000)...

Input sentences: 1786594  Output sentences:  1467708

The next thing that we need to do is tag both sides of the corpus:

$ nohup cat europarl.clean.en | apertium-destxt |\
 apertium -f none -d /home/fran/source/apertium-en-es en-es-pretransfer > europarl.tagged.en &
$ nohup cat europarl.clean.es | apertium-destxt |\
 apertium -f none -d /home/fran/source/apertium-en-es es-en-pretransfer > europarl.tagged.es &

Then we need to remove the lines with no analyses on... but we want to also be able to keep track of which lines we have selected from the original corpus.

$ seq 1 1467708 > europarl.lines
$ paste europarl.lines europarl.tagged.en europarl.tagged.es | grep '<' | cut -f1 > europarl.lines.new
$ paste europarl.lines europarl.tagged.en europarl.tagged.es | grep '<' | cut -f2 > europarl.tagged.en.new
$ paste europarl.lines europarl.tagged.en europarl.tagged.es | grep '<' | cut -f3 > europarl.tagged.es.new
$ mv europarl.lines.new europarl.lines
$ mv europarl.tagged.en.new europarl.tagged.en
$ mv europarl.tagged.es.new europarl.tagged.es

Then run the English side through the lexical transfer:

$ nohup cat europarl.tagged.en | lt-proc -b ~/source/apertium-en-es/en-es.autobil.bin > europarl.biltrans.en-es &

We're going to cut off the bottom 67,658 for testing (also because Giza++ segfaults somewhere around there).

$ mkdir testing
$ tail -67658 europarl.lines > testing/europarl.67658.lines
$ tail -67658 europarl.tagged.en > testing/europarl.tagged.67658.en
$ tail -67658 europarl.tagged.es > testing/europarl.tagged.67658.es

$  head -1400000 europarl.lines > europarl.lines.new
$  head -1400000 europarl.tagged.en > europarl.tagged.en.new
$  head -1400000 europarl.tagged.es > europarl.tagged.es.new
$  head -1400000 europarl.biltrans.en-es > europarl.biltrans.en-es.new
$  mv europarl.lines.new europarl.lines
$  mv europarl.tagged.en.new europarl.tagged.en
$  mv europarl.tagged.es.new europarl.tagged.es
$  mv europarl.biltrans.en-es.new europarl.biltrans.en-es

These files are:

europarl.lines: The list of lines included in the corpus from the original cleaned corpus.
europarl.tagged.en: The tagged source language side of the corpus
europarl.tagged.es: The tagged target language side of the corpus
europarl.biltrans.en-es: The output of the lexical transfer SL→TL

Check that they have the same length:

$ wc -l europarl.*
   1400000 europarl.biltrans.en-es
   1400000 europarl.lines
   1400000 europarl.tagged.en
   1400000 europarl.tagged.es
   5600000 total

The next step is to tokenise these into a format appropriate for Moses. We also do some tag trimming here so that we could use the correct tags when generating lexical rules and bidix entries. For this we can use rocess-tagger-output from the apertium-lex-tools directory.

$ nohup cat europarl.tagged.es | ~/source/apertium-lex-tools/process-tagger-output ~/source/apertium-en-es/es-en.autobil.bin -p -t > europarl.tag-tok.en&
$ nohup cat europarl.tagged.en | ~/source/apertium-lex-tools/process-tagger-output ~/source/apertium-en-es/en-es.autobil.bin -p -t > europarl.tag-tok.es&
$ nohup cat europarl.biltrans.en-es | ~/source/apertium-lex-tools/process-tagger-output ~/source/apertium-en-es/es-en.autobil.bin -b -t > europarl.biltrans-tok.en-es &

Align corpus

Now we've got the corpus files ready, we can align the corpus using the Moses scripts:

nohup perl (path to your mosesdecoder)/scripts/training/train-model.perl -external-bin-dir \
 ~/smt/local/bin -corpus europarl.tag-tok \
 -f en -e es -alignment grow-diag-final-and -reordering msd-bidirectional-fe \
 -lm 0:5:/home/fran/corpora/europarl/europarl.lm:0 >log 2>&1 &

Note: Remember to change all the paths in the above command!

You'll need an LM file, but you can copy it from a previous Moses installation. If you don't have one, make an empty file and put a few words in it. We won't be using the LM anyway.

This takes a while, from a few hours to a day. So leave it running and go and make a soufflé, or chop some wood or something.

Extract sentences

The first thing we need to do after Moses has finished training is convert the Giza++ alignments to a less human- (and machine-) hostile format:

$ zcat giza.en-es/en-es.A3.final.gz | ~/source/apertium-lex-tools/scripts/giza-to-moses.awk > europarl.phrasetable.en-es

Then we want to make sure again that our file has the right number of lines:

$ wc -l europarl.phrasetable.en-es
1400000 europarl.phrasetable.en-es

Then we want to extract the sentences where the target language word aligned to a source language word is a possible translation in the bilingual dictionary:

$ ~/source/apertium-lex-tools/scripts/extract-sentences.py europarl.phrasetable.en-es europarl.biltrans-tok.en-es \
  > europarl.candidates.en-es

These are basically sentences that we can hope that Apertium might be able to generate.

Extract frequency lexicon

The next step is to extract the frequency lexicon.

$ python ~/source/apertium-lex-tools/scripts/extract-freq-lexicon.py europarl.candidates.en-es > europarl.lex.en-es

This file should look like:

$ cat europarl.lex.en-es  | head 
31381 union<n> unión<n> @
101 union<n> sindicato<n>
1 union<n> situación<n>
1 union<n> monetario<adj>
4 slope<n> pendiente<n> @
1 slope<n> ladera<n>

Where the highest frequency translation is marked with an @.

Note: This frequency lexicon can be used as a substitute for "choosing the most general translation" in your bilingual dictionary.

Generate patterns

Now we generate the ngrams that we are going to generate the rules from.

$ crisphold=1.5 # ratio of how many times you see the alternative translation compared to the default
$ python ~/source/apertium-lex-tools/scripts/ngram-count-patterns.py europarl.lex.en-es europarl.candidates.en-es $crisphold 2>/dev/null > europarl.ngrams.en-es

This script outputs lines in the following format:

-language<n>	and<cnjcoo> language<n> ,<cm>	lengua<n>	2
+language<n>	plain<adj> language<n> ,<cm>	lenguaje<n>	3
-language<n>	language<n> knowledge<n>	lengua<n>	4
-language<n>	language<n> of<pr> communication<n>	lengua<n>	3
-language<n>	Community<adj> language<n> .<sent>	lengua<n>	5
-language<n>	language<n> in~addition~to<pr> their<det><pos>	lengua<n>	2
-language<n>	every<det><ind> language<n>	lengua<n>	2
+language<n>	and<cnjcoo> *understandable language<n>	lenguaje<n>	2
-language<n>	two<num> language<n>	lengua<n>	8
-language<n>	only<adj> official<adj> language<n>	lengua<n>	2

The + and - indicate if this line chooses the most frequent transation (-) or a translation which is not the most frequent (+). The pattern selecting the translation is then shown, followed by the translation and then the frequency.

Filter rules

Now you can filter the rules, for example by removing rules with conjunctions, or removing rules with unknown words.

Generate rules

The final stage is to generate the rules,

python3 ~/source/apertium-lex-tools/scripts/ngrams-to-rules.py europarl.ngrams.en-es $crisphold > europarl.ngrams.en-es.lrx

Generating lexical-selection rules from a parallel corpus

Contents

You will need

Installing prerequisites

Getting started

Prepare corpus

Align corpus

Extract sentences

Extract frequency lexicon

Generate patterns

Filter rules

Generate rules

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools