Difference between revisions of "Generating lexical-selection rules from a parallel corpus"

From Apertium
Jump to navigation Jump to search
(Link to French page)
(37 intermediate revisions by 3 users not shown)
Line 1: Line 1:
[[Génération de règles de sélection lexicale depuis un corpus parallèle|En français]]

{{deprecated2|Learning rules from parallel and non-parallel corpora}}
If you have a parallel corpus, one of the things you can do is generate some lexical selection rules from it, to improve translation of words with more than one possible translation.
If you have a parallel corpus, one of the things you can do is generate some lexical selection rules from it, to improve translation of words with more than one possible translation.
Line 46: Line 49:

== Getting started ==
== Getting started ==

If you don't want through the whole process step by step, you can use the Makefile script provided in the [[#Makefile|last section]] of this page.

We're going to do the example with [[Corpora|EuroParl]] and the English to Spanish pair in Apertium.
We're going to do the example with [[Corpora|EuroParl]] and the English to Spanish pair in Apertium.
Line 59: Line 65:
* The output of the lexical transfer module in the source→target direction, tokenised
* The output of the lexical transfer module in the source→target direction, tokenised

These three files should be sentence aligned.
These three files should be sentence aligned.

Make a folder called data-en-es. We are going to keep all the generated files there.
The first thing that you need to do is clean the corpus, to remove long sentences.
(Make sure you are in the same directory as the one where you have your europarl corpus)

$ perl (path to your mosesdecoder)/scripts/training/clean-corpus-n.perl europarl-v7.es-en es en europarl.clean 1 40
clean-corpus.perl: processing europarl-v6.es-en.es & .en to europarl.clean, cutoff 1-40

Input sentences: 1786594 Output sentences: 1467708

The next thing that we need to do is tag both sides of the corpus:
The first thing that we need to do is tag both sides of the corpus:

$ nohup cat europarl.clean.en | apertium-destxt |\
$ nohup cat europarl.en-es.en | apertium-destxt |\
apertium -f none -d /home/fran/source/apertium-en-es en-es-pretransfer > europarl.tagged.en &
apertium -f none -d /home/fran/source/apertium-en-es en-es-pretransfer > data-en-es/europarl-en-es.tagged.en &
$ nohup cat europarl.clean.es | apertium-destxt |\
$ nohup cat europarl.clean.es | apertium-destxt |\
apertium -f none -d /home/fran/source/apertium-en-es es-en-pretransfer > europarl.tagged.es &
apertium -f none -d /home/fran/source/apertium-en-es es-en-pretransfer > data-en-es.europarl-en-es.tagged.es &

Then we need to remove the lines with no analyses on... but we want to also be able to keep track of which lines we have selected from the original corpus.
Then we need to remove the lines with no analyses on and replace blanks within lemmas with a new character (we will use `~`):

$ paste data-en-es/europarl-en-es.tagged.en data-en-es/europarl-en-es.tagged.es | grep '<' | cut -f2 | sed 's/ /~/g' | sed 's/\$[^\^]*/\$ /g' > data-en-es/europarl-en-es.tagged.new.es
$ seq 1 1467708 > europarl.lines
$ paste europarl.lines europarl.tagged.en europarl.tagged.es | grep '<' | cut -f1 > europarl.lines.new
$ paste data-en-es/europarl-en-es.tagged.en data-en-es/europarl-en-es.tagged.es | grep '<' | cut -f3 | sed 's/ /~/g' | sed 's/\$[^\^]*/\$ /g' > data-en-es/europarl-en-es.tagged.new.en
$ paste europarl.lines europarl.tagged.en europarl.tagged.es | grep '<' | cut -f2 > europarl.tagged.en.new
$ paste europarl.lines europarl.tagged.en europarl.tagged.es | grep '<' | cut -f3 > europarl.tagged.es.new
$ mv europarl.lines.new europarl.lines
$ mv europarl.tagged.en.new europarl.tagged.en
$ mv europarl.tagged.es.new europarl.tagged.es

Next, we need to clean the corpus and remove long sentences.
Then run the English side through the lexical transfer:
(Make sure you are in the same directory as the one where you have your europarl corpus)

$ perl (path to your mosesdecoder)/scripts/training/clean-corpus-n.perl data-en-es/europarl-en-es.tagged.new es en data-en-es/europarl-en-es.tag-clean 1 40
$ nohup cat europarl.tagged.en | lt-proc -b ~/source/apertium-en-es/en-es.autobil.bin > europarl.biltrans.en-es &
clean-corpus.perl: processing europarl-v6.es-en.es & .en to data-en-es/europarl-en-es.clean, cutoff 1-40

Input sentences: 1786594 Output sentences: 1467708

We're going to cut off the bottom 67,658 for testing (also because Giza++ segfaults somewhere around there).

We're going to cut off the bottom 67,658 for testing (also because Giza++ segfaults somewhere around there).
$ mkdir testing
$ mkdir testing
$ tail -67658 europarl.lines > testing/europarl.67658.lines
$ tail -67658 data-en-es/europarl-en-es.tag-clean.en > testing/europarl-en-es.tag-clean.67658.en
$ tail -67658 europarl.tagged.en > testing/europarl.tagged.67658.en
$ tail -67658 data-en-es/europarl-en-es.tag-clean.es > testing/europarl-en-es.tag-clean.67658.es
$ tail -67658 europarl.tagged.es > testing/europarl.tagged.67658.es

$ head -1400000 europarl.lines > europarl.lines.new
$ head -1400000 data-en-es/europarl-en-es.tag-clean.en > data-en-es/europarl-en-es.tag-clean.en.new
$ head -1400000 europarl.tagged.en > europarl.tagged.en.new
$ head -1400000 data-en-es/europarl-en-es.tag-clean.es > data-en-es/europarl-en-es.tag-clean.es.new
$ head -1400000 europarl.tagged.es > europarl.tagged.es.new
$ mv data-en-es/europarl-en-es.tag-clean.en.new data-en-es/europarl-en-es.tag-clean.en
$ head -1400000 europarl.biltrans.en-es > europarl.biltrans.en-es.new
$ mv data-en-es/europarl-en-es.tag-clean.es.new data-en-es/europarl-en-es.tag-clean.es
$ mv europarl.lines.new europarl.lines
$ mv europarl.tagged.en.new europarl.tagged.en
$ mv europarl.tagged.es.new europarl.tagged.es
$ mv europarl.biltrans.en-es.new europarl.biltrans.en-es

These files are:
These files are:

* <code>europarl.lines</code>: The list of lines included in the corpus from the original cleaned corpus.
* <code>data-en-es/europarl-en-es.tag-clean.en</code>: The tagged source language side of the corpus
* <code>europarl.tagged.en</code>: The tagged source language side of the corpus
* <code>data-en-es/europarl-en-es.tag-clean.es</code>: The tagged target language side of the corpus
* <code>europarl.tagged.es</code>: The tagged target language side of the corpus
* <code>europarl.biltrans.en-es</code>: The output of the lexical transfer SL→TL

Check that they have the same length:
Check that they have the same length:
Line 130: Line 121:
$ wc -l europarl.*
$ wc -l europarl.*
1400000 europarl.biltrans.en-es
1400000 data-en-es/europarl-en-es.tag-clean.en
1400000 europarl.lines
1400000 data-en-es/europarl-en-es.tag-clean.es
2800000 total
1400000 europarl.tagged.en
1400000 europarl.tagged.es
5600000 total

The next step is to tokenise these into a format appropriate for Moses, we can also do some tag replacements here too. There are a couple of scripts in the apertium-lex-tools folder that will do this. Note: If you are not doing Spanish and English you will need to edit the <code>process-tagger-output.py</code> script to include a translation table of all of the tag combinations that are found in your corpus.

$ nohup cat europarl.tagged.en | python ~/source/apertium-lex-tools/scripts/process-tagger-output.py en > europarl.tag-tok.en&
$ nohup cat europarl.tagged.es | python ~/source/apertium-lex-tools/scripts/process-tagger-output.py es > europarl.tag-tok.es&
$ nohup cat europarl.biltrans.en-es | python ~/source/apertium-lex-tools/scripts/process-biltrans-output.py > europarl.biltrans-tok.en-es &

Line 150: Line 131:

nohup perl ~/local/bin/scripts-20120109-1229/training/train-model.perl -scripts-root-dir \
nohup perl (path to your mosesdecoder)/scripts/training/train-model.perl -external-bin-dir \
/home/fran/local/bin/scripts-20120109-1229/ -root-dir . -corpus europarl.tag-tok \
~/smt/local/bin -corpus data-en-es/europarl-en-es.tag-clean \
-f en -e es -alignment grow-diag-final-and -reordering msd-bidirectional-fe \
-f en -e es -alignment grow-diag-final-and -reordering msd-bidirectional-fe \
-lm 0:5:/home/fran/corpora/europarl/europarl.lm:0 >log 2>&1 &
-lm 0:5:/home/fran/corpora/europarl/europarl.lm:0 >log 2>&1 &
Line 160: Line 141:
You'll need an LM file, but you can copy it from a previous Moses installation. If you don't have one, make an empty file and put a few words in it. We won't be using the LM anyway.
You'll need an LM file, but you can copy it from a previous Moses installation. If you don't have one, make an empty file and put a few words in it. We won't be using the LM anyway.

This takes a while, probably about a day. So leave it running and go and make a soufflé, or chop some wood or something.
This takes a while, from a few hours to a day. So leave it running and go and make a soufflé, or chop some wood or something.

=== Extract sentences ===
=== Extract sentences ===
After the sentences are aligned, we need to trim unnecessary tags from the tokens, and generate a biltrans file.

The first thing we need to do after Moses has finished training is convert the Giza++ alignments to a less human- (and machine-) hostile format:

$ zcat giza.en-es/en-es.A3.final.gz | ~/source/apertium-lex-tools/scripts/giza-to-moses.awk > europarl.phrasetable.en-es
zcat giza.en-es/en-es.A3.final.gz | ~/source/apertium-lex-tools/scripts/giza-to-moses.awk > data-en-es/europarl.phrasetable.en-es
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 1 \
| sed 's/~/ /g' | ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil.bin -p -t > tmp1
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 2 \
| sed 's/~/ /g' | ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil.bin -p -t > tmp2
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 3 > tmp3
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 2 \
| sed 's/~/ /g' | ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil.bin -b -t > data-en-es/europarl.clean-biltrans.en-es
paste tmp1 tmp2 tmp3 | sed 's/\t/ ||| /g' > data-en-es/europarl.phrasetable.en-es
rm tmp1 tmp2 tmp3

Line 173: Line 162:

$ wc -l europarl.phrasetable.en-es
$ wc -l data-en-es/europarl.phrasetable.en-es
1400000 europarl.phrasetable.en-es
1400000 data-en-es/europarl.phrasetable.en-es

Line 180: Line 169:

$ ~/source/apertium-lex-tools/scripts/extract-sentences.py europarl.phrasetable.en-es europarl.biltrans-tok.en-es \
$ ~/source/apertium-lex-tools/scripts/extract-sentences.py data-en-es/europarl-en-es.phrasetable.en-es data-en-es/europarl-en-es.biltrans-tok.en-es \
> europarl.candidates.en-es
> data-en-es/europarl-en-es.candidates.en-es

These are basically sentences that we can hope that Apertium might be able to generate.
These are basically sentences that we can hope that Apertium might be able to generate.

=== Extract bilingual dictionary candidates ===

Using the phrasetable and the bilingual file we can extract candidates for the bilingual dictionary.

python3 ~/source/apertium-lex-tools/scripts/extract-biltrans-candidates.py data-en-es/europarl-en-es.phrasetable.en-es data-en-es/europarl-en-es.biltrans-tok.en-es > data-en-es/europarl-en-es.biltrans-candidates.en-es 2> data-en-es/europarl-en-es.biltrans-pairs.en-es

where data-en-es/europarl-en-es.biltrans-candidates.en-es contains the generated candidates for the bilingual dictionary.

===Extract frequency lexicon===
===Extract frequency lexicon===
Line 191: Line 190:

$ python ~/source/apertium-lex-tools/scripts/extract-freq-lexicon.py europarl.candidates.en-es > europarl.lex.en-es
$ python ~/source/apertium-lex-tools/scripts/extract-freq-lexicon.py data-en-es/europarl-en-es.candidates.en-es > data-en-es/europarl-en-es.lex.en-es

Line 216: Line 215:
$ crisphold=1.5 # ratio of how many times you see the alternative translation compared to the default
$ crisphold=1.5 # ratio of how many times you see the alternative translation compared to the default
$ python ~/source/apertium-lex-tools/scripts/ngram-count-patterns.py europarl.lex.en-es europarl.candidates.en-es $crisphold 2>/dev/null > europarl.ngrams.en-es
$ python ~/source/apertium-lex-tools/scripts/ngram-count-patterns.py europarl.lex.en-es data-en-es/europarl-en-es.candidates.en-es $crisphold 2>/dev/null > data-en-es/europarl-en-es.ngrams.en-es

Line 245: Line 244:

python3 ~/source/apertium-lex-tools/scripts/ngrams-to-rules.py europarl.ngrams.en-es $crisphold > europarl.ngrams.en-es.lrx
python3 ~/source/apertium-lex-tools/scripts/ngrams-to-rules.py data-en-es/europarl-en-es.ngrams.en-es $crisphold > data-en-es/europarl-en-es.ngrams.en-es.lrx

=== Makefile ===
For the whole process you can run the following Makefile:




all: data-$(SL)-$(TL)/$(CORPUS).ngrams.$(SL)-$(TL).lrx data-$(SL)-$(TL)/$(CORPUS).biltrans-entries.$(SL)-$(TL)

data-$(SL)-$(TL)/$(CORPUS).tagged.$(SL): $(CORPUS).$(PAIR).$(SL)
if [ ! -d data-$(SL)-$(TL) ]; then mkdir data-$(SL)-$(TL); fi
cat $(CORPUS).$(PAIR).$(SL) | head -n $(TRAINING_LINES) \
| apertium-destxt \
| apertium -f none -d $(DATA) $(SL)-$(TL)-tagger \
| apertium-pretransfer > $@;

data-$(SL)-$(TL)/$(CORPUS).tagged.$(TL): $(CORPUS).$(PAIR).$(TL)
if [ ! -d data-$(SL)-$(TL) ]; then mkdir data-$(SL)-$(TL); fi
cat $(CORPUS).$(PAIR).$(TL) | head -n $(TRAINING_LINES) \
| apertium-destxt \
| apertium -f none -d $(DATA) $(TL)-$(SL)-tagger \
| apertium-pretransfer > $@;

data-$(SL)-$(TL)/$(CORPUS).tagged.new.$(SL): data-$(SL)-$(TL)/$(CORPUS).tagged.$(SL) data-$(SL)-$(TL)/$(CORPUS).tagged.$(TL)
paste data-$(SL)-$(TL)/$(CORPUS).tagged.$(SL) data-$(SL)-$(TL)/$(CORPUS).tagged.$(TL) \
| grep '<' \
| cut -f1 \
| sed 's/ /~/g' | sed 's/$$[^\^]*/$$ /g' > $@

data-$(SL)-$(TL)/$(CORPUS).tagged.new.$(TL): data-$(SL)-$(TL)/$(CORPUS).tagged.$(SL) data-$(SL)-$(TL)/$(CORPUS).tagged.$(TL)
paste data-$(SL)-$(TL)/$(CORPUS).tagged.$(SL) data-$(SL)-$(TL)/$(CORPUS).tagged.$(TL) \
| grep '<' \
| cut -f2 \
| sed 's/ /~/g' | sed 's/$$[^\^]*/$$ /g' > $@

data-$(SL)-$(TL)/$(CORPUS).tag-clean.$(SL) data-$(SL)-$(TL)/$(CORPUS).tag-clean.$(TL): data-$(SL)-$(TL)/$(CORPUS).tagged.new.$(SL) data-$(SL)-$(TL)/$(CORPUS).tagged.new.$(TL)
perl $(MOSESDECODER)/clean-corpus-n.perl data-$(SL)-$(TL)/$(CORPUS).tagged.new $(SL) $(TL) data-$(SL)-$(TL)/$(CORPUS).tag-clean 1 40;

model: data-$(SL)-$(TL)/$(CORPUS).tag-clean.$(SL) data-$(SL)-$(TL)/$(CORPUS).tag-clean.$(TL)
-perl $(MOSESDECODER)/train-model.perl -external-bin-dir $(BIN_DIR) -corpus data-$(SL)-$(TL)/$(CORPUS).tag-clean \
-f $(TL) -e $(SL) -alignment grow-diag-final-and -reordering msd-bidirectional-fe \
-lm 0:5:$(LM):0 2>&1

data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL): model
zcat giza.$(SL)-$(TL)/$(SL)-$(TL).A3.final.gz | $(SCRIPTS)/giza-to-moses.awk > $@
cat data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL) | sed 's/ ||| /\t/g' | cut -f 1 \
| sed 's/~/ /g' | $(LEX_TOOLS)/multitrans $(DATA)/$(TL)-$(SL).autobil.bin -p -t > tmp1
cat data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL) | sed 's/ ||| /\t/g' | cut -f 2 \
| sed 's/~/ /g' | $(LEX_TOOLS)/multitrans $(DATA)/$(SL)-$(TL).autobil.bin -p -t > tmp2
cat data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL) | sed 's/ ||| /\t/g' | cut -f 3 > tmp3
cat data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL) | sed 's/ ||| /\t/g' | cut -f 2 \
| sed 's/~/ /g' | $(LEX_TOOLS)/multitrans $(DATA)/$(SL)-$(TL).autobil.bin -b -t > data-$(SL)-$(TL)/$(CORPUS).clean-biltrans.$(PAIR)
paste tmp1 tmp2 tmp3 | sed 's/\t/ ||| /g' > data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL)
rm tmp1 tmp2 tmp3

data-$(SL)-$(TL)/$(CORPUS).candidates.$(SL)-$(TL): data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL) data-$(SL)-$(TL)/$(CORPUS).clean-biltrans.$(PAIR)
python3 $(SCRIPTS)/extract-sentences.py data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL) \
data-$(SL)-$(TL)/$(CORPUS).clean-biltrans.$(PAIR) > $@ 2>/dev/null

data-$(SL)-$(TL)/$(CORPUS).lex.$(SL)-$(TL): data-$(SL)-$(TL)/$(CORPUS).candidates.$(SL)-$(TL)
python $(SCRIPTS)/extract-freq-lexicon.py data-$(SL)-$(TL)/$(CORPUS).candidates.$(SL)-$(TL) > $@ 2>/dev/null

data-$(SL)-$(TL)/$(CORPUS).biltrans-entries.$(SL)-$(TL): data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL) data-$(SL)-$(TL)/$(CORPUS).clean-biltrans.$(PAIR)
python3 $(SCRIPTS)/extract-biltrans-candidates.py data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL) data-$(SL)-$(TL)/$(CORPUS).clean-biltrans.$(PAIR) \
> $@ 2>/dev/null

data-$(SL)-$(TL)/$(CORPUS).ngrams.$(SL)-$(TL): data-$(SL)-$(TL)/$(CORPUS).lex.$(SL)-$(TL) data-$(SL)-$(TL)/$(CORPUS).candidates.$(SL)-$(TL)
python $(SCRIPTS)/ngram-count-patterns.py data-$(SL)-$(TL)/$(CORPUS).lex.$(SL)-$(TL) data-$(SL)-$(TL)/$(CORPUS).candidates.$(SL)-$(TL) $(crisphold) 2>/dev/null > $@

data-$(SL)-$(TL)/$(CORPUS).ngrams.$(SL)-$(TL).lrx: data-$(SL)-$(TL)/$(CORPUS).ngrams.$(SL)-$(TL)
python3 $(SCRIPTS)/ngrams-to-rules.py data-$(SL)-$(TL)/$(CORPUS).ngrams.$(SL)-$(TL) $(crisphold) > $@ 2>/dev/null


[[Category:Lexical selection]]
[[Category:Lexical selection]]

Latest revision as of 13:58, 7 October 2014

En français

  This page is deprecated. For further information see: Learning rules from parallel and non-parallel corpora

If you have a parallel corpus, one of the things you can do is generate some lexical selection rules from it, to improve translation of words with more than one possible translation.

You will need[edit]

Here is a list of software that you will need installed:

  • Giza++ (or some other word aligner)
  • Moses (for making Giza++ less human hostile)
  • All the Moses scripts
  • lttoolbox
  • Apertium
  • apertium-lex-tools

Furthermore you'll need:

  • an Apertium language pair
  • a parallel corpus (see Corpora)

Installing prerequisites[edit]

See Minimal installation from SVN for apertium/lttoolbox.

See Constraint-based lexical selection module for apertium-lex-tools.

For Giza++ and moses-decoder, etc. you can do

$ mkdir ~/smt
$ cd ~/smt
$ mkdir local # our "install prefix"
$ wget https://giza-pp.googlecode.com/files/giza-pp-v1.0.7.tar.gz
$ tar xzvf giza-pp-v1.0.7.tar.gz
$ cd giza-pp
$ make
$ mkdir ../local/bin
$ cp GIZA++-v2/snt2cooc.out ../local/bin/
$ cp GIZA++-v2/snt2plain.out ../local/bin/
$ cp GIZA++-v2/GIZA++ ../local/bin/
$ cp mkcls-v2/mkcls ../local/bin/
$ git clone https://github.com/moses-smt/mosesdecoder
$ cd mosesdecoder/
$ ./bjam 

Now e.g. the clean-corpus and train-model scripts referred to below will be in ~/smt/mosesdecoder/scripts/training/clean-corpus-n.perl See http://www.statmt.org/moses/?n=Development.GetStarted if you want to install the binaries to some other directory.

Getting started[edit]

Important: If you don't want through the whole process step by step, you can use the Makefile script provided in the last section of this page.

We're going to do the example with EuroParl and the English to Spanish pair in Apertium.

Given that you've got all the stuff installed, the work will be as follows:

Prepare corpus[edit]

To generate the rules, we need three files,

  • The tagged and tokenised source corpus
  • The tagged and tokenised target corpus
  • The output of the lexical transfer module in the source→target direction, tokenised

These three files should be sentence aligned.

Make a folder called data-en-es. We are going to keep all the generated files there.

The first thing that we need to do is tag both sides of the corpus:

$ nohup cat europarl.en-es.en | apertium-destxt |\
 apertium -f none -d /home/fran/source/apertium-en-es en-es-pretransfer > data-en-es/europarl-en-es.tagged.en &
$ nohup cat europarl.clean.es | apertium-destxt |\
 apertium -f none -d /home/fran/source/apertium-en-es es-en-pretransfer > data-en-es.europarl-en-es.tagged.es &

Then we need to remove the lines with no analyses on and replace blanks within lemmas with a new character (we will use `~`):

$ paste data-en-es/europarl-en-es.tagged.en data-en-es/europarl-en-es.tagged.es | grep '<' | cut -f2 | sed 's/ /~/g' | sed 's/\$[^\^]*/\$ /g' > data-en-es/europarl-en-es.tagged.new.es
$ paste data-en-es/europarl-en-es.tagged.en data-en-es/europarl-en-es.tagged.es | grep '<' | cut -f3 | sed 's/ /~/g' | sed 's/\$[^\^]*/\$ /g' > data-en-es/europarl-en-es.tagged.new.en

Next, we need to clean the corpus and remove long sentences. (Make sure you are in the same directory as the one where you have your europarl corpus)

$ perl (path to your mosesdecoder)/scripts/training/clean-corpus-n.perl data-en-es/europarl-en-es.tagged.new es en data-en-es/europarl-en-es.tag-clean 1 40
clean-corpus.perl: processing europarl-v6.es-en.es & .en to data-en-es/europarl-en-es.clean, cutoff 1-40

Input sentences: 1786594  Output sentences:  1467708

We're going to cut off the bottom 67,658 for testing (also because Giza++ segfaults somewhere around there).

$ mkdir testing
$ tail -67658 data-en-es/europarl-en-es.tag-clean.en > testing/europarl-en-es.tag-clean.67658.en
$ tail -67658 data-en-es/europarl-en-es.tag-clean.es > testing/europarl-en-es.tag-clean.67658.es
$  head -1400000 data-en-es/europarl-en-es.tag-clean.en > data-en-es/europarl-en-es.tag-clean.en.new
$  head -1400000 data-en-es/europarl-en-es.tag-clean.es > data-en-es/europarl-en-es.tag-clean.es.new
$  mv data-en-es/europarl-en-es.tag-clean.en.new data-en-es/europarl-en-es.tag-clean.en
$  mv data-en-es/europarl-en-es.tag-clean.es.new data-en-es/europarl-en-es.tag-clean.es

These files are:

  • data-en-es/europarl-en-es.tag-clean.en: The tagged source language side of the corpus
  • data-en-es/europarl-en-es.tag-clean.es: The tagged target language side of the corpus

Check that they have the same length:

$ wc -l europarl.*
   1400000 data-en-es/europarl-en-es.tag-clean.en
   1400000 data-en-es/europarl-en-es.tag-clean.es
   2800000 total

Align corpus[edit]

Now we've got the corpus files ready, we can align the corpus using the Moses scripts:

nohup perl (path to your mosesdecoder)/scripts/training/train-model.perl -external-bin-dir \
 ~/smt/local/bin -corpus data-en-es/europarl-en-es.tag-clean \
 -f en -e es -alignment grow-diag-final-and -reordering msd-bidirectional-fe \
 -lm 0:5:/home/fran/corpora/europarl/europarl.lm:0 >log 2>&1 &

Note: Remember to change all the paths in the above command!

You'll need an LM file, but you can copy it from a previous Moses installation. If you don't have one, make an empty file and put a few words in it. We won't be using the LM anyway.

This takes a while, from a few hours to a day. So leave it running and go and make a soufflé, or chop some wood or something.

Extract sentences[edit]

After the sentences are aligned, we need to trim unnecessary tags from the tokens, and generate a biltrans file.

zcat giza.en-es/en-es.A3.final.gz | ~/source/apertium-lex-tools/scripts/giza-to-moses.awk > data-en-es/europarl.phrasetable.en-es
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 1 \
    | sed 's/~/ /g' | ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil.bin -p -t > tmp1
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 2 \
  | sed 's/~/ /g' | ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil.bin -p -t > tmp2
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 3 > tmp3
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 2 \
  | sed 's/~/ /g' | ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil.bin -b -t > data-en-es/europarl.clean-biltrans.en-es
paste tmp1 tmp2 tmp3 | sed 's/\t/ ||| /g' > data-en-es/europarl.phrasetable.en-es
rm tmp1 tmp2 tmp3

Then we want to make sure again that our file has the right number of lines:

$ wc -l data-en-es/europarl.phrasetable.en-es
1400000 data-en-es/europarl.phrasetable.en-es

Then we want to extract the sentences where the target language word aligned to a source language word is a possible translation in the bilingual dictionary:

$ ~/source/apertium-lex-tools/scripts/extract-sentences.py data-en-es/europarl-en-es.phrasetable.en-es data-en-es/europarl-en-es.biltrans-tok.en-es \
  > data-en-es/europarl-en-es.candidates.en-es

These are basically sentences that we can hope that Apertium might be able to generate.

Extract bilingual dictionary candidates[edit]

Using the phrasetable and the bilingual file we can extract candidates for the bilingual dictionary.

python3 ~/source/apertium-lex-tools/scripts/extract-biltrans-candidates.py data-en-es/europarl-en-es.phrasetable.en-es data-en-es/europarl-en-es.biltrans-tok.en-es > data-en-es/europarl-en-es.biltrans-candidates.en-es 2> data-en-es/europarl-en-es.biltrans-pairs.en-es

where data-en-es/europarl-en-es.biltrans-candidates.en-es contains the generated candidates for the bilingual dictionary.

Extract frequency lexicon[edit]

The next step is to extract the frequency lexicon.

$ python ~/source/apertium-lex-tools/scripts/extract-freq-lexicon.py data-en-es/europarl-en-es.candidates.en-es > data-en-es/europarl-en-es.lex.en-es

This file should look like:

$ cat europarl.lex.en-es  | head 
31381 union<n> unión<n> @
101 union<n> sindicato<n>
1 union<n> situación<n>
1 union<n> monetario<adj>
4 slope<n> pendiente<n> @
1 slope<n> ladera<n>

Where the highest frequency translation is marked with an @.

Note: This frequency lexicon can be used as a substitute for "choosing the most general translation" in your bilingual dictionary.

Generate patterns[edit]

Now we generate the ngrams that we are going to generate the rules from.

$ crisphold=1.5 # ratio of how many times you see the alternative translation compared to the default
$ python ~/source/apertium-lex-tools/scripts/ngram-count-patterns.py europarl.lex.en-es data-en-es/europarl-en-es.candidates.en-es $crisphold 2>/dev/null > data-en-es/europarl-en-es.ngrams.en-es

This script outputs lines in the following format:

-language<n>	and<cnjcoo> language<n> ,<cm>	lengua<n>	2
+language<n>	plain<adj> language<n> ,<cm>	lenguaje<n>	3
-language<n>	language<n> knowledge<n>	lengua<n>	4
-language<n>	language<n> of<pr> communication<n>	lengua<n>	3
-language<n>	Community<adj> language<n> .<sent>	lengua<n>	5
-language<n>	language<n> in~addition~to<pr> their<det><pos>	lengua<n>	2
-language<n>	every<det><ind> language<n>	lengua<n>	2
+language<n>	and<cnjcoo> *understandable language<n>	lenguaje<n>	2
-language<n>	two<num> language<n>	lengua<n>	8
-language<n>	only<adj> official<adj> language<n>	lengua<n>	2

The + and - indicate if this line chooses the most frequent transation (-) or a translation which is not the most frequent (+). The pattern selecting the translation is then shown, followed by the translation and then the frequency.

Filter rules[edit]

Now you can filter the rules, for example by removing rules with conjunctions, or removing rules with unknown words.

Generate rules[edit]

The final stage is to generate the rules,

python3 ~/source/apertium-lex-tools/scripts/ngrams-to-rules.py data-en-es/europarl-en-es.ngrams.en-es $crisphold > data-en-es/europarl-en-es.ngrams.en-es.lrx


For the whole process you can run the following Makefile:




all: data-$(SL)-$(TL)/$(CORPUS).ngrams.$(SL)-$(TL).lrx data-$(SL)-$(TL)/$(CORPUS).biltrans-entries.$(SL)-$(TL)

data-$(SL)-$(TL)/$(CORPUS).tagged.$(SL): $(CORPUS).$(PAIR).$(SL)
	if [ ! -d data-$(SL)-$(TL) ]; then mkdir data-$(SL)-$(TL); fi
	cat $(CORPUS).$(PAIR).$(SL) | head -n $(TRAINING_LINES) \
	| apertium-destxt \
	| apertium -f none -d $(DATA) $(SL)-$(TL)-tagger \
	| apertium-pretransfer > $@;

data-$(SL)-$(TL)/$(CORPUS).tagged.$(TL): $(CORPUS).$(PAIR).$(TL)
	if [ ! -d data-$(SL)-$(TL) ]; then mkdir data-$(SL)-$(TL); fi
	cat $(CORPUS).$(PAIR).$(TL) | head -n $(TRAINING_LINES) \
	| apertium-destxt \
	| apertium -f none -d $(DATA) $(TL)-$(SL)-tagger \
	| apertium-pretransfer > $@;

data-$(SL)-$(TL)/$(CORPUS).tagged.new.$(SL): data-$(SL)-$(TL)/$(CORPUS).tagged.$(SL) data-$(SL)-$(TL)/$(CORPUS).tagged.$(TL)
	paste data-$(SL)-$(TL)/$(CORPUS).tagged.$(SL) data-$(SL)-$(TL)/$(CORPUS).tagged.$(TL) \
	| grep '<' \
	| cut -f1 \
	| sed 's/ /~/g' | sed 's/$$[^\^]*/$$ /g' > $@

data-$(SL)-$(TL)/$(CORPUS).tagged.new.$(TL): data-$(SL)-$(TL)/$(CORPUS).tagged.$(SL) data-$(SL)-$(TL)/$(CORPUS).tagged.$(TL)
	paste data-$(SL)-$(TL)/$(CORPUS).tagged.$(SL) data-$(SL)-$(TL)/$(CORPUS).tagged.$(TL) \
	| grep '<' \
	| cut -f2 \
	| sed 's/ /~/g' | sed 's/$$[^\^]*/$$ /g' > $@

data-$(SL)-$(TL)/$(CORPUS).tag-clean.$(SL) data-$(SL)-$(TL)/$(CORPUS).tag-clean.$(TL): data-$(SL)-$(TL)/$(CORPUS).tagged.new.$(SL) data-$(SL)-$(TL)/$(CORPUS).tagged.new.$(TL)
	perl $(MOSESDECODER)/clean-corpus-n.perl data-$(SL)-$(TL)/$(CORPUS).tagged.new $(SL) $(TL) data-$(SL)-$(TL)/$(CORPUS).tag-clean 1 40;

model: data-$(SL)-$(TL)/$(CORPUS).tag-clean.$(SL) data-$(SL)-$(TL)/$(CORPUS).tag-clean.$(TL)
	-perl $(MOSESDECODER)/train-model.perl -external-bin-dir $(BIN_DIR) -corpus data-$(SL)-$(TL)/$(CORPUS).tag-clean \
	 -f $(TL) -e $(SL) -alignment grow-diag-final-and -reordering msd-bidirectional-fe \
	-lm 0:5:$(LM):0 2>&1

data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL): model
	zcat giza.$(SL)-$(TL)/$(SL)-$(TL).A3.final.gz | $(SCRIPTS)/giza-to-moses.awk > $@
	cat data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL) | sed 's/ ||| /\t/g' | cut -f 1 \
	| sed 's/~/ /g' | $(LEX_TOOLS)/multitrans $(DATA)/$(TL)-$(SL).autobil.bin -p -t > tmp1
	cat data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL) | sed 's/ ||| /\t/g' | cut -f 2 \
	| sed 's/~/ /g' | $(LEX_TOOLS)/multitrans $(DATA)/$(SL)-$(TL).autobil.bin -p -t > tmp2
	cat data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL) | sed 's/ ||| /\t/g' | cut -f 3 > tmp3
	cat data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL) | sed 's/ ||| /\t/g' | cut -f 2 \
	| sed 's/~/ /g' | $(LEX_TOOLS)/multitrans $(DATA)/$(SL)-$(TL).autobil.bin -b -t > data-$(SL)-$(TL)/$(CORPUS).clean-biltrans.$(PAIR)
	paste tmp1 tmp2 tmp3 | sed 's/\t/ ||| /g' > data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL)
	rm tmp1 tmp2 tmp3

data-$(SL)-$(TL)/$(CORPUS).candidates.$(SL)-$(TL): data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL) data-$(SL)-$(TL)/$(CORPUS).clean-biltrans.$(PAIR)
	python3 $(SCRIPTS)/extract-sentences.py data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL) \
			data-$(SL)-$(TL)/$(CORPUS).clean-biltrans.$(PAIR) > $@ 2>/dev/null

data-$(SL)-$(TL)/$(CORPUS).lex.$(SL)-$(TL): data-$(SL)-$(TL)/$(CORPUS).candidates.$(SL)-$(TL)
	python $(SCRIPTS)/extract-freq-lexicon.py data-$(SL)-$(TL)/$(CORPUS).candidates.$(SL)-$(TL) > $@ 2>/dev/null

data-$(SL)-$(TL)/$(CORPUS).biltrans-entries.$(SL)-$(TL): data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL) data-$(SL)-$(TL)/$(CORPUS).clean-biltrans.$(PAIR)
	python3 $(SCRIPTS)/extract-biltrans-candidates.py data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL) data-$(SL)-$(TL)/$(CORPUS).clean-biltrans.$(PAIR) \
	> $@ 2>/dev/null

data-$(SL)-$(TL)/$(CORPUS).ngrams.$(SL)-$(TL): data-$(SL)-$(TL)/$(CORPUS).lex.$(SL)-$(TL) data-$(SL)-$(TL)/$(CORPUS).candidates.$(SL)-$(TL)
	python $(SCRIPTS)/ngram-count-patterns.py data-$(SL)-$(TL)/$(CORPUS).lex.$(SL)-$(TL) data-$(SL)-$(TL)/$(CORPUS).candidates.$(SL)-$(TL) $(crisphold) 2>/dev/null > $@

data-$(SL)-$(TL)/$(CORPUS).ngrams.$(SL)-$(TL).lrx: data-$(SL)-$(TL)/$(CORPUS).ngrams.$(SL)-$(TL)
	python3 $(SCRIPTS)/ngrams-to-rules.py data-$(SL)-$(TL)/$(CORPUS).ngrams.$(SL)-$(TL) $(crisphold) > $@ 2>/dev/null