Learning rules from parallel and non-parallel corpora

Estimating rules using parallel corpora

It is always recommended to use a parallel corpus for any type of machine translation training when such a resource is available. This section describes several (3) methods for estimating lexical selection rules using a parallel corpus. We start by describing the part of the training process that is shared by all three methods, and then continue to describe each of the individual methods separately.

Prequisites

The training methods use several software packages that need to be installed. First you will need to download and install:

lttoolbox
Apertium
apertium-lex-tools
Moses and its training scripts which you can install using the following script

Furthermore you will also need:

an Apertium language pair
a parallel corpus (see Corpora)

Installing prerequisites

See Minimal installation from SVN for apertium/lttoolbox.

See Constraint-based lexical selection module for apertium-lex-tools.

For moses-decoder you can do

git clone https://github.com/moses-smt/mosesdecoder
cd mosesdecoder/
./bjam

Now e.g. the clean-corpus and train-model scripts referred to below will be in ~/smt/mosesdecoder/scripts/training/clean-corpus-n.perl See http://www.statmt.org/moses/?n=Development.GetStarted if you want to install the binaries to some other directory.

Preparing the training files

The parallel corpus is processed in such a way that the training files (the source and target side corpus) are first analysed and tagged.

Next, lines with no analysis are removed and blank within tokens are replaced with a new character since Giza tokenizes a sentence by splitting on white space.

Finally, both files are cleaned using a moses training script so that Giza will not crash during the training process.

All of this can be achieved using the following script:

CORPUS="Europarl3"
PAIR="es-pt"
SL="pt"
TL="es"
DATA="/home/philip/Apertium/apertium-es-pt"

LEX_TOOLS="/home/philip/Apertium/apertium-lex-tools"
SCRIPTS="$LEX_TOOLS/scripts"
MOSESDECODER="/home/philip/Apertium/mosesdecoder/scripts/training"
TRAINING_LINES=100000


if [ ! -d data-$SL-$TL ]; then 
	mkdir data-$SL-$TL;
fi

# TAG CORPUS
cat "$CORPUS.$PAIR.$SL" | head -n $TRAINING_LINES | apertium -d "$DATA" $SL-$TL-tagger-dcase \
	| apertium-pretransfer > data-$SL-$TL/$CORPUS.tagged.$SL;

cat "$CORPUS.$PAIR.$TL" | head -n $TRAINING_LINES | apertium -d "$DATA" $TL-$SL-tagger-dcase \
	| apertium-pretransfer > data-$SL-$TL/$CORPUS.tagged.$TL;

N=`wc -l $CORPUS.$PAIR.$SL | cut -d ' ' -f 1`


# REMOVE LINES WITH NO ANALYSES
seq 1 $TRAINING_LINES > data-$SL-$TL/$CORPUS.lines
paste data-$SL-$TL/$CORPUS.lines data-$SL-$TL/$CORPUS.tagged.$SL data-$SL-$TL/$CORPUS.tagged.$TL | grep '<' \
	| cut -f1 > data-$SL-$TL/$CORPUS.lines.new
paste data-$SL-$TL/$CORPUS.lines data-$SL-$TL/$CORPUS.tagged.$SL data-$SL-$TL/$CORPUS.tagged.$TL | grep '<' \
	| cut -f2 > data-$SL-$TL/$CORPUS.tagged.$SL.new
paste data-$SL-$TL/$CORPUS.lines data-$SL-$TL/$CORPUS.tagged.$SL data-$SL-$TL/$CORPUS.tagged.$TL | grep '<' \
	| cut -f3 > data-$SL-$TL/$CORPUS.tagged.$TL.new
mv data-$SL-$TL/$CORPUS.lines.new data-$SL-$TL/$CORPUS.lines
cat data-$SL-$TL/$CORPUS.tagged.$SL.new \
	| sed 's/ /~/g' | sed 's/\$[^\^]*/$ /g' > data-$SL-$TL/$CORPUS.tagged.$SL
cat data-$SL-$TL/$CORPUS.tagged.$TL.new \
	| sed 's/ /~/g' | sed 's/\$[^\^]*/$ /g' > data-$SL-$TL/$CORPUS.tagged.$TL
rm data-$SL-$TL/*.new


# CLEAN CORPUS
perl "$MOSESDECODER/clean-corpus-n.perl" data-$SL-$TL/$CORPUS.tagged $SL $TL "data-$SL-$TL/$CORPUS.tag-clean" 1 40;

Make sure you set the variables as follows:

CORPUS="Europarl3": name of the corpus that you're using
PAIR: the direction of the corpus
SL: the source language
TL: the target language
DATA: path to the language resources for the language pair
LEX_TOOLS: path to apertium-lex-tools
MOSESDECODER: path to moses-decoder
TRAINING_LINES: amount of training lines

Learning rules with Giza

Installing Giza

You can download and install Giza in the following way:


$ mkdir ~/smt
$ cd ~/smt
$ mkdir local # our "install prefix"
$ wget https://giza-pp.googlecode.com/files/giza-pp-v1.0.7.tar.gz
$ tar xzvf giza-pp-v1.0.7.tar.gz
$ cd giza-pp
$ make
$ mkdir ../local/bin
$ cp GIZA++-v2/snt2cooc.out ../local/bin/
$ cp GIZA++-v2/snt2plain.out ../local/bin/
$ cp GIZA++-v2/GIZA++ ../local/bin/
$ cp mkcls-v2/mkcls ../local/bin/

Notice that all of your binary files will be placed in a single folder which is needed by the moses training script.

Alignment

Giza++ is used for obtaining an alignment between the tokens of the parallel sentences in the two corpora. The alignment process can sometimes be slow depending on the size of the corpora.

After alignment has been done, the tokens' tags can be trimmed down so that they match the the set of tags found in the bilingual dictionary of the language pair.

Finally, a bilingual transfer output is obtained from the source language side so that ambiguous sentences and missing bidix candidates can be extracted.

BIN_DIR="/home/philip/Apertium/smt/bin"
# ALIGN
perl $MOSESDECODER/train-model.perl -external-bin-dir "$BIN_DIR" -corpus data-$SL-$TL/$CORPUS.tag-clean \
 -f $TL -e $SL -alignment grow-diag-final-and -reordering msd-bidirectional-fe \
-lm 0:5:/home/philip/Apertium/gsoc2013/giza/europarl.lm:0 2>&1

# EXTRACT
zcat giza.$SL-$TL/$SL-$TL.A3.final.gz | $SCRIPTS/giza-to-moses.awk > data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL

# TRIM TAGS
cat data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL | sed 's/ ||| /\t/g' | cut -f 1 \
	| sed 's/~/ /g' | $LEX_TOOLS/process-tagger-output $DATA/$TL-$SL.autobil.bin -p -t > tmp1

cat data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL | sed 's/ ||| /\t/g' | cut -f 2 \
	| sed 's/~/ /g' | $LEX_TOOLS/process-tagger-output $DATA/$SL-$TL.autobil.bin -p -t > tmp2

cat data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL | sed 's/ ||| /\t/g' | cut -f 3 > tmp3

cat data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL | sed 's/ ||| /\t/g' | cut -f 2 \
	| sed 's/~/ /g' | $LEX_TOOLS/process-tagger-output $DATA/$SL-$TL.autobil.bin -b -t > data-$SL-$TL/$CORPUS.clean-biltrans.$PAIR

paste tmp1 tmp2 tmp3 | sed 's/\t/ ||| /g' > data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL
rm tmp1 tmp2 tmp3

Estimating rules using non-parallel corpora

Prerequisites:

Install apertium-lex-tools
Install IRSTLM (http://sourceforge.net/projects/irstlm/)
Estimate a binary target side language model (http://sourceforge.net/apps/mediawiki/irstlm/index.php?title=User_Manual).
The language pair must support the pretransfer and multi modes. See apertium-sh-mk/modes.xml

as a reference on how to add these modes if they do not exist. Place the following Makefile in the folder where you want to run your training process:

CORPUS=setimes
DIR=sh-mk
DATA=/home/philip/Apertium/apertium-sh-mk/
AUTOBIL=sh-mk.autobil.bin
SCRIPTS=/home/philip/Apertium/apertium-lex-tools/scripts
MODEL=/home/philip/Apertium/corpora/language-models/mk/setimes.mk.5.blm
LEX_TOOLS=/home/philip/Apertium/apertium-lex-tools
THR=0

#all: data/$(CORPUS).$(DIR).lrx data/$(CORPUS).$(DIR).freq.lrx
all: data/$(CORPUS).$(DIR).freq.lrx.bin data/$(CORPUS).$(DIR).patterns.lrx

data/$(CORPUS).$(DIR).tagger: $(CORPUS).$(DIR).txt
	if [ ! -d data ]; then mkdir data; fi
	cat $(CORPUS).$(DIR).txt | sed 's/[^\.]$$/./g' | apertium-destxt | apertium -f none -d $(DATA) $(DIR)-tagger | apertium-pretransfer > $@
 
data/$(CORPUS).$(DIR).ambig: data/$(CORPUS).$(DIR).tagger
	cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/multitrans $(DATA)$(DIR).autobil.bin -b -t > $@

data/$(CORPUS).$(DIR).multi-trimmed: data/$(CORPUS).$(DIR).tagger
	cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/multitrans $(DATA)$(DIR).autobil.bin -m -t > $@

data/$(CORPUS).$(DIR).ranked: data/$(CORPUS).$(DIR).tagger
	cat $< | $(LEX_TOOLS)/multitrans $(DATA)$(DIR).autobil.bin -m | apertium -f none -d $(DATA) $(DIR)-multi | irstlm-ranker-frac $(MODEL) > $@

data/$(CORPUS).$(DIR).annotated: data/$(CORPUS).$(DIR).multi-trimmed data/$(CORPUS).$(DIR).ranked
	paste data/$(CORPUS).$(DIR).multi-trimmed data/$(CORPUS).$(DIR).ranked | cut -f1-4 > $@
 
data/$(CORPUS).$(DIR).freq: data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated
	python3 $(SCRIPTS)/biltrans-extract-frac-freq.py  data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated > $@
 
data/$(CORPUS).$(DIR).freq.lrx:  data/$(CORPUS).$(DIR).freq
	python3 $(SCRIPTS)/extract-alig-lrx.py $< > $@

data/$(CORPUS).$(DIR).freq.lrx.bin: data/$(CORPUS).$(DIR).freq.lrx
	lrx-comp $< $@

data/$(CORPUS).$(DIR).ngrams: data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated
	python3 $(SCRIPTS)/biltrans-count-patterns-ngrams.py data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated > $@
 
data/$(CORPUS).$(DIR).patterns: data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ngrams
	python3 $(SCRIPTS)/ngram-pruning-frac.py data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ngrams > $@  
 
data/$(CORPUS).$(DIR).patterns.lrx:  data/$(CORPUS).$(DIR).patterns
	python3 $(SCRIPTS)/ngrams-to-rules.py $< $(THR) > $@

In the same folder also place your source side corpus file. The corpus file needs to be named as "basename"."language-pair".txt.
As an illustration, in the Makefile example, the corpus file is named setimes.sh-mk.txt.

Set the Makefile variables as follows:

CORPUS denotes the base name of your corpus file
DIR stands for the language pair
DATA is the path to the language resources for the language pair
AUTOBIL is the path to binary bilingual dictionary for the language pair
SCRIPTS denotes the path to the lex-tools scripts
MODEL is the path to the target side (binary) language model used for scoring the possible translations of ambiguous words

Finally, executing the Makefile will generate lexical selection rules for the specified language pair.

Learning rules from parallel and non-parallel corpora

Contents

Estimating rules using parallel corpora

Prequisites

Installing prerequisites

Preparing the training files

Learning rules with Giza

Installing Giza

Alignment

Estimating rules using non-parallel corpora

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools