Learning rules from parallel and non-parallel corpora

From Apertium
Jump to navigation Jump to search

Estimating rules using parallel corpora[edit]

It is always recommended to use a parallel corpus for any type of machine translation training when such a resource is available. This section describes several (3) methods for estimating lexical selection rules using a parallel corpus. We start by describing the part of the training process that is shared by all three methods, and then continue to describe each of the individual methods separately.


Overview[edit]

Assumptions

Your language pair should be fully set up in the direction that you're training for, and at least up until pretransfer in the opposite direction. Of course, you also need a parallel corpus for this method (see Running the monolingual rule learning if you only have monolingual corpora).

Method
  • First we translate our corpus up until the pretransfer stage from both sides
  • Then we use Giza++/Moses to create an alignment
  • Then we translate the aligned sentences from pretransfer to bidix in the source→target direction
  • We run extract-sentences.py on the alignment and the bidix-translated file to create candidates
  • We run extract-freq-lexicon.py on the candidates to create a .lex file
  • We turn the .lex and candidates list into lex.sel rules, using one of various methods
    • Maximum Likelihood rule extraction
    • Maximum Entropy rule extraction

Prerequisites[edit]

The training methods use several software packages that need to be installed. First you will need to download and install:

Furthermore you will also need:

  • an Apertium language pair
  • a parallel corpus (see Corpora)

Installing prerequisites[edit]

See Installation for apertium/lttoolbox/apertium-lex-tools, though if you install them all from the nightly repo you'll still have to do

svn co https://svn.code.sf.net/p/apertium/svn/trunk/apertium-lex-tools/
cd apertium-lex-tools/scripts
make

to get process-tagger-output and the lexical selection scripts. (The program multitrans should be installed by apertium-lex-tools to your PATH.)


See IRSTLM, GIZA++, Moses for how to install those.


Now e.g. the clean-corpus and train-model scripts referred to below will be in ~/smt/mosesdecoder/scripts/training/clean-corpus-n.perl

See http://www.statmt.org/moses/?n=Development.GetStarted if you want to install the binaries to some other directory.

Preparing the training files[edit]

The parallel corpus is processed in such a way that the training files (the source and target side corpus) are first analysed and tagged and processed up until the pretransfer step.

Next, lines with no analysis are removed and blank within tokens are replaced with a new character since Giza tokenizes a sentence by splitting on white space.

Finally, both files are cleaned using a moses training script so that Giza will not crash during the training process.

All of this can be achieved using the following script:

CORPUS="Europarl3"
PAIR="es-pt"
SL="pt"
TL="es"
DATA="/home/philip/Apertium/apertium-es-pt"

LEX_TOOLS="/home/philip/Apertium/apertium-lex-tools"
SCRIPTS="$LEX_TOOLS/scripts"
MOSESDECODER="/home/philip/Apertium/mosesdecoder/scripts/training"
TRAINING_LINES=100000


if [ ! -d data-$SL-$TL ]; then 
	mkdir data-$SL-$TL;
fi

# TAG CORPUS
cat "$CORPUS.$PAIR.$SL" | head -n $TRAINING_LINES | apertium -d "$DATA" $SL-$TL-tagger-dcase \
	| apertium-pretransfer > data-$SL-$TL/$CORPUS.tagged.$SL;

cat "$CORPUS.$PAIR.$TL" | head -n $TRAINING_LINES | apertium -d "$DATA" $TL-$SL-tagger-dcase \
	| apertium-pretransfer > data-$SL-$TL/$CORPUS.tagged.$TL;

N=`wc -l $CORPUS.$PAIR.$SL | cut -d ' ' -f 1`


# REMOVE LINES WITH NO ANALYSES
seq 1 $TRAINING_LINES > data-$SL-$TL/$CORPUS.lines
paste data-$SL-$TL/$CORPUS.lines data-$SL-$TL/$CORPUS.tagged.$SL data-$SL-$TL/$CORPUS.tagged.$TL | grep '<' \
	| cut -f1 > data-$SL-$TL/$CORPUS.lines.new
paste data-$SL-$TL/$CORPUS.lines data-$SL-$TL/$CORPUS.tagged.$SL data-$SL-$TL/$CORPUS.tagged.$TL | grep '<' \
	| cut -f2 > data-$SL-$TL/$CORPUS.tagged.$SL.new
paste data-$SL-$TL/$CORPUS.lines data-$SL-$TL/$CORPUS.tagged.$SL data-$SL-$TL/$CORPUS.tagged.$TL | grep '<' \
	| cut -f3 > data-$SL-$TL/$CORPUS.tagged.$TL.new
mv data-$SL-$TL/$CORPUS.lines.new data-$SL-$TL/$CORPUS.lines
cat data-$SL-$TL/$CORPUS.tagged.$SL.new \
	| sed 's/ /~/g' | sed 's/\$[^\^]*/$ /g' > data-$SL-$TL/$CORPUS.tagged.$SL
cat data-$SL-$TL/$CORPUS.tagged.$TL.new \
	| sed 's/ /~/g' | sed 's/\$[^\^]*/$ /g' > data-$SL-$TL/$CORPUS.tagged.$TL
rm data-$SL-$TL/*.new


# CLEAN CORPUS
perl "$MOSESDECODER/clean-corpus-n.perl" data-$SL-$TL/$CORPUS.tagged $SL $TL "data-$SL-$TL/$CORPUS.tag-clean" 1 40;

Make sure you set the variables as follows:

  • CORPUS="Europarl3": name of the corpus that you're using
  • PAIR: the direction of the corpus
  • SL: the source language
  • TL: the target language
  • DATA: path to the language resources for the language pair
  • LEX_TOOLS: path to apertium-lex-tools
  • MOSESDECODER: path to moses-decoder
  • TRAINING_LINES: amount of training lines

Learning rules with Giza[edit]

Make a language model[edit]

See IRSTLM for how to do this.

I’ve used kenlm:

lmplz -o 5 < data-$SL-$TL/$CORPUS.tag-clean.$SL > data-$SL-$TL/$CORPUS.tag-clean.$SL.arpa
lmplz -o 5 < data-$SL-$TL/$CORPUS.tag-clean.$TL > data-$SL-$TL/$CORPUS.tag-clean.$TL.arpa

Then just use :8: in train-model.perl instead of :2:

Alignment[edit]

Giza++ is used for obtaining an alignment between the tokens of the parallel sentences in the two corpora. The alignment process can sometimes be slow depending on the size of the corpora.

After alignment has been done, the tokens' tags can be trimmed down so that they match the the set of tags found in the bilingual dictionary of the language pair.

Next, a bilingual transfer output is obtained from the source language side so that ambiguous sentences and missing bidix candidates can be extracted.

Finally, a frequency lexicon is created, marking the most common translation for each ambiguous token.

BIN_DIR="/home/philip/Apertium/smt/bin"
# *Absolute path* to the lm that you created with IRSTLM:
LM=/home/philip/Apertium/gsoc2013/giza/europarl.lm

# ALIGN
PYTHONIOENCODING=utf-8 perl $MOSESDECODER/train-model.perl \
  -external-bin-dir "$BIN_DIR" \
  -corpus data-$SL-$TL/$CORPUS.tag-clean \
  -f $TL -e $SL \
  -alignment grow-diag-final-and -reordering msd-bidirectional-fe \
  -lm 0:5:${LM}:0 2>&1

# (if you use mgiza, add the -mgiza argument)


# EXTRACT
zcat giza.$SL-$TL/$SL-$TL.A3.final.gz | $SCRIPTS/giza-to-moses.awk > data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL

# TRIM TAGS
cat data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL | sed 's/ ||| /\t/g' | cut -f 1 \
	| sed 's/~/ /g' | $LEX_TOOLS/process-tagger-output $DATA/$TL-$SL.autobil.bin -p -t > tmp1

cat data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL | sed 's/ ||| /\t/g' | cut -f 2 \
	| sed 's/~/ /g' | $LEX_TOOLS/process-tagger-output $DATA/$SL-$TL.autobil.bin -p -t > tmp2

cat data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL | sed 's/ ||| /\t/g' | cut -f 3 > tmp3

cat data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL | sed 's/ ||| /\t/g' | cut -f 2 \
	| sed 's/~/ /g' | $LEX_TOOLS/process-tagger-output $DATA/$SL-$TL.autobil.bin -b -t > data-$SL-$TL/$CORPUS.clean-biltrans.$PAIR

paste tmp1 tmp2 tmp3 | sed 's/\t/ ||| /g' > data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL
rm tmp1 tmp2 tmp3

# SENTENCES
python3 $SCRIPTS/extract-sentences.py data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL data-$SL-$TL/$CORPUS.clean-biltrans.$PAIR \
  > data-$SL-$TL/$CORPUS.candidates.$SL-$TL 2>/dev/null

# FREQUENCY LEXICON
python $SCRIPTS/extract-freq-lexicon.py data-$SL-$TL/$CORPUS.candidates.$SL-$TL > data-$SL-$TL/$CORPUS.lex.$SL-$TL 2>/dev/null

Make sure you set the BIN_DIR variable so that it contains the path to the binary folder generated by the Giza installation process, and LM to the LM you created with IRSTLM.

Maximum likelihood rule extraction[edit]

The ML method counts how many each translation occurs in a given context, and compares that number with the default translation from the frequency lexicon. It then decides whether to create a rule with the given translation, or to leave the default translation.

The rule generation process is done with the following script:

crisphold=1.5
# NGRAM PATTERNS
python $SCRIPTS/ngram-count-patterns.py data-$SL-$TL/$CORPUS.lex.$SL-$TL data-$SL-$TL/$CORPUS.candidates.$SL-$TL $crisphold 2>/dev/null > data-$SL-$TL/$CORPUS.ngrams.$SL-$TL

# NGRAMS TO RULES
python3 $SCRIPTS/ngrams-to-rules.py data-$SL-$TL/$CORPUS.ngrams.$SL-$TL $crisphold > data-$SL-$TL/$CORPUS.ngrams.$SL-$TL.lrx 2>/dev/null

Where 'crisphold' is a variable which determines how many more times should an individual translation occur over the default translation for a rule to be created.

Maximum entropy rule extraction[edit]

The ME method learns a discriminative model which assigns each individual ngram a weight with which it contributes to a certain translation.

The rule extraction process is done in the following way:

MIN=1
YASMET=$LEX_TOOLS/yasmet
python $SCRIPTS/ngram-count-patterns-maxent2.py data-$SL-$TL/$CORPUS.lex.$SL-$TL data-$SL-$TL/$CORPUS.candidates.$SL-$TL 2>ngrams > events

echo -n "" > all-lambdas
cat events | grep -v -e '\$ 0\.0 #' -e '\$ 0 #' > events.trimmed
for i in `cat events.trimmed | cut -f1 | sort -u | sed 's/\([\*\^\$]\)/\\\\\1/g'`; do

	num=`cat events.trimmed | grep "^$i" | cut -f2 | head -1`
	echo $num > tmp.yasmet.$i;
	cat events.trimmed | grep "^$i" | cut -f3  >> tmp.yasmet.$i;
	echo "$i"
	cat tmp.yasmet.$i | $YASMET -red $MIN > tmp.yasmet.$i.$MIN; 
	cat tmp.yasmet.$i.$MIN | $YASMET > tmp.lambdas.$i
	cat tmp.lambdas.$i | sed "s/^/$i /g" >> all-lambdas;
done

rm tmp.*

python3 $SCRIPTS/merge-ngrams-lambdas.py ngrams all-lambdas > rules-all.txt

python3 $SCRIPTS/lambdas-to-rules.py data-$SL-$TL/$CORPUS.lex.$SL-$TL rules-all.txt > ngrams-all.txt

python3 $SCRIPTS/ngrams-to-rules-me.py ngrams-all.txt > $PAIR.ngrams-lm-$MIN.xml 2>/tmp/$PAIR.$MIN.lög

The MIN variable denotes how many times a certain context should occur for it to be taken into account.

Poorman's alignment[edit]

When using a large corpus, aligning tokens with Giza can be very slow. For that reason, we can estimate pairwise and ngram counts directly by relaxing the coocurence criteria used by Giza.

For each possible translation of an ambiguous word, we add one if the translation occurs anywhere in the target sentence of the parallel corpus.

A script for learning rules with maximum likelihood is given below:


Estimating rules using non-parallel corpora[edit]

Prerequisites:

as a reference on how to add these modes if they do not exist. Place the following Makefile in the folder where you want to run your training process:

CORPUS=setimes
DIR=sh-mk
DATA=/home/philip/Apertium/apertium-sh-mk/
AUTOBIL=sh-mk.autobil.bin
SCRIPTS=/home/philip/Apertium/apertium-lex-tools/scripts
MODEL=/home/philip/Apertium/corpora/language-models/mk/setimes.mk.5.blm
LEX_TOOLS=/home/philip/Apertium/apertium-lex-tools
THR=0

#all: data/$(CORPUS).$(DIR).lrx data/$(CORPUS).$(DIR).freq.lrx
all: data/$(CORPUS).$(DIR).freq.lrx.bin data/$(CORPUS).$(DIR).patterns.lrx

data/$(CORPUS).$(DIR).tagger: $(CORPUS).$(DIR).txt
	if [ ! -d data ]; then mkdir data; fi
	cat $(CORPUS).$(DIR).txt | sed 's/[^\.]$$/./g' | apertium-destxt | apertium -f none -d $(DATA) $(DIR)-tagger | apertium-pretransfer > $@
 
data/$(CORPUS).$(DIR).ambig: data/$(CORPUS).$(DIR).tagger
	cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/multitrans $(DATA)$(DIR).autobil.bin -b -t > $@

data/$(CORPUS).$(DIR).multi-trimmed: data/$(CORPUS).$(DIR).tagger
	cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/multitrans $(DATA)$(DIR).autobil.bin -m -t > $@

data/$(CORPUS).$(DIR).ranked: data/$(CORPUS).$(DIR).tagger
	cat $< | $(LEX_TOOLS)/multitrans $(DATA)$(DIR).autobil.bin -m | apertium -f none -d $(DATA) $(DIR)-multi | irstlm-ranker-frac $(MODEL) > $@

data/$(CORPUS).$(DIR).annotated: data/$(CORPUS).$(DIR).multi-trimmed data/$(CORPUS).$(DIR).ranked
	paste data/$(CORPUS).$(DIR).multi-trimmed data/$(CORPUS).$(DIR).ranked | cut -f1-4 > $@
 
data/$(CORPUS).$(DIR).freq: data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated
	python3 $(SCRIPTS)/biltrans-extract-frac-freq.py  data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated > $@
 
data/$(CORPUS).$(DIR).freq.lrx:  data/$(CORPUS).$(DIR).freq
	python3 $(SCRIPTS)/extract-alig-lrx.py $< > $@

data/$(CORPUS).$(DIR).freq.lrx.bin: data/$(CORPUS).$(DIR).freq.lrx
	lrx-comp $< $@

data/$(CORPUS).$(DIR).ngrams: data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated
	python3 $(SCRIPTS)/biltrans-count-patterns-ngrams.py data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated > $@
 
data/$(CORPUS).$(DIR).patterns: data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ngrams
	python3 $(SCRIPTS)/ngram-pruning-frac.py data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ngrams > $@  
 
data/$(CORPUS).$(DIR).patterns.lrx:  data/$(CORPUS).$(DIR).patterns
	python3 $(SCRIPTS)/ngrams-to-rules.py $< $(THR) > $@

In the same folder also place your source side corpus file. The corpus file needs to be named as "basename"."language-pair".txt.
As an illustration, in the Makefile example, the corpus file is named setimes.sh-mk.txt.

Set the Makefile variables as follows:

  • CORPUS denotes the base name of your corpus file
  • DIR stands for the language pair
  • DATA is the path to the language resources for the language pair
  • AUTOBIL is the path to binary bilingual dictionary for the language pair
  • SCRIPTS denotes the path to the lex-tools scripts
  • MODEL is the path to the target side (binary) language model used for scoring the possible translations of ambiguous words

Finally, executing the Makefile will generate lexical selection rules for the specified language pair.