Learning rules from parallel and non-parallel corpora
Contents
Estimating rules using parallel corpora
It is always recommended to use a parallel corpus for any type of machine translation training when such a resource is available. This section describes several (3) methods for estimating lexical selection rules using a parallel corpus. We start by describing the part of the training process that is shared by all three methods, and then continue to describe each of the individual methods separately.
Prequisites
The training methods use several software packages that need to be installed. First you will need to download and install:
- lttoolbox
- Apertium
- apertium-lex-tools
- Moses and its training scripts which you can install using the following script
Furthermore you will also need:
- an Apertium language pair
- a parallel corpus (see Corpora)
Installing prerequisites
See Minimal installation from SVN for apertium/lttoolbox.
See Constraint-based lexical selection module for apertium-lex-tools.
For moses-decoder you can do
git clone https://github.com/moses-smt/mosesdecoder cd mosesdecoder/ ./bjam
Now e.g. the clean-corpus and train-model scripts referred to below will be in ~/smt/mosesdecoder/scripts/training/clean-corpus-n.perl See http://www.statmt.org/moses/?n=Development.GetStarted if you want to install the binaries to some other directory.
Preparing the training files
The parallel corpus is processed in such a way that the training files (the source and target side corpus) are first analysed and tagged.
Next, lines with no analysis are removed and blank within tokens are replaced with a new character since Giza tokenizes a sentence by splitting on white space.
Finally, both files are cleaned using a moses training script so that Giza will not crash during the training process.
All of this can be achieved using the following script:
CORPUS="Europarl3" PAIR="es-pt" SL="pt" TL="es" DATA="/home/philip/Apertium/apertium-es-pt" LEX_TOOLS="/home/philip/Apertium/apertium-lex-tools" SCRIPTS="$LEX_TOOLS/scripts" MOSESDECODER="/home/philip/Apertium/mosesdecoder/scripts/training" TRAINING_LINES=100000 if [ ! -d data-$SL-$TL ]; then mkdir data-$SL-$TL; fi # TAG CORPUS cat "$CORPUS.$PAIR.$SL" | head -n $TRAINING_LINES | apertium -d "$DATA" $SL-$TL-tagger-dcase \ | apertium-pretransfer > data-$SL-$TL/$CORPUS.tagged.$SL; cat "$CORPUS.$PAIR.$TL" | head -n $TRAINING_LINES | apertium -d "$DATA" $TL-$SL-tagger-dcase \ | apertium-pretransfer > data-$SL-$TL/$CORPUS.tagged.$TL; N=`wc -l $CORPUS.$PAIR.$SL | cut -d ' ' -f 1` # REMOVE LINES WITH NO ANALYSES seq 1 $TRAINING_LINES > data-$SL-$TL/$CORPUS.lines paste data-$SL-$TL/$CORPUS.lines data-$SL-$TL/$CORPUS.tagged.$SL data-$SL-$TL/$CORPUS.tagged.$TL | grep '<' \ | cut -f1 > data-$SL-$TL/$CORPUS.lines.new paste data-$SL-$TL/$CORPUS.lines data-$SL-$TL/$CORPUS.tagged.$SL data-$SL-$TL/$CORPUS.tagged.$TL | grep '<' \ | cut -f2 > data-$SL-$TL/$CORPUS.tagged.$SL.new paste data-$SL-$TL/$CORPUS.lines data-$SL-$TL/$CORPUS.tagged.$SL data-$SL-$TL/$CORPUS.tagged.$TL | grep '<' \ | cut -f3 > data-$SL-$TL/$CORPUS.tagged.$TL.new mv data-$SL-$TL/$CORPUS.lines.new data-$SL-$TL/$CORPUS.lines cat data-$SL-$TL/$CORPUS.tagged.$SL.new \ | sed 's/ /~/g' | sed 's/\$[^\^]*/$ /g' > data-$SL-$TL/$CORPUS.tagged.$SL cat data-$SL-$TL/$CORPUS.tagged.$TL.new \ | sed 's/ /~/g' | sed 's/\$[^\^]*/$ /g' > data-$SL-$TL/$CORPUS.tagged.$TL rm data-$SL-$TL/*.new # CLEAN CORPUS perl "$MOSESDECODER/clean-corpus-n.perl" data-$SL-$TL/$CORPUS.tagged $SL $TL "data-$SL-$TL/$CORPUS.tag-clean" 1 40;
Make sure you set the variables as follows:
- CORPUS="Europarl3": name of the corpus that you're using
- PAIR: the direction of the corpus
- SL: the source language
- TL: the target language
- DATA: path to the language resources for the language pair
- LEX_TOOLS: path to apertium-lex-tools
- MOSESDECODER: path to moses-decoder
- TRAINING_LINES: amount of training lines
Learning rules with Giza
Installing Giza
$ mkdir ~/smt $ cd ~/smt $ mkdir local # our "install prefix" $ wget https://giza-pp.googlecode.com/files/giza-pp-v1.0.7.tar.gz $ tar xzvf giza-pp-v1.0.7.tar.gz $ cd giza-pp $ make $ mkdir ../local/bin $ cp GIZA++-v2/snt2cooc.out ../local/bin/ $ cp GIZA++-v2/snt2plain.out ../local/bin/ $ cp GIZA++-v2/GIZA++ ../local/bin/ $ cp mkcls-v2/mkcls ../local/bin/
Estimating rules using non-parallel corpora
Prerequisites:
- Install apertium-lex-tools
- Install IRSTLM (http://sourceforge.net/projects/irstlm/)
- Estimate a binary target side language model (http://sourceforge.net/apps/mediawiki/irstlm/index.php?title=User_Manual).
- The language pair must support the pretransfer and multi modes. See apertium-sh-mk/modes.xml
as a reference on how to add these modes if they do not exist. Place the following Makefile in the folder where you want to run your training process:
CORPUS=setimes DIR=sh-mk DATA=/home/philip/Apertium/apertium-sh-mk/ AUTOBIL=sh-mk.autobil.bin SCRIPTS=/home/philip/Apertium/apertium-lex-tools/scripts MODEL=/home/philip/Apertium/corpora/language-models/mk/setimes.mk.5.blm LEX_TOOLS=/home/philip/Apertium/apertium-lex-tools THR=0 #all: data/$(CORPUS).$(DIR).lrx data/$(CORPUS).$(DIR).freq.lrx all: data/$(CORPUS).$(DIR).freq.lrx.bin data/$(CORPUS).$(DIR).patterns.lrx data/$(CORPUS).$(DIR).tagger: $(CORPUS).$(DIR).txt if [ ! -d data ]; then mkdir data; fi cat $(CORPUS).$(DIR).txt | sed 's/[^\.]$$/./g' | apertium-destxt | apertium -f none -d $(DATA) $(DIR)-tagger | apertium-pretransfer > $@ data/$(CORPUS).$(DIR).ambig: data/$(CORPUS).$(DIR).tagger cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/multitrans $(DATA)$(DIR).autobil.bin -b -t > $@ data/$(CORPUS).$(DIR).multi-trimmed: data/$(CORPUS).$(DIR).tagger cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/multitrans $(DATA)$(DIR).autobil.bin -m -t > $@ data/$(CORPUS).$(DIR).ranked: data/$(CORPUS).$(DIR).tagger cat $< | $(LEX_TOOLS)/multitrans $(DATA)$(DIR).autobil.bin -m | apertium -f none -d $(DATA) $(DIR)-multi | irstlm-ranker-frac $(MODEL) > $@ data/$(CORPUS).$(DIR).annotated: data/$(CORPUS).$(DIR).multi-trimmed data/$(CORPUS).$(DIR).ranked paste data/$(CORPUS).$(DIR).multi-trimmed data/$(CORPUS).$(DIR).ranked | cut -f1-4 > $@ data/$(CORPUS).$(DIR).freq: data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated python3 $(SCRIPTS)/biltrans-extract-frac-freq.py data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated > $@ data/$(CORPUS).$(DIR).freq.lrx: data/$(CORPUS).$(DIR).freq python3 $(SCRIPTS)/extract-alig-lrx.py $< > $@ data/$(CORPUS).$(DIR).freq.lrx.bin: data/$(CORPUS).$(DIR).freq.lrx lrx-comp $< $@ data/$(CORPUS).$(DIR).ngrams: data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated python3 $(SCRIPTS)/biltrans-count-patterns-ngrams.py data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated > $@ data/$(CORPUS).$(DIR).patterns: data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ngrams python3 $(SCRIPTS)/ngram-pruning-frac.py data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ngrams > $@ data/$(CORPUS).$(DIR).patterns.lrx: data/$(CORPUS).$(DIR).patterns python3 $(SCRIPTS)/ngrams-to-rules.py $< $(THR) > $@
In the same folder also place your source side corpus file. The corpus file needs to be named as "basename"."language-pair".txt.
As an illustration, in the Makefile example, the corpus file is named setimes.sh-mk.txt.
Set the Makefile variables as follows:
- CORPUS denotes the base name of your corpus file
- DIR stands for the language pair
- DATA is the path to the language resources for the language pair
- AUTOBIL is the path to binary bilingual dictionary for the language pair
- SCRIPTS denotes the path to the lex-tools scripts
- MODEL is the path to the target side (binary) language model used for scoring the possible translations of ambiguous words
Finally, executing the Makefile will generate lexical selection rules for the specified language pair.