Difference between revisions of "Learning rules from parallel and non-parallel corpora"

From Apertium
Jump to navigation Jump to search
Line 17: Line 17:
 
=== Preparing the training files ===
 
=== Preparing the training files ===
   
  +
<pre>
  +
CORPUS="Europarl3"
  +
PAIR="es-pt"
  +
SL="pt"
  +
TL="es"
  +
DATA="/home/philip/Apertium/apertium-es-pt"
   
  +
LEX_TOOLS="/home/philip/Apertium/apertium-lex-tools"
  +
SCRIPTS="$LEX_TOOLS/scripts"
  +
MOSESDECODER="/home/philip/Apertium/mosesdecoder/scripts/training"
  +
TRAINING_LINES=100000
  +
BIN_DIR="/home/philip/Apertium/smt/bin"
  +
crisphold=1
   
  +
if [ ! -d data-$SL-$TL ]; then
  +
mkdir data-$SL-$TL;
  +
fi
  +
  +
#TAG CORPUS
  +
cat "$CORPUS.$PAIR.$SL" | head -n $TRAINING_LINES | apertium -d "$DATA" $SL-$TL-tagger-dcase \
  +
| apertium-pretransfer > data-$SL-$TL/$CORPUS.tagged.$SL;
  +
  +
cat "$CORPUS.$PAIR.$TL" | head -n $TRAINING_LINES | apertium -d "$DATA" $TL-$SL-tagger-dcase \
  +
| apertium-pretransfer > data-$SL-$TL/$CORPUS.tagged.$TL;
  +
  +
N=`wc -l $CORPUS.$PAIR.$SL | cut -d ' ' -f 1`
  +
  +
  +
# REMOVE LINES WITH NO ANALYSES
  +
seq 1 $TRAINING_LINES > data-$SL-$TL/$CORPUS.lines
  +
paste data-$SL-$TL/$CORPUS.lines data-$SL-$TL/$CORPUS.tagged.$SL data-$SL-$TL/$CORPUS.tagged.$TL | grep '<' \
  +
| cut -f1 > data-$SL-$TL/$CORPUS.lines.new
  +
paste data-$SL-$TL/$CORPUS.lines data-$SL-$TL/$CORPUS.tagged.$SL data-$SL-$TL/$CORPUS.tagged.$TL | grep '<' \
  +
| cut -f2 > data-$SL-$TL/$CORPUS.tagged.$SL.new
  +
paste data-$SL-$TL/$CORPUS.lines data-$SL-$TL/$CORPUS.tagged.$SL data-$SL-$TL/$CORPUS.tagged.$TL | grep '<' \
  +
| cut -f3 > data-$SL-$TL/$CORPUS.tagged.$TL.new
  +
mv data-$SL-$TL/$CORPUS.lines.new data-$SL-$TL/$CORPUS.lines
  +
cat data-$SL-$TL/$CORPUS.tagged.$SL.new \
  +
| sed 's/ /~/g' | sed 's/\$[^\^]*/$ /g' > data-$SL-$TL/$CORPUS.tagged.$SL
  +
cat data-$SL-$TL/$CORPUS.tagged.$TL.new \
  +
| sed 's/ /~/g' | sed 's/\$[^\^]*/$ /g' > data-$SL-$TL/$CORPUS.tagged.$TL
  +
rm data-$SL-$TL/*.new
  +
  +
  +
# CLEAN CORPUS
  +
perl "$MOSESDECODER/clean-corpus-n.perl" data-$SL-$TL/$CORPUS.tagged $SL $TL "data-$SL-$TL/$CORPUS.tag-clean" 1 40;
  +
  +
# ALIGN
  +
perl $MOSESDECODER/train-model.perl -external-bin-dir "$BIN_DIR" -corpus data-$SL-$TL/$CORPUS.tag-clean \
  +
-f $TL -e $SL -alignment grow-diag-final-and -reordering msd-bidirectional-fe \
  +
-lm 0:5:/home/philip/Apertium/gsoc2013/giza/europarl.lm:0 2>&1
  +
</pre>
  +
  +
=== Installing Giza ===
  +
===Installing prerequisites===
  +
See [[Minimal installation from SVN]] for apertium/lttoolbox.
  +
  +
See [[Constraint-based lexical selection module]] for apertium-lex-tools.
  +
  +
For Giza++ and moses-decoder, etc. you can do
  +
<pre>
  +
$ mkdir ~/smt
  +
$ cd ~/smt
  +
$ mkdir local # our "install prefix"
  +
$ wget https://giza-pp.googlecode.com/files/giza-pp-v1.0.7.tar.gz
  +
$ tar xzvf giza-pp-v1.0.7.tar.gz
  +
$ cd giza-pp
  +
$ make
  +
$ mkdir ../local/bin
  +
$ cp GIZA++-v2/snt2cooc.out ../local/bin/
  +
$ cp GIZA++-v2/snt2plain.out ../local/bin/
  +
$ cp GIZA++-v2/GIZA++ ../local/bin/
  +
$ cp mkcls-v2/mkcls ../local/bin/
  +
</pre>
  +
  +
Now e.g. the clean-corpus and train-model scripts referred to below will be in ~/smt/mosesdecoder/scripts/training/clean-corpus-n.perl
  +
See http://www.statmt.org/moses/?n=Development.GetStarted if you want to install the binaries to some other directory.
   
 
== Estimating rules using non-parallel corpora ==
 
== Estimating rules using non-parallel corpora ==

Revision as of 07:09, 19 August 2013

Estimating rules using parallel corpora

It is always recommended to use a parallel corpus for any type of machine translation training when such a resource is available. This section describes several (3) methods for estimating lexical selection rules using a parallel corpus. We start by describing the part of the training process that is shared by all three methods, and then continue to describe each of the individual methods separately.

Prequisites

The training methods use several software packages that need to be installed. First you will need to download and install:

Furthermore you will also need:

  • an Apertium language pair
  • a parallel corpus (see Corpora)

Preparing the training files

CORPUS="Europarl3"
PAIR="es-pt"
SL="pt"
TL="es"
DATA="/home/philip/Apertium/apertium-es-pt"

LEX_TOOLS="/home/philip/Apertium/apertium-lex-tools"
SCRIPTS="$LEX_TOOLS/scripts"
MOSESDECODER="/home/philip/Apertium/mosesdecoder/scripts/training"
TRAINING_LINES=100000
BIN_DIR="/home/philip/Apertium/smt/bin"
crisphold=1

if [ ! -d data-$SL-$TL ]; then 
	mkdir data-$SL-$TL;
fi

#TAG CORPUS
cat "$CORPUS.$PAIR.$SL" | head -n $TRAINING_LINES | apertium -d "$DATA" $SL-$TL-tagger-dcase \
	| apertium-pretransfer > data-$SL-$TL/$CORPUS.tagged.$SL;

cat "$CORPUS.$PAIR.$TL" | head -n $TRAINING_LINES | apertium -d "$DATA" $TL-$SL-tagger-dcase \
	| apertium-pretransfer > data-$SL-$TL/$CORPUS.tagged.$TL;

N=`wc -l $CORPUS.$PAIR.$SL | cut -d ' ' -f 1`


# REMOVE LINES WITH NO ANALYSES
seq 1 $TRAINING_LINES > data-$SL-$TL/$CORPUS.lines
paste data-$SL-$TL/$CORPUS.lines data-$SL-$TL/$CORPUS.tagged.$SL data-$SL-$TL/$CORPUS.tagged.$TL | grep '<' \
	| cut -f1 > data-$SL-$TL/$CORPUS.lines.new
paste data-$SL-$TL/$CORPUS.lines data-$SL-$TL/$CORPUS.tagged.$SL data-$SL-$TL/$CORPUS.tagged.$TL | grep '<' \
	| cut -f2 > data-$SL-$TL/$CORPUS.tagged.$SL.new
paste data-$SL-$TL/$CORPUS.lines data-$SL-$TL/$CORPUS.tagged.$SL data-$SL-$TL/$CORPUS.tagged.$TL | grep '<' \
	| cut -f3 > data-$SL-$TL/$CORPUS.tagged.$TL.new
mv data-$SL-$TL/$CORPUS.lines.new data-$SL-$TL/$CORPUS.lines
cat data-$SL-$TL/$CORPUS.tagged.$SL.new \
	| sed 's/ /~/g' | sed 's/\$[^\^]*/$ /g' > data-$SL-$TL/$CORPUS.tagged.$SL
cat data-$SL-$TL/$CORPUS.tagged.$TL.new \
	| sed 's/ /~/g' | sed 's/\$[^\^]*/$ /g' > data-$SL-$TL/$CORPUS.tagged.$TL
rm data-$SL-$TL/*.new


# CLEAN CORPUS
perl "$MOSESDECODER/clean-corpus-n.perl" data-$SL-$TL/$CORPUS.tagged $SL $TL "data-$SL-$TL/$CORPUS.tag-clean" 1 40;

# ALIGN
perl $MOSESDECODER/train-model.perl -external-bin-dir "$BIN_DIR" -corpus data-$SL-$TL/$CORPUS.tag-clean \
 -f $TL -e $SL -alignment grow-diag-final-and -reordering msd-bidirectional-fe \
-lm 0:5:/home/philip/Apertium/gsoc2013/giza/europarl.lm:0 2>&1

Installing Giza

Installing prerequisites

See Minimal installation from SVN for apertium/lttoolbox.

See Constraint-based lexical selection module for apertium-lex-tools.

For Giza++ and moses-decoder, etc. you can do

$ mkdir ~/smt
$ cd ~/smt
$ mkdir local # our "install prefix"
$ wget https://giza-pp.googlecode.com/files/giza-pp-v1.0.7.tar.gz
$ tar xzvf giza-pp-v1.0.7.tar.gz
$ cd giza-pp
$ make
$ mkdir ../local/bin
$ cp GIZA++-v2/snt2cooc.out ../local/bin/
$ cp GIZA++-v2/snt2plain.out ../local/bin/
$ cp GIZA++-v2/GIZA++ ../local/bin/
$ cp mkcls-v2/mkcls ../local/bin/

Now e.g. the clean-corpus and train-model scripts referred to below will be in ~/smt/mosesdecoder/scripts/training/clean-corpus-n.perl See http://www.statmt.org/moses/?n=Development.GetStarted if you want to install the binaries to some other directory.

Estimating rules using non-parallel corpora

Prerequisites:

as a reference on how to add these modes if they do not exist. Place the following Makefile in the folder where you want to run your training process:

CORPUS=setimes
DIR=sh-mk
DATA=/home/philip/Apertium/apertium-sh-mk/
AUTOBIL=sh-mk.autobil.bin
SCRIPTS=/home/philip/Apertium/apertium-lex-tools/scripts
MODEL=/home/philip/Apertium/corpora/language-models/mk/setimes.mk.5.blm
LEX_TOOLS=/home/philip/Apertium/apertium-lex-tools
THR=0

#all: data/$(CORPUS).$(DIR).lrx data/$(CORPUS).$(DIR).freq.lrx
all: data/$(CORPUS).$(DIR).freq.lrx.bin data/$(CORPUS).$(DIR).patterns.lrx

data/$(CORPUS).$(DIR).tagger: $(CORPUS).$(DIR).txt
	if [ ! -d data ]; then mkdir data; fi
	cat $(CORPUS).$(DIR).txt | sed 's/[^\.]$$/./g' | apertium-destxt | apertium -f none -d $(DATA) $(DIR)-tagger | apertium-pretransfer > $@
 
data/$(CORPUS).$(DIR).ambig: data/$(CORPUS).$(DIR).tagger
	cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/multitrans $(DATA)$(DIR).autobil.bin -b -t > $@

data/$(CORPUS).$(DIR).multi-trimmed: data/$(CORPUS).$(DIR).tagger
	cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/multitrans $(DATA)$(DIR).autobil.bin -m -t > $@

data/$(CORPUS).$(DIR).ranked: data/$(CORPUS).$(DIR).tagger
	cat $< | $(LEX_TOOLS)/multitrans $(DATA)$(DIR).autobil.bin -m | apertium -f none -d $(DATA) $(DIR)-multi | irstlm-ranker-frac $(MODEL) > $@

data/$(CORPUS).$(DIR).annotated: data/$(CORPUS).$(DIR).multi-trimmed data/$(CORPUS).$(DIR).ranked
	paste data/$(CORPUS).$(DIR).multi-trimmed data/$(CORPUS).$(DIR).ranked | cut -f1-4 > $@
 
data/$(CORPUS).$(DIR).freq: data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated
	python3 $(SCRIPTS)/biltrans-extract-frac-freq.py  data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated > $@
 
data/$(CORPUS).$(DIR).freq.lrx:  data/$(CORPUS).$(DIR).freq
	python3 $(SCRIPTS)/extract-alig-lrx.py $< > $@

data/$(CORPUS).$(DIR).freq.lrx.bin: data/$(CORPUS).$(DIR).freq.lrx
	lrx-comp $< $@

data/$(CORPUS).$(DIR).ngrams: data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated
	python3 $(SCRIPTS)/biltrans-count-patterns-ngrams.py data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated > $@
 
data/$(CORPUS).$(DIR).patterns: data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ngrams
	python3 $(SCRIPTS)/ngram-pruning-frac.py data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ngrams > $@  
 
data/$(CORPUS).$(DIR).patterns.lrx:  data/$(CORPUS).$(DIR).patterns
	python3 $(SCRIPTS)/ngrams-to-rules.py $< $(THR) > $@

In the same folder also place your source side corpus file. The corpus file needs to be named as "basename"."language-pair".txt.
As an illustration, in the Makefile example, the corpus file is named setimes.sh-mk.txt.

Set the Makefile variables as follows:

  • CORPUS denotes the base name of your corpus file
  • DIR stands for the language pair
  • DATA is the path to the language resources for the language pair
  • AUTOBIL is the path to binary bilingual dictionary for the language pair
  • SCRIPTS denotes the path to the lex-tools scripts
  • MODEL is the path to the target side (binary) language model used for scoring the possible translations of ambiguous words

Finally, executing the Makefile will generate lexical selection rules for the specified language pair.