Apertium - User contributions [en]

Generating lexical-selection rules from monolingual corpora

2013-09-23T06:21:12Z

Fpetkovski: /* Annotation */

{{TOCD}}
This page describes how to generate lexical selection rules without relying on a parallel corpus.

==Prerequisites==

* [[apertium-lex-tools]]
* [[IRSTLM]]
* A language pair (e.g. apertium-br-fr)
** The language pair should have the following two modes:
*** <code>-multi</code> which is all the modules after lexical transfer (see apertium-mk-en/modes.xml)
*** <code>-pretransfer</code> which is all the modules up to lexical transfer (see apertium-mk-en/modes.xml)

==Annotation==

'''Important:''' If you don't want through the whole process step by step, you can use the Makefile scripts provided in the [[#Makefiles|last section]] of this page.

We're going to do the example with EuroParl and the English to Spanish pair in Apertium.

Given that you've got all the stuff installed, the work will be as follows:

Make sure that you have trained a language model for the target language. See [http://sourceforge.net/apps/mediawiki/irstlm/index.php?title=User_Manual#Training_your_first_LM]

Take your corpus and make a tagged version of it:

<pre>
cat europarl.es-en.es | apertium-destxt | apertium -f none -d ~/source/apertium-en-es en-es-pretransfer > europarl.en-es.es.tagged
</pre>

Make an ambiguous version of your corpus and trim redundant tags:

<pre>
cat europarl.en-es.es.tagged | python ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil -b -f -t -n > europarl.en-es.es.ambig
</pre>

Next, generate all the possible disambiguation paths while trimming redundant tags:

<pre>
cat europarl.en-es.es.tagged | ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil -m -f -t -n > europarl.en-es.es.multi-trimmed
</pre>

Translate and score all possible disambiguation paths:

<pre>
cat europarl.en-es.es.tagged | python ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil -m -f -n |
apertium -f none -d ~/source/apertium-en-es en-es-multi | ~/source/apertium-lex-tools/irstlm-ranker
~/source/corpora/lm/en.blm europarl.en-es.es.multi-trimmed -f > europarl.en-es.es.annotated
</pre>

Now we have a pseudo-parallel corpus where each possible translation is scored.
We start by extracting a frequency lexicon:

<pre>
python3 ~/source/apertium-lex-tools-scripts/biltrans-extract-frac-freq.py europarl.en-es.es.ambig europarl.en-es.es.annotated > europarl.en-es.freq
</pre>

<pre>
python3 ~/source/apertium-lex-tools-scripts/extract-alig-lrx.py europarl.en-es.freq > europarl.en-es.freq.lrx
</pre>

<pre>
lrx-comp europarl.en-es.freq.lrx europarl.en-es.freq.lrx.bin
</pre>

From here on, we have two paths we can choose. We can extract rules using a maximum entropy classifier, or we can extract
rules based only on the scores provided by irstlm-ranker.

== Direct rule extraction ==
When using this method, we directly continue with extracting ngrams from the pseudo parallel corpus:

<pre>
python3 ~/source/apertium-lex-tools/scripts/biltrans-count-patterns-ngrams.py europarl.en-es.freq europarl.en-es.es.ambig europarl.en-es.es.annotated > ngrams
</pre>

Next, we prune the generated ngrams:

<pre>
python3 ~/source/apertium-lex-tools/scripts/ngram-pruning-frac.py europarl.en-es.freq ngrams > patterns
</pre>

Finally, we generate and compile lexical selection rules while thresholding their irstlm-score

<pre>
crisphold=1;
python3 ~/source/apertium-lex-tools/scripts//ngrams-to-rules.py patterns $crisphold > patterns.lrx
</pre>

<pre>
lrx-comp patterns.lrx patterns.lrx.bin
</pre>

== Maximum entropy rule extraction ==
When extracting rules using a maximum entropy criterion, we first extract features which we are going to feed to a classifier:

<pre>
python3 ~/source/apertium-lex-tools/scripts/biltrans-count-patterns-frac-maxent.py europarl-en-es.freq
europarl.en-es.ambig europarl.en-es.annotated > events 2>ngrams
</pre>

We then train classifiers which as a side effect score how much each ngram contributes to a certain translation:
<pre>
cat events | grep -v -e '\$$ 0\.0 #' -e '\$$ 0 #' > events.trimmed
cat events.trimmed | python ~/source/apertium-lex-tools/scripts/merge-all-lambdas.py $(YASMET) > all-lambdas
</pre>

<pre>
python3 ~/source/apertium-lex-tools/scripts/merge-ngrams-lambdas.py ngrams all-lambdas > rules-all
</pre>

Finally, we extract ngrams:

<pre>
python3 ~/source/apertium-lex-tools/scripts/lambdas-to-rules.py europarl-en-es.freq rules-all > ngrams-all
</pre>

we trim them:

<pre>
python3 ~/source/apertium-lex-tools/scripts/ngram-pareto-trim.py ngrams-all > ngrams-trimmed
</pre>

and generate lexical selection rules:

<pre>
python3 ~/source/apertium-lex-tools/scripts/ngrams-to-rules-me.py ngrams-trimmed > europarl.en-es.lrx.bin
</pre>

== Makefiles ==

=== Direct rule extraction ===
You can use this makefile to generate rules using direct rule extraction.
Your corpus needs to be placed in the same folder with your makefile.

<pre>
CORPUS=setimes
PAIR=mk-en
DATA=/home/philip/Apertium/apertium-mk-en
SL=mk
TL=en
TRAINING_LINES=10000
SCRIPTS=/home/philip/Apertium/apertium-lex-tools/scripts
MODEL=/home/philip/Apertium/corpora/language-models/en/setimes.en.5.blm
LEX_TOOLS=/home/philip/Apertium/apertium-lex-tools

AUTOBIL=$(SL)-$(TL).autobil.bin
DIR=$(SL)-$(TL)

all: data/$(CORPUS).$(DIR).freq.lrx.bin data/$(CORPUS).$(DIR).patterns.lrx

data/$(CORPUS).$(DIR).tagger: $(CORPUS).$(PAIR).$(SL)
if [ ! -d data ]; then mkdir data; fi
cat $(CORPUS).$(PAIR).$(SL) | head -n $(TRAINING_LINES)| sed 's/[^\.]$$/./g' | apertium-destxt | apertium -f none -d $(DATA) $(DIR)-tagger | apertium-pretransfer > $@

data/$(CORPUS).$(DIR).ambig: data/$(CORPUS).$(DIR).tagger
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/process-tagger-output $(DATA)/$(AUTOBIL) -b -t -f -n > $@

data/$(CORPUS).$(DIR).multi-trimmed: data/$(CORPUS).$(DIR).tagger
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/process-tagger-output $(DATA)/$(AUTOBIL) -m -t -f > $@

data/$(CORPUS).$(DIR).annotated: data/$(CORPUS).$(DIR).tagger data/$(CORPUS).$(DIR).multi-trimmed
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/process-tagger-output $(DATA)/$(AUTOBIL) -m -f | apertium -f none -d $(DATA) $(DIR)-multi | $(LEX_TOOLS)/irstlm-ranker $(MODEL) data/$(CORPUS).$(DIR).multi-trimmed -f 2>/dev/null | grep "|@|" | cut -f 1-3 > $@

data/$(CORPUS).$(DIR).freq: data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated
python3 $(SCRIPTS)/biltrans-extract-frac-freq.py data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated > $@

data/$(CORPUS).$(DIR).freq.lrx: data/$(CORPUS).$(DIR).freq
python3 $(SCRIPTS)/extract-alig-lrx.py $< > $@

data/$(CORPUS).$(DIR).freq.lrx.bin: data/$(CORPUS).$(DIR).freq.lrx
lrx-comp $< $@

data/$(CORPUS).$(DIR).ngrams: data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated
python3 $(SCRIPTS)/biltrans-count-patterns-ngrams.py data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated > $@

data/$(CORPUS).$(DIR).patterns: data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ngrams
python3 $(SCRIPTS)/ngram-pruning-frac.py data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ngrams > $@

data/$(CORPUS).$(DIR).patterns.lrx: data/$(CORPUS).$(DIR).patterns
python3 $(SCRIPTS)/ngrams-to-rules.py $< $(THR) > $@

</pre>

The corpus file needs to be named as "basename"."language-pair"."source side". 
As an illustration, in the Makefile example, the corpus file is named europarl.en-es.es 

Set the Makefile variables as follows: 
CORPUS denotes the base name of your corpus file 
PAIR stands for the language pair 
SL and TL stand for source language and target language 
DATA is the path to the language resources for the language pair 
SCRIPTS denotes the path to the lex-tools scripts 
MODEL is the path to the target side (binary) language model used for scoring the possible translations of ambiguous words 

Finally, executing the Makefile will generate lexical selection rules for the specified language pair.

=== Maximum entropy ===
You can use this makefile to generate rules using maximum entropy classifiers.
Your corpus needs to be placed in the same folder with your makefile.

<pre>
CORPUS=europarl
PAIR=en-es
DATA=/home/philip/source/apertium-en-es
SL=es
TL=en
MODEL=/home/philip/lm/en.blm
SCRIPTS=/home/philip/source/apertium-lex-tools/scripts
LEX_TOOLS=/home/philip/source/apertium-lex-tools
THR=1
TRAINING_LINES=10000

AUTOBIL=$(SL)-$(TL).autobil.bin
DIR=$(SL)-$(TL)
YASMET=$(LEX_TOOLS)/yasmet
all: data/$(CORPUS).$(DIR).freq.lrx.bin data/$(CORPUS).$(DIR).lm.xml

data/$(CORPUS).$(DIR).tagger: $(CORPUS).$(PAIR).$(SL)
if [ ! -d data ]; then mkdir data; fi
cat $(CORPUS).$(PAIR).$(SL) | head -n $(TRAINING_LINES) | apertium-destxt | apertium -f none -d $(DATA) $(DIR)-tagger | apertium-pretransfer > $@

data/$(CORPUS).$(DIR).ambig: data/$(CORPUS).$(DIR).tagger
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/multitrans $(DATA)/$(AUTOBIL) -b -t -f -n > $@

data/$(CORPUS).$(DIR).multi-trimmed: data/$(CORPUS).$(DIR).tagger
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/multitrans $(DATA)/$(AUTOBIL) -m -t -f > $@

data/$(CORPUS).$(DIR).annotated: data/$(CORPUS).$(DIR).tagger data/$(CORPUS).$(DIR).multi-trimmed
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/multitrans $(DATA)/$(AUTOBIL) -m -f | apertium -f none -d $(DATA) $(DIR)-multi | $(LEX_TOOLS)/irstlm-ranker $(MODEL) data/$(CORPUS).$(DIR).multi-trimmed -f 2>/dev/null > $@

data/$(CORPUS).$(DIR).freq: data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated
python3 $(SCRIPTS)/biltrans-extract-frac-freq.py data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated > $@

data/$(CORPUS).$(DIR).freq.lrx: data/$(CORPUS).$(DIR).freq
python3 $(SCRIPTS)/extract-alig-lrx.py $< > $@

data/$(CORPUS).$(DIR).freq.lrx.bin: data/$(CORPUS).$(DIR).freq.lrx
lrx-comp $< $@

data/$(CORPUS).$(DIR).events data/$(CORPUS).$(DIR).ngrams: data/$(CORPUS).$(DIR).annotated data/$(CORPUS).$(DIR).freq
python3 $(SCRIPTS)/biltrans-count-patterns-frac-maxent.py data/$(CORPUS).$(SL)-$(TL).freq data/$(CORPUS).$(SL)-$(TL).ambig data/$(CORPUS).$(SL)-$(TL).annotated > data/$(CORPUS).$(DIR).events 2>data/$(CORPUS).$(DIR).ngrams

data/$(CORPUS).$(DIR).all-lambdas: data/$(CORPUS).$(DIR).events
cat data/$(CORPUS).$(DIR).events | grep -v -e '\$$ 0\.0 #' -e '\$$ 0 #' > data/$(CORPUS).$(DIR).events.trimmed
cat data/$(CORPUS).$(DIR).events.trimmed | python $(SCRIPTS)/merge-all-lambdas.py $(YASMET)> $@

data/$(CORPUS).$(DIR).rules-all: data/$(CORPUS).$(DIR).ngrams data/$(CORPUS).$(DIR).all-lambdas
python3 $(SCRIPTS)/merge-ngrams-lambdas.py data/$(CORPUS).$(DIR).ngrams data/$(CORPUS).$(DIR).all-lambdas > $@

data/$(CORPUS).$(DIR).ngrams-all: data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).rules-all
python3 $(SCRIPTS)/lambdas-to-rules.py data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).rules-all > $@

data/$(CORPUS).$(DIR).ngrams-trimmed: data/$(CORPUS).$(DIR).ngrams-all
cat $< | python3 $(SCRIPTS)/ngram-pareto-trim.py > $@

data/$(CORPUS).$(DIR).lm.xml: data/$(CORPUS).$(DIR).ngrams-trimmed
python3 $(SCRIPTS)/ngrams-to-rules-me.py data/$(CORPUS).$(DIR).ngrams-trimmed > $@

data/$(CORPUS).$(DIR).lm.bin: data/$(CORPUS).$(DIR).lm.xml
lrx-comp $< $@

</pre>
The corpus file needs to be named as "basename"."language-pair"."source side". 
As an illustration, in the Makefile example, the corpus file is named europarl.en-es.es 

Set the Makefile variables as follows: 
CORPUS denotes the base name of your corpus file 
PAIR stands for the language pair 
SL and TL stand for source language and target language 
DATA is the path to the language resources for the language pair 
SCRIPTS denotes the path to the lex-tools scripts 
MODEL is the path to the target side (binary) language model used for scoring the possible translations of ambiguous words 

Finally, executing the Makefile will generate lexical selection rules for the specified language pair.

[[Category:Lexical selection]]

Generating lexical-selection rules from monolingual corpora

2013-09-23T06:19:38Z

Fpetkovski: /* Annotation */

{{TOCD}}
This page describes how to generate lexical selection rules without relying on a parallel corpus.

==Prerequisites==

* [[apertium-lex-tools]]
* [[IRSTLM]]
* A language pair (e.g. apertium-br-fr)
** The language pair should have the following two modes:
*** <code>-multi</code> which is all the modules after lexical transfer (see apertium-mk-en/modes.xml)
*** <code>-pretransfer</code> which is all the modules up to lexical transfer (see apertium-mk-en/modes.xml)

==Annotation==

'''Important: If you don't want through the whole process step by step, you can use the Makefile script provided in the last section of this page.'''

We're going to do the example with EuroParl and the English to Spanish pair in Apertium.

Given that you've got all the stuff installed, the work will be as follows:

Make sure that you have trained a language model for the target language. See [http://sourceforge.net/apps/mediawiki/irstlm/index.php?title=User_Manual#Training_your_first_LM]

Take your corpus and make a tagged version of it:

<pre>
cat europarl.es-en.es | apertium-destxt | apertium -f none -d ~/source/apertium-en-es en-es-pretransfer > europarl.en-es.es.tagged
</pre>

Make an ambiguous version of your corpus and trim redundant tags:

<pre>
cat europarl.en-es.es.tagged | python ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil -b -f -t -n > europarl.en-es.es.ambig
</pre>

Next, generate all the possible disambiguation paths while trimming redundant tags:

<pre>
cat europarl.en-es.es.tagged | ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil -m -f -t -n > europarl.en-es.es.multi-trimmed
</pre>

Translate and score all possible disambiguation paths:

<pre>
cat europarl.en-es.es.tagged | python ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil -m -f -n |
apertium -f none -d ~/source/apertium-en-es en-es-multi | ~/source/apertium-lex-tools/irstlm-ranker
~/source/corpora/lm/en.blm europarl.en-es.es.multi-trimmed -f > europarl.en-es.es.annotated
</pre>

Now we have a pseudo-parallel corpus where each possible translation is scored.
We start by extracting a frequency lexicon:

<pre>
python3 ~/source/apertium-lex-tools-scripts/biltrans-extract-frac-freq.py europarl.en-es.es.ambig europarl.en-es.es.annotated > europarl.en-es.freq
</pre>

<pre>
python3 ~/source/apertium-lex-tools-scripts/extract-alig-lrx.py europarl.en-es.freq > europarl.en-es.freq.lrx
</pre>

<pre>
lrx-comp europarl.en-es.freq.lrx europarl.en-es.freq.lrx.bin
</pre>

From here on, we have two paths we can choose. We can extract rules using a maximum entropy classifier, or we can extract
rules based only on the scores provided by irstlm-ranker.

== Direct rule extraction ==
When using this method, we directly continue with extracting ngrams from the pseudo parallel corpus:

<pre>
python3 ~/source/apertium-lex-tools/scripts/biltrans-count-patterns-ngrams.py europarl.en-es.freq europarl.en-es.es.ambig europarl.en-es.es.annotated > ngrams
</pre>

Next, we prune the generated ngrams:

<pre>
python3 ~/source/apertium-lex-tools/scripts/ngram-pruning-frac.py europarl.en-es.freq ngrams > patterns
</pre>

Finally, we generate and compile lexical selection rules while thresholding their irstlm-score

<pre>
crisphold=1;
python3 ~/source/apertium-lex-tools/scripts//ngrams-to-rules.py patterns $crisphold > patterns.lrx
</pre>

<pre>
lrx-comp patterns.lrx patterns.lrx.bin
</pre>

== Maximum entropy rule extraction ==
When extracting rules using a maximum entropy criterion, we first extract features which we are going to feed to a classifier:

<pre>
python3 ~/source/apertium-lex-tools/scripts/biltrans-count-patterns-frac-maxent.py europarl-en-es.freq
europarl.en-es.ambig europarl.en-es.annotated > events 2>ngrams
</pre>

We then train classifiers which as a side effect score how much each ngram contributes to a certain translation:
<pre>
cat events | grep -v -e '\$$ 0\.0 #' -e '\$$ 0 #' > events.trimmed
cat events.trimmed | python ~/source/apertium-lex-tools/scripts/merge-all-lambdas.py $(YASMET) > all-lambdas
</pre>

<pre>
python3 ~/source/apertium-lex-tools/scripts/merge-ngrams-lambdas.py ngrams all-lambdas > rules-all
</pre>

Finally, we extract ngrams:

<pre>
python3 ~/source/apertium-lex-tools/scripts/lambdas-to-rules.py europarl-en-es.freq rules-all > ngrams-all
</pre>

we trim them:

<pre>
python3 ~/source/apertium-lex-tools/scripts/ngram-pareto-trim.py ngrams-all > ngrams-trimmed
</pre>

and generate lexical selection rules:

<pre>
python3 ~/source/apertium-lex-tools/scripts/ngrams-to-rules-me.py ngrams-trimmed > europarl.en-es.lrx.bin
</pre>

== Makefiles ==

=== Direct rule extraction ===
You can use this makefile to generate rules using direct rule extraction.
Your corpus needs to be placed in the same folder with your makefile.

<pre>
CORPUS=setimes
PAIR=mk-en
DATA=/home/philip/Apertium/apertium-mk-en
SL=mk
TL=en
TRAINING_LINES=10000
SCRIPTS=/home/philip/Apertium/apertium-lex-tools/scripts
MODEL=/home/philip/Apertium/corpora/language-models/en/setimes.en.5.blm
LEX_TOOLS=/home/philip/Apertium/apertium-lex-tools

AUTOBIL=$(SL)-$(TL).autobil.bin
DIR=$(SL)-$(TL)

all: data/$(CORPUS).$(DIR).freq.lrx.bin data/$(CORPUS).$(DIR).patterns.lrx

data/$(CORPUS).$(DIR).tagger: $(CORPUS).$(PAIR).$(SL)
if [ ! -d data ]; then mkdir data; fi
cat $(CORPUS).$(PAIR).$(SL) | head -n $(TRAINING_LINES)| sed 's/[^\.]$$/./g' | apertium-destxt | apertium -f none -d $(DATA) $(DIR)-tagger | apertium-pretransfer > $@

data/$(CORPUS).$(DIR).ambig: data/$(CORPUS).$(DIR).tagger
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/process-tagger-output $(DATA)/$(AUTOBIL) -b -t -f -n > $@

data/$(CORPUS).$(DIR).multi-trimmed: data/$(CORPUS).$(DIR).tagger
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/process-tagger-output $(DATA)/$(AUTOBIL) -m -t -f > $@

data/$(CORPUS).$(DIR).annotated: data/$(CORPUS).$(DIR).tagger data/$(CORPUS).$(DIR).multi-trimmed
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/process-tagger-output $(DATA)/$(AUTOBIL) -m -f | apertium -f none -d $(DATA) $(DIR)-multi | $(LEX_TOOLS)/irstlm-ranker $(MODEL) data/$(CORPUS).$(DIR).multi-trimmed -f 2>/dev/null | grep "|@|" | cut -f 1-3 > $@

data/$(CORPUS).$(DIR).freq: data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated
python3 $(SCRIPTS)/biltrans-extract-frac-freq.py data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated > $@

data/$(CORPUS).$(DIR).freq.lrx: data/$(CORPUS).$(DIR).freq
python3 $(SCRIPTS)/extract-alig-lrx.py $< > $@

data/$(CORPUS).$(DIR).freq.lrx.bin: data/$(CORPUS).$(DIR).freq.lrx
lrx-comp $< $@

data/$(CORPUS).$(DIR).ngrams: data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated
python3 $(SCRIPTS)/biltrans-count-patterns-ngrams.py data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated > $@

data/$(CORPUS).$(DIR).patterns: data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ngrams
python3 $(SCRIPTS)/ngram-pruning-frac.py data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ngrams > $@

data/$(CORPUS).$(DIR).patterns.lrx: data/$(CORPUS).$(DIR).patterns
python3 $(SCRIPTS)/ngrams-to-rules.py $< $(THR) > $@

</pre>

The corpus file needs to be named as "basename"."language-pair"."source side". 
As an illustration, in the Makefile example, the corpus file is named europarl.en-es.es 

Set the Makefile variables as follows: 
CORPUS denotes the base name of your corpus file 
PAIR stands for the language pair 
SL and TL stand for source language and target language 
DATA is the path to the language resources for the language pair 
SCRIPTS denotes the path to the lex-tools scripts 
MODEL is the path to the target side (binary) language model used for scoring the possible translations of ambiguous words 

Finally, executing the Makefile will generate lexical selection rules for the specified language pair.

=== Maximum entropy ===
You can use this makefile to generate rules using maximum entropy classifiers.
Your corpus needs to be placed in the same folder with your makefile.

<pre>
CORPUS=europarl
PAIR=en-es
DATA=/home/philip/source/apertium-en-es
SL=es
TL=en
MODEL=/home/philip/lm/en.blm
SCRIPTS=/home/philip/source/apertium-lex-tools/scripts
LEX_TOOLS=/home/philip/source/apertium-lex-tools
THR=1
TRAINING_LINES=10000

AUTOBIL=$(SL)-$(TL).autobil.bin
DIR=$(SL)-$(TL)
YASMET=$(LEX_TOOLS)/yasmet
all: data/$(CORPUS).$(DIR).freq.lrx.bin data/$(CORPUS).$(DIR).lm.xml

data/$(CORPUS).$(DIR).tagger: $(CORPUS).$(PAIR).$(SL)
if [ ! -d data ]; then mkdir data; fi
cat $(CORPUS).$(PAIR).$(SL) | head -n $(TRAINING_LINES) | apertium-destxt | apertium -f none -d $(DATA) $(DIR)-tagger | apertium-pretransfer > $@

data/$(CORPUS).$(DIR).ambig: data/$(CORPUS).$(DIR).tagger
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/multitrans $(DATA)/$(AUTOBIL) -b -t -f -n > $@

data/$(CORPUS).$(DIR).multi-trimmed: data/$(CORPUS).$(DIR).tagger
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/multitrans $(DATA)/$(AUTOBIL) -m -t -f > $@

data/$(CORPUS).$(DIR).annotated: data/$(CORPUS).$(DIR).tagger data/$(CORPUS).$(DIR).multi-trimmed
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/multitrans $(DATA)/$(AUTOBIL) -m -f | apertium -f none -d $(DATA) $(DIR)-multi | $(LEX_TOOLS)/irstlm-ranker $(MODEL) data/$(CORPUS).$(DIR).multi-trimmed -f 2>/dev/null > $@

data/$(CORPUS).$(DIR).freq: data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated
python3 $(SCRIPTS)/biltrans-extract-frac-freq.py data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated > $@

data/$(CORPUS).$(DIR).freq.lrx: data/$(CORPUS).$(DIR).freq
python3 $(SCRIPTS)/extract-alig-lrx.py $< > $@

data/$(CORPUS).$(DIR).freq.lrx.bin: data/$(CORPUS).$(DIR).freq.lrx
lrx-comp $< $@

data/$(CORPUS).$(DIR).events data/$(CORPUS).$(DIR).ngrams: data/$(CORPUS).$(DIR).annotated data/$(CORPUS).$(DIR).freq
python3 $(SCRIPTS)/biltrans-count-patterns-frac-maxent.py data/$(CORPUS).$(SL)-$(TL).freq data/$(CORPUS).$(SL)-$(TL).ambig data/$(CORPUS).$(SL)-$(TL).annotated > data/$(CORPUS).$(DIR).events 2>data/$(CORPUS).$(DIR).ngrams

data/$(CORPUS).$(DIR).all-lambdas: data/$(CORPUS).$(DIR).events
cat data/$(CORPUS).$(DIR).events | grep -v -e '\$$ 0\.0 #' -e '\$$ 0 #' > data/$(CORPUS).$(DIR).events.trimmed
cat data/$(CORPUS).$(DIR).events.trimmed | python $(SCRIPTS)/merge-all-lambdas.py $(YASMET)> $@

data/$(CORPUS).$(DIR).rules-all: data/$(CORPUS).$(DIR).ngrams data/$(CORPUS).$(DIR).all-lambdas
python3 $(SCRIPTS)/merge-ngrams-lambdas.py data/$(CORPUS).$(DIR).ngrams data/$(CORPUS).$(DIR).all-lambdas > $@

data/$(CORPUS).$(DIR).ngrams-all: data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).rules-all
python3 $(SCRIPTS)/lambdas-to-rules.py data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).rules-all > $@

data/$(CORPUS).$(DIR).ngrams-trimmed: data/$(CORPUS).$(DIR).ngrams-all
cat $< | python3 $(SCRIPTS)/ngram-pareto-trim.py > $@

data/$(CORPUS).$(DIR).lm.xml: data/$(CORPUS).$(DIR).ngrams-trimmed
python3 $(SCRIPTS)/ngrams-to-rules-me.py data/$(CORPUS).$(DIR).ngrams-trimmed > $@

data/$(CORPUS).$(DIR).lm.bin: data/$(CORPUS).$(DIR).lm.xml
lrx-comp $< $@

</pre>
The corpus file needs to be named as "basename"."language-pair"."source side". 
As an illustration, in the Makefile example, the corpus file is named europarl.en-es.es 

Set the Makefile variables as follows: 
CORPUS denotes the base name of your corpus file 
PAIR stands for the language pair 
SL and TL stand for source language and target language 
DATA is the path to the language resources for the language pair 
SCRIPTS denotes the path to the lex-tools scripts 
MODEL is the path to the target side (binary) language model used for scoring the possible translations of ambiguous words 

Finally, executing the Makefile will generate lexical selection rules for the specified language pair.

[[Category:Lexical selection]]

Generating lexical-selection rules from monolingual corpora

2013-09-23T06:19:20Z

Fpetkovski:

{{TOCD}}
This page describes how to generate lexical selection rules without relying on a parallel corpus.

==Prerequisites==

* [[apertium-lex-tools]]
* [[IRSTLM]]
* A language pair (e.g. apertium-br-fr)
** The language pair should have the following two modes:
*** <code>-multi</code> which is all the modules after lexical transfer (see apertium-mk-en/modes.xml)
*** <code>-pretransfer</code> which is all the modules up to lexical transfer (see apertium-mk-en/modes.xml)

==Annotation==

'''Important: If you don't want through the whole process step by step, you can use the Makefile script provided in the last section of this page.
'''
We're going to do the example with EuroParl and the English to Spanish pair in Apertium.

Given that you've got all the stuff installed, the work will be as follows:

Make sure that you have trained a language model for the target language. See [http://sourceforge.net/apps/mediawiki/irstlm/index.php?title=User_Manual#Training_your_first_LM]

Take your corpus and make a tagged version of it:

<pre>
cat europarl.es-en.es | apertium-destxt | apertium -f none -d ~/source/apertium-en-es en-es-pretransfer > europarl.en-es.es.tagged
</pre>

Make an ambiguous version of your corpus and trim redundant tags:

<pre>
cat europarl.en-es.es.tagged | python ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil -b -f -t -n > europarl.en-es.es.ambig
</pre>

Next, generate all the possible disambiguation paths while trimming redundant tags:

<pre>
cat europarl.en-es.es.tagged | ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil -m -f -t -n > europarl.en-es.es.multi-trimmed
</pre>

Translate and score all possible disambiguation paths:

<pre>
cat europarl.en-es.es.tagged | python ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil -m -f -n |
apertium -f none -d ~/source/apertium-en-es en-es-multi | ~/source/apertium-lex-tools/irstlm-ranker
~/source/corpora/lm/en.blm europarl.en-es.es.multi-trimmed -f > europarl.en-es.es.annotated
</pre>

Now we have a pseudo-parallel corpus where each possible translation is scored.
We start by extracting a frequency lexicon:

<pre>
python3 ~/source/apertium-lex-tools-scripts/biltrans-extract-frac-freq.py europarl.en-es.es.ambig europarl.en-es.es.annotated > europarl.en-es.freq
</pre>

<pre>
python3 ~/source/apertium-lex-tools-scripts/extract-alig-lrx.py europarl.en-es.freq > europarl.en-es.freq.lrx
</pre>

<pre>
lrx-comp europarl.en-es.freq.lrx europarl.en-es.freq.lrx.bin
</pre>

From here on, we have two paths we can choose. We can extract rules using a maximum entropy classifier, or we can extract
rules based only on the scores provided by irstlm-ranker.

== Direct rule extraction ==
When using this method, we directly continue with extracting ngrams from the pseudo parallel corpus:

<pre>
python3 ~/source/apertium-lex-tools/scripts/biltrans-count-patterns-ngrams.py europarl.en-es.freq europarl.en-es.es.ambig europarl.en-es.es.annotated > ngrams
</pre>

Next, we prune the generated ngrams:

<pre>
python3 ~/source/apertium-lex-tools/scripts/ngram-pruning-frac.py europarl.en-es.freq ngrams > patterns
</pre>

Finally, we generate and compile lexical selection rules while thresholding their irstlm-score

<pre>
crisphold=1;
python3 ~/source/apertium-lex-tools/scripts//ngrams-to-rules.py patterns $crisphold > patterns.lrx
</pre>

<pre>
lrx-comp patterns.lrx patterns.lrx.bin
</pre>

== Maximum entropy rule extraction ==
When extracting rules using a maximum entropy criterion, we first extract features which we are going to feed to a classifier:

<pre>
python3 ~/source/apertium-lex-tools/scripts/biltrans-count-patterns-frac-maxent.py europarl-en-es.freq
europarl.en-es.ambig europarl.en-es.annotated > events 2>ngrams
</pre>

We then train classifiers which as a side effect score how much each ngram contributes to a certain translation:
<pre>
cat events | grep -v -e '\$$ 0\.0 #' -e '\$$ 0 #' > events.trimmed
cat events.trimmed | python ~/source/apertium-lex-tools/scripts/merge-all-lambdas.py $(YASMET) > all-lambdas
</pre>

<pre>
python3 ~/source/apertium-lex-tools/scripts/merge-ngrams-lambdas.py ngrams all-lambdas > rules-all
</pre>

Finally, we extract ngrams:

<pre>
python3 ~/source/apertium-lex-tools/scripts/lambdas-to-rules.py europarl-en-es.freq rules-all > ngrams-all
</pre>

we trim them:

<pre>
python3 ~/source/apertium-lex-tools/scripts/ngram-pareto-trim.py ngrams-all > ngrams-trimmed
</pre>

and generate lexical selection rules:

<pre>
python3 ~/source/apertium-lex-tools/scripts/ngrams-to-rules-me.py ngrams-trimmed > europarl.en-es.lrx.bin
</pre>

== Makefiles ==

=== Direct rule extraction ===
You can use this makefile to generate rules using direct rule extraction.
Your corpus needs to be placed in the same folder with your makefile.

<pre>
CORPUS=setimes
PAIR=mk-en
DATA=/home/philip/Apertium/apertium-mk-en
SL=mk
TL=en
TRAINING_LINES=10000
SCRIPTS=/home/philip/Apertium/apertium-lex-tools/scripts
MODEL=/home/philip/Apertium/corpora/language-models/en/setimes.en.5.blm
LEX_TOOLS=/home/philip/Apertium/apertium-lex-tools

AUTOBIL=$(SL)-$(TL).autobil.bin
DIR=$(SL)-$(TL)

all: data/$(CORPUS).$(DIR).freq.lrx.bin data/$(CORPUS).$(DIR).patterns.lrx

data/$(CORPUS).$(DIR).tagger: $(CORPUS).$(PAIR).$(SL)
if [ ! -d data ]; then mkdir data; fi
cat $(CORPUS).$(PAIR).$(SL) | head -n $(TRAINING_LINES)| sed 's/[^\.]$$/./g' | apertium-destxt | apertium -f none -d $(DATA) $(DIR)-tagger | apertium-pretransfer > $@

data/$(CORPUS).$(DIR).ambig: data/$(CORPUS).$(DIR).tagger
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/process-tagger-output $(DATA)/$(AUTOBIL) -b -t -f -n > $@

data/$(CORPUS).$(DIR).multi-trimmed: data/$(CORPUS).$(DIR).tagger
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/process-tagger-output $(DATA)/$(AUTOBIL) -m -t -f > $@

data/$(CORPUS).$(DIR).annotated: data/$(CORPUS).$(DIR).tagger data/$(CORPUS).$(DIR).multi-trimmed
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/process-tagger-output $(DATA)/$(AUTOBIL) -m -f | apertium -f none -d $(DATA) $(DIR)-multi | $(LEX_TOOLS)/irstlm-ranker $(MODEL) data/$(CORPUS).$(DIR).multi-trimmed -f 2>/dev/null | grep "|@|" | cut -f 1-3 > $@

data/$(CORPUS).$(DIR).freq: data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated
python3 $(SCRIPTS)/biltrans-extract-frac-freq.py data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated > $@

data/$(CORPUS).$(DIR).freq.lrx: data/$(CORPUS).$(DIR).freq
python3 $(SCRIPTS)/extract-alig-lrx.py $< > $@

data/$(CORPUS).$(DIR).freq.lrx.bin: data/$(CORPUS).$(DIR).freq.lrx
lrx-comp $< $@

data/$(CORPUS).$(DIR).ngrams: data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated
python3 $(SCRIPTS)/biltrans-count-patterns-ngrams.py data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated > $@

data/$(CORPUS).$(DIR).patterns: data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ngrams
python3 $(SCRIPTS)/ngram-pruning-frac.py data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ngrams > $@

data/$(CORPUS).$(DIR).patterns.lrx: data/$(CORPUS).$(DIR).patterns
python3 $(SCRIPTS)/ngrams-to-rules.py $< $(THR) > $@

</pre>

The corpus file needs to be named as "basename"."language-pair"."source side". 
As an illustration, in the Makefile example, the corpus file is named europarl.en-es.es 

Set the Makefile variables as follows: 
CORPUS denotes the base name of your corpus file 
PAIR stands for the language pair 
SL and TL stand for source language and target language 
DATA is the path to the language resources for the language pair 
SCRIPTS denotes the path to the lex-tools scripts 
MODEL is the path to the target side (binary) language model used for scoring the possible translations of ambiguous words 

Finally, executing the Makefile will generate lexical selection rules for the specified language pair.

=== Maximum entropy ===
You can use this makefile to generate rules using maximum entropy classifiers.
Your corpus needs to be placed in the same folder with your makefile.

<pre>
CORPUS=europarl
PAIR=en-es
DATA=/home/philip/source/apertium-en-es
SL=es
TL=en
MODEL=/home/philip/lm/en.blm
SCRIPTS=/home/philip/source/apertium-lex-tools/scripts
LEX_TOOLS=/home/philip/source/apertium-lex-tools
THR=1
TRAINING_LINES=10000

AUTOBIL=$(SL)-$(TL).autobil.bin
DIR=$(SL)-$(TL)
YASMET=$(LEX_TOOLS)/yasmet
all: data/$(CORPUS).$(DIR).freq.lrx.bin data/$(CORPUS).$(DIR).lm.xml

data/$(CORPUS).$(DIR).tagger: $(CORPUS).$(PAIR).$(SL)
if [ ! -d data ]; then mkdir data; fi
cat $(CORPUS).$(PAIR).$(SL) | head -n $(TRAINING_LINES) | apertium-destxt | apertium -f none -d $(DATA) $(DIR)-tagger | apertium-pretransfer > $@

data/$(CORPUS).$(DIR).ambig: data/$(CORPUS).$(DIR).tagger
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/multitrans $(DATA)/$(AUTOBIL) -b -t -f -n > $@

data/$(CORPUS).$(DIR).multi-trimmed: data/$(CORPUS).$(DIR).tagger
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/multitrans $(DATA)/$(AUTOBIL) -m -t -f > $@

data/$(CORPUS).$(DIR).annotated: data/$(CORPUS).$(DIR).tagger data/$(CORPUS).$(DIR).multi-trimmed
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/multitrans $(DATA)/$(AUTOBIL) -m -f | apertium -f none -d $(DATA) $(DIR)-multi | $(LEX_TOOLS)/irstlm-ranker $(MODEL) data/$(CORPUS).$(DIR).multi-trimmed -f 2>/dev/null > $@

data/$(CORPUS).$(DIR).freq: data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated
python3 $(SCRIPTS)/biltrans-extract-frac-freq.py data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated > $@

data/$(CORPUS).$(DIR).freq.lrx: data/$(CORPUS).$(DIR).freq
python3 $(SCRIPTS)/extract-alig-lrx.py $< > $@

data/$(CORPUS).$(DIR).freq.lrx.bin: data/$(CORPUS).$(DIR).freq.lrx
lrx-comp $< $@

data/$(CORPUS).$(DIR).events data/$(CORPUS).$(DIR).ngrams: data/$(CORPUS).$(DIR).annotated data/$(CORPUS).$(DIR).freq
python3 $(SCRIPTS)/biltrans-count-patterns-frac-maxent.py data/$(CORPUS).$(SL)-$(TL).freq data/$(CORPUS).$(SL)-$(TL).ambig data/$(CORPUS).$(SL)-$(TL).annotated > data/$(CORPUS).$(DIR).events 2>data/$(CORPUS).$(DIR).ngrams

data/$(CORPUS).$(DIR).all-lambdas: data/$(CORPUS).$(DIR).events
cat data/$(CORPUS).$(DIR).events | grep -v -e '\$$ 0\.0 #' -e '\$$ 0 #' > data/$(CORPUS).$(DIR).events.trimmed
cat data/$(CORPUS).$(DIR).events.trimmed | python $(SCRIPTS)/merge-all-lambdas.py $(YASMET)> $@

data/$(CORPUS).$(DIR).rules-all: data/$(CORPUS).$(DIR).ngrams data/$(CORPUS).$(DIR).all-lambdas
python3 $(SCRIPTS)/merge-ngrams-lambdas.py data/$(CORPUS).$(DIR).ngrams data/$(CORPUS).$(DIR).all-lambdas > $@

data/$(CORPUS).$(DIR).ngrams-all: data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).rules-all
python3 $(SCRIPTS)/lambdas-to-rules.py data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).rules-all > $@

data/$(CORPUS).$(DIR).ngrams-trimmed: data/$(CORPUS).$(DIR).ngrams-all
cat $< | python3 $(SCRIPTS)/ngram-pareto-trim.py > $@

data/$(CORPUS).$(DIR).lm.xml: data/$(CORPUS).$(DIR).ngrams-trimmed
python3 $(SCRIPTS)/ngrams-to-rules-me.py data/$(CORPUS).$(DIR).ngrams-trimmed > $@

data/$(CORPUS).$(DIR).lm.bin: data/$(CORPUS).$(DIR).lm.xml
lrx-comp $< $@

</pre>
The corpus file needs to be named as "basename"."language-pair"."source side". 
As an illustration, in the Makefile example, the corpus file is named europarl.en-es.es 

Set the Makefile variables as follows: 
CORPUS denotes the base name of your corpus file 
PAIR stands for the language pair 
SL and TL stand for source language and target language 
DATA is the path to the language resources for the language pair 
SCRIPTS denotes the path to the lex-tools scripts 
MODEL is the path to the target side (binary) language model used for scoring the possible translations of ambiguous words 

Finally, executing the Makefile will generate lexical selection rules for the specified language pair.

[[Category:Lexical selection]]

Generating lexical-selection rules from monolingual corpora

2013-09-23T06:18:45Z

Fpetkovski: /* Annotation */

{{TOCD}}
This page describes how to generate lexical selection rules without relying on a parallel corpus.

==Prerequisites==

* [[apertium-lex-tools]]
* [[IRSTLM]]
* A language pair (e.g. apertium-br-fr)
** The language pair should have the following two modes:
*** <code>-multi</code> which is all the modules after lexical transfer (see apertium-mk-en/modes.xml)
*** <code>-pretransfer</code> which is all the modules up to lexical transfer (see apertium-mk-en/modes.xml)

==Annotation==

'''Important: If you don't want through the whole process step by step, you can use the Makefile script provided in the last section of this page.
'''
We're going to do the example with EuroParl and the English to Spanish pair in Apertium.

Given that you've got all the stuff installed, the work will be as follows:

Make sure that you have trained a language model for the target language. See [http://sourceforge.net/apps/mediawiki/irstlm/index.php?title=User_Manual#Training_your_first_LM]

Take your corpus and make a tagged version of it:

<pre>
cat europarl.es-en.es | apertium-destxt | apertium -f none -d ~/source/apertium-en-es en-es-pretransfer > europarl.en-es.es.tagged
</pre>

Make an ambiguous version of your corpus and trim redundant tags:

<pre>
cat europarl.en-es.es.tagged | python ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil -b -f -t -n > europarl.en-es.es.ambig
</pre>

Next, generate all the possible disambiguation paths while trimming redundant tags:

<pre>
cat europarl.en-es.es.tagged | ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil -m -f -t -n > europarl.en-es.es.multi-trimmed
</pre>

Translate and score all possible disambiguation paths:

<pre>
cat europarl.en-es.es.tagged | python ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil -m -f -n |
apertium -f none -d ~/source/apertium-en-es en-es-multi | ~/source/apertium-lex-tools/irstlm-ranker
~/source/corpora/lm/en.blm europarl.en-es.es.multi-trimmed -f > europarl.en-es.es.annotated
</pre>

Now we have a pseudo-parallel corpus where each possible translation is scored.
We start by extracting a frequency lexicon:

<pre>
python3 ~/source/apertium-lex-tools-scripts/biltrans-extract-frac-freq.py europarl.en-es.es.ambig europarl.en-es.es.annotated > europarl.en-es.freq
</pre>

<pre>
python3 ~/source/apertium-lex-tools-scripts/extract-alig-lrx.py europarl.en-es.freq > europarl.en-es.freq.lrx
</pre>

<pre>
lrx-comp europarl.en-es.freq.lrx europarl.en-es.freq.lrx.bin
</pre>

From here on, we have two paths we can choose. We can extract rules using a maximum entropy classifier, or we can extract
rules based only on the scores provided by irstlm-ranker.

== Direct rule extraction ==
When using this method, we directly continue with extracting ngrams from the pseudo parallel corpus:

<pre>
python3 ~/source/apertium/apertium-lex-tools/scripts/biltrans-count-patterns-ngrams.py europarl.en-es.freq europarl.en-es.es.ambig europarl.en-es.es.annotated > ngrams
</pre>

Next, we prune the generated ngrams:

<pre>
python3 ~/source/apertium/apertium-lex-tools/scripts/ngram-pruning-frac.py europarl.en-es.freq ngrams > patterns
</pre>

Finally, we generate and compile lexical selection rules while thresholding their irstlm-score

<pre>
crisphold=1;
python3 ~/source/apertium/apertium-lex-tools/scripts//ngrams-to-rules.py patterns $crisphold > patterns.lrx
</pre>

<pre>
lrx-comp patterns.lrx patterns.lrx.bin
</pre>

== Maximum entropy rule extraction ==
When extracting rules using a maximum entropy criterion, we first extract features which we are going to feed to a classifier:

<pre>
python3 ~/source/apertium/apertium-lex-tools/scripts/biltrans-count-patterns-frac-maxent.py europarl-en-es.freq
europarl.en-es.ambig europarl.en-es.annotated > events 2>ngrams
</pre>

We then train classifiers which as a side effect score how much each ngram contributes to a certain translation:
<pre>
cat events | grep -v -e '\$$ 0\.0 #' -e '\$$ 0 #' > events.trimmed
cat events.trimmed | python ~/source/apertium/apertium-lex-tools/scripts/merge-all-lambdas.py $(YASMET) > all-lambdas
</pre>

<pre>
python3 ~/source/apertium-lex-tools/scripts/merge-ngrams-lambdas.py ngrams all-lambdas > rules-all
</pre>

Finally, we extract ngrams:

<pre>
python3 ~/source/apertium-lex-tools/scripts/lambdas-to-rules.py europarl-en-es.freq rules-all > ngrams-all
</pre>

we trim them:

<pre>
python3 ~/source/apertium-lex-tools/scripts/ngram-pareto-trim.py ngrams-all > ngrams-trimmed
</pre>

and generate lexical selection rules:

<pre>
python3 ~/source/apertium-lex-tools/scripts/ngrams-to-rules-me.py ngrams-trimmed > europarl.en-es.lrx.bin
</pre>

== Makefiles ==

=== Direct rule extraction ===
You can use this makefile to generate rules using direct rule extraction.
Your corpus needs to be placed in the same folder with your makefile.

<pre>
CORPUS=setimes
PAIR=mk-en
DATA=/home/philip/Apertium/apertium-mk-en
SL=mk
TL=en
TRAINING_LINES=10000
SCRIPTS=/home/philip/Apertium/apertium-lex-tools/scripts
MODEL=/home/philip/Apertium/corpora/language-models/en/setimes.en.5.blm
LEX_TOOLS=/home/philip/Apertium/apertium-lex-tools

AUTOBIL=$(SL)-$(TL).autobil.bin
DIR=$(SL)-$(TL)

all: data/$(CORPUS).$(DIR).freq.lrx.bin data/$(CORPUS).$(DIR).patterns.lrx

data/$(CORPUS).$(DIR).tagger: $(CORPUS).$(PAIR).$(SL)
if [ ! -d data ]; then mkdir data; fi
cat $(CORPUS).$(PAIR).$(SL) | head -n $(TRAINING_LINES)| sed 's/[^\.]$$/./g' | apertium-destxt | apertium -f none -d $(DATA) $(DIR)-tagger | apertium-pretransfer > $@

data/$(CORPUS).$(DIR).ambig: data/$(CORPUS).$(DIR).tagger
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/process-tagger-output $(DATA)/$(AUTOBIL) -b -t -f -n > $@

data/$(CORPUS).$(DIR).multi-trimmed: data/$(CORPUS).$(DIR).tagger
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/process-tagger-output $(DATA)/$(AUTOBIL) -m -t -f > $@

data/$(CORPUS).$(DIR).annotated: data/$(CORPUS).$(DIR).tagger data/$(CORPUS).$(DIR).multi-trimmed
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/process-tagger-output $(DATA)/$(AUTOBIL) -m -f | apertium -f none -d $(DATA) $(DIR)-multi | $(LEX_TOOLS)/irstlm-ranker $(MODEL) data/$(CORPUS).$(DIR).multi-trimmed -f 2>/dev/null | grep "|@|" | cut -f 1-3 > $@

data/$(CORPUS).$(DIR).freq: data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated
python3 $(SCRIPTS)/biltrans-extract-frac-freq.py data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated > $@

data/$(CORPUS).$(DIR).freq.lrx: data/$(CORPUS).$(DIR).freq
python3 $(SCRIPTS)/extract-alig-lrx.py $< > $@

data/$(CORPUS).$(DIR).freq.lrx.bin: data/$(CORPUS).$(DIR).freq.lrx
lrx-comp $< $@

data/$(CORPUS).$(DIR).ngrams: data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated
python3 $(SCRIPTS)/biltrans-count-patterns-ngrams.py data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated > $@

data/$(CORPUS).$(DIR).patterns: data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ngrams
python3 $(SCRIPTS)/ngram-pruning-frac.py data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ngrams > $@

data/$(CORPUS).$(DIR).patterns.lrx: data/$(CORPUS).$(DIR).patterns
python3 $(SCRIPTS)/ngrams-to-rules.py $< $(THR) > $@

</pre>

The corpus file needs to be named as "basename"."language-pair"."source side". 
As an illustration, in the Makefile example, the corpus file is named europarl.en-es.es 

Set the Makefile variables as follows: 
CORPUS denotes the base name of your corpus file 
PAIR stands for the language pair 
SL and TL stand for source language and target language 
DATA is the path to the language resources for the language pair 
SCRIPTS denotes the path to the lex-tools scripts 
MODEL is the path to the target side (binary) language model used for scoring the possible translations of ambiguous words 

Finally, executing the Makefile will generate lexical selection rules for the specified language pair.

=== Maximum entropy ===
You can use this makefile to generate rules using maximum entropy classifiers.
Your corpus needs to be placed in the same folder with your makefile.

<pre>
CORPUS=europarl
PAIR=en-es
DATA=/home/philip/source/apertium-en-es
SL=es
TL=en
MODEL=/home/philip/lm/en.blm
SCRIPTS=/home/philip/source/apertium-lex-tools/scripts
LEX_TOOLS=/home/philip/source/apertium-lex-tools
THR=1
TRAINING_LINES=10000

AUTOBIL=$(SL)-$(TL).autobil.bin
DIR=$(SL)-$(TL)
YASMET=$(LEX_TOOLS)/yasmet
all: data/$(CORPUS).$(DIR).freq.lrx.bin data/$(CORPUS).$(DIR).lm.xml

data/$(CORPUS).$(DIR).tagger: $(CORPUS).$(PAIR).$(SL)
if [ ! -d data ]; then mkdir data; fi
cat $(CORPUS).$(PAIR).$(SL) | head -n $(TRAINING_LINES) | apertium-destxt | apertium -f none -d $(DATA) $(DIR)-tagger | apertium-pretransfer > $@

data/$(CORPUS).$(DIR).ambig: data/$(CORPUS).$(DIR).tagger
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/multitrans $(DATA)/$(AUTOBIL) -b -t -f -n > $@

data/$(CORPUS).$(DIR).multi-trimmed: data/$(CORPUS).$(DIR).tagger
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/multitrans $(DATA)/$(AUTOBIL) -m -t -f > $@

data/$(CORPUS).$(DIR).annotated: data/$(CORPUS).$(DIR).tagger data/$(CORPUS).$(DIR).multi-trimmed
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/multitrans $(DATA)/$(AUTOBIL) -m -f | apertium -f none -d $(DATA) $(DIR)-multi | $(LEX_TOOLS)/irstlm-ranker $(MODEL) data/$(CORPUS).$(DIR).multi-trimmed -f 2>/dev/null > $@

data/$(CORPUS).$(DIR).freq: data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated
python3 $(SCRIPTS)/biltrans-extract-frac-freq.py data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated > $@

data/$(CORPUS).$(DIR).freq.lrx: data/$(CORPUS).$(DIR).freq
python3 $(SCRIPTS)/extract-alig-lrx.py $< > $@

data/$(CORPUS).$(DIR).freq.lrx.bin: data/$(CORPUS).$(DIR).freq.lrx
lrx-comp $< $@

data/$(CORPUS).$(DIR).events data/$(CORPUS).$(DIR).ngrams: data/$(CORPUS).$(DIR).annotated data/$(CORPUS).$(DIR).freq
python3 $(SCRIPTS)/biltrans-count-patterns-frac-maxent.py data/$(CORPUS).$(SL)-$(TL).freq data/$(CORPUS).$(SL)-$(TL).ambig data/$(CORPUS).$(SL)-$(TL).annotated > data/$(CORPUS).$(DIR).events 2>data/$(CORPUS).$(DIR).ngrams

data/$(CORPUS).$(DIR).all-lambdas: data/$(CORPUS).$(DIR).events
cat data/$(CORPUS).$(DIR).events | grep -v -e '\$$ 0\.0 #' -e '\$$ 0 #' > data/$(CORPUS).$(DIR).events.trimmed
cat data/$(CORPUS).$(DIR).events.trimmed | python $(SCRIPTS)/merge-all-lambdas.py $(YASMET)> $@

data/$(CORPUS).$(DIR).rules-all: data/$(CORPUS).$(DIR).ngrams data/$(CORPUS).$(DIR).all-lambdas
python3 $(SCRIPTS)/merge-ngrams-lambdas.py data/$(CORPUS).$(DIR).ngrams data/$(CORPUS).$(DIR).all-lambdas > $@

data/$(CORPUS).$(DIR).ngrams-all: data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).rules-all
python3 $(SCRIPTS)/lambdas-to-rules.py data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).rules-all > $@

data/$(CORPUS).$(DIR).ngrams-trimmed: data/$(CORPUS).$(DIR).ngrams-all
cat $< | python3 $(SCRIPTS)/ngram-pareto-trim.py > $@

data/$(CORPUS).$(DIR).lm.xml: data/$(CORPUS).$(DIR).ngrams-trimmed
python3 $(SCRIPTS)/ngrams-to-rules-me.py data/$(CORPUS).$(DIR).ngrams-trimmed > $@

data/$(CORPUS).$(DIR).lm.bin: data/$(CORPUS).$(DIR).lm.xml
lrx-comp $< $@

</pre>
The corpus file needs to be named as "basename"."language-pair"."source side". 
As an illustration, in the Makefile example, the corpus file is named europarl.en-es.es 

Set the Makefile variables as follows: 
CORPUS denotes the base name of your corpus file 
PAIR stands for the language pair 
SL and TL stand for source language and target language 
DATA is the path to the language resources for the language pair 
SCRIPTS denotes the path to the lex-tools scripts 
MODEL is the path to the target side (binary) language model used for scoring the possible translations of ambiguous words 

Finally, executing the Makefile will generate lexical selection rules for the specified language pair.

[[Category:Lexical selection]]

Generating lexical-selection rules from monolingual corpora

2013-09-23T06:14:55Z

Fpetkovski: /* Direct rule extraction */

{{TOCD}}
This page describes how to generate lexical selection rules without relying on a parallel corpus.

==Prerequisites==

* [[apertium-lex-tools]]
* [[IRSTLM]]
* A language pair (e.g. apertium-br-fr)
** The language pair should have the following two modes:
*** <code>-multi</code> which is all the modules after lexical transfer (see apertium-mk-en/modes.xml)
*** <code>-pretransfer</code> which is all the modules up to lexical transfer (see apertium-mk-en/modes.xml)

==Annotation==

Important: If you don't want through the whole process step by step, you can use the Makefile script provided in the last section of this page.

We're going to do the example with EuroParl and the English to Spanish pair in Apertium.

Given that you've got all the stuff installed, the work will be as follows:

Take your corpus and make a tagged version of it:

<pre>
cat europarl.es-en.es | apertium-destxt | apertium -f none -d ~/source/apertium/apertium-en-es en-es-pretransfer > europarl.en-es.es.tagged
</pre>

Make an ambiguous version of your corpus and trim redundant tags:

<pre>
cat europarl.en-es.es.tagged | python ~/source/apertium/apertium-lex-tools/multitrans ~/source/apertium/apertium-en-es/en-es.autobil -b -f -t -n > europarl.en-es.es.ambig
</pre>

Next, generate all the possible disambiguation paths while trimming redundant tags:

<pre>
cat europarl.en-es.es.tagged | ~/source/apertium/apertium-lex-tools/multitrans ~/source/apertium/apertium-en-es/en-es.autobil -m -f -t -n > europarl.en-es.es.multi-trimmed
</pre>

Translate and score all possible disambiguation paths:

<pre>
cat europarl.en-es.es.tagged | python ~/source/apertium/apertium-lex-tools/multitrans ~/source/apertium/apertium-en-es/en-es.autobil -m -f -n |
apertium -f none -d ~/source/apertium/apertium-en-es en-es-multi | ~/source/apertium/apertium-lex-tools/irstlm-ranker
~/source/corpora/lm/en.blm europarl.en-es.es.multi-trimmed -f > europarl.en-es.es.annotated
</pre>

Now we have a pseudo-parallel corpus where each possible translation is scored.
We start by extracting a frequency lexicon:

<pre>
python3 ~/source/apertium/apertium-lex-tools-scripts/biltrans-extract-frac-freq.py europarl.en-es.es.ambig europarl.en-es.es.annotated > europarl.en-es.freq
</pre>

<pre>
python3 ~/source/apertium/apertium-lex-tools-scripts/extract-alig-lrx.py europarl.en-es.freq > europarl.en-es.freq.lrx
</pre>

<pre>
lrx-comp europarl.en-es.freq.lrx europarl.en-es.freq.lrx.bin
</pre>

From here on, we have two paths we can choose. We can extract rules using a maximum entropy classifier, or we can extract
rules based only on the scores provided by irstlm-ranker.

== Direct rule extraction ==
When using this method, we directly continue with extracting ngrams from the pseudo parallel corpus:

<pre>
python3 ~/source/apertium/apertium-lex-tools/scripts/biltrans-count-patterns-ngrams.py europarl.en-es.freq europarl.en-es.es.ambig europarl.en-es.es.annotated > ngrams
</pre>

Next, we prune the generated ngrams:

<pre>
python3 ~/source/apertium/apertium-lex-tools/scripts/ngram-pruning-frac.py europarl.en-es.freq ngrams > patterns
</pre>

Finally, we generate and compile lexical selection rules while thresholding their irstlm-score

<pre>
crisphold=1;
python3 ~/source/apertium/apertium-lex-tools/scripts//ngrams-to-rules.py patterns $crisphold > patterns.lrx
</pre>

<pre>
lrx-comp patterns.lrx patterns.lrx.bin
</pre>

== Maximum entropy rule extraction ==
When extracting rules using a maximum entropy criterion, we first extract features which we are going to feed to a classifier:

<pre>
python3 ~/source/apertium/apertium-lex-tools/scripts/biltrans-count-patterns-frac-maxent.py europarl-en-es.freq
europarl.en-es.ambig europarl.en-es.annotated > events 2>ngrams
</pre>

We then train classifiers which as a side effect score how much each ngram contributes to a certain translation:
<pre>
cat events | grep -v -e '\$$ 0\.0 #' -e '\$$ 0 #' > events.trimmed
cat events.trimmed | python ~/source/apertium/apertium-lex-tools/scripts/merge-all-lambdas.py $(YASMET) > all-lambdas
</pre>

<pre>
python3 ~/source/apertium-lex-tools/scripts/merge-ngrams-lambdas.py ngrams all-lambdas > rules-all
</pre>

Finally, we extract ngrams:

<pre>
python3 ~/source/apertium-lex-tools/scripts/lambdas-to-rules.py europarl-en-es.freq rules-all > ngrams-all
</pre>

we trim them:

<pre>
python3 ~/source/apertium-lex-tools/scripts/ngram-pareto-trim.py ngrams-all > ngrams-trimmed
</pre>

and generate lexical selection rules:

<pre>
python3 ~/source/apertium-lex-tools/scripts/ngrams-to-rules-me.py ngrams-trimmed > europarl.en-es.lrx.bin
</pre>

== Makefiles ==

=== Direct rule extraction ===
You can use this makefile to generate rules using direct rule extraction.
Your corpus needs to be placed in the same folder with your makefile.

<pre>
CORPUS=setimes
PAIR=mk-en
DATA=/home/philip/Apertium/apertium-mk-en
SL=mk
TL=en
TRAINING_LINES=10000
SCRIPTS=/home/philip/Apertium/apertium-lex-tools/scripts
MODEL=/home/philip/Apertium/corpora/language-models/en/setimes.en.5.blm
LEX_TOOLS=/home/philip/Apertium/apertium-lex-tools

AUTOBIL=$(SL)-$(TL).autobil.bin
DIR=$(SL)-$(TL)

all: data/$(CORPUS).$(DIR).freq.lrx.bin data/$(CORPUS).$(DIR).patterns.lrx

data/$(CORPUS).$(DIR).tagger: $(CORPUS).$(PAIR).$(SL)
if [ ! -d data ]; then mkdir data; fi
cat $(CORPUS).$(PAIR).$(SL) | head -n $(TRAINING_LINES)| sed 's/[^\.]$$/./g' | apertium-destxt | apertium -f none -d $(DATA) $(DIR)-tagger | apertium-pretransfer > $@

data/$(CORPUS).$(DIR).ambig: data/$(CORPUS).$(DIR).tagger
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/process-tagger-output $(DATA)/$(AUTOBIL) -b -t -f -n > $@

data/$(CORPUS).$(DIR).multi-trimmed: data/$(CORPUS).$(DIR).tagger
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/process-tagger-output $(DATA)/$(AUTOBIL) -m -t -f > $@

data/$(CORPUS).$(DIR).annotated: data/$(CORPUS).$(DIR).tagger data/$(CORPUS).$(DIR).multi-trimmed
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/process-tagger-output $(DATA)/$(AUTOBIL) -m -f | apertium -f none -d $(DATA) $(DIR)-multi | $(LEX_TOOLS)/irstlm-ranker $(MODEL) data/$(CORPUS).$(DIR).multi-trimmed -f 2>/dev/null | grep "|@|" | cut -f 1-3 > $@

data/$(CORPUS).$(DIR).freq: data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated
python3 $(SCRIPTS)/biltrans-extract-frac-freq.py data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated > $@

data/$(CORPUS).$(DIR).freq.lrx: data/$(CORPUS).$(DIR).freq
python3 $(SCRIPTS)/extract-alig-lrx.py $< > $@

data/$(CORPUS).$(DIR).freq.lrx.bin: data/$(CORPUS).$(DIR).freq.lrx
lrx-comp $< $@

data/$(CORPUS).$(DIR).ngrams: data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated
python3 $(SCRIPTS)/biltrans-count-patterns-ngrams.py data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated > $@

data/$(CORPUS).$(DIR).patterns: data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ngrams
python3 $(SCRIPTS)/ngram-pruning-frac.py data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ngrams > $@

data/$(CORPUS).$(DIR).patterns.lrx: data/$(CORPUS).$(DIR).patterns
python3 $(SCRIPTS)/ngrams-to-rules.py $< $(THR) > $@

</pre>

The corpus file needs to be named as "basename"."language-pair"."source side". 
As an illustration, in the Makefile example, the corpus file is named europarl.en-es.es 

Set the Makefile variables as follows: 
CORPUS denotes the base name of your corpus file 
PAIR stands for the language pair 
SL and TL stand for source language and target language 
DATA is the path to the language resources for the language pair 
SCRIPTS denotes the path to the lex-tools scripts 
MODEL is the path to the target side (binary) language model used for scoring the possible translations of ambiguous words 

Finally, executing the Makefile will generate lexical selection rules for the specified language pair.

=== Maximum entropy ===
You can use this makefile to generate rules using maximum entropy classifiers.
Your corpus needs to be placed in the same folder with your makefile.

<pre>
CORPUS=europarl
PAIR=en-es
DATA=/home/philip/source/apertium-en-es
SL=es
TL=en
MODEL=/home/philip/lm/en.blm
SCRIPTS=/home/philip/source/apertium-lex-tools/scripts
LEX_TOOLS=/home/philip/source/apertium-lex-tools
THR=1
TRAINING_LINES=10000

AUTOBIL=$(SL)-$(TL).autobil.bin
DIR=$(SL)-$(TL)
YASMET=$(LEX_TOOLS)/yasmet
all: data/$(CORPUS).$(DIR).freq.lrx.bin data/$(CORPUS).$(DIR).lm.xml

data/$(CORPUS).$(DIR).tagger: $(CORPUS).$(PAIR).$(SL)
if [ ! -d data ]; then mkdir data; fi
cat $(CORPUS).$(PAIR).$(SL) | head -n $(TRAINING_LINES) | apertium-destxt | apertium -f none -d $(DATA) $(DIR)-tagger | apertium-pretransfer > $@

data/$(CORPUS).$(DIR).ambig: data/$(CORPUS).$(DIR).tagger
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/multitrans $(DATA)/$(AUTOBIL) -b -t -f -n > $@

data/$(CORPUS).$(DIR).multi-trimmed: data/$(CORPUS).$(DIR).tagger
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/multitrans $(DATA)/$(AUTOBIL) -m -t -f > $@

data/$(CORPUS).$(DIR).annotated: data/$(CORPUS).$(DIR).tagger data/$(CORPUS).$(DIR).multi-trimmed
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/multitrans $(DATA)/$(AUTOBIL) -m -f | apertium -f none -d $(DATA) $(DIR)-multi | $(LEX_TOOLS)/irstlm-ranker $(MODEL) data/$(CORPUS).$(DIR).multi-trimmed -f 2>/dev/null > $@

data/$(CORPUS).$(DIR).freq: data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated
python3 $(SCRIPTS)/biltrans-extract-frac-freq.py data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated > $@

data/$(CORPUS).$(DIR).freq.lrx: data/$(CORPUS).$(DIR).freq
python3 $(SCRIPTS)/extract-alig-lrx.py $< > $@

data/$(CORPUS).$(DIR).freq.lrx.bin: data/$(CORPUS).$(DIR).freq.lrx
lrx-comp $< $@

data/$(CORPUS).$(DIR).events data/$(CORPUS).$(DIR).ngrams: data/$(CORPUS).$(DIR).annotated data/$(CORPUS).$(DIR).freq
python3 $(SCRIPTS)/biltrans-count-patterns-frac-maxent.py data/$(CORPUS).$(SL)-$(TL).freq data/$(CORPUS).$(SL)-$(TL).ambig data/$(CORPUS).$(SL)-$(TL).annotated > data/$(CORPUS).$(DIR).events 2>data/$(CORPUS).$(DIR).ngrams

data/$(CORPUS).$(DIR).all-lambdas: data/$(CORPUS).$(DIR).events
cat data/$(CORPUS).$(DIR).events | grep -v -e '\$$ 0\.0 #' -e '\$$ 0 #' > data/$(CORPUS).$(DIR).events.trimmed
cat data/$(CORPUS).$(DIR).events.trimmed | python $(SCRIPTS)/merge-all-lambdas.py $(YASMET)> $@

data/$(CORPUS).$(DIR).rules-all: data/$(CORPUS).$(DIR).ngrams data/$(CORPUS).$(DIR).all-lambdas
python3 $(SCRIPTS)/merge-ngrams-lambdas.py data/$(CORPUS).$(DIR).ngrams data/$(CORPUS).$(DIR).all-lambdas > $@

data/$(CORPUS).$(DIR).ngrams-all: data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).rules-all
python3 $(SCRIPTS)/lambdas-to-rules.py data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).rules-all > $@

data/$(CORPUS).$(DIR).ngrams-trimmed: data/$(CORPUS).$(DIR).ngrams-all
cat $< | python3 $(SCRIPTS)/ngram-pareto-trim.py > $@

data/$(CORPUS).$(DIR).lm.xml: data/$(CORPUS).$(DIR).ngrams-trimmed
python3 $(SCRIPTS)/ngrams-to-rules-me.py data/$(CORPUS).$(DIR).ngrams-trimmed > $@

data/$(CORPUS).$(DIR).lm.bin: data/$(CORPUS).$(DIR).lm.xml
lrx-comp $< $@

</pre>
The corpus file needs to be named as "basename"."language-pair"."source side". 
As an illustration, in the Makefile example, the corpus file is named europarl.en-es.es 

Set the Makefile variables as follows: 
CORPUS denotes the base name of your corpus file 
PAIR stands for the language pair 
SL and TL stand for source language and target language 
DATA is the path to the language resources for the language pair 
SCRIPTS denotes the path to the lex-tools scripts 
MODEL is the path to the target side (binary) language model used for scoring the possible translations of ambiguous words 

Finally, executing the Makefile will generate lexical selection rules for the specified language pair.

[[Category:Lexical selection]]

Generating lexical-selection rules from monolingual corpora

2013-09-23T06:14:28Z

Fpetkovski: /* Maximum entropy */

{{TOCD}}
This page describes how to generate lexical selection rules without relying on a parallel corpus.

==Prerequisites==

* [[apertium-lex-tools]]
* [[IRSTLM]]
* A language pair (e.g. apertium-br-fr)
** The language pair should have the following two modes:
*** <code>-multi</code> which is all the modules after lexical transfer (see apertium-mk-en/modes.xml)
*** <code>-pretransfer</code> which is all the modules up to lexical transfer (see apertium-mk-en/modes.xml)

==Annotation==

Important: If you don't want through the whole process step by step, you can use the Makefile script provided in the last section of this page.

We're going to do the example with EuroParl and the English to Spanish pair in Apertium.

Given that you've got all the stuff installed, the work will be as follows:

Take your corpus and make a tagged version of it:

<pre>
cat europarl.es-en.es | apertium-destxt | apertium -f none -d ~/source/apertium/apertium-en-es en-es-pretransfer > europarl.en-es.es.tagged
</pre>

Make an ambiguous version of your corpus and trim redundant tags:

<pre>
cat europarl.en-es.es.tagged | python ~/source/apertium/apertium-lex-tools/multitrans ~/source/apertium/apertium-en-es/en-es.autobil -b -f -t -n > europarl.en-es.es.ambig
</pre>

Next, generate all the possible disambiguation paths while trimming redundant tags:

<pre>
cat europarl.en-es.es.tagged | ~/source/apertium/apertium-lex-tools/multitrans ~/source/apertium/apertium-en-es/en-es.autobil -m -f -t -n > europarl.en-es.es.multi-trimmed
</pre>

Translate and score all possible disambiguation paths:

<pre>
cat europarl.en-es.es.tagged | python ~/source/apertium/apertium-lex-tools/multitrans ~/source/apertium/apertium-en-es/en-es.autobil -m -f -n |
apertium -f none -d ~/source/apertium/apertium-en-es en-es-multi | ~/source/apertium/apertium-lex-tools/irstlm-ranker
~/source/corpora/lm/en.blm europarl.en-es.es.multi-trimmed -f > europarl.en-es.es.annotated
</pre>

Now we have a pseudo-parallel corpus where each possible translation is scored.
We start by extracting a frequency lexicon:

<pre>
python3 ~/source/apertium/apertium-lex-tools-scripts/biltrans-extract-frac-freq.py europarl.en-es.es.ambig europarl.en-es.es.annotated > europarl.en-es.freq
</pre>

<pre>
python3 ~/source/apertium/apertium-lex-tools-scripts/extract-alig-lrx.py europarl.en-es.freq > europarl.en-es.freq.lrx
</pre>

<pre>
lrx-comp europarl.en-es.freq.lrx europarl.en-es.freq.lrx.bin
</pre>

From here on, we have two paths we can choose. We can extract rules using a maximum entropy classifier, or we can extract
rules based only on the scores provided by irstlm-ranker.

== Direct rule extraction ==
When using this method, we directly continue with extracting ngrams from the pseudo parallel corpus:

<pre>
python3 ~/source/apertium/apertium-lex-tools/scripts/biltrans-count-patterns-ngrams.py europarl.en-es.freq europarl.en-es.es.ambig europarl.en-es.es.annotated > ngrams
</pre>

Next, we prune the generated ngrams:

<pre>
python3 ~/source/apertium/apertium-lex-tools/scripts/ngram-pruning-frac.py europarl.en-es.freq ngrams > patterns
</pre>

Finally, we generate and compile lexical selection rules while thresholding their irstlm-score

<pre>
crisphold=1;
python3 ~/source/apertium/apertium-lex-tools/scripts//ngrams-to-rules.py patterns $crisphold > patterns.lrx
</pre>

<pre>
lrx-comp patterns.lrx patterns.lrx.bin
</pre>

== Maximum entropy rule extraction ==
When extracting rules using a maximum entropy criterion, we first extract features which we are going to feed to a classifier:

<pre>
python3 ~/source/apertium/apertium-lex-tools/scripts/biltrans-count-patterns-frac-maxent.py europarl-en-es.freq
europarl.en-es.ambig europarl.en-es.annotated > events 2>ngrams
</pre>

We then train classifiers which as a side effect score how much each ngram contributes to a certain translation:
<pre>
cat events | grep -v -e '\$$ 0\.0 #' -e '\$$ 0 #' > events.trimmed
cat events.trimmed | python ~/source/apertium/apertium-lex-tools/scripts/merge-all-lambdas.py $(YASMET) > all-lambdas
</pre>

<pre>
python3 ~/source/apertium-lex-tools/scripts/merge-ngrams-lambdas.py ngrams all-lambdas > rules-all
</pre>

Finally, we extract ngrams:

<pre>
python3 ~/source/apertium-lex-tools/scripts/lambdas-to-rules.py europarl-en-es.freq rules-all > ngrams-all
</pre>

we trim them:

<pre>
python3 ~/source/apertium-lex-tools/scripts/ngram-pareto-trim.py ngrams-all > ngrams-trimmed
</pre>

and generate lexical selection rules:

<pre>
python3 ~/source/apertium-lex-tools/scripts/ngrams-to-rules-me.py ngrams-trimmed > europarl.en-es.lrx.bin
</pre>

== Makefiles ==

=== Direct rule extraction ===
You can use this makefile to generate rules using direct rule extraction.
Your corpus needs to be placed in the same folder with your makefile.

<pre>
CORPUS=setimes
PAIR=mk-en
DATA=/home/philip/Apertium/apertium-mk-en
SL=mk
TL=en
TRAINING_LINES=10000
SCRIPTS=/home/philip/Apertium/apertium-lex-tools/scripts
MODEL=/home/philip/Apertium/corpora/language-models/en/setimes.en.5.blm
LEX_TOOLS=/home/philip/Apertium/apertium-lex-tools

AUTOBIL=$(SL)-$(TL).autobil.bin
DIR=$(SL)-$(TL)

all: data/$(CORPUS).$(DIR).freq.lrx.bin data/$(CORPUS).$(DIR).patterns.lrx

data/$(CORPUS).$(DIR).tagger: $(CORPUS).$(PAIR).$(SL)
if [ ! -d data ]; then mkdir data; fi
cat $(CORPUS).$(PAIR).$(SL) | head -n $(TRAINING_LINES)| sed 's/[^\.]$$/./g' | apertium-destxt | apertium -f none -d $(DATA) $(DIR)-tagger | apertium-pretransfer > $@

data/$(CORPUS).$(DIR).ambig: data/$(CORPUS).$(DIR).tagger
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/process-tagger-output $(DATA)/$(AUTOBIL) -b -t -f -n > $@

data/$(CORPUS).$(DIR).multi-trimmed: data/$(CORPUS).$(DIR).tagger
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/process-tagger-output $(DATA)/$(AUTOBIL) -m -t -f > $@

data/$(CORPUS).$(DIR).annotated: data/$(CORPUS).$(DIR).tagger data/$(CORPUS).$(DIR).multi-trimmed
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/process-tagger-output $(DATA)/$(AUTOBIL) -m -f | apertium -f none -d $(DATA) $(DIR)-multi | $(LEX_TOOLS)/irstlm-ranker $(MODEL) data/$(CORPUS).$(DIR).multi-trimmed -f 2>/dev/null | grep "|@|" | cut -f 1-3 > $@

data/$(CORPUS).$(DIR).freq: data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated
python3 $(SCRIPTS)/biltrans-extract-frac-freq.py data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated > $@

data/$(CORPUS).$(DIR).freq.lrx: data/$(CORPUS).$(DIR).freq
python3 $(SCRIPTS)/extract-alig-lrx.py $< > $@

data/$(CORPUS).$(DIR).freq.lrx.bin: data/$(CORPUS).$(DIR).freq.lrx
lrx-comp $< $@

data/$(CORPUS).$(DIR).ngrams: data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated
python3 $(SCRIPTS)/biltrans-count-patterns-ngrams.py data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated > $@

data/$(CORPUS).$(DIR).patterns: data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ngrams
python3 $(SCRIPTS)/ngram-pruning-frac.py data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ngrams > $@

data/$(CORPUS).$(DIR).patterns.lrx: data/$(CORPUS).$(DIR).patterns
python3 $(SCRIPTS)/ngrams-to-rules.py $< $(THR) > $@

</pre>

The corpus file needs to be named as "basename"."language-pair"."source side".
As an illustration, in the Makefile example, the corpus file is named europarl.en-es.es
Set the Makefile variables as follows:
CORPUS denotes the base name of your corpus file
PAIR stands for the language pair
SL and TL stand for source language and target language
DATA is the path to the language resources for the language pair
SCRIPTS denotes the path to the lex-tools scripts
MODEL is the path to the target side (binary) language model used for scoring the possible translations of ambiguous words

Finally, executing the Makefile will generate lexical selection rules for the specified language pair.

=== Maximum entropy ===
You can use this makefile to generate rules using maximum entropy classifiers.
Your corpus needs to be placed in the same folder with your makefile.

<pre>
CORPUS=europarl
PAIR=en-es
DATA=/home/philip/source/apertium-en-es
SL=es
TL=en
MODEL=/home/philip/lm/en.blm
SCRIPTS=/home/philip/source/apertium-lex-tools/scripts
LEX_TOOLS=/home/philip/source/apertium-lex-tools
THR=1
TRAINING_LINES=10000

AUTOBIL=$(SL)-$(TL).autobil.bin
DIR=$(SL)-$(TL)
YASMET=$(LEX_TOOLS)/yasmet
all: data/$(CORPUS).$(DIR).freq.lrx.bin data/$(CORPUS).$(DIR).lm.xml

data/$(CORPUS).$(DIR).tagger: $(CORPUS).$(PAIR).$(SL)
if [ ! -d data ]; then mkdir data; fi
cat $(CORPUS).$(PAIR).$(SL) | head -n $(TRAINING_LINES) | apertium-destxt | apertium -f none -d $(DATA) $(DIR)-tagger | apertium-pretransfer > $@

data/$(CORPUS).$(DIR).ambig: data/$(CORPUS).$(DIR).tagger
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/multitrans $(DATA)/$(AUTOBIL) -b -t -f -n > $@

data/$(CORPUS).$(DIR).multi-trimmed: data/$(CORPUS).$(DIR).tagger
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/multitrans $(DATA)/$(AUTOBIL) -m -t -f > $@

data/$(CORPUS).$(DIR).annotated: data/$(CORPUS).$(DIR).tagger data/$(CORPUS).$(DIR).multi-trimmed
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/multitrans $(DATA)/$(AUTOBIL) -m -f | apertium -f none -d $(DATA) $(DIR)-multi | $(LEX_TOOLS)/irstlm-ranker $(MODEL) data/$(CORPUS).$(DIR).multi-trimmed -f 2>/dev/null > $@

data/$(CORPUS).$(DIR).freq: data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated
python3 $(SCRIPTS)/biltrans-extract-frac-freq.py data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated > $@

data/$(CORPUS).$(DIR).freq.lrx: data/$(CORPUS).$(DIR).freq
python3 $(SCRIPTS)/extract-alig-lrx.py $< > $@

data/$(CORPUS).$(DIR).freq.lrx.bin: data/$(CORPUS).$(DIR).freq.lrx
lrx-comp $< $@

data/$(CORPUS).$(DIR).events data/$(CORPUS).$(DIR).ngrams: data/$(CORPUS).$(DIR).annotated data/$(CORPUS).$(DIR).freq
python3 $(SCRIPTS)/biltrans-count-patterns-frac-maxent.py data/$(CORPUS).$(SL)-$(TL).freq data/$(CORPUS).$(SL)-$(TL).ambig data/$(CORPUS).$(SL)-$(TL).annotated > data/$(CORPUS).$(DIR).events 2>data/$(CORPUS).$(DIR).ngrams

data/$(CORPUS).$(DIR).all-lambdas: data/$(CORPUS).$(DIR).events
cat data/$(CORPUS).$(DIR).events | grep -v -e '\$$ 0\.0 #' -e '\$$ 0 #' > data/$(CORPUS).$(DIR).events.trimmed
cat data/$(CORPUS).$(DIR).events.trimmed | python $(SCRIPTS)/merge-all-lambdas.py $(YASMET)> $@

data/$(CORPUS).$(DIR).rules-all: data/$(CORPUS).$(DIR).ngrams data/$(CORPUS).$(DIR).all-lambdas
python3 $(SCRIPTS)/merge-ngrams-lambdas.py data/$(CORPUS).$(DIR).ngrams data/$(CORPUS).$(DIR).all-lambdas > $@

data/$(CORPUS).$(DIR).ngrams-all: data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).rules-all
python3 $(SCRIPTS)/lambdas-to-rules.py data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).rules-all > $@

data/$(CORPUS).$(DIR).ngrams-trimmed: data/$(CORPUS).$(DIR).ngrams-all
cat $< | python3 $(SCRIPTS)/ngram-pareto-trim.py > $@

data/$(CORPUS).$(DIR).lm.xml: data/$(CORPUS).$(DIR).ngrams-trimmed
python3 $(SCRIPTS)/ngrams-to-rules-me.py data/$(CORPUS).$(DIR).ngrams-trimmed > $@

data/$(CORPUS).$(DIR).lm.bin: data/$(CORPUS).$(DIR).lm.xml
lrx-comp $< $@

</pre>
The corpus file needs to be named as "basename"."language-pair"."source side". 
As an illustration, in the Makefile example, the corpus file is named europarl.en-es.es 

Set the Makefile variables as follows: 
CORPUS denotes the base name of your corpus file 
PAIR stands for the language pair 
SL and TL stand for source language and target language 
DATA is the path to the language resources for the language pair 
SCRIPTS denotes the path to the lex-tools scripts 
MODEL is the path to the target side (binary) language model used for scoring the possible translations of ambiguous words 

Finally, executing the Makefile will generate lexical selection rules for the specified language pair.

[[Category:Lexical selection]]

Generating lexical-selection rules from monolingual corpora

2013-09-23T06:13:51Z

Fpetkovski: /* Maximum entropy */

{{TOCD}}
This page describes how to generate lexical selection rules without relying on a parallel corpus.

==Prerequisites==

* [[apertium-lex-tools]]
* [[IRSTLM]]
* A language pair (e.g. apertium-br-fr)
** The language pair should have the following two modes:
*** <code>-multi</code> which is all the modules after lexical transfer (see apertium-mk-en/modes.xml)
*** <code>-pretransfer</code> which is all the modules up to lexical transfer (see apertium-mk-en/modes.xml)

==Annotation==

Important: If you don't want through the whole process step by step, you can use the Makefile script provided in the last section of this page.

We're going to do the example with EuroParl and the English to Spanish pair in Apertium.

Given that you've got all the stuff installed, the work will be as follows:

Take your corpus and make a tagged version of it:

<pre>
cat europarl.es-en.es | apertium-destxt | apertium -f none -d ~/source/apertium/apertium-en-es en-es-pretransfer > europarl.en-es.es.tagged
</pre>

Make an ambiguous version of your corpus and trim redundant tags:

<pre>
cat europarl.en-es.es.tagged | python ~/source/apertium/apertium-lex-tools/multitrans ~/source/apertium/apertium-en-es/en-es.autobil -b -f -t -n > europarl.en-es.es.ambig
</pre>

Next, generate all the possible disambiguation paths while trimming redundant tags:

<pre>
cat europarl.en-es.es.tagged | ~/source/apertium/apertium-lex-tools/multitrans ~/source/apertium/apertium-en-es/en-es.autobil -m -f -t -n > europarl.en-es.es.multi-trimmed
</pre>

Translate and score all possible disambiguation paths:

<pre>
cat europarl.en-es.es.tagged | python ~/source/apertium/apertium-lex-tools/multitrans ~/source/apertium/apertium-en-es/en-es.autobil -m -f -n |
apertium -f none -d ~/source/apertium/apertium-en-es en-es-multi | ~/source/apertium/apertium-lex-tools/irstlm-ranker
~/source/corpora/lm/en.blm europarl.en-es.es.multi-trimmed -f > europarl.en-es.es.annotated
</pre>

Now we have a pseudo-parallel corpus where each possible translation is scored.
We start by extracting a frequency lexicon:

<pre>
python3 ~/source/apertium/apertium-lex-tools-scripts/biltrans-extract-frac-freq.py europarl.en-es.es.ambig europarl.en-es.es.annotated > europarl.en-es.freq
</pre>

<pre>
python3 ~/source/apertium/apertium-lex-tools-scripts/extract-alig-lrx.py europarl.en-es.freq > europarl.en-es.freq.lrx
</pre>

<pre>
lrx-comp europarl.en-es.freq.lrx europarl.en-es.freq.lrx.bin
</pre>

From here on, we have two paths we can choose. We can extract rules using a maximum entropy classifier, or we can extract
rules based only on the scores provided by irstlm-ranker.

== Direct rule extraction ==
When using this method, we directly continue with extracting ngrams from the pseudo parallel corpus:

<pre>
python3 ~/source/apertium/apertium-lex-tools/scripts/biltrans-count-patterns-ngrams.py europarl.en-es.freq europarl.en-es.es.ambig europarl.en-es.es.annotated > ngrams
</pre>

Next, we prune the generated ngrams:

<pre>
python3 ~/source/apertium/apertium-lex-tools/scripts/ngram-pruning-frac.py europarl.en-es.freq ngrams > patterns
</pre>

Finally, we generate and compile lexical selection rules while thresholding their irstlm-score

<pre>
crisphold=1;
python3 ~/source/apertium/apertium-lex-tools/scripts//ngrams-to-rules.py patterns $crisphold > patterns.lrx
</pre>

<pre>
lrx-comp patterns.lrx patterns.lrx.bin
</pre>

== Maximum entropy rule extraction ==
When extracting rules using a maximum entropy criterion, we first extract features which we are going to feed to a classifier:

<pre>
python3 ~/source/apertium/apertium-lex-tools/scripts/biltrans-count-patterns-frac-maxent.py europarl-en-es.freq
europarl.en-es.ambig europarl.en-es.annotated > events 2>ngrams
</pre>

We then train classifiers which as a side effect score how much each ngram contributes to a certain translation:
<pre>
cat events | grep -v -e '\$$ 0\.0 #' -e '\$$ 0 #' > events.trimmed
cat events.trimmed | python ~/source/apertium/apertium-lex-tools/scripts/merge-all-lambdas.py $(YASMET) > all-lambdas
</pre>

<pre>
python3 ~/source/apertium-lex-tools/scripts/merge-ngrams-lambdas.py ngrams all-lambdas > rules-all
</pre>

Finally, we extract ngrams:

<pre>
python3 ~/source/apertium-lex-tools/scripts/lambdas-to-rules.py europarl-en-es.freq rules-all > ngrams-all
</pre>

we trim them:

<pre>
python3 ~/source/apertium-lex-tools/scripts/ngram-pareto-trim.py ngrams-all > ngrams-trimmed
</pre>

and generate lexical selection rules:

<pre>
python3 ~/source/apertium-lex-tools/scripts/ngrams-to-rules-me.py ngrams-trimmed > europarl.en-es.lrx.bin
</pre>

== Makefiles ==

=== Direct rule extraction ===
You can use this makefile to generate rules using direct rule extraction.
Your corpus needs to be placed in the same folder with your makefile.

<pre>
CORPUS=setimes
PAIR=mk-en
DATA=/home/philip/Apertium/apertium-mk-en
SL=mk
TL=en
TRAINING_LINES=10000
SCRIPTS=/home/philip/Apertium/apertium-lex-tools/scripts
MODEL=/home/philip/Apertium/corpora/language-models/en/setimes.en.5.blm
LEX_TOOLS=/home/philip/Apertium/apertium-lex-tools

AUTOBIL=$(SL)-$(TL).autobil.bin
DIR=$(SL)-$(TL)

all: data/$(CORPUS).$(DIR).freq.lrx.bin data/$(CORPUS).$(DIR).patterns.lrx

data/$(CORPUS).$(DIR).tagger: $(CORPUS).$(PAIR).$(SL)
if [ ! -d data ]; then mkdir data; fi
cat $(CORPUS).$(PAIR).$(SL) | head -n $(TRAINING_LINES)| sed 's/[^\.]$$/./g' | apertium-destxt | apertium -f none -d $(DATA) $(DIR)-tagger | apertium-pretransfer > $@

data/$(CORPUS).$(DIR).ambig: data/$(CORPUS).$(DIR).tagger
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/process-tagger-output $(DATA)/$(AUTOBIL) -b -t -f -n > $@

data/$(CORPUS).$(DIR).multi-trimmed: data/$(CORPUS).$(DIR).tagger
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/process-tagger-output $(DATA)/$(AUTOBIL) -m -t -f > $@

data/$(CORPUS).$(DIR).annotated: data/$(CORPUS).$(DIR).tagger data/$(CORPUS).$(DIR).multi-trimmed
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/process-tagger-output $(DATA)/$(AUTOBIL) -m -f | apertium -f none -d $(DATA) $(DIR)-multi | $(LEX_TOOLS)/irstlm-ranker $(MODEL) data/$(CORPUS).$(DIR).multi-trimmed -f 2>/dev/null | grep "|@|" | cut -f 1-3 > $@

data/$(CORPUS).$(DIR).freq: data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated
python3 $(SCRIPTS)/biltrans-extract-frac-freq.py data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated > $@

data/$(CORPUS).$(DIR).freq.lrx: data/$(CORPUS).$(DIR).freq
python3 $(SCRIPTS)/extract-alig-lrx.py $< > $@

data/$(CORPUS).$(DIR).freq.lrx.bin: data/$(CORPUS).$(DIR).freq.lrx
lrx-comp $< $@

data/$(CORPUS).$(DIR).ngrams: data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated
python3 $(SCRIPTS)/biltrans-count-patterns-ngrams.py data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated > $@

data/$(CORPUS).$(DIR).patterns: data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ngrams
python3 $(SCRIPTS)/ngram-pruning-frac.py data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ngrams > $@

data/$(CORPUS).$(DIR).patterns.lrx: data/$(CORPUS).$(DIR).patterns
python3 $(SCRIPTS)/ngrams-to-rules.py $< $(THR) > $@

</pre>

The corpus file needs to be named as "basename"."language-pair"."source side".
As an illustration, in the Makefile example, the corpus file is named europarl.en-es.es
Set the Makefile variables as follows:
CORPUS denotes the base name of your corpus file
PAIR stands for the language pair
SL and TL stand for source language and target language
DATA is the path to the language resources for the language pair
SCRIPTS denotes the path to the lex-tools scripts
MODEL is the path to the target side (binary) language model used for scoring the possible translations of ambiguous words

Finally, executing the Makefile will generate lexical selection rules for the specified language pair.

=== Maximum entropy ===
You can use this makefile to generate rules using maximum entropy classifiers.
Your corpus needs to be placed in the same folder with your makefile.

<pre>
CORPUS=europarl
PAIR=en-es
DATA=/home/philip/source/apertium-en-es
SL=es
TL=en
MODEL=/home/philip/lm/en.blm
SCRIPTS=/home/philip/source/apertium-lex-tools/scripts
LEX_TOOLS=/home/philip/source/apertium-lex-tools
THR=1
TRAINING_LINES=10000

AUTOBIL=$(SL)-$(TL).autobil.bin
DIR=$(SL)-$(TL)
YASMET=$(LEX_TOOLS)/yasmet
all: data/$(CORPUS).$(DIR).freq.lrx.bin data/$(CORPUS).$(DIR).lm.xml

data/$(CORPUS).$(DIR).tagger: $(CORPUS).$(PAIR).$(SL)
if [ ! -d data ]; then mkdir data; fi
cat $(CORPUS).$(PAIR).$(SL) | head -n $(TRAINING_LINES) | apertium-destxt | apertium -f none -d $(DATA) $(DIR)-tagger | apertium-pretransfer > $@

data/$(CORPUS).$(DIR).ambig: data/$(CORPUS).$(DIR).tagger
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/multitrans $(DATA)/$(AUTOBIL) -b -t -f -n > $@

data/$(CORPUS).$(DIR).multi-trimmed: data/$(CORPUS).$(DIR).tagger
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/multitrans $(DATA)/$(AUTOBIL) -m -t -f > $@

data/$(CORPUS).$(DIR).annotated: data/$(CORPUS).$(DIR).tagger data/$(CORPUS).$(DIR).multi-trimmed
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/multitrans $(DATA)/$(AUTOBIL) -m -f | apertium -f none -d $(DATA) $(DIR)-multi | $(LEX_TOOLS)/irstlm-ranker $(MODEL) data/$(CORPUS).$(DIR).multi-trimmed -f 2>/dev/null > $@

data/$(CORPUS).$(DIR).freq: data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated
python3 $(SCRIPTS)/biltrans-extract-frac-freq.py data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated > $@

data/$(CORPUS).$(DIR).freq.lrx: data/$(CORPUS).$(DIR).freq
python3 $(SCRIPTS)/extract-alig-lrx.py $< > $@

data/$(CORPUS).$(DIR).freq.lrx.bin: data/$(CORPUS).$(DIR).freq.lrx
lrx-comp $< $@

data/$(CORPUS).$(DIR).events data/$(CORPUS).$(DIR).ngrams: data/$(CORPUS).$(DIR).annotated data/$(CORPUS).$(DIR).freq
python3 $(SCRIPTS)/biltrans-count-patterns-frac-maxent.py data/$(CORPUS).$(SL)-$(TL).freq data/$(CORPUS).$(SL)-$(TL).ambig data/$(CORPUS).$(SL)-$(TL).annotated > data/$(CORPUS).$(DIR).events 2>data/$(CORPUS).$(DIR).ngrams

data/$(CORPUS).$(DIR).all-lambdas: data/$(CORPUS).$(DIR).events
cat data/$(CORPUS).$(DIR).events | grep -v -e '\$$ 0\.0 #' -e '\$$ 0 #' > data/$(CORPUS).$(DIR).events.trimmed
cat data/$(CORPUS).$(DIR).events.trimmed | python $(SCRIPTS)/merge-all-lambdas.py $(YASMET)> $@

data/$(CORPUS).$(DIR).rules-all: data/$(CORPUS).$(DIR).ngrams data/$(CORPUS).$(DIR).all-lambdas
python3 $(SCRIPTS)/merge-ngrams-lambdas.py data/$(CORPUS).$(DIR).ngrams data/$(CORPUS).$(DIR).all-lambdas > $@

data/$(CORPUS).$(DIR).ngrams-all: data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).rules-all
python3 $(SCRIPTS)/lambdas-to-rules.py data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).rules-all > $@

data/$(CORPUS).$(DIR).ngrams-trimmed: data/$(CORPUS).$(DIR).ngrams-all
cat $< | python3 $(SCRIPTS)/ngram-pareto-trim.py > $@

data/$(CORPUS).$(DIR).lm.xml: data/$(CORPUS).$(DIR).ngrams-trimmed
python3 $(SCRIPTS)/ngrams-to-rules-me.py data/$(CORPUS).$(DIR).ngrams-trimmed > $@

data/$(CORPUS).$(DIR).lm.bin: data/$(CORPUS).$(DIR).lm.xml
lrx-comp $< $@

</pre>
The corpus file needs to be named as "basename"."language-pair"."source side".
As an illustration, in the Makefile example, the corpus file is named europarl.en-es.es
Set the Makefile variables as follows:
CORPUS denotes the base name of your corpus file
PAIR stands for the language pair
SL and TL stand for source language and target language
DATA is the path to the language resources for the language pair
SCRIPTS denotes the path to the lex-tools scripts
MODEL is the path to the target side (binary) language model used for scoring the possible translations of ambiguous words

Finally, executing the Makefile will generate lexical selection rules for the specified language pair.

[[Category:Lexical selection]]

Generating lexical-selection rules from monolingual corpora

2013-09-23T06:12:35Z

Fpetkovski: /* Direct rule extraction */

{{TOCD}}
This page describes how to generate lexical selection rules without relying on a parallel corpus.

==Prerequisites==

* [[apertium-lex-tools]]
* [[IRSTLM]]
* A language pair (e.g. apertium-br-fr)
** The language pair should have the following two modes:
*** <code>-multi</code> which is all the modules after lexical transfer (see apertium-mk-en/modes.xml)
*** <code>-pretransfer</code> which is all the modules up to lexical transfer (see apertium-mk-en/modes.xml)

==Annotation==

Important: If you don't want through the whole process step by step, you can use the Makefile script provided in the last section of this page.

We're going to do the example with EuroParl and the English to Spanish pair in Apertium.

Given that you've got all the stuff installed, the work will be as follows:

Take your corpus and make a tagged version of it:

<pre>
cat europarl.es-en.es | apertium-destxt | apertium -f none -d ~/source/apertium/apertium-en-es en-es-pretransfer > europarl.en-es.es.tagged
</pre>

Make an ambiguous version of your corpus and trim redundant tags:

<pre>
cat europarl.en-es.es.tagged | python ~/source/apertium/apertium-lex-tools/multitrans ~/source/apertium/apertium-en-es/en-es.autobil -b -f -t -n > europarl.en-es.es.ambig
</pre>

Next, generate all the possible disambiguation paths while trimming redundant tags:

<pre>
cat europarl.en-es.es.tagged | ~/source/apertium/apertium-lex-tools/multitrans ~/source/apertium/apertium-en-es/en-es.autobil -m -f -t -n > europarl.en-es.es.multi-trimmed
</pre>

Translate and score all possible disambiguation paths:

<pre>
cat europarl.en-es.es.tagged | python ~/source/apertium/apertium-lex-tools/multitrans ~/source/apertium/apertium-en-es/en-es.autobil -m -f -n |
apertium -f none -d ~/source/apertium/apertium-en-es en-es-multi | ~/source/apertium/apertium-lex-tools/irstlm-ranker
~/source/corpora/lm/en.blm europarl.en-es.es.multi-trimmed -f > europarl.en-es.es.annotated
</pre>

Now we have a pseudo-parallel corpus where each possible translation is scored.
We start by extracting a frequency lexicon:

<pre>
python3 ~/source/apertium/apertium-lex-tools-scripts/biltrans-extract-frac-freq.py europarl.en-es.es.ambig europarl.en-es.es.annotated > europarl.en-es.freq
</pre>

<pre>
python3 ~/source/apertium/apertium-lex-tools-scripts/extract-alig-lrx.py europarl.en-es.freq > europarl.en-es.freq.lrx
</pre>

<pre>
lrx-comp europarl.en-es.freq.lrx europarl.en-es.freq.lrx.bin
</pre>

From here on, we have two paths we can choose. We can extract rules using a maximum entropy classifier, or we can extract
rules based only on the scores provided by irstlm-ranker.

== Direct rule extraction ==
When using this method, we directly continue with extracting ngrams from the pseudo parallel corpus:

<pre>
python3 ~/source/apertium/apertium-lex-tools/scripts/biltrans-count-patterns-ngrams.py europarl.en-es.freq europarl.en-es.es.ambig europarl.en-es.es.annotated > ngrams
</pre>

Next, we prune the generated ngrams:

<pre>
python3 ~/source/apertium/apertium-lex-tools/scripts/ngram-pruning-frac.py europarl.en-es.freq ngrams > patterns
</pre>

Finally, we generate and compile lexical selection rules while thresholding their irstlm-score

<pre>
crisphold=1;
python3 ~/source/apertium/apertium-lex-tools/scripts//ngrams-to-rules.py patterns $crisphold > patterns.lrx
</pre>

<pre>
lrx-comp patterns.lrx patterns.lrx.bin
</pre>

== Maximum entropy rule extraction ==
When extracting rules using a maximum entropy criterion, we first extract features which we are going to feed to a classifier:

<pre>
python3 ~/source/apertium/apertium-lex-tools/scripts/biltrans-count-patterns-frac-maxent.py europarl-en-es.freq
europarl.en-es.ambig europarl.en-es.annotated > events 2>ngrams
</pre>

We then train classifiers which as a side effect score how much each ngram contributes to a certain translation:
<pre>
cat events | grep -v -e '\$$ 0\.0 #' -e '\$$ 0 #' > events.trimmed
cat events.trimmed | python ~/source/apertium/apertium-lex-tools/scripts/merge-all-lambdas.py $(YASMET) > all-lambdas
</pre>

<pre>
python3 ~/source/apertium-lex-tools/scripts/merge-ngrams-lambdas.py ngrams all-lambdas > rules-all
</pre>

Finally, we extract ngrams:

<pre>
python3 ~/source/apertium-lex-tools/scripts/lambdas-to-rules.py europarl-en-es.freq rules-all > ngrams-all
</pre>

we trim them:

<pre>
python3 ~/source/apertium-lex-tools/scripts/ngram-pareto-trim.py ngrams-all > ngrams-trimmed
</pre>

and generate lexical selection rules:

<pre>
python3 ~/source/apertium-lex-tools/scripts/ngrams-to-rules-me.py ngrams-trimmed > europarl.en-es.lrx.bin
</pre>

== Makefiles ==

=== Direct rule extraction ===
You can use this makefile to generate rules using direct rule extraction.
Your corpus needs to be placed in the same folder with your makefile.

<pre>
CORPUS=setimes
PAIR=mk-en
DATA=/home/philip/Apertium/apertium-mk-en
SL=mk
TL=en
TRAINING_LINES=10000
SCRIPTS=/home/philip/Apertium/apertium-lex-tools/scripts
MODEL=/home/philip/Apertium/corpora/language-models/en/setimes.en.5.blm
LEX_TOOLS=/home/philip/Apertium/apertium-lex-tools

AUTOBIL=$(SL)-$(TL).autobil.bin
DIR=$(SL)-$(TL)

all: data/$(CORPUS).$(DIR).freq.lrx.bin data/$(CORPUS).$(DIR).patterns.lrx

data/$(CORPUS).$(DIR).tagger: $(CORPUS).$(PAIR).$(SL)
if [ ! -d data ]; then mkdir data; fi
cat $(CORPUS).$(PAIR).$(SL) | head -n $(TRAINING_LINES)| sed 's/[^\.]$$/./g' | apertium-destxt | apertium -f none -d $(DATA) $(DIR)-tagger | apertium-pretransfer > $@

data/$(CORPUS).$(DIR).ambig: data/$(CORPUS).$(DIR).tagger
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/process-tagger-output $(DATA)/$(AUTOBIL) -b -t -f -n > $@

data/$(CORPUS).$(DIR).multi-trimmed: data/$(CORPUS).$(DIR).tagger
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/process-tagger-output $(DATA)/$(AUTOBIL) -m -t -f > $@

data/$(CORPUS).$(DIR).annotated: data/$(CORPUS).$(DIR).tagger data/$(CORPUS).$(DIR).multi-trimmed
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/process-tagger-output $(DATA)/$(AUTOBIL) -m -f | apertium -f none -d $(DATA) $(DIR)-multi | $(LEX_TOOLS)/irstlm-ranker $(MODEL) data/$(CORPUS).$(DIR).multi-trimmed -f 2>/dev/null | grep "|@|" | cut -f 1-3 > $@

data/$(CORPUS).$(DIR).freq: data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated
python3 $(SCRIPTS)/biltrans-extract-frac-freq.py data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated > $@

data/$(CORPUS).$(DIR).freq.lrx: data/$(CORPUS).$(DIR).freq
python3 $(SCRIPTS)/extract-alig-lrx.py $< > $@

data/$(CORPUS).$(DIR).freq.lrx.bin: data/$(CORPUS).$(DIR).freq.lrx
lrx-comp $< $@

data/$(CORPUS).$(DIR).ngrams: data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated
python3 $(SCRIPTS)/biltrans-count-patterns-ngrams.py data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated > $@

data/$(CORPUS).$(DIR).patterns: data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ngrams
python3 $(SCRIPTS)/ngram-pruning-frac.py data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ngrams > $@

data/$(CORPUS).$(DIR).patterns.lrx: data/$(CORPUS).$(DIR).patterns
python3 $(SCRIPTS)/ngrams-to-rules.py $< $(THR) > $@

</pre>

The corpus file needs to be named as "basename"."language-pair"."source side".
As an illustration, in the Makefile example, the corpus file is named europarl.en-es.es
Set the Makefile variables as follows:
CORPUS denotes the base name of your corpus file
PAIR stands for the language pair
SL and TL stand for source language and target language
DATA is the path to the language resources for the language pair
SCRIPTS denotes the path to the lex-tools scripts
MODEL is the path to the target side (binary) language model used for scoring the possible translations of ambiguous words

Finally, executing the Makefile will generate lexical selection rules for the specified language pair.

=== Maximum entropy ===
You can use this makefile to generate rules using maximum entropy classifiers.
Your corpus needs to be placed in the same folder with your makefile.

<pre>
CORPUS=europarl
PAIR=en-es
DATA=/home/philip/source/apertium-en-es
SL=es
TL=en
MODEL=/home/philip/lm/en.blm
SCRIPTS=/home/philip/source/apertium-lex-tools/scripts
LEX_TOOLS=/home/philip/source/apertium-lex-tools
THR=1
TRAINING_LINES=10000

AUTOBIL=$(SL)-$(TL).autobil.bin
DIR=$(SL)-$(TL)
YASMET=$(LEX_TOOLS)/yasmet
all: data/$(CORPUS).$(DIR).freq.lrx.bin data/$(CORPUS).$(DIR).lm.xml

data/$(CORPUS).$(DIR).tagger: $(CORPUS).$(PAIR).$(SL)
if [ ! -d data ]; then mkdir data; fi
cat $(CORPUS).$(PAIR).$(SL) | head -n $(TRAINING_LINES) | apertium-destxt | apertium -f none -d $(DATA) $(DIR)-tagger | apertium-pretransfer > $@

data/$(CORPUS).$(DIR).ambig: data/$(CORPUS).$(DIR).tagger
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/multitrans $(DATA)/$(AUTOBIL) -b -t -f -n > $@

data/$(CORPUS).$(DIR).multi-trimmed: data/$(CORPUS).$(DIR).tagger
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/multitrans $(DATA)/$(AUTOBIL) -m -t -f > $@

data/$(CORPUS).$(DIR).annotated: data/$(CORPUS).$(DIR).tagger data/$(CORPUS).$(DIR).multi-trimmed
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/multitrans $(DATA)/$(AUTOBIL) -m -f | apertium -f none -d $(DATA) $(DIR)-multi | $(LEX_TOOLS)/irstlm-ranker $(MODEL) data/$(CORPUS).$(DIR).multi-trimmed -f 2>/dev/null > $@

data/$(CORPUS).$(DIR).freq: data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated
python3 $(SCRIPTS)/biltrans-extract-frac-freq.py data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated > $@

data/$(CORPUS).$(DIR).freq.lrx: data/$(CORPUS).$(DIR).freq
python3 $(SCRIPTS)/extract-alig-lrx.py $< > $@

data/$(CORPUS).$(DIR).freq.lrx.bin: data/$(CORPUS).$(DIR).freq.lrx
lrx-comp $< $@

data/$(CORPUS).$(DIR).events data/$(CORPUS).$(DIR).ngrams: data/$(CORPUS).$(DIR).annotated data/$(CORPUS).$(DIR).freq
python3 $(SCRIPTS)/biltrans-count-patterns-frac-maxent.py data/$(CORPUS).$(SL)-$(TL).freq data/$(CORPUS).$(SL)-$(TL).ambig data/$(CORPUS).$(SL)-$(TL).annotated > data/$(CORPUS).$(DIR).events 2>data/$(CORPUS).$(DIR).ngrams

data/$(CORPUS).$(DIR).all-lambdas: data/$(CORPUS).$(DIR).events
cat data/$(CORPUS).$(DIR).events | grep -v -e '\$$ 0\.0 #' -e '\$$ 0 #' > data/$(CORPUS).$(DIR).events.trimmed
cat data/$(CORPUS).$(DIR).events.trimmed | python $(SCRIPTS)/merge-all-lambdas.py $(YASMET)> $@

data/$(CORPUS).$(DIR).rules-all: data/$(CORPUS).$(DIR).ngrams data/$(CORPUS).$(DIR).all-lambdas
python3 $(SCRIPTS)/merge-ngrams-lambdas.py data/$(CORPUS).$(DIR).ngrams data/$(CORPUS).$(DIR).all-lambdas > $@

data/$(CORPUS).$(DIR).ngrams-all: data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).rules-all
python3 $(SCRIPTS)/lambdas-to-rules.py data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).rules-all > $@

data/$(CORPUS).$(DIR).ngrams-trimmed: data/$(CORPUS).$(DIR).ngrams-all
cat $< | python3 $(SCRIPTS)/ngram-pareto-trim.py > $@

data/$(CORPUS).$(DIR).lm.xml: data/$(CORPUS).$(DIR).ngrams-trimmed
python3 $(SCRIPTS)/ngrams-to-rules-me.py data/$(CORPUS).$(DIR).ngrams-trimmed > $@

</pre>
The corpus file needs to be named as "basename"."language-pair"."source side".
As an illustration, in the Makefile example, the corpus file is named europarl.en-es.es
Set the Makefile variables as follows:
CORPUS denotes the base name of your corpus file
PAIR stands for the language pair
SL and TL stand for source language and target language
DATA is the path to the language resources for the language pair
SCRIPTS denotes the path to the lex-tools scripts
MODEL is the path to the target side (binary) language model used for scoring the possible translations of ambiguous words

Finally, executing the Makefile will generate lexical selection rules for the specified language pair.

[[Category:Lexical selection]]

Generating lexical-selection rules from monolingual corpora

2013-09-23T06:12:10Z

Fpetkovski: /* Maximum entropy rule extraction */

{{TOCD}}
This page describes how to generate lexical selection rules without relying on a parallel corpus.

==Prerequisites==

* [[apertium-lex-tools]]
* [[IRSTLM]]
* A language pair (e.g. apertium-br-fr)
** The language pair should have the following two modes:
*** <code>-multi</code> which is all the modules after lexical transfer (see apertium-mk-en/modes.xml)
*** <code>-pretransfer</code> which is all the modules up to lexical transfer (see apertium-mk-en/modes.xml)

==Annotation==

Important: If you don't want through the whole process step by step, you can use the Makefile script provided in the last section of this page.

We're going to do the example with EuroParl and the English to Spanish pair in Apertium.

Given that you've got all the stuff installed, the work will be as follows:

Take your corpus and make a tagged version of it:

<pre>
cat europarl.es-en.es | apertium-destxt | apertium -f none -d ~/source/apertium/apertium-en-es en-es-pretransfer > europarl.en-es.es.tagged
</pre>

Make an ambiguous version of your corpus and trim redundant tags:

<pre>
cat europarl.en-es.es.tagged | python ~/source/apertium/apertium-lex-tools/multitrans ~/source/apertium/apertium-en-es/en-es.autobil -b -f -t -n > europarl.en-es.es.ambig
</pre>

Next, generate all the possible disambiguation paths while trimming redundant tags:

<pre>
cat europarl.en-es.es.tagged | ~/source/apertium/apertium-lex-tools/multitrans ~/source/apertium/apertium-en-es/en-es.autobil -m -f -t -n > europarl.en-es.es.multi-trimmed
</pre>

Translate and score all possible disambiguation paths:

<pre>
cat europarl.en-es.es.tagged | python ~/source/apertium/apertium-lex-tools/multitrans ~/source/apertium/apertium-en-es/en-es.autobil -m -f -n |
apertium -f none -d ~/source/apertium/apertium-en-es en-es-multi | ~/source/apertium/apertium-lex-tools/irstlm-ranker
~/source/corpora/lm/en.blm europarl.en-es.es.multi-trimmed -f > europarl.en-es.es.annotated
</pre>

Now we have a pseudo-parallel corpus where each possible translation is scored.
We start by extracting a frequency lexicon:

<pre>
python3 ~/source/apertium/apertium-lex-tools-scripts/biltrans-extract-frac-freq.py europarl.en-es.es.ambig europarl.en-es.es.annotated > europarl.en-es.freq
</pre>

<pre>
python3 ~/source/apertium/apertium-lex-tools-scripts/extract-alig-lrx.py europarl.en-es.freq > europarl.en-es.freq.lrx
</pre>

<pre>
lrx-comp europarl.en-es.freq.lrx europarl.en-es.freq.lrx.bin
</pre>

From here on, we have two paths we can choose. We can extract rules using a maximum entropy classifier, or we can extract
rules based only on the scores provided by irstlm-ranker.

== Direct rule extraction ==
When using this method, we directly continue with extracting ngrams from the pseudo parallel corpus:

<pre>
python3 ~/source/apertium/apertium-lex-tools/scripts/biltrans-count-patterns-ngrams.py europarl.en-es.freq europarl.en-es.es.ambig europarl.en-es.es.annotated > ngrams
</pre>

Next, we prune the generated ngrams:

<pre>
python3 ~/source/apertium/apertium-lex-tools/scripts/ngram-pruning-frac.py europarl.en-es.freq ngrams > patterns
</pre>

Finally, we generate and compile lexical selection rules while thresholding their irstlm-score

<pre>
crisphold=1;
python3 ~/source/apertium/apertium-lex-tools/scripts//ngrams-to-rules.py patterns $crisphold > patterns.lrx
</pre>

<pre>
lrx-comp patterns.lrx patterns.lrx.bin
</pre>

== Maximum entropy rule extraction ==
When extracting rules using a maximum entropy criterion, we first extract features which we are going to feed to a classifier:

<pre>
python3 ~/source/apertium/apertium-lex-tools/scripts/biltrans-count-patterns-frac-maxent.py europarl-en-es.freq
europarl.en-es.ambig europarl.en-es.annotated > events 2>ngrams
</pre>

We then train classifiers which as a side effect score how much each ngram contributes to a certain translation:
<pre>
cat events | grep -v -e '\$$ 0\.0 #' -e '\$$ 0 #' > events.trimmed
cat events.trimmed | python ~/source/apertium/apertium-lex-tools/scripts/merge-all-lambdas.py $(YASMET) > all-lambdas
</pre>

<pre>
python3 ~/source/apertium-lex-tools/scripts/merge-ngrams-lambdas.py ngrams all-lambdas > rules-all
</pre>

Finally, we extract ngrams:

<pre>
python3 ~/source/apertium-lex-tools/scripts/lambdas-to-rules.py europarl-en-es.freq rules-all > ngrams-all
</pre>

we trim them:

<pre>
python3 ~/source/apertium-lex-tools/scripts/ngram-pareto-trim.py ngrams-all > ngrams-trimmed
</pre>

and generate lexical selection rules:

<pre>
python3 ~/source/apertium-lex-tools/scripts/ngrams-to-rules-me.py ngrams-trimmed > europarl.en-es.lrx.bin
</pre>

== Makefiles ==

=== Direct rule extraction ===
You can use this makefile to generate rules using maximum entropy classifiers.
Your corpus needs to be placed in the same folder with your makefile.

<pre>
CORPUS=setimes
PAIR=mk-en
DATA=/home/philip/Apertium/apertium-mk-en
SL=mk
TL=en
TRAINING_LINES=10000
SCRIPTS=/home/philip/Apertium/apertium-lex-tools/scripts
MODEL=/home/philip/Apertium/corpora/language-models/en/setimes.en.5.blm
LEX_TOOLS=/home/philip/Apertium/apertium-lex-tools

AUTOBIL=$(SL)-$(TL).autobil.bin
DIR=$(SL)-$(TL)

all: data/$(CORPUS).$(DIR).freq.lrx.bin data/$(CORPUS).$(DIR).patterns.lrx

data/$(CORPUS).$(DIR).tagger: $(CORPUS).$(PAIR).$(SL)
if [ ! -d data ]; then mkdir data; fi
cat $(CORPUS).$(PAIR).$(SL) | head -n $(TRAINING_LINES)| sed 's/[^\.]$$/./g' | apertium-destxt | apertium -f none -d $(DATA) $(DIR)-tagger | apertium-pretransfer > $@

data/$(CORPUS).$(DIR).ambig: data/$(CORPUS).$(DIR).tagger
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/process-tagger-output $(DATA)/$(AUTOBIL) -b -t -f -n > $@

data/$(CORPUS).$(DIR).multi-trimmed: data/$(CORPUS).$(DIR).tagger
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/process-tagger-output $(DATA)/$(AUTOBIL) -m -t -f > $@

data/$(CORPUS).$(DIR).annotated: data/$(CORPUS).$(DIR).tagger data/$(CORPUS).$(DIR).multi-trimmed
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/process-tagger-output $(DATA)/$(AUTOBIL) -m -f | apertium -f none -d $(DATA) $(DIR)-multi | $(LEX_TOOLS)/irstlm-ranker $(MODEL) data/$(CORPUS).$(DIR).multi-trimmed -f 2>/dev/null | grep "|@|" | cut -f 1-3 > $@

data/$(CORPUS).$(DIR).freq: data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated
python3 $(SCRIPTS)/biltrans-extract-frac-freq.py data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated > $@

data/$(CORPUS).$(DIR).freq.lrx: data/$(CORPUS).$(DIR).freq
python3 $(SCRIPTS)/extract-alig-lrx.py $< > $@

data/$(CORPUS).$(DIR).freq.lrx.bin: data/$(CORPUS).$(DIR).freq.lrx
lrx-comp $< $@

data/$(CORPUS).$(DIR).ngrams: data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated
python3 $(SCRIPTS)/biltrans-count-patterns-ngrams.py data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated > $@

data/$(CORPUS).$(DIR).patterns: data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ngrams
python3 $(SCRIPTS)/ngram-pruning-frac.py data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ngrams > $@

data/$(CORPUS).$(DIR).patterns.lrx: data/$(CORPUS).$(DIR).patterns
python3 $(SCRIPTS)/ngrams-to-rules.py $< $(THR) > $@

</pre>

The corpus file needs to be named as "basename"."language-pair"."source side".
As an illustration, in the Makefile example, the corpus file is named europarl.en-es.es
Set the Makefile variables as follows:
CORPUS denotes the base name of your corpus file
PAIR stands for the language pair
SL and TL stand for source language and target language
DATA is the path to the language resources for the language pair
SCRIPTS denotes the path to the lex-tools scripts
MODEL is the path to the target side (binary) language model used for scoring the possible translations of ambiguous words

Finally, executing the Makefile will generate lexical selection rules for the specified language pair.

=== Maximum entropy ===
You can use this makefile to generate rules using maximum entropy classifiers.
Your corpus needs to be placed in the same folder with your makefile.

<pre>
CORPUS=europarl
PAIR=en-es
DATA=/home/philip/source/apertium-en-es
SL=es
TL=en
MODEL=/home/philip/lm/en.blm
SCRIPTS=/home/philip/source/apertium-lex-tools/scripts
LEX_TOOLS=/home/philip/source/apertium-lex-tools
THR=1
TRAINING_LINES=10000

AUTOBIL=$(SL)-$(TL).autobil.bin
DIR=$(SL)-$(TL)
YASMET=$(LEX_TOOLS)/yasmet
all: data/$(CORPUS).$(DIR).freq.lrx.bin data/$(CORPUS).$(DIR).lm.xml

data/$(CORPUS).$(DIR).tagger: $(CORPUS).$(PAIR).$(SL)
if [ ! -d data ]; then mkdir data; fi
cat $(CORPUS).$(PAIR).$(SL) | head -n $(TRAINING_LINES) | apertium-destxt | apertium -f none -d $(DATA) $(DIR)-tagger | apertium-pretransfer > $@

data/$(CORPUS).$(DIR).ambig: data/$(CORPUS).$(DIR).tagger
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/multitrans $(DATA)/$(AUTOBIL) -b -t -f -n > $@

data/$(CORPUS).$(DIR).multi-trimmed: data/$(CORPUS).$(DIR).tagger
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/multitrans $(DATA)/$(AUTOBIL) -m -t -f > $@

data/$(CORPUS).$(DIR).annotated: data/$(CORPUS).$(DIR).tagger data/$(CORPUS).$(DIR).multi-trimmed
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/multitrans $(DATA)/$(AUTOBIL) -m -f | apertium -f none -d $(DATA) $(DIR)-multi | $(LEX_TOOLS)/irstlm-ranker $(MODEL) data/$(CORPUS).$(DIR).multi-trimmed -f 2>/dev/null > $@

data/$(CORPUS).$(DIR).freq: data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated
python3 $(SCRIPTS)/biltrans-extract-frac-freq.py data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated > $@

data/$(CORPUS).$(DIR).freq.lrx: data/$(CORPUS).$(DIR).freq
python3 $(SCRIPTS)/extract-alig-lrx.py $< > $@

data/$(CORPUS).$(DIR).freq.lrx.bin: data/$(CORPUS).$(DIR).freq.lrx
lrx-comp $< $@

data/$(CORPUS).$(DIR).events data/$(CORPUS).$(DIR).ngrams: data/$(CORPUS).$(DIR).annotated data/$(CORPUS).$(DIR).freq
python3 $(SCRIPTS)/biltrans-count-patterns-frac-maxent.py data/$(CORPUS).$(SL)-$(TL).freq data/$(CORPUS).$(SL)-$(TL).ambig data/$(CORPUS).$(SL)-$(TL).annotated > data/$(CORPUS).$(DIR).events 2>data/$(CORPUS).$(DIR).ngrams

data/$(CORPUS).$(DIR).all-lambdas: data/$(CORPUS).$(DIR).events
cat data/$(CORPUS).$(DIR).events | grep -v -e '\$$ 0\.0 #' -e '\$$ 0 #' > data/$(CORPUS).$(DIR).events.trimmed
cat data/$(CORPUS).$(DIR).events.trimmed | python $(SCRIPTS)/merge-all-lambdas.py $(YASMET)> $@

data/$(CORPUS).$(DIR).rules-all: data/$(CORPUS).$(DIR).ngrams data/$(CORPUS).$(DIR).all-lambdas
python3 $(SCRIPTS)/merge-ngrams-lambdas.py data/$(CORPUS).$(DIR).ngrams data/$(CORPUS).$(DIR).all-lambdas > $@

data/$(CORPUS).$(DIR).ngrams-all: data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).rules-all
python3 $(SCRIPTS)/lambdas-to-rules.py data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).rules-all > $@

data/$(CORPUS).$(DIR).ngrams-trimmed: data/$(CORPUS).$(DIR).ngrams-all
cat $< | python3 $(SCRIPTS)/ngram-pareto-trim.py > $@

data/$(CORPUS).$(DIR).lm.xml: data/$(CORPUS).$(DIR).ngrams-trimmed
python3 $(SCRIPTS)/ngrams-to-rules-me.py data/$(CORPUS).$(DIR).ngrams-trimmed > $@

</pre>
The corpus file needs to be named as "basename"."language-pair"."source side".
As an illustration, in the Makefile example, the corpus file is named europarl.en-es.es
Set the Makefile variables as follows:
CORPUS denotes the base name of your corpus file
PAIR stands for the language pair
SL and TL stand for source language and target language
DATA is the path to the language resources for the language pair
SCRIPTS denotes the path to the lex-tools scripts
MODEL is the path to the target side (binary) language model used for scoring the possible translations of ambiguous words

Finally, executing the Makefile will generate lexical selection rules for the specified language pair.

[[Category:Lexical selection]]

Generating lexical-selection rules from monolingual corpora

2013-09-23T06:04:42Z

Fpetkovski: /* Direct rule extraction */

Generating lexical-selection rules from monolingual corpora

2013-09-23T06:04:13Z

Fpetkovski: /* Rule-extraction */

Generating lexical-selection rules from monolingual corpora

2013-09-23T05:47:23Z

Fpetkovski: /* Annotation */

Generating lexical-selection rules from monolingual corpora

2013-09-23T05:45:28Z

Fpetkovski: /* Annotation */

Generating lexical-selection rules from monolingual corpora

2013-09-23T05:42:49Z

Fpetkovski: /* Annotation */

Generating lexical-selection rules from a parallel corpus

2013-09-23T05:29:12Z

Fpetkovski: /* Getting started */

{{TOCD}}
If you have a parallel corpus, one of the things you can do is generate some lexical selection rules from it, to improve translation of words with more than one possible translation.

== You will need ==

Here is a list of software that you will need installed:

* Giza++ (or some other word aligner)
* Moses (for making Giza++ less human hostile)
* All the Moses scripts
* [[lttoolbox]]
* Apertium
* [[apertium-lex-tools]]

Furthermore you'll need:

* an Apertium language pair
* a parallel corpus (see [[Corpora]])

===Installing prerequisites===
See [[Minimal installation from SVN]] for apertium/lttoolbox.

See [[Constraint-based lexical selection module]] for apertium-lex-tools.

For Giza++ and moses-decoder, etc. you can do
<pre>
$ mkdir ~/smt
$ cd ~/smt
$ mkdir local # our "install prefix"
$ wget https://giza-pp.googlecode.com/files/giza-pp-v1.0.7.tar.gz
$ tar xzvf giza-pp-v1.0.7.tar.gz
$ cd giza-pp
$ make
$ mkdir ../local/bin
$ cp GIZA++-v2/snt2cooc.out ../local/bin/
$ cp GIZA++-v2/snt2plain.out ../local/bin/
$ cp GIZA++-v2/GIZA++ ../local/bin/
$ cp mkcls-v2/mkcls ../local/bin/
$ git clone https://github.com/moses-smt/mosesdecoder
$ cd mosesdecoder/
$ ./bjam
</pre>

Now e.g. the clean-corpus and train-model scripts referred to below will be in ~/smt/mosesdecoder/scripts/training/clean-corpus-n.perl
See http://www.statmt.org/moses/?n=Development.GetStarted if you want to install the binaries to some other directory.

== Getting started ==

'''Important:'''
If you don't want through the whole process step by step, you can use the Makefile script provided in the [[#Makefile|last section]] of this page.

We're going to do the example with [[Corpora|EuroParl]] and the English to Spanish pair in Apertium.

Given that you've got all the stuff installed, the work will be as follows:

=== Prepare corpus ===

To generate the rules, we need three files,

* The tagged and tokenised source corpus
* The tagged and tokenised target corpus
* The output of the lexical transfer module in the source→target direction, tokenised

These three files should be sentence aligned.

Make a folder called data-en-es. We are going to keep all the generated files there.

The first thing that we need to do is tag both sides of the corpus:

<pre>
$ nohup cat europarl.en-es.en | apertium-destxt |\
apertium -f none -d /home/fran/source/apertium-en-es en-es-pretransfer > data-en-es/europarl-en-es.tagged.en &
$ nohup cat europarl.clean.es | apertium-destxt |\
apertium -f none -d /home/fran/source/apertium-en-es es-en-pretransfer > data-en-es.europarl-en-es.tagged.es &
</pre>

Then we need to remove the lines with no analyses on and replace blanks within lemmas with a new character (we will use `~`):

<pre>
$ paste data-en-es/europarl-en-es.tagged.en data-en-es/europarl-en-es.tagged.es | grep '<' | cut -f2 | sed 's/ /~/g' | sed 's/\$[^\^]*/\$ /g' > data-en-es/europarl-en-es.tagged.new.es
$ paste data-en-es/europarl-en-es.tagged.en data-en-es/europarl-en-es.tagged.es | grep '<' | cut -f3 | sed 's/ /~/g' | sed 's/\$[^\^]*/\$ /g' > data-en-es/europarl-en-es.tagged.new.en
</pre>

Next, we need to clean the corpus and remove long sentences.
(Make sure you are in the same directory as the one where you have your europarl corpus)

<pre>
$ perl (path to your mosesdecoder)/scripts/training/clean-corpus-n.perl data-en-es/europarl-en-es.tagged.new es en data-en-es/europarl-en-es.tag-clean 1 40
clean-corpus.perl: processing europarl-v6.es-en.es & .en to data-en-es/europarl-en-es.clean, cutoff 1-40
..........(100000)...

Input sentences: 1786594 Output sentences: 1467708
</pre>

We're going to cut off the bottom 67,658 for testing (also because Giza++ segfaults somewhere around there).
<pre>
$ mkdir testing
$ tail -67658 data-en-es/europarl-en-es.tag-clean.en > testing/europarl-en-es.tag-clean.67658.en
$ tail -67658 data-en-es/europarl-en-es.tag-clean.es > testing/europarl-en-es.tag-clean.67658.es
</pre>

<pre>
$ head -1400000 data-en-es/europarl-en-es.tag-clean.en > data-en-es/europarl-en-es.tag-clean.en.new
$ head -1400000 data-en-es/europarl-en-es.tag-clean.es > data-en-es/europarl-en-es.tag-clean.es.new
$ mv data-en-es/europarl-en-es.tag-clean.en.new data-en-es/europarl-en-es.tag-clean.en
$ mv data-en-es/europarl-en-es.tag-clean.es.new data-en-es/europarl-en-es.tag-clean.es
</pre>

These files are:

* <code>data-en-es/europarl-en-es.tag-clean.en</code>: The tagged source language side of the corpus
* <code>data-en-es/europarl-en-es.tag-clean.es</code>: The tagged target language side of the corpus

Check that they have the same length:

<pre>
$ wc -l europarl.*
1400000 data-en-es/europarl-en-es.tag-clean.en
1400000 data-en-es/europarl-en-es.tag-clean.es
2800000 total
</pre>

=== Align corpus ===

Now we've got the corpus files ready, we can align the corpus using the Moses scripts:

<pre>
nohup perl (path to your mosesdecoder)/scripts/training/train-model.perl -external-bin-dir \
~/smt/local/bin -corpus data-en-es/europarl-en-es.tag-clean \
-f en -e es -alignment grow-diag-final-and -reordering msd-bidirectional-fe \
-lm 0:5:/home/fran/corpora/europarl/europarl.lm:0 >log 2>&1 &
</pre>

Note: Remember to change all the paths in the above command!

You'll need an LM file, but you can copy it from a previous Moses installation. If you don't have one, make an empty file and put a few words in it. We won't be using the LM anyway.

This takes a while, from a few hours to a day. So leave it running and go and make a soufflé, or chop some wood or something.

=== Extract sentences ===
After the sentences are aligned, we need to trim unnecessary tags from the tokens, and generate a biltrans file.

<pre>
zcat giza.en-es/en-es.A3.final.gz | ~/source/apertium-lex-tools/scripts/giza-to-moses.awk > data-en-es/europarl.phrasetable.en-es
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 1 \
| sed 's/~/ /g' | ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil.bin -p -t > tmp1
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 2 \
| sed 's/~/ /g' | ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil.bin -p -t > tmp2
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 3 > tmp3
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 2 \
| sed 's/~/ /g' | ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil.bin -b -t > data-en-es/europarl.clean-biltrans.en-es
paste tmp1 tmp2 tmp3 | sed 's/\t/ ||| /g' > data-en-es/europarl.phrasetable.en-es
rm tmp1 tmp2 tmp3
</pre>

Then we want to make sure again that our file has the right number of lines:

<pre>
$ wc -l data-en-es/europarl.phrasetable.en-es
1400000 data-en-es/europarl.phrasetable.en-es
</pre>

Then we want to extract the sentences where the target language word aligned to a source language word is a possible translation in the bilingual dictionary:

<pre>
$ ~/source/apertium-lex-tools/scripts/extract-sentences.py data-en-es/europarl-en-es.phrasetable.en-es data-en-es/europarl-en-es.biltrans-tok.en-es \
> data-en-es/europarl-en-es.candidates.en-es
</pre>

These are basically sentences that we can hope that Apertium might be able to generate.

=== Extract bilingual dictionary candidates ===

Using the phrasetable and the bilingual file we can extract candidates for the bilingual dictionary.

<pre>
python3 ~/source/apertium-lex-tools/scripts/extract-biltrans-candidates.py data-en-es/europarl-en-es.phrasetable.en-es data-en-es/europarl-en-es.biltrans-tok.en-es > data-en-es/europarl-en-es.biltrans-candidates.en-es 2> data-en-es/europarl-en-es.biltrans-pairs.en-es
</pre>

where data-en-es/europarl-en-es.biltrans-candidates.en-es contains the generated candidates for the bilingual dictionary.

===Extract frequency lexicon===

The next step is to extract the frequency lexicon.

<pre>
$ python ~/source/apertium-lex-tools/scripts/extract-freq-lexicon.py data-en-es/europarl-en-es.candidates.en-es > data-en-es/europarl-en-es.lex.en-es
</pre>

This file should look like:

<pre>
$ cat europarl.lex.en-es | head
31381 union<n> unión<n> @
101 union<n> sindicato<n>
1 union<n> situación<n>
1 union<n> monetario<adj>
4 slope<n> pendiente<n> @
1 slope<n> ladera<n>
</pre>

Where the highest frequency translation is marked with an <code>@</code>.

Note: This frequency lexicon can be used as a substitute for "choosing the most general translation" in your bilingual dictionary.

===Generate patterns===

Now we generate the ngrams that we are going to generate the rules from.

<pre>
$ crisphold=1.5 # ratio of how many times you see the alternative translation compared to the default
$ python ~/source/apertium-lex-tools/scripts/ngram-count-patterns.py europarl.lex.en-es data-en-es/europarl-en-es.candidates.en-es $crisphold 2>/dev/null > data-en-es/europarl-en-es.ngrams.en-es
</pre>

This script outputs lines in the following format:

<pre>
-language<n> and<cnjcoo> language<n> ,<cm> lengua<n> 2
+language<n> plain<adj> language<n> ,<cm> lenguaje<n> 3
-language<n> language<n> knowledge<n> lengua<n> 4
-language<n> language<n> of<pr> communication<n> lengua<n> 3
-language<n> Community<adj> language<n> .<sent> lengua<n> 5
-language<n> language<n> in~addition~to<pr> their<det><pos> lengua<n> 2
-language<n> every<det><ind> language<n> lengua<n> 2
+language<n> and<cnjcoo> *understandable language<n> lenguaje<n> 2
-language<n> two<num> language<n> lengua<n> 8
-language<n> only<adj> official<adj> language<n> lengua<n> 2
</pre>

The <code>+</code> and <code>-</code> indicate if this line chooses the most frequent transation (<code>-</code>) or a translation which is not the most frequent (<code>+</code>). The pattern selecting the translation is then shown, followed by the translation and then the frequency.

===Filter rules===

Now you can filter the rules, for example by removing rules with conjunctions, or removing rules with unknown words.

=== Generate rules ===

The final stage is to generate the rules,

<pre>
python3 ~/source/apertium-lex-tools/scripts/ngrams-to-rules.py data-en-es/europarl-en-es.ngrams.en-es $crisphold > data-en-es/europarl-en-es.ngrams.en-es.lrx
</pre>

=== Makefile ===
For the whole process you can run the following Makefile:

<pre>
CORPUS=europarl
PAIR=en-es
SL=en
TL=es
DATA=/home/philip/Apertium/apertium-en-es

LEX_TOOLS=/home/philip/Apertium/apertium-lex-tools
SCRIPTS=$(LEX_TOOLS)/scripts
MOSESDECODER=/home/philip/mosesdecoder/scripts/training
TRAINING_LINES=200000
BIN_DIR=/home/philip/giza-pp/bin
LM=/home/philip/Apertium/gsoc2013/giza/dummy.lm

crisphold=1

all: data-$(SL)-$(TL)/$(CORPUS).ngrams.$(SL)-$(TL).lrx data-$(SL)-$(TL)/$(CORPUS).biltrans-entries.$(SL)-$(TL)

# TAG CORPUS
data-$(SL)-$(TL)/$(CORPUS).tagged.$(SL): $(CORPUS).$(PAIR).$(SL)
if [ ! -d data-$(SL)-$(TL) ]; then mkdir data-$(SL)-$(TL); fi
cat $(CORPUS).$(PAIR).$(SL) | head -n $(TRAINING_LINES) \
| apertium-destxt \
| apertium -f none -d $(DATA) $(SL)-$(TL)-tagger \
| apertium-pretransfer > $@;

data-$(SL)-$(TL)/$(CORPUS).tagged.$(TL): $(CORPUS).$(PAIR).$(TL)
if [ ! -d data-$(SL)-$(TL) ]; then mkdir data-$(SL)-$(TL); fi
cat $(CORPUS).$(PAIR).$(TL) | head -n $(TRAINING_LINES) \
| apertium-destxt \
| apertium -f none -d $(DATA) $(TL)-$(SL)-tagger \
| apertium-pretransfer > $@;

# REMOVE LINES WITH NO ANALYSES
data-$(SL)-$(TL)/$(CORPUS).tagged.new.$(SL): data-$(SL)-$(TL)/$(CORPUS).tagged.$(SL) data-$(SL)-$(TL)/$(CORPUS).tagged.$(TL)
paste data-$(SL)-$(TL)/$(CORPUS).tagged.$(SL) data-$(SL)-$(TL)/$(CORPUS).tagged.$(TL) \
| grep '<' \
| cut -f1 \
| sed 's/ /~/g' | sed 's/$$[^\^]*/$$ /g' > $@

data-$(SL)-$(TL)/$(CORPUS).tagged.new.$(TL): data-$(SL)-$(TL)/$(CORPUS).tagged.$(SL) data-$(SL)-$(TL)/$(CORPUS).tagged.$(TL)
paste data-$(SL)-$(TL)/$(CORPUS).tagged.$(SL) data-$(SL)-$(TL)/$(CORPUS).tagged.$(TL) \
| grep '<' \
| cut -f2 \
| sed 's/ /~/g' | sed 's/$$[^\^]*/$$ /g' > $@

# CLEAN
data-$(SL)-$(TL)/$(CORPUS).tag-clean.$(SL) data-$(SL)-$(TL)/$(CORPUS).tag-clean.$(TL): data-$(SL)-$(TL)/$(CORPUS).tagged.new.$(SL) data-$(SL)-$(TL)/$(CORPUS).tagged.new.$(TL)
perl $(MOSESDECODER)/clean-corpus-n.perl data-$(SL)-$(TL)/$(CORPUS).tagged.new $(SL) $(TL) data-$(SL)-$(TL)/$(CORPUS).tag-clean 1 40;

# ALIGN
model: data-$(SL)-$(TL)/$(CORPUS).tag-clean.$(SL) data-$(SL)-$(TL)/$(CORPUS).tag-clean.$(TL)
-perl $(MOSESDECODER)/train-model.perl -external-bin-dir $(BIN_DIR) -corpus data-$(SL)-$(TL)/$(CORPUS).tag-clean \
-f $(TL) -e $(SL) -alignment grow-diag-final-and -reordering msd-bidirectional-fe \
-lm 0:5:$(LM):0 2>&1

# EXTRACT AND TRIM
data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL): model
zcat giza.$(SL)-$(TL)/$(SL)-$(TL).A3.final.gz | $(SCRIPTS)/giza-to-moses.awk > $@
cat data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL) | sed 's/ ||| /\t/g' | cut -f 1 \
| sed 's/~/ /g' | $(LEX_TOOLS)/multitrans $(DATA)/$(TL)-$(SL).autobil.bin -p -t > tmp1
cat data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL) | sed 's/ ||| /\t/g' | cut -f 2 \
| sed 's/~/ /g' | $(LEX_TOOLS)/multitrans $(DATA)/$(SL)-$(TL).autobil.bin -p -t > tmp2
cat data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL) | sed 's/ ||| /\t/g' | cut -f 3 > tmp3
cat data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL) | sed 's/ ||| /\t/g' | cut -f 2 \
| sed 's/~/ /g' | $(LEX_TOOLS)/multitrans $(DATA)/$(SL)-$(TL).autobil.bin -b -t > data-$(SL)-$(TL)/$(CORPUS).clean-biltrans.$(PAIR)
paste tmp1 tmp2 tmp3 | sed 's/\t/ ||| /g' > data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL)
rm tmp1 tmp2 tmp3

# SENTENCES
data-$(SL)-$(TL)/$(CORPUS).candidates.$(SL)-$(TL): data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL) data-$(SL)-$(TL)/$(CORPUS).clean-biltrans.$(PAIR)
python3 $(SCRIPTS)/extract-sentences.py data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL) \
data-$(SL)-$(TL)/$(CORPUS).clean-biltrans.$(PAIR) > $@ 2>/dev/null

# FREQUENCY LEXICON
data-$(SL)-$(TL)/$(CORPUS).lex.$(SL)-$(TL): data-$(SL)-$(TL)/$(CORPUS).candidates.$(SL)-$(TL)
python $(SCRIPTS)/extract-freq-lexicon.py data-$(SL)-$(TL)/$(CORPUS).candidates.$(SL)-$(TL) > $@ 2>/dev/null

# BILTRANS CANDIDATES
data-$(SL)-$(TL)/$(CORPUS).biltrans-entries.$(SL)-$(TL): data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL) data-$(SL)-$(TL)/$(CORPUS).clean-biltrans.$(PAIR)
python3 $(SCRIPTS)/extract-biltrans-candidates.py data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL) data-$(SL)-$(TL)/$(CORPUS).clean-biltrans.$(PAIR) \
> $@ 2>/dev/null

# NGRAM PATTERNS
data-$(SL)-$(TL)/$(CORPUS).ngrams.$(SL)-$(TL): data-$(SL)-$(TL)/$(CORPUS).lex.$(SL)-$(TL) data-$(SL)-$(TL)/$(CORPUS).candidates.$(SL)-$(TL)
python $(SCRIPTS)/ngram-count-patterns.py data-$(SL)-$(TL)/$(CORPUS).lex.$(SL)-$(TL) data-$(SL)-$(TL)/$(CORPUS).candidates.$(SL)-$(TL) $(crisphold) 2>/dev/null > $@

# NGRAMS TO RULES
data-$(SL)-$(TL)/$(CORPUS).ngrams.$(SL)-$(TL).lrx: data-$(SL)-$(TL)/$(CORPUS).ngrams.$(SL)-$(TL)
python3 $(SCRIPTS)/ngrams-to-rules.py data-$(SL)-$(TL)/$(CORPUS).ngrams.$(SL)-$(TL) $(crisphold) > $@ 2>/dev/null

</pre>

[[Category:Lexical selection]]
[[Category:Documentation in English]]

Generating lexical-selection rules from a parallel corpus

2013-09-23T05:28:39Z

Fpetkovski: /* Getting started */

{{TOCD}}
If you have a parallel corpus, one of the things you can do is generate some lexical selection rules from it, to improve translation of words with more than one possible translation.

== You will need ==

Here is a list of software that you will need installed:

* Giza++ (or some other word aligner)
* Moses (for making Giza++ less human hostile)
* All the Moses scripts
* [[lttoolbox]]
* Apertium
* [[apertium-lex-tools]]

Furthermore you'll need:

* an Apertium language pair
* a parallel corpus (see [[Corpora]])

===Installing prerequisites===
See [[Minimal installation from SVN]] for apertium/lttoolbox.

See [[Constraint-based lexical selection module]] for apertium-lex-tools.

For Giza++ and moses-decoder, etc. you can do
<pre>
$ mkdir ~/smt
$ cd ~/smt
$ mkdir local # our "install prefix"
$ wget https://giza-pp.googlecode.com/files/giza-pp-v1.0.7.tar.gz
$ tar xzvf giza-pp-v1.0.7.tar.gz
$ cd giza-pp
$ make
$ mkdir ../local/bin
$ cp GIZA++-v2/snt2cooc.out ../local/bin/
$ cp GIZA++-v2/snt2plain.out ../local/bin/
$ cp GIZA++-v2/GIZA++ ../local/bin/
$ cp mkcls-v2/mkcls ../local/bin/
$ git clone https://github.com/moses-smt/mosesdecoder
$ cd mosesdecoder/
$ ./bjam
</pre>

Now e.g. the clean-corpus and train-model scripts referred to below will be in ~/smt/mosesdecoder/scripts/training/clean-corpus-n.perl
See http://www.statmt.org/moses/?n=Development.GetStarted if you want to install the binaries to some other directory.

== Getting started ==

'''Important:'''
'''If you don't want through the whole process step by step, you can use the Makefile script provided in the [[#Makefile|last section]] of this page.'''

We're going to do the example with [[Corpora|EuroParl]] and the English to Spanish pair in Apertium.

Given that you've got all the stuff installed, the work will be as follows:

=== Prepare corpus ===

To generate the rules, we need three files,

* The tagged and tokenised source corpus
* The tagged and tokenised target corpus
* The output of the lexical transfer module in the source→target direction, tokenised

These three files should be sentence aligned.

Make a folder called data-en-es. We are going to keep all the generated files there.

The first thing that we need to do is tag both sides of the corpus:

<pre>
$ nohup cat europarl.en-es.en | apertium-destxt |\
apertium -f none -d /home/fran/source/apertium-en-es en-es-pretransfer > data-en-es/europarl-en-es.tagged.en &
$ nohup cat europarl.clean.es | apertium-destxt |\
apertium -f none -d /home/fran/source/apertium-en-es es-en-pretransfer > data-en-es.europarl-en-es.tagged.es &
</pre>

Then we need to remove the lines with no analyses on and replace blanks within lemmas with a new character (we will use `~`):

<pre>
$ paste data-en-es/europarl-en-es.tagged.en data-en-es/europarl-en-es.tagged.es | grep '<' | cut -f2 | sed 's/ /~/g' | sed 's/\$[^\^]*/\$ /g' > data-en-es/europarl-en-es.tagged.new.es
$ paste data-en-es/europarl-en-es.tagged.en data-en-es/europarl-en-es.tagged.es | grep '<' | cut -f3 | sed 's/ /~/g' | sed 's/\$[^\^]*/\$ /g' > data-en-es/europarl-en-es.tagged.new.en
</pre>

Next, we need to clean the corpus and remove long sentences.
(Make sure you are in the same directory as the one where you have your europarl corpus)

<pre>
$ perl (path to your mosesdecoder)/scripts/training/clean-corpus-n.perl data-en-es/europarl-en-es.tagged.new es en data-en-es/europarl-en-es.tag-clean 1 40
clean-corpus.perl: processing europarl-v6.es-en.es & .en to data-en-es/europarl-en-es.clean, cutoff 1-40
..........(100000)...

Input sentences: 1786594 Output sentences: 1467708
</pre>

We're going to cut off the bottom 67,658 for testing (also because Giza++ segfaults somewhere around there).
<pre>
$ mkdir testing
$ tail -67658 data-en-es/europarl-en-es.tag-clean.en > testing/europarl-en-es.tag-clean.67658.en
$ tail -67658 data-en-es/europarl-en-es.tag-clean.es > testing/europarl-en-es.tag-clean.67658.es
</pre>

<pre>
$ head -1400000 data-en-es/europarl-en-es.tag-clean.en > data-en-es/europarl-en-es.tag-clean.en.new
$ head -1400000 data-en-es/europarl-en-es.tag-clean.es > data-en-es/europarl-en-es.tag-clean.es.new
$ mv data-en-es/europarl-en-es.tag-clean.en.new data-en-es/europarl-en-es.tag-clean.en
$ mv data-en-es/europarl-en-es.tag-clean.es.new data-en-es/europarl-en-es.tag-clean.es
</pre>

These files are:

* <code>data-en-es/europarl-en-es.tag-clean.en</code>: The tagged source language side of the corpus
* <code>data-en-es/europarl-en-es.tag-clean.es</code>: The tagged target language side of the corpus

Check that they have the same length:

<pre>
$ wc -l europarl.*
1400000 data-en-es/europarl-en-es.tag-clean.en
1400000 data-en-es/europarl-en-es.tag-clean.es
2800000 total
</pre>

=== Align corpus ===

Now we've got the corpus files ready, we can align the corpus using the Moses scripts:

<pre>
nohup perl (path to your mosesdecoder)/scripts/training/train-model.perl -external-bin-dir \
~/smt/local/bin -corpus data-en-es/europarl-en-es.tag-clean \
-f en -e es -alignment grow-diag-final-and -reordering msd-bidirectional-fe \
-lm 0:5:/home/fran/corpora/europarl/europarl.lm:0 >log 2>&1 &
</pre>

Note: Remember to change all the paths in the above command!

You'll need an LM file, but you can copy it from a previous Moses installation. If you don't have one, make an empty file and put a few words in it. We won't be using the LM anyway.

This takes a while, from a few hours to a day. So leave it running and go and make a soufflé, or chop some wood or something.

=== Extract sentences ===
After the sentences are aligned, we need to trim unnecessary tags from the tokens, and generate a biltrans file.

<pre>
zcat giza.en-es/en-es.A3.final.gz | ~/source/apertium-lex-tools/scripts/giza-to-moses.awk > data-en-es/europarl.phrasetable.en-es
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 1 \
| sed 's/~/ /g' | ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil.bin -p -t > tmp1
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 2 \
| sed 's/~/ /g' | ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil.bin -p -t > tmp2
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 3 > tmp3
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 2 \
| sed 's/~/ /g' | ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil.bin -b -t > data-en-es/europarl.clean-biltrans.en-es
paste tmp1 tmp2 tmp3 | sed 's/\t/ ||| /g' > data-en-es/europarl.phrasetable.en-es
rm tmp1 tmp2 tmp3
</pre>

Then we want to make sure again that our file has the right number of lines:

<pre>
$ wc -l data-en-es/europarl.phrasetable.en-es
1400000 data-en-es/europarl.phrasetable.en-es
</pre>

Then we want to extract the sentences where the target language word aligned to a source language word is a possible translation in the bilingual dictionary:

<pre>
$ ~/source/apertium-lex-tools/scripts/extract-sentences.py data-en-es/europarl-en-es.phrasetable.en-es data-en-es/europarl-en-es.biltrans-tok.en-es \
> data-en-es/europarl-en-es.candidates.en-es
</pre>

These are basically sentences that we can hope that Apertium might be able to generate.

=== Extract bilingual dictionary candidates ===

Using the phrasetable and the bilingual file we can extract candidates for the bilingual dictionary.

<pre>
python3 ~/source/apertium-lex-tools/scripts/extract-biltrans-candidates.py data-en-es/europarl-en-es.phrasetable.en-es data-en-es/europarl-en-es.biltrans-tok.en-es > data-en-es/europarl-en-es.biltrans-candidates.en-es 2> data-en-es/europarl-en-es.biltrans-pairs.en-es
</pre>

where data-en-es/europarl-en-es.biltrans-candidates.en-es contains the generated candidates for the bilingual dictionary.

===Extract frequency lexicon===

The next step is to extract the frequency lexicon.

<pre>
$ python ~/source/apertium-lex-tools/scripts/extract-freq-lexicon.py data-en-es/europarl-en-es.candidates.en-es > data-en-es/europarl-en-es.lex.en-es
</pre>

This file should look like:

<pre>
$ cat europarl.lex.en-es | head
31381 union<n> unión<n> @
101 union<n> sindicato<n>
1 union<n> situación<n>
1 union<n> monetario<adj>
4 slope<n> pendiente<n> @
1 slope<n> ladera<n>
</pre>

Where the highest frequency translation is marked with an <code>@</code>.

Note: This frequency lexicon can be used as a substitute for "choosing the most general translation" in your bilingual dictionary.

===Generate patterns===

Now we generate the ngrams that we are going to generate the rules from.

<pre>
$ crisphold=1.5 # ratio of how many times you see the alternative translation compared to the default
$ python ~/source/apertium-lex-tools/scripts/ngram-count-patterns.py europarl.lex.en-es data-en-es/europarl-en-es.candidates.en-es $crisphold 2>/dev/null > data-en-es/europarl-en-es.ngrams.en-es
</pre>

This script outputs lines in the following format:

<pre>
-language<n> and<cnjcoo> language<n> ,<cm> lengua<n> 2
+language<n> plain<adj> language<n> ,<cm> lenguaje<n> 3
-language<n> language<n> knowledge<n> lengua<n> 4
-language<n> language<n> of<pr> communication<n> lengua<n> 3
-language<n> Community<adj> language<n> .<sent> lengua<n> 5
-language<n> language<n> in~addition~to<pr> their<det><pos> lengua<n> 2
-language<n> every<det><ind> language<n> lengua<n> 2
+language<n> and<cnjcoo> *understandable language<n> lenguaje<n> 2
-language<n> two<num> language<n> lengua<n> 8
-language<n> only<adj> official<adj> language<n> lengua<n> 2
</pre>

The <code>+</code> and <code>-</code> indicate if this line chooses the most frequent transation (<code>-</code>) or a translation which is not the most frequent (<code>+</code>). The pattern selecting the translation is then shown, followed by the translation and then the frequency.

===Filter rules===

Now you can filter the rules, for example by removing rules with conjunctions, or removing rules with unknown words.

=== Generate rules ===

The final stage is to generate the rules,

<pre>
python3 ~/source/apertium-lex-tools/scripts/ngrams-to-rules.py data-en-es/europarl-en-es.ngrams.en-es $crisphold > data-en-es/europarl-en-es.ngrams.en-es.lrx
</pre>

=== Makefile ===
For the whole process you can run the following Makefile:

<pre>
CORPUS=europarl
PAIR=en-es
SL=en
TL=es
DATA=/home/philip/Apertium/apertium-en-es

LEX_TOOLS=/home/philip/Apertium/apertium-lex-tools
SCRIPTS=$(LEX_TOOLS)/scripts
MOSESDECODER=/home/philip/mosesdecoder/scripts/training
TRAINING_LINES=200000
BIN_DIR=/home/philip/giza-pp/bin
LM=/home/philip/Apertium/gsoc2013/giza/dummy.lm

crisphold=1

all: data-$(SL)-$(TL)/$(CORPUS).ngrams.$(SL)-$(TL).lrx data-$(SL)-$(TL)/$(CORPUS).biltrans-entries.$(SL)-$(TL)

# TAG CORPUS
data-$(SL)-$(TL)/$(CORPUS).tagged.$(SL): $(CORPUS).$(PAIR).$(SL)
if [ ! -d data-$(SL)-$(TL) ]; then mkdir data-$(SL)-$(TL); fi
cat $(CORPUS).$(PAIR).$(SL) | head -n $(TRAINING_LINES) \
| apertium-destxt \
| apertium -f none -d $(DATA) $(SL)-$(TL)-tagger \
| apertium-pretransfer > $@;

data-$(SL)-$(TL)/$(CORPUS).tagged.$(TL): $(CORPUS).$(PAIR).$(TL)
if [ ! -d data-$(SL)-$(TL) ]; then mkdir data-$(SL)-$(TL); fi
cat $(CORPUS).$(PAIR).$(TL) | head -n $(TRAINING_LINES) \
| apertium-destxt \
| apertium -f none -d $(DATA) $(TL)-$(SL)-tagger \
| apertium-pretransfer > $@;

# REMOVE LINES WITH NO ANALYSES
data-$(SL)-$(TL)/$(CORPUS).tagged.new.$(SL): data-$(SL)-$(TL)/$(CORPUS).tagged.$(SL) data-$(SL)-$(TL)/$(CORPUS).tagged.$(TL)
paste data-$(SL)-$(TL)/$(CORPUS).tagged.$(SL) data-$(SL)-$(TL)/$(CORPUS).tagged.$(TL) \
| grep '<' \
| cut -f1 \
| sed 's/ /~/g' | sed 's/$$[^\^]*/$$ /g' > $@

data-$(SL)-$(TL)/$(CORPUS).tagged.new.$(TL): data-$(SL)-$(TL)/$(CORPUS).tagged.$(SL) data-$(SL)-$(TL)/$(CORPUS).tagged.$(TL)
paste data-$(SL)-$(TL)/$(CORPUS).tagged.$(SL) data-$(SL)-$(TL)/$(CORPUS).tagged.$(TL) \
| grep '<' \
| cut -f2 \
| sed 's/ /~/g' | sed 's/$$[^\^]*/$$ /g' > $@

# CLEAN
data-$(SL)-$(TL)/$(CORPUS).tag-clean.$(SL) data-$(SL)-$(TL)/$(CORPUS).tag-clean.$(TL): data-$(SL)-$(TL)/$(CORPUS).tagged.new.$(SL) data-$(SL)-$(TL)/$(CORPUS).tagged.new.$(TL)
perl $(MOSESDECODER)/clean-corpus-n.perl data-$(SL)-$(TL)/$(CORPUS).tagged.new $(SL) $(TL) data-$(SL)-$(TL)/$(CORPUS).tag-clean 1 40;

# ALIGN
model: data-$(SL)-$(TL)/$(CORPUS).tag-clean.$(SL) data-$(SL)-$(TL)/$(CORPUS).tag-clean.$(TL)
-perl $(MOSESDECODER)/train-model.perl -external-bin-dir $(BIN_DIR) -corpus data-$(SL)-$(TL)/$(CORPUS).tag-clean \
-f $(TL) -e $(SL) -alignment grow-diag-final-and -reordering msd-bidirectional-fe \
-lm 0:5:$(LM):0 2>&1

# EXTRACT AND TRIM
data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL): model
zcat giza.$(SL)-$(TL)/$(SL)-$(TL).A3.final.gz | $(SCRIPTS)/giza-to-moses.awk > $@
cat data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL) | sed 's/ ||| /\t/g' | cut -f 1 \
| sed 's/~/ /g' | $(LEX_TOOLS)/multitrans $(DATA)/$(TL)-$(SL).autobil.bin -p -t > tmp1
cat data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL) | sed 's/ ||| /\t/g' | cut -f 2 \
| sed 's/~/ /g' | $(LEX_TOOLS)/multitrans $(DATA)/$(SL)-$(TL).autobil.bin -p -t > tmp2
cat data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL) | sed 's/ ||| /\t/g' | cut -f 3 > tmp3
cat data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL) | sed 's/ ||| /\t/g' | cut -f 2 \
| sed 's/~/ /g' | $(LEX_TOOLS)/multitrans $(DATA)/$(SL)-$(TL).autobil.bin -b -t > data-$(SL)-$(TL)/$(CORPUS).clean-biltrans.$(PAIR)
paste tmp1 tmp2 tmp3 | sed 's/\t/ ||| /g' > data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL)
rm tmp1 tmp2 tmp3

# SENTENCES
data-$(SL)-$(TL)/$(CORPUS).candidates.$(SL)-$(TL): data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL) data-$(SL)-$(TL)/$(CORPUS).clean-biltrans.$(PAIR)
python3 $(SCRIPTS)/extract-sentences.py data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL) \
data-$(SL)-$(TL)/$(CORPUS).clean-biltrans.$(PAIR) > $@ 2>/dev/null

# FREQUENCY LEXICON
data-$(SL)-$(TL)/$(CORPUS).lex.$(SL)-$(TL): data-$(SL)-$(TL)/$(CORPUS).candidates.$(SL)-$(TL)
python $(SCRIPTS)/extract-freq-lexicon.py data-$(SL)-$(TL)/$(CORPUS).candidates.$(SL)-$(TL) > $@ 2>/dev/null

# BILTRANS CANDIDATES
data-$(SL)-$(TL)/$(CORPUS).biltrans-entries.$(SL)-$(TL): data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL) data-$(SL)-$(TL)/$(CORPUS).clean-biltrans.$(PAIR)
python3 $(SCRIPTS)/extract-biltrans-candidates.py data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL) data-$(SL)-$(TL)/$(CORPUS).clean-biltrans.$(PAIR) \
> $@ 2>/dev/null

# NGRAM PATTERNS
data-$(SL)-$(TL)/$(CORPUS).ngrams.$(SL)-$(TL): data-$(SL)-$(TL)/$(CORPUS).lex.$(SL)-$(TL) data-$(SL)-$(TL)/$(CORPUS).candidates.$(SL)-$(TL)
python $(SCRIPTS)/ngram-count-patterns.py data-$(SL)-$(TL)/$(CORPUS).lex.$(SL)-$(TL) data-$(SL)-$(TL)/$(CORPUS).candidates.$(SL)-$(TL) $(crisphold) 2>/dev/null > $@

# NGRAMS TO RULES
data-$(SL)-$(TL)/$(CORPUS).ngrams.$(SL)-$(TL).lrx: data-$(SL)-$(TL)/$(CORPUS).ngrams.$(SL)-$(TL)
python3 $(SCRIPTS)/ngrams-to-rules.py data-$(SL)-$(TL)/$(CORPUS).ngrams.$(SL)-$(TL) $(crisphold) > $@ 2>/dev/null

</pre>

[[Category:Lexical selection]]
[[Category:Documentation in English]]

Generating lexical-selection rules from a parallel corpus

2013-09-23T05:28:24Z

Fpetkovski: /* Getting started */

{{TOCD}}
If you have a parallel corpus, one of the things you can do is generate some lexical selection rules from it, to improve translation of words with more than one possible translation.

== You will need ==

Here is a list of software that you will need installed:

* Giza++ (or some other word aligner)
* Moses (for making Giza++ less human hostile)
* All the Moses scripts
* [[lttoolbox]]
* Apertium
* [[apertium-lex-tools]]

Furthermore you'll need:

* an Apertium language pair
* a parallel corpus (see [[Corpora]])

===Installing prerequisites===
See [[Minimal installation from SVN]] for apertium/lttoolbox.

See [[Constraint-based lexical selection module]] for apertium-lex-tools.

For Giza++ and moses-decoder, etc. you can do
<pre>
$ mkdir ~/smt
$ cd ~/smt
$ mkdir local # our "install prefix"
$ wget https://giza-pp.googlecode.com/files/giza-pp-v1.0.7.tar.gz
$ tar xzvf giza-pp-v1.0.7.tar.gz
$ cd giza-pp
$ make
$ mkdir ../local/bin
$ cp GIZA++-v2/snt2cooc.out ../local/bin/
$ cp GIZA++-v2/snt2plain.out ../local/bin/
$ cp GIZA++-v2/GIZA++ ../local/bin/
$ cp mkcls-v2/mkcls ../local/bin/
$ git clone https://github.com/moses-smt/mosesdecoder
$ cd mosesdecoder/
$ ./bjam
</pre>

Now e.g. the clean-corpus and train-model scripts referred to below will be in ~/smt/mosesdecoder/scripts/training/clean-corpus-n.perl
See http://www.statmt.org/moses/?n=Development.GetStarted if you want to install the binaries to some other directory.

== Getting started ==

Important:
'''If you don't want through the whole process step by step, you can use the Makefile script provided in the [[#Makefile|last section]] of this page.'''

We're going to do the example with [[Corpora|EuroParl]] and the English to Spanish pair in Apertium.

Given that you've got all the stuff installed, the work will be as follows:

=== Prepare corpus ===

To generate the rules, we need three files,

* The tagged and tokenised source corpus
* The tagged and tokenised target corpus
* The output of the lexical transfer module in the source→target direction, tokenised

These three files should be sentence aligned.

Make a folder called data-en-es. We are going to keep all the generated files there.

The first thing that we need to do is tag both sides of the corpus:

<pre>
$ nohup cat europarl.en-es.en | apertium-destxt |\
apertium -f none -d /home/fran/source/apertium-en-es en-es-pretransfer > data-en-es/europarl-en-es.tagged.en &
$ nohup cat europarl.clean.es | apertium-destxt |\
apertium -f none -d /home/fran/source/apertium-en-es es-en-pretransfer > data-en-es.europarl-en-es.tagged.es &
</pre>

Then we need to remove the lines with no analyses on and replace blanks within lemmas with a new character (we will use `~`):

<pre>
$ paste data-en-es/europarl-en-es.tagged.en data-en-es/europarl-en-es.tagged.es | grep '<' | cut -f2 | sed 's/ /~/g' | sed 's/\$[^\^]*/\$ /g' > data-en-es/europarl-en-es.tagged.new.es
$ paste data-en-es/europarl-en-es.tagged.en data-en-es/europarl-en-es.tagged.es | grep '<' | cut -f3 | sed 's/ /~/g' | sed 's/\$[^\^]*/\$ /g' > data-en-es/europarl-en-es.tagged.new.en
</pre>

Next, we need to clean the corpus and remove long sentences.
(Make sure you are in the same directory as the one where you have your europarl corpus)

<pre>
$ perl (path to your mosesdecoder)/scripts/training/clean-corpus-n.perl data-en-es/europarl-en-es.tagged.new es en data-en-es/europarl-en-es.tag-clean 1 40
clean-corpus.perl: processing europarl-v6.es-en.es & .en to data-en-es/europarl-en-es.clean, cutoff 1-40
..........(100000)...

Input sentences: 1786594 Output sentences: 1467708
</pre>

We're going to cut off the bottom 67,658 for testing (also because Giza++ segfaults somewhere around there).
<pre>
$ mkdir testing
$ tail -67658 data-en-es/europarl-en-es.tag-clean.en > testing/europarl-en-es.tag-clean.67658.en
$ tail -67658 data-en-es/europarl-en-es.tag-clean.es > testing/europarl-en-es.tag-clean.67658.es
</pre>

<pre>
$ head -1400000 data-en-es/europarl-en-es.tag-clean.en > data-en-es/europarl-en-es.tag-clean.en.new
$ head -1400000 data-en-es/europarl-en-es.tag-clean.es > data-en-es/europarl-en-es.tag-clean.es.new
$ mv data-en-es/europarl-en-es.tag-clean.en.new data-en-es/europarl-en-es.tag-clean.en
$ mv data-en-es/europarl-en-es.tag-clean.es.new data-en-es/europarl-en-es.tag-clean.es
</pre>

These files are:

* <code>data-en-es/europarl-en-es.tag-clean.en</code>: The tagged source language side of the corpus
* <code>data-en-es/europarl-en-es.tag-clean.es</code>: The tagged target language side of the corpus

Check that they have the same length:

<pre>
$ wc -l europarl.*
1400000 data-en-es/europarl-en-es.tag-clean.en
1400000 data-en-es/europarl-en-es.tag-clean.es
2800000 total
</pre>

=== Align corpus ===

Now we've got the corpus files ready, we can align the corpus using the Moses scripts:

<pre>
nohup perl (path to your mosesdecoder)/scripts/training/train-model.perl -external-bin-dir \
~/smt/local/bin -corpus data-en-es/europarl-en-es.tag-clean \
-f en -e es -alignment grow-diag-final-and -reordering msd-bidirectional-fe \
-lm 0:5:/home/fran/corpora/europarl/europarl.lm:0 >log 2>&1 &
</pre>

Note: Remember to change all the paths in the above command!

You'll need an LM file, but you can copy it from a previous Moses installation. If you don't have one, make an empty file and put a few words in it. We won't be using the LM anyway.

This takes a while, from a few hours to a day. So leave it running and go and make a soufflé, or chop some wood or something.

=== Extract sentences ===
After the sentences are aligned, we need to trim unnecessary tags from the tokens, and generate a biltrans file.

<pre>
zcat giza.en-es/en-es.A3.final.gz | ~/source/apertium-lex-tools/scripts/giza-to-moses.awk > data-en-es/europarl.phrasetable.en-es
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 1 \
| sed 's/~/ /g' | ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil.bin -p -t > tmp1
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 2 \
| sed 's/~/ /g' | ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil.bin -p -t > tmp2
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 3 > tmp3
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 2 \
| sed 's/~/ /g' | ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil.bin -b -t > data-en-es/europarl.clean-biltrans.en-es
paste tmp1 tmp2 tmp3 | sed 's/\t/ ||| /g' > data-en-es/europarl.phrasetable.en-es
rm tmp1 tmp2 tmp3
</pre>

Then we want to make sure again that our file has the right number of lines:

<pre>
$ wc -l data-en-es/europarl.phrasetable.en-es
1400000 data-en-es/europarl.phrasetable.en-es
</pre>

Then we want to extract the sentences where the target language word aligned to a source language word is a possible translation in the bilingual dictionary:

<pre>
$ ~/source/apertium-lex-tools/scripts/extract-sentences.py data-en-es/europarl-en-es.phrasetable.en-es data-en-es/europarl-en-es.biltrans-tok.en-es \
> data-en-es/europarl-en-es.candidates.en-es
</pre>

These are basically sentences that we can hope that Apertium might be able to generate.

=== Extract bilingual dictionary candidates ===

Using the phrasetable and the bilingual file we can extract candidates for the bilingual dictionary.

<pre>
python3 ~/source/apertium-lex-tools/scripts/extract-biltrans-candidates.py data-en-es/europarl-en-es.phrasetable.en-es data-en-es/europarl-en-es.biltrans-tok.en-es > data-en-es/europarl-en-es.biltrans-candidates.en-es 2> data-en-es/europarl-en-es.biltrans-pairs.en-es
</pre>

where data-en-es/europarl-en-es.biltrans-candidates.en-es contains the generated candidates for the bilingual dictionary.

===Extract frequency lexicon===

The next step is to extract the frequency lexicon.

<pre>
$ python ~/source/apertium-lex-tools/scripts/extract-freq-lexicon.py data-en-es/europarl-en-es.candidates.en-es > data-en-es/europarl-en-es.lex.en-es
</pre>

This file should look like:

<pre>
$ cat europarl.lex.en-es | head
31381 union<n> unión<n> @
101 union<n> sindicato<n>
1 union<n> situación<n>
1 union<n> monetario<adj>
4 slope<n> pendiente<n> @
1 slope<n> ladera<n>
</pre>

Where the highest frequency translation is marked with an <code>@</code>.

Note: This frequency lexicon can be used as a substitute for "choosing the most general translation" in your bilingual dictionary.

===Generate patterns===

Now we generate the ngrams that we are going to generate the rules from.

<pre>
$ crisphold=1.5 # ratio of how many times you see the alternative translation compared to the default
$ python ~/source/apertium-lex-tools/scripts/ngram-count-patterns.py europarl.lex.en-es data-en-es/europarl-en-es.candidates.en-es $crisphold 2>/dev/null > data-en-es/europarl-en-es.ngrams.en-es
</pre>

This script outputs lines in the following format:

<pre>
-language<n> and<cnjcoo> language<n> ,<cm> lengua<n> 2
+language<n> plain<adj> language<n> ,<cm> lenguaje<n> 3
-language<n> language<n> knowledge<n> lengua<n> 4
-language<n> language<n> of<pr> communication<n> lengua<n> 3
-language<n> Community<adj> language<n> .<sent> lengua<n> 5
-language<n> language<n> in~addition~to<pr> their<det><pos> lengua<n> 2
-language<n> every<det><ind> language<n> lengua<n> 2
+language<n> and<cnjcoo> *understandable language<n> lenguaje<n> 2
-language<n> two<num> language<n> lengua<n> 8
-language<n> only<adj> official<adj> language<n> lengua<n> 2
</pre>

The <code>+</code> and <code>-</code> indicate if this line chooses the most frequent transation (<code>-</code>) or a translation which is not the most frequent (<code>+</code>). The pattern selecting the translation is then shown, followed by the translation and then the frequency.

===Filter rules===

Now you can filter the rules, for example by removing rules with conjunctions, or removing rules with unknown words.

=== Generate rules ===

The final stage is to generate the rules,

<pre>
python3 ~/source/apertium-lex-tools/scripts/ngrams-to-rules.py data-en-es/europarl-en-es.ngrams.en-es $crisphold > data-en-es/europarl-en-es.ngrams.en-es.lrx
</pre>

=== Makefile ===
For the whole process you can run the following Makefile:

<pre>
CORPUS=europarl
PAIR=en-es
SL=en
TL=es
DATA=/home/philip/Apertium/apertium-en-es

LEX_TOOLS=/home/philip/Apertium/apertium-lex-tools
SCRIPTS=$(LEX_TOOLS)/scripts
MOSESDECODER=/home/philip/mosesdecoder/scripts/training
TRAINING_LINES=200000
BIN_DIR=/home/philip/giza-pp/bin
LM=/home/philip/Apertium/gsoc2013/giza/dummy.lm

crisphold=1

all: data-$(SL)-$(TL)/$(CORPUS).ngrams.$(SL)-$(TL).lrx data-$(SL)-$(TL)/$(CORPUS).biltrans-entries.$(SL)-$(TL)

# TAG CORPUS
data-$(SL)-$(TL)/$(CORPUS).tagged.$(SL): $(CORPUS).$(PAIR).$(SL)
if [ ! -d data-$(SL)-$(TL) ]; then mkdir data-$(SL)-$(TL); fi
cat $(CORPUS).$(PAIR).$(SL) | head -n $(TRAINING_LINES) \
| apertium-destxt \
| apertium -f none -d $(DATA) $(SL)-$(TL)-tagger \
| apertium-pretransfer > $@;

data-$(SL)-$(TL)/$(CORPUS).tagged.$(TL): $(CORPUS).$(PAIR).$(TL)
if [ ! -d data-$(SL)-$(TL) ]; then mkdir data-$(SL)-$(TL); fi
cat $(CORPUS).$(PAIR).$(TL) | head -n $(TRAINING_LINES) \
| apertium-destxt \
| apertium -f none -d $(DATA) $(TL)-$(SL)-tagger \
| apertium-pretransfer > $@;

# REMOVE LINES WITH NO ANALYSES
data-$(SL)-$(TL)/$(CORPUS).tagged.new.$(SL): data-$(SL)-$(TL)/$(CORPUS).tagged.$(SL) data-$(SL)-$(TL)/$(CORPUS).tagged.$(TL)
paste data-$(SL)-$(TL)/$(CORPUS).tagged.$(SL) data-$(SL)-$(TL)/$(CORPUS).tagged.$(TL) \
| grep '<' \
| cut -f1 \
| sed 's/ /~/g' | sed 's/$$[^\^]*/$$ /g' > $@

data-$(SL)-$(TL)/$(CORPUS).tagged.new.$(TL): data-$(SL)-$(TL)/$(CORPUS).tagged.$(SL) data-$(SL)-$(TL)/$(CORPUS).tagged.$(TL)
paste data-$(SL)-$(TL)/$(CORPUS).tagged.$(SL) data-$(SL)-$(TL)/$(CORPUS).tagged.$(TL) \
| grep '<' \
| cut -f2 \
| sed 's/ /~/g' | sed 's/$$[^\^]*/$$ /g' > $@

# CLEAN
data-$(SL)-$(TL)/$(CORPUS).tag-clean.$(SL) data-$(SL)-$(TL)/$(CORPUS).tag-clean.$(TL): data-$(SL)-$(TL)/$(CORPUS).tagged.new.$(SL) data-$(SL)-$(TL)/$(CORPUS).tagged.new.$(TL)
perl $(MOSESDECODER)/clean-corpus-n.perl data-$(SL)-$(TL)/$(CORPUS).tagged.new $(SL) $(TL) data-$(SL)-$(TL)/$(CORPUS).tag-clean 1 40;

# ALIGN
model: data-$(SL)-$(TL)/$(CORPUS).tag-clean.$(SL) data-$(SL)-$(TL)/$(CORPUS).tag-clean.$(TL)
-perl $(MOSESDECODER)/train-model.perl -external-bin-dir $(BIN_DIR) -corpus data-$(SL)-$(TL)/$(CORPUS).tag-clean \
-f $(TL) -e $(SL) -alignment grow-diag-final-and -reordering msd-bidirectional-fe \
-lm 0:5:$(LM):0 2>&1

# EXTRACT AND TRIM
data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL): model
zcat giza.$(SL)-$(TL)/$(SL)-$(TL).A3.final.gz | $(SCRIPTS)/giza-to-moses.awk > $@
cat data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL) | sed 's/ ||| /\t/g' | cut -f 1 \
| sed 's/~/ /g' | $(LEX_TOOLS)/multitrans $(DATA)/$(TL)-$(SL).autobil.bin -p -t > tmp1
cat data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL) | sed 's/ ||| /\t/g' | cut -f 2 \
| sed 's/~/ /g' | $(LEX_TOOLS)/multitrans $(DATA)/$(SL)-$(TL).autobil.bin -p -t > tmp2
cat data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL) | sed 's/ ||| /\t/g' | cut -f 3 > tmp3
cat data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL) | sed 's/ ||| /\t/g' | cut -f 2 \
| sed 's/~/ /g' | $(LEX_TOOLS)/multitrans $(DATA)/$(SL)-$(TL).autobil.bin -b -t > data-$(SL)-$(TL)/$(CORPUS).clean-biltrans.$(PAIR)
paste tmp1 tmp2 tmp3 | sed 's/\t/ ||| /g' > data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL)
rm tmp1 tmp2 tmp3

# SENTENCES
data-$(SL)-$(TL)/$(CORPUS).candidates.$(SL)-$(TL): data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL) data-$(SL)-$(TL)/$(CORPUS).clean-biltrans.$(PAIR)
python3 $(SCRIPTS)/extract-sentences.py data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL) \
data-$(SL)-$(TL)/$(CORPUS).clean-biltrans.$(PAIR) > $@ 2>/dev/null

# FREQUENCY LEXICON
data-$(SL)-$(TL)/$(CORPUS).lex.$(SL)-$(TL): data-$(SL)-$(TL)/$(CORPUS).candidates.$(SL)-$(TL)
python $(SCRIPTS)/extract-freq-lexicon.py data-$(SL)-$(TL)/$(CORPUS).candidates.$(SL)-$(TL) > $@ 2>/dev/null

# BILTRANS CANDIDATES
data-$(SL)-$(TL)/$(CORPUS).biltrans-entries.$(SL)-$(TL): data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL) data-$(SL)-$(TL)/$(CORPUS).clean-biltrans.$(PAIR)
python3 $(SCRIPTS)/extract-biltrans-candidates.py data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL) data-$(SL)-$(TL)/$(CORPUS).clean-biltrans.$(PAIR) \
> $@ 2>/dev/null

# NGRAM PATTERNS
data-$(SL)-$(TL)/$(CORPUS).ngrams.$(SL)-$(TL): data-$(SL)-$(TL)/$(CORPUS).lex.$(SL)-$(TL) data-$(SL)-$(TL)/$(CORPUS).candidates.$(SL)-$(TL)
python $(SCRIPTS)/ngram-count-patterns.py data-$(SL)-$(TL)/$(CORPUS).lex.$(SL)-$(TL) data-$(SL)-$(TL)/$(CORPUS).candidates.$(SL)-$(TL) $(crisphold) 2>/dev/null > $@

# NGRAMS TO RULES
data-$(SL)-$(TL)/$(CORPUS).ngrams.$(SL)-$(TL).lrx: data-$(SL)-$(TL)/$(CORPUS).ngrams.$(SL)-$(TL)
python3 $(SCRIPTS)/ngrams-to-rules.py data-$(SL)-$(TL)/$(CORPUS).ngrams.$(SL)-$(TL) $(crisphold) > $@ 2>/dev/null

</pre>

[[Category:Lexical selection]]
[[Category:Documentation in English]]

Generating lexical-selection rules from monolingual corpora

2013-09-23T05:24:48Z

Fpetkovski: /* Prerequisites */

Generating lexical-selection rules from a parallel corpus

2013-09-21T15:19:40Z

Fpetkovski: /* Getting started */

{{TOCD}}
If you have a parallel corpus, one of the things you can do is generate some lexical selection rules from it, to improve translation of words with more than one possible translation.

== You will need ==

Here is a list of software that you will need installed:

* Giza++ (or some other word aligner)
* Moses (for making Giza++ less human hostile)
* All the Moses scripts
* [[lttoolbox]]
* Apertium
* [[apertium-lex-tools]]

Furthermore you'll need:

* an Apertium language pair
* a parallel corpus (see [[Corpora]])

===Installing prerequisites===
See [[Minimal installation from SVN]] for apertium/lttoolbox.

See [[Constraint-based lexical selection module]] for apertium-lex-tools.

For Giza++ and moses-decoder, etc. you can do
<pre>
$ mkdir ~/smt
$ cd ~/smt
$ mkdir local # our "install prefix"
$ wget https://giza-pp.googlecode.com/files/giza-pp-v1.0.7.tar.gz
$ tar xzvf giza-pp-v1.0.7.tar.gz
$ cd giza-pp
$ make
$ mkdir ../local/bin
$ cp GIZA++-v2/snt2cooc.out ../local/bin/
$ cp GIZA++-v2/snt2plain.out ../local/bin/
$ cp GIZA++-v2/GIZA++ ../local/bin/
$ cp mkcls-v2/mkcls ../local/bin/
$ git clone https://github.com/moses-smt/mosesdecoder
$ cd mosesdecoder/
$ ./bjam
</pre>

Now e.g. the clean-corpus and train-model scripts referred to below will be in ~/smt/mosesdecoder/scripts/training/clean-corpus-n.perl
See http://www.statmt.org/moses/?n=Development.GetStarted if you want to install the binaries to some other directory.

== Getting started ==

If you don't want through the whole process step by step, you can use the Makefile script provided in the [[#Makefile|last section]] of this page.

We're going to do the example with [[Corpora|EuroParl]] and the English to Spanish pair in Apertium.

Given that you've got all the stuff installed, the work will be as follows:

=== Prepare corpus ===

To generate the rules, we need three files,

* The tagged and tokenised source corpus
* The tagged and tokenised target corpus
* The output of the lexical transfer module in the source→target direction, tokenised

These three files should be sentence aligned.

Make a folder called data-en-es. We are going to keep all the generated files there.

The first thing that we need to do is tag both sides of the corpus:

<pre>
$ nohup cat europarl.en-es.en | apertium-destxt |\
apertium -f none -d /home/fran/source/apertium-en-es en-es-pretransfer > data-en-es/europarl-en-es.tagged.en &
$ nohup cat europarl.clean.es | apertium-destxt |\
apertium -f none -d /home/fran/source/apertium-en-es es-en-pretransfer > data-en-es.europarl-en-es.tagged.es &
</pre>

Then we need to remove the lines with no analyses on and replace blanks within lemmas with a new character (we will use `~`):

<pre>
$ paste data-en-es/europarl-en-es.tagged.en data-en-es/europarl-en-es.tagged.es | grep '<' | cut -f2 | sed 's/ /~/g' | sed 's/\$[^\^]*/\$ /g' > data-en-es/europarl-en-es.tagged.new.es
$ paste data-en-es/europarl-en-es.tagged.en data-en-es/europarl-en-es.tagged.es | grep '<' | cut -f3 | sed 's/ /~/g' | sed 's/\$[^\^]*/\$ /g' > data-en-es/europarl-en-es.tagged.new.en
</pre>

Next, we need to clean the corpus and remove long sentences.
(Make sure you are in the same directory as the one where you have your europarl corpus)

<pre>
$ perl (path to your mosesdecoder)/scripts/training/clean-corpus-n.perl data-en-es/europarl-en-es.tagged.new es en data-en-es/europarl-en-es.tag-clean 1 40
clean-corpus.perl: processing europarl-v6.es-en.es & .en to data-en-es/europarl-en-es.clean, cutoff 1-40
..........(100000)...

Input sentences: 1786594 Output sentences: 1467708
</pre>

We're going to cut off the bottom 67,658 for testing (also because Giza++ segfaults somewhere around there).
<pre>
$ mkdir testing
$ tail -67658 data-en-es/europarl-en-es.tag-clean.en > testing/europarl-en-es.tag-clean.67658.en
$ tail -67658 data-en-es/europarl-en-es.tag-clean.es > testing/europarl-en-es.tag-clean.67658.es
</pre>

<pre>
$ head -1400000 data-en-es/europarl-en-es.tag-clean.en > data-en-es/europarl-en-es.tag-clean.en.new
$ head -1400000 data-en-es/europarl-en-es.tag-clean.es > data-en-es/europarl-en-es.tag-clean.es.new
$ mv data-en-es/europarl-en-es.tag-clean.en.new data-en-es/europarl-en-es.tag-clean.en
$ mv data-en-es/europarl-en-es.tag-clean.es.new data-en-es/europarl-en-es.tag-clean.es
</pre>

These files are:

* <code>data-en-es/europarl-en-es.tag-clean.en</code>: The tagged source language side of the corpus
* <code>data-en-es/europarl-en-es.tag-clean.es</code>: The tagged target language side of the corpus

Check that they have the same length:

<pre>
$ wc -l europarl.*
1400000 data-en-es/europarl-en-es.tag-clean.en
1400000 data-en-es/europarl-en-es.tag-clean.es
2800000 total
</pre>

=== Align corpus ===

Now we've got the corpus files ready, we can align the corpus using the Moses scripts:

<pre>
nohup perl (path to your mosesdecoder)/scripts/training/train-model.perl -external-bin-dir \
~/smt/local/bin -corpus data-en-es/europarl-en-es.tag-clean \
-f en -e es -alignment grow-diag-final-and -reordering msd-bidirectional-fe \
-lm 0:5:/home/fran/corpora/europarl/europarl.lm:0 >log 2>&1 &
</pre>

Note: Remember to change all the paths in the above command!

You'll need an LM file, but you can copy it from a previous Moses installation. If you don't have one, make an empty file and put a few words in it. We won't be using the LM anyway.

This takes a while, from a few hours to a day. So leave it running and go and make a soufflé, or chop some wood or something.

=== Extract sentences ===
After the sentences are aligned, we need to trim unnecessary tags from the tokens, and generate a biltrans file.

<pre>
zcat giza.en-es/en-es.A3.final.gz | ~/source/apertium-lex-tools/scripts/giza-to-moses.awk > data-en-es/europarl.phrasetable.en-es
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 1 \
| sed 's/~/ /g' | ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil.bin -p -t > tmp1
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 2 \
| sed 's/~/ /g' | ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil.bin -p -t > tmp2
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 3 > tmp3
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 2 \
| sed 's/~/ /g' | ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil.bin -b -t > data-en-es/europarl.clean-biltrans.en-es
paste tmp1 tmp2 tmp3 | sed 's/\t/ ||| /g' > data-en-es/europarl.phrasetable.en-es
rm tmp1 tmp2 tmp3
</pre>

Then we want to make sure again that our file has the right number of lines:

<pre>
$ wc -l data-en-es/europarl.phrasetable.en-es
1400000 data-en-es/europarl.phrasetable.en-es
</pre>

Then we want to extract the sentences where the target language word aligned to a source language word is a possible translation in the bilingual dictionary:

<pre>
$ ~/source/apertium-lex-tools/scripts/extract-sentences.py data-en-es/europarl-en-es.phrasetable.en-es data-en-es/europarl-en-es.biltrans-tok.en-es \
> data-en-es/europarl-en-es.candidates.en-es
</pre>

These are basically sentences that we can hope that Apertium might be able to generate.

=== Extract bilingual dictionary candidates ===

Using the phrasetable and the bilingual file we can extract candidates for the bilingual dictionary.

<pre>
python3 ~/source/apertium-lex-tools/scripts/extract-biltrans-candidates.py data-en-es/europarl-en-es.phrasetable.en-es data-en-es/europarl-en-es.biltrans-tok.en-es > data-en-es/europarl-en-es.biltrans-candidates.en-es 2> data-en-es/europarl-en-es.biltrans-pairs.en-es
</pre>

where data-en-es/europarl-en-es.biltrans-candidates.en-es contains the generated candidates for the bilingual dictionary.

===Extract frequency lexicon===

The next step is to extract the frequency lexicon.

<pre>
$ python ~/source/apertium-lex-tools/scripts/extract-freq-lexicon.py data-en-es/europarl-en-es.candidates.en-es > data-en-es/europarl-en-es.lex.en-es
</pre>

This file should look like:

<pre>
$ cat europarl.lex.en-es | head
31381 union<n> unión<n> @
101 union<n> sindicato<n>
1 union<n> situación<n>
1 union<n> monetario<adj>
4 slope<n> pendiente<n> @
1 slope<n> ladera<n>
</pre>

Where the highest frequency translation is marked with an <code>@</code>.

Note: This frequency lexicon can be used as a substitute for "choosing the most general translation" in your bilingual dictionary.

===Generate patterns===

Now we generate the ngrams that we are going to generate the rules from.

<pre>
$ crisphold=1.5 # ratio of how many times you see the alternative translation compared to the default
$ python ~/source/apertium-lex-tools/scripts/ngram-count-patterns.py europarl.lex.en-es data-en-es/europarl-en-es.candidates.en-es $crisphold 2>/dev/null > data-en-es/europarl-en-es.ngrams.en-es
</pre>

This script outputs lines in the following format:

<pre>
-language<n> and<cnjcoo> language<n> ,<cm> lengua<n> 2
+language<n> plain<adj> language<n> ,<cm> lenguaje<n> 3
-language<n> language<n> knowledge<n> lengua<n> 4
-language<n> language<n> of<pr> communication<n> lengua<n> 3
-language<n> Community<adj> language<n> .<sent> lengua<n> 5
-language<n> language<n> in~addition~to<pr> their<det><pos> lengua<n> 2
-language<n> every<det><ind> language<n> lengua<n> 2
+language<n> and<cnjcoo> *understandable language<n> lenguaje<n> 2
-language<n> two<num> language<n> lengua<n> 8
-language<n> only<adj> official<adj> language<n> lengua<n> 2
</pre>

The <code>+</code> and <code>-</code> indicate if this line chooses the most frequent transation (<code>-</code>) or a translation which is not the most frequent (<code>+</code>). The pattern selecting the translation is then shown, followed by the translation and then the frequency.

===Filter rules===

Now you can filter the rules, for example by removing rules with conjunctions, or removing rules with unknown words.

=== Generate rules ===

The final stage is to generate the rules,

<pre>
python3 ~/source/apertium-lex-tools/scripts/ngrams-to-rules.py data-en-es/europarl-en-es.ngrams.en-es $crisphold > data-en-es/europarl-en-es.ngrams.en-es.lrx
</pre>

=== Makefile ===
For the whole process you can run the following Makefile:

<pre>
CORPUS=europarl
PAIR=en-es
SL=en
TL=es
DATA=/home/philip/Apertium/apertium-en-es

LEX_TOOLS=/home/philip/Apertium/apertium-lex-tools
SCRIPTS=$(LEX_TOOLS)/scripts
MOSESDECODER=/home/philip/mosesdecoder/scripts/training
TRAINING_LINES=200000
BIN_DIR=/home/philip/giza-pp/bin
LM=/home/philip/Apertium/gsoc2013/giza/dummy.lm

crisphold=1

all: data-$(SL)-$(TL)/$(CORPUS).ngrams.$(SL)-$(TL).lrx data-$(SL)-$(TL)/$(CORPUS).biltrans-entries.$(SL)-$(TL)

# TAG CORPUS
data-$(SL)-$(TL)/$(CORPUS).tagged.$(SL): $(CORPUS).$(PAIR).$(SL)
if [ ! -d data-$(SL)-$(TL) ]; then mkdir data-$(SL)-$(TL); fi
cat $(CORPUS).$(PAIR).$(SL) | head -n $(TRAINING_LINES) \
| apertium-destxt \
| apertium -f none -d $(DATA) $(SL)-$(TL)-tagger \
| apertium-pretransfer > $@;

data-$(SL)-$(TL)/$(CORPUS).tagged.$(TL): $(CORPUS).$(PAIR).$(TL)
if [ ! -d data-$(SL)-$(TL) ]; then mkdir data-$(SL)-$(TL); fi
cat $(CORPUS).$(PAIR).$(TL) | head -n $(TRAINING_LINES) \
| apertium-destxt \
| apertium -f none -d $(DATA) $(TL)-$(SL)-tagger \
| apertium-pretransfer > $@;

# REMOVE LINES WITH NO ANALYSES
data-$(SL)-$(TL)/$(CORPUS).tagged.new.$(SL): data-$(SL)-$(TL)/$(CORPUS).tagged.$(SL) data-$(SL)-$(TL)/$(CORPUS).tagged.$(TL)
paste data-$(SL)-$(TL)/$(CORPUS).tagged.$(SL) data-$(SL)-$(TL)/$(CORPUS).tagged.$(TL) \
| grep '<' \
| cut -f1 \
| sed 's/ /~/g' | sed 's/$$[^\^]*/$$ /g' > $@

data-$(SL)-$(TL)/$(CORPUS).tagged.new.$(TL): data-$(SL)-$(TL)/$(CORPUS).tagged.$(SL) data-$(SL)-$(TL)/$(CORPUS).tagged.$(TL)
paste data-$(SL)-$(TL)/$(CORPUS).tagged.$(SL) data-$(SL)-$(TL)/$(CORPUS).tagged.$(TL) \
| grep '<' \
| cut -f2 \
| sed 's/ /~/g' | sed 's/$$[^\^]*/$$ /g' > $@

# CLEAN
data-$(SL)-$(TL)/$(CORPUS).tag-clean.$(SL) data-$(SL)-$(TL)/$(CORPUS).tag-clean.$(TL): data-$(SL)-$(TL)/$(CORPUS).tagged.new.$(SL) data-$(SL)-$(TL)/$(CORPUS).tagged.new.$(TL)
perl $(MOSESDECODER)/clean-corpus-n.perl data-$(SL)-$(TL)/$(CORPUS).tagged.new $(SL) $(TL) data-$(SL)-$(TL)/$(CORPUS).tag-clean 1 40;

# ALIGN
model: data-$(SL)-$(TL)/$(CORPUS).tag-clean.$(SL) data-$(SL)-$(TL)/$(CORPUS).tag-clean.$(TL)
-perl $(MOSESDECODER)/train-model.perl -external-bin-dir $(BIN_DIR) -corpus data-$(SL)-$(TL)/$(CORPUS).tag-clean \
-f $(TL) -e $(SL) -alignment grow-diag-final-and -reordering msd-bidirectional-fe \
-lm 0:5:$(LM):0 2>&1

# EXTRACT AND TRIM
data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL): model
zcat giza.$(SL)-$(TL)/$(SL)-$(TL).A3.final.gz | $(SCRIPTS)/giza-to-moses.awk > $@
cat data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL) | sed 's/ ||| /\t/g' | cut -f 1 \
| sed 's/~/ /g' | $(LEX_TOOLS)/multitrans $(DATA)/$(TL)-$(SL).autobil.bin -p -t > tmp1
cat data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL) | sed 's/ ||| /\t/g' | cut -f 2 \
| sed 's/~/ /g' | $(LEX_TOOLS)/multitrans $(DATA)/$(SL)-$(TL).autobil.bin -p -t > tmp2
cat data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL) | sed 's/ ||| /\t/g' | cut -f 3 > tmp3
cat data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL) | sed 's/ ||| /\t/g' | cut -f 2 \
| sed 's/~/ /g' | $(LEX_TOOLS)/multitrans $(DATA)/$(SL)-$(TL).autobil.bin -b -t > data-$(SL)-$(TL)/$(CORPUS).clean-biltrans.$(PAIR)
paste tmp1 tmp2 tmp3 | sed 's/\t/ ||| /g' > data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL)
rm tmp1 tmp2 tmp3

# SENTENCES
data-$(SL)-$(TL)/$(CORPUS).candidates.$(SL)-$(TL): data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL) data-$(SL)-$(TL)/$(CORPUS).clean-biltrans.$(PAIR)
python3 $(SCRIPTS)/extract-sentences.py data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL) \
data-$(SL)-$(TL)/$(CORPUS).clean-biltrans.$(PAIR) > $@ 2>/dev/null

# FREQUENCY LEXICON
data-$(SL)-$(TL)/$(CORPUS).lex.$(SL)-$(TL): data-$(SL)-$(TL)/$(CORPUS).candidates.$(SL)-$(TL)
python $(SCRIPTS)/extract-freq-lexicon.py data-$(SL)-$(TL)/$(CORPUS).candidates.$(SL)-$(TL) > $@ 2>/dev/null

# BILTRANS CANDIDATES
data-$(SL)-$(TL)/$(CORPUS).biltrans-entries.$(SL)-$(TL): data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL) data-$(SL)-$(TL)/$(CORPUS).clean-biltrans.$(PAIR)
python3 $(SCRIPTS)/extract-biltrans-candidates.py data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL) data-$(SL)-$(TL)/$(CORPUS).clean-biltrans.$(PAIR) \
> $@ 2>/dev/null

# NGRAM PATTERNS
data-$(SL)-$(TL)/$(CORPUS).ngrams.$(SL)-$(TL): data-$(SL)-$(TL)/$(CORPUS).lex.$(SL)-$(TL) data-$(SL)-$(TL)/$(CORPUS).candidates.$(SL)-$(TL)
python $(SCRIPTS)/ngram-count-patterns.py data-$(SL)-$(TL)/$(CORPUS).lex.$(SL)-$(TL) data-$(SL)-$(TL)/$(CORPUS).candidates.$(SL)-$(TL) $(crisphold) 2>/dev/null > $@

# NGRAMS TO RULES
data-$(SL)-$(TL)/$(CORPUS).ngrams.$(SL)-$(TL).lrx: data-$(SL)-$(TL)/$(CORPUS).ngrams.$(SL)-$(TL)
python3 $(SCRIPTS)/ngrams-to-rules.py data-$(SL)-$(TL)/$(CORPUS).ngrams.$(SL)-$(TL) $(crisphold) > $@ 2>/dev/null

</pre>

[[Category:Lexical selection]]
[[Category:Documentation in English]]

Generating lexical-selection rules from a parallel corpus

2013-09-21T15:15:44Z

Fpetkovski: /* Process script */

{{TOCD}}
If you have a parallel corpus, one of the things you can do is generate some lexical selection rules from it, to improve translation of words with more than one possible translation.

== You will need ==

Here is a list of software that you will need installed:

* Giza++ (or some other word aligner)
* Moses (for making Giza++ less human hostile)
* All the Moses scripts
* [[lttoolbox]]
* Apertium
* [[apertium-lex-tools]]

Furthermore you'll need:

* an Apertium language pair
* a parallel corpus (see [[Corpora]])

===Installing prerequisites===
See [[Minimal installation from SVN]] for apertium/lttoolbox.

See [[Constraint-based lexical selection module]] for apertium-lex-tools.

For Giza++ and moses-decoder, etc. you can do
<pre>
$ mkdir ~/smt
$ cd ~/smt
$ mkdir local # our "install prefix"
$ wget https://giza-pp.googlecode.com/files/giza-pp-v1.0.7.tar.gz
$ tar xzvf giza-pp-v1.0.7.tar.gz
$ cd giza-pp
$ make
$ mkdir ../local/bin
$ cp GIZA++-v2/snt2cooc.out ../local/bin/
$ cp GIZA++-v2/snt2plain.out ../local/bin/
$ cp GIZA++-v2/GIZA++ ../local/bin/
$ cp mkcls-v2/mkcls ../local/bin/
$ git clone https://github.com/moses-smt/mosesdecoder
$ cd mosesdecoder/
$ ./bjam
</pre>

Now e.g. the clean-corpus and train-model scripts referred to below will be in ~/smt/mosesdecoder/scripts/training/clean-corpus-n.perl
See http://www.statmt.org/moses/?n=Development.GetStarted if you want to install the binaries to some other directory.

== Getting started ==

We're going to do the example with [[Corpora|EuroParl]] and the English to Spanish pair in Apertium.

Given that you've got all the stuff installed, the work will be as follows:

=== Prepare corpus ===

To generate the rules, we need three files,

* The tagged and tokenised source corpus
* The tagged and tokenised target corpus
* The output of the lexical transfer module in the source→target direction, tokenised

These three files should be sentence aligned.

Make a folder called data-en-es. We are going to keep all the generated files there.

The first thing that we need to do is tag both sides of the corpus:

<pre>
$ nohup cat europarl.en-es.en | apertium-destxt |\
apertium -f none -d /home/fran/source/apertium-en-es en-es-pretransfer > data-en-es/europarl-en-es.tagged.en &
$ nohup cat europarl.clean.es | apertium-destxt |\
apertium -f none -d /home/fran/source/apertium-en-es es-en-pretransfer > data-en-es.europarl-en-es.tagged.es &
</pre>

Then we need to remove the lines with no analyses on and replace blanks within lemmas with a new character (we will use `~`):

<pre>
$ paste data-en-es/europarl-en-es.tagged.en data-en-es/europarl-en-es.tagged.es | grep '<' | cut -f2 | sed 's/ /~/g' | sed 's/\$[^\^]*/\$ /g' > data-en-es/europarl-en-es.tagged.new.es
$ paste data-en-es/europarl-en-es.tagged.en data-en-es/europarl-en-es.tagged.es | grep '<' | cut -f3 | sed 's/ /~/g' | sed 's/\$[^\^]*/\$ /g' > data-en-es/europarl-en-es.tagged.new.en
</pre>

Next, we need to clean the corpus and remove long sentences.
(Make sure you are in the same directory as the one where you have your europarl corpus)

<pre>
$ perl (path to your mosesdecoder)/scripts/training/clean-corpus-n.perl data-en-es/europarl-en-es.tagged.new es en data-en-es/europarl-en-es.tag-clean 1 40
clean-corpus.perl: processing europarl-v6.es-en.es & .en to data-en-es/europarl-en-es.clean, cutoff 1-40
..........(100000)...

Input sentences: 1786594 Output sentences: 1467708
</pre>

We're going to cut off the bottom 67,658 for testing (also because Giza++ segfaults somewhere around there).
<pre>
$ mkdir testing
$ tail -67658 data-en-es/europarl-en-es.tag-clean.en > testing/europarl-en-es.tag-clean.67658.en
$ tail -67658 data-en-es/europarl-en-es.tag-clean.es > testing/europarl-en-es.tag-clean.67658.es
</pre>

<pre>
$ head -1400000 data-en-es/europarl-en-es.tag-clean.en > data-en-es/europarl-en-es.tag-clean.en.new
$ head -1400000 data-en-es/europarl-en-es.tag-clean.es > data-en-es/europarl-en-es.tag-clean.es.new
$ mv data-en-es/europarl-en-es.tag-clean.en.new data-en-es/europarl-en-es.tag-clean.en
$ mv data-en-es/europarl-en-es.tag-clean.es.new data-en-es/europarl-en-es.tag-clean.es
</pre>

These files are:

* <code>data-en-es/europarl-en-es.tag-clean.en</code>: The tagged source language side of the corpus
* <code>data-en-es/europarl-en-es.tag-clean.es</code>: The tagged target language side of the corpus

Check that they have the same length:

<pre>
$ wc -l europarl.*
1400000 data-en-es/europarl-en-es.tag-clean.en
1400000 data-en-es/europarl-en-es.tag-clean.es
2800000 total
</pre>

=== Align corpus ===

Now we've got the corpus files ready, we can align the corpus using the Moses scripts:

<pre>
nohup perl (path to your mosesdecoder)/scripts/training/train-model.perl -external-bin-dir \
~/smt/local/bin -corpus data-en-es/europarl-en-es.tag-clean \
-f en -e es -alignment grow-diag-final-and -reordering msd-bidirectional-fe \
-lm 0:5:/home/fran/corpora/europarl/europarl.lm:0 >log 2>&1 &
</pre>

Note: Remember to change all the paths in the above command!

You'll need an LM file, but you can copy it from a previous Moses installation. If you don't have one, make an empty file and put a few words in it. We won't be using the LM anyway.

This takes a while, from a few hours to a day. So leave it running and go and make a soufflé, or chop some wood or something.

=== Extract sentences ===
After the sentences are aligned, we need to trim unnecessary tags from the tokens, and generate a biltrans file.

<pre>
zcat giza.en-es/en-es.A3.final.gz | ~/source/apertium-lex-tools/scripts/giza-to-moses.awk > data-en-es/europarl.phrasetable.en-es
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 1 \
| sed 's/~/ /g' | ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil.bin -p -t > tmp1
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 2 \
| sed 's/~/ /g' | ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil.bin -p -t > tmp2
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 3 > tmp3
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 2 \
| sed 's/~/ /g' | ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil.bin -b -t > data-en-es/europarl.clean-biltrans.en-es
paste tmp1 tmp2 tmp3 | sed 's/\t/ ||| /g' > data-en-es/europarl.phrasetable.en-es
rm tmp1 tmp2 tmp3
</pre>

Then we want to make sure again that our file has the right number of lines:

<pre>
$ wc -l data-en-es/europarl.phrasetable.en-es
1400000 data-en-es/europarl.phrasetable.en-es
</pre>

Then we want to extract the sentences where the target language word aligned to a source language word is a possible translation in the bilingual dictionary:

<pre>
$ ~/source/apertium-lex-tools/scripts/extract-sentences.py data-en-es/europarl-en-es.phrasetable.en-es data-en-es/europarl-en-es.biltrans-tok.en-es \
> data-en-es/europarl-en-es.candidates.en-es
</pre>

These are basically sentences that we can hope that Apertium might be able to generate.

=== Extract bilingual dictionary candidates ===

Using the phrasetable and the bilingual file we can extract candidates for the bilingual dictionary.

<pre>
python3 ~/source/apertium-lex-tools/scripts/extract-biltrans-candidates.py data-en-es/europarl-en-es.phrasetable.en-es data-en-es/europarl-en-es.biltrans-tok.en-es > data-en-es/europarl-en-es.biltrans-candidates.en-es 2> data-en-es/europarl-en-es.biltrans-pairs.en-es
</pre>

where data-en-es/europarl-en-es.biltrans-candidates.en-es contains the generated candidates for the bilingual dictionary.

===Extract frequency lexicon===

The next step is to extract the frequency lexicon.

<pre>
$ python ~/source/apertium-lex-tools/scripts/extract-freq-lexicon.py data-en-es/europarl-en-es.candidates.en-es > data-en-es/europarl-en-es.lex.en-es
</pre>

This file should look like:

<pre>
$ cat europarl.lex.en-es | head
31381 union<n> unión<n> @
101 union<n> sindicato<n>
1 union<n> situación<n>
1 union<n> monetario<adj>
4 slope<n> pendiente<n> @
1 slope<n> ladera<n>
</pre>

Where the highest frequency translation is marked with an <code>@</code>.

Note: This frequency lexicon can be used as a substitute for "choosing the most general translation" in your bilingual dictionary.

===Generate patterns===

Now we generate the ngrams that we are going to generate the rules from.

<pre>
$ crisphold=1.5 # ratio of how many times you see the alternative translation compared to the default
$ python ~/source/apertium-lex-tools/scripts/ngram-count-patterns.py europarl.lex.en-es data-en-es/europarl-en-es.candidates.en-es $crisphold 2>/dev/null > data-en-es/europarl-en-es.ngrams.en-es
</pre>

This script outputs lines in the following format:

<pre>
-language<n> and<cnjcoo> language<n> ,<cm> lengua<n> 2
+language<n> plain<adj> language<n> ,<cm> lenguaje<n> 3
-language<n> language<n> knowledge<n> lengua<n> 4
-language<n> language<n> of<pr> communication<n> lengua<n> 3
-language<n> Community<adj> language<n> .<sent> lengua<n> 5
-language<n> language<n> in~addition~to<pr> their<det><pos> lengua<n> 2
-language<n> every<det><ind> language<n> lengua<n> 2
+language<n> and<cnjcoo> *understandable language<n> lenguaje<n> 2
-language<n> two<num> language<n> lengua<n> 8
-language<n> only<adj> official<adj> language<n> lengua<n> 2
</pre>

The <code>+</code> and <code>-</code> indicate if this line chooses the most frequent transation (<code>-</code>) or a translation which is not the most frequent (<code>+</code>). The pattern selecting the translation is then shown, followed by the translation and then the frequency.

===Filter rules===

Now you can filter the rules, for example by removing rules with conjunctions, or removing rules with unknown words.

=== Generate rules ===

The final stage is to generate the rules,

<pre>
python3 ~/source/apertium-lex-tools/scripts/ngrams-to-rules.py data-en-es/europarl-en-es.ngrams.en-es $crisphold > data-en-es/europarl-en-es.ngrams.en-es.lrx
</pre>

=== Makefile ===
For the whole process you can run the following Makefile:

<pre>
CORPUS=europarl
PAIR=en-es
SL=en
TL=es
DATA=/home/philip/Apertium/apertium-en-es

LEX_TOOLS=/home/philip/Apertium/apertium-lex-tools
SCRIPTS=$(LEX_TOOLS)/scripts
MOSESDECODER=/home/philip/mosesdecoder/scripts/training
TRAINING_LINES=200000
BIN_DIR=/home/philip/giza-pp/bin
LM=/home/philip/Apertium/gsoc2013/giza/dummy.lm

crisphold=1

all: data-$(SL)-$(TL)/$(CORPUS).ngrams.$(SL)-$(TL).lrx data-$(SL)-$(TL)/$(CORPUS).biltrans-entries.$(SL)-$(TL)

# TAG CORPUS
data-$(SL)-$(TL)/$(CORPUS).tagged.$(SL): $(CORPUS).$(PAIR).$(SL)
if [ ! -d data-$(SL)-$(TL) ]; then mkdir data-$(SL)-$(TL); fi
cat $(CORPUS).$(PAIR).$(SL) | head -n $(TRAINING_LINES) \
| apertium-destxt \
| apertium -f none -d $(DATA) $(SL)-$(TL)-tagger \
| apertium-pretransfer > $@;

data-$(SL)-$(TL)/$(CORPUS).tagged.$(TL): $(CORPUS).$(PAIR).$(TL)
if [ ! -d data-$(SL)-$(TL) ]; then mkdir data-$(SL)-$(TL); fi
cat $(CORPUS).$(PAIR).$(TL) | head -n $(TRAINING_LINES) \
| apertium-destxt \
| apertium -f none -d $(DATA) $(TL)-$(SL)-tagger \
| apertium-pretransfer > $@;

# REMOVE LINES WITH NO ANALYSES
data-$(SL)-$(TL)/$(CORPUS).tagged.new.$(SL): data-$(SL)-$(TL)/$(CORPUS).tagged.$(SL) data-$(SL)-$(TL)/$(CORPUS).tagged.$(TL)
paste data-$(SL)-$(TL)/$(CORPUS).tagged.$(SL) data-$(SL)-$(TL)/$(CORPUS).tagged.$(TL) \
| grep '<' \
| cut -f1 \
| sed 's/ /~/g' | sed 's/$$[^\^]*/$$ /g' > $@

data-$(SL)-$(TL)/$(CORPUS).tagged.new.$(TL): data-$(SL)-$(TL)/$(CORPUS).tagged.$(SL) data-$(SL)-$(TL)/$(CORPUS).tagged.$(TL)
paste data-$(SL)-$(TL)/$(CORPUS).tagged.$(SL) data-$(SL)-$(TL)/$(CORPUS).tagged.$(TL) \
| grep '<' \
| cut -f2 \
| sed 's/ /~/g' | sed 's/$$[^\^]*/$$ /g' > $@

# CLEAN
data-$(SL)-$(TL)/$(CORPUS).tag-clean.$(SL) data-$(SL)-$(TL)/$(CORPUS).tag-clean.$(TL): data-$(SL)-$(TL)/$(CORPUS).tagged.new.$(SL) data-$(SL)-$(TL)/$(CORPUS).tagged.new.$(TL)
perl $(MOSESDECODER)/clean-corpus-n.perl data-$(SL)-$(TL)/$(CORPUS).tagged.new $(SL) $(TL) data-$(SL)-$(TL)/$(CORPUS).tag-clean 1 40;

# ALIGN
model: data-$(SL)-$(TL)/$(CORPUS).tag-clean.$(SL) data-$(SL)-$(TL)/$(CORPUS).tag-clean.$(TL)
-perl $(MOSESDECODER)/train-model.perl -external-bin-dir $(BIN_DIR) -corpus data-$(SL)-$(TL)/$(CORPUS).tag-clean \
-f $(TL) -e $(SL) -alignment grow-diag-final-and -reordering msd-bidirectional-fe \
-lm 0:5:$(LM):0 2>&1

# EXTRACT AND TRIM
data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL): model
zcat giza.$(SL)-$(TL)/$(SL)-$(TL).A3.final.gz | $(SCRIPTS)/giza-to-moses.awk > $@
cat data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL) | sed 's/ ||| /\t/g' | cut -f 1 \
| sed 's/~/ /g' | $(LEX_TOOLS)/multitrans $(DATA)/$(TL)-$(SL).autobil.bin -p -t > tmp1
cat data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL) | sed 's/ ||| /\t/g' | cut -f 2 \
| sed 's/~/ /g' | $(LEX_TOOLS)/multitrans $(DATA)/$(SL)-$(TL).autobil.bin -p -t > tmp2
cat data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL) | sed 's/ ||| /\t/g' | cut -f 3 > tmp3
cat data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL) | sed 's/ ||| /\t/g' | cut -f 2 \
| sed 's/~/ /g' | $(LEX_TOOLS)/multitrans $(DATA)/$(SL)-$(TL).autobil.bin -b -t > data-$(SL)-$(TL)/$(CORPUS).clean-biltrans.$(PAIR)
paste tmp1 tmp2 tmp3 | sed 's/\t/ ||| /g' > data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL)
rm tmp1 tmp2 tmp3

# SENTENCES
data-$(SL)-$(TL)/$(CORPUS).candidates.$(SL)-$(TL): data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL) data-$(SL)-$(TL)/$(CORPUS).clean-biltrans.$(PAIR)
python3 $(SCRIPTS)/extract-sentences.py data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL) \
data-$(SL)-$(TL)/$(CORPUS).clean-biltrans.$(PAIR) > $@ 2>/dev/null

# FREQUENCY LEXICON
data-$(SL)-$(TL)/$(CORPUS).lex.$(SL)-$(TL): data-$(SL)-$(TL)/$(CORPUS).candidates.$(SL)-$(TL)
python $(SCRIPTS)/extract-freq-lexicon.py data-$(SL)-$(TL)/$(CORPUS).candidates.$(SL)-$(TL) > $@ 2>/dev/null

# BILTRANS CANDIDATES
data-$(SL)-$(TL)/$(CORPUS).biltrans-entries.$(SL)-$(TL): data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL) data-$(SL)-$(TL)/$(CORPUS).clean-biltrans.$(PAIR)
python3 $(SCRIPTS)/extract-biltrans-candidates.py data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL) data-$(SL)-$(TL)/$(CORPUS).clean-biltrans.$(PAIR) \
> $@ 2>/dev/null

# NGRAM PATTERNS
data-$(SL)-$(TL)/$(CORPUS).ngrams.$(SL)-$(TL): data-$(SL)-$(TL)/$(CORPUS).lex.$(SL)-$(TL) data-$(SL)-$(TL)/$(CORPUS).candidates.$(SL)-$(TL)
python $(SCRIPTS)/ngram-count-patterns.py data-$(SL)-$(TL)/$(CORPUS).lex.$(SL)-$(TL) data-$(SL)-$(TL)/$(CORPUS).candidates.$(SL)-$(TL) $(crisphold) 2>/dev/null > $@

# NGRAMS TO RULES
data-$(SL)-$(TL)/$(CORPUS).ngrams.$(SL)-$(TL).lrx: data-$(SL)-$(TL)/$(CORPUS).ngrams.$(SL)-$(TL)
python3 $(SCRIPTS)/ngrams-to-rules.py data-$(SL)-$(TL)/$(CORPUS).ngrams.$(SL)-$(TL) $(crisphold) > $@ 2>/dev/null

</pre>

[[Category:Lexical selection]]
[[Category:Documentation in English]]

Generating lexical-selection rules from a parallel corpus

2013-09-21T15:14:46Z

Fpetkovski: /* Generate rules */

{{TOCD}}
If you have a parallel corpus, one of the things you can do is generate some lexical selection rules from it, to improve translation of words with more than one possible translation.

== You will need ==

Here is a list of software that you will need installed:

* Giza++ (or some other word aligner)
* Moses (for making Giza++ less human hostile)
* All the Moses scripts
* [[lttoolbox]]
* Apertium
* [[apertium-lex-tools]]

Furthermore you'll need:

* an Apertium language pair
* a parallel corpus (see [[Corpora]])

===Installing prerequisites===
See [[Minimal installation from SVN]] for apertium/lttoolbox.

See [[Constraint-based lexical selection module]] for apertium-lex-tools.

For Giza++ and moses-decoder, etc. you can do
<pre>
$ mkdir ~/smt
$ cd ~/smt
$ mkdir local # our "install prefix"
$ wget https://giza-pp.googlecode.com/files/giza-pp-v1.0.7.tar.gz
$ tar xzvf giza-pp-v1.0.7.tar.gz
$ cd giza-pp
$ make
$ mkdir ../local/bin
$ cp GIZA++-v2/snt2cooc.out ../local/bin/
$ cp GIZA++-v2/snt2plain.out ../local/bin/
$ cp GIZA++-v2/GIZA++ ../local/bin/
$ cp mkcls-v2/mkcls ../local/bin/
$ git clone https://github.com/moses-smt/mosesdecoder
$ cd mosesdecoder/
$ ./bjam
</pre>

Now e.g. the clean-corpus and train-model scripts referred to below will be in ~/smt/mosesdecoder/scripts/training/clean-corpus-n.perl
See http://www.statmt.org/moses/?n=Development.GetStarted if you want to install the binaries to some other directory.

== Getting started ==

We're going to do the example with [[Corpora|EuroParl]] and the English to Spanish pair in Apertium.

Given that you've got all the stuff installed, the work will be as follows:

=== Prepare corpus ===

To generate the rules, we need three files,

* The tagged and tokenised source corpus
* The tagged and tokenised target corpus
* The output of the lexical transfer module in the source→target direction, tokenised

These three files should be sentence aligned.

Make a folder called data-en-es. We are going to keep all the generated files there.

The first thing that we need to do is tag both sides of the corpus:

<pre>
$ nohup cat europarl.en-es.en | apertium-destxt |\
apertium -f none -d /home/fran/source/apertium-en-es en-es-pretransfer > data-en-es/europarl-en-es.tagged.en &
$ nohup cat europarl.clean.es | apertium-destxt |\
apertium -f none -d /home/fran/source/apertium-en-es es-en-pretransfer > data-en-es.europarl-en-es.tagged.es &
</pre>

Then we need to remove the lines with no analyses on and replace blanks within lemmas with a new character (we will use `~`):

<pre>
$ paste data-en-es/europarl-en-es.tagged.en data-en-es/europarl-en-es.tagged.es | grep '<' | cut -f2 | sed 's/ /~/g' | sed 's/\$[^\^]*/\$ /g' > data-en-es/europarl-en-es.tagged.new.es
$ paste data-en-es/europarl-en-es.tagged.en data-en-es/europarl-en-es.tagged.es | grep '<' | cut -f3 | sed 's/ /~/g' | sed 's/\$[^\^]*/\$ /g' > data-en-es/europarl-en-es.tagged.new.en
</pre>

Next, we need to clean the corpus and remove long sentences.
(Make sure you are in the same directory as the one where you have your europarl corpus)

<pre>
$ perl (path to your mosesdecoder)/scripts/training/clean-corpus-n.perl data-en-es/europarl-en-es.tagged.new es en data-en-es/europarl-en-es.tag-clean 1 40
clean-corpus.perl: processing europarl-v6.es-en.es & .en to data-en-es/europarl-en-es.clean, cutoff 1-40
..........(100000)...

Input sentences: 1786594 Output sentences: 1467708
</pre>

We're going to cut off the bottom 67,658 for testing (also because Giza++ segfaults somewhere around there).
<pre>
$ mkdir testing
$ tail -67658 data-en-es/europarl-en-es.tag-clean.en > testing/europarl-en-es.tag-clean.67658.en
$ tail -67658 data-en-es/europarl-en-es.tag-clean.es > testing/europarl-en-es.tag-clean.67658.es
</pre>

<pre>
$ head -1400000 data-en-es/europarl-en-es.tag-clean.en > data-en-es/europarl-en-es.tag-clean.en.new
$ head -1400000 data-en-es/europarl-en-es.tag-clean.es > data-en-es/europarl-en-es.tag-clean.es.new
$ mv data-en-es/europarl-en-es.tag-clean.en.new data-en-es/europarl-en-es.tag-clean.en
$ mv data-en-es/europarl-en-es.tag-clean.es.new data-en-es/europarl-en-es.tag-clean.es
</pre>

These files are:

* <code>data-en-es/europarl-en-es.tag-clean.en</code>: The tagged source language side of the corpus
* <code>data-en-es/europarl-en-es.tag-clean.es</code>: The tagged target language side of the corpus

Check that they have the same length:

<pre>
$ wc -l europarl.*
1400000 data-en-es/europarl-en-es.tag-clean.en
1400000 data-en-es/europarl-en-es.tag-clean.es
2800000 total
</pre>

=== Align corpus ===

Now we've got the corpus files ready, we can align the corpus using the Moses scripts:

<pre>
nohup perl (path to your mosesdecoder)/scripts/training/train-model.perl -external-bin-dir \
~/smt/local/bin -corpus data-en-es/europarl-en-es.tag-clean \
-f en -e es -alignment grow-diag-final-and -reordering msd-bidirectional-fe \
-lm 0:5:/home/fran/corpora/europarl/europarl.lm:0 >log 2>&1 &
</pre>

Note: Remember to change all the paths in the above command!

You'll need an LM file, but you can copy it from a previous Moses installation. If you don't have one, make an empty file and put a few words in it. We won't be using the LM anyway.

This takes a while, from a few hours to a day. So leave it running and go and make a soufflé, or chop some wood or something.

=== Extract sentences ===
After the sentences are aligned, we need to trim unnecessary tags from the tokens, and generate a biltrans file.

<pre>
zcat giza.en-es/en-es.A3.final.gz | ~/source/apertium-lex-tools/scripts/giza-to-moses.awk > data-en-es/europarl.phrasetable.en-es
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 1 \
| sed 's/~/ /g' | ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil.bin -p -t > tmp1
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 2 \
| sed 's/~/ /g' | ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil.bin -p -t > tmp2
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 3 > tmp3
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 2 \
| sed 's/~/ /g' | ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil.bin -b -t > data-en-es/europarl.clean-biltrans.en-es
paste tmp1 tmp2 tmp3 | sed 's/\t/ ||| /g' > data-en-es/europarl.phrasetable.en-es
rm tmp1 tmp2 tmp3
</pre>

Then we want to make sure again that our file has the right number of lines:

<pre>
$ wc -l data-en-es/europarl.phrasetable.en-es
1400000 data-en-es/europarl.phrasetable.en-es
</pre>

Then we want to extract the sentences where the target language word aligned to a source language word is a possible translation in the bilingual dictionary:

<pre>
$ ~/source/apertium-lex-tools/scripts/extract-sentences.py data-en-es/europarl-en-es.phrasetable.en-es data-en-es/europarl-en-es.biltrans-tok.en-es \
> data-en-es/europarl-en-es.candidates.en-es
</pre>

These are basically sentences that we can hope that Apertium might be able to generate.

=== Extract bilingual dictionary candidates ===

Using the phrasetable and the bilingual file we can extract candidates for the bilingual dictionary.

<pre>
python3 ~/source/apertium-lex-tools/scripts/extract-biltrans-candidates.py data-en-es/europarl-en-es.phrasetable.en-es data-en-es/europarl-en-es.biltrans-tok.en-es > data-en-es/europarl-en-es.biltrans-candidates.en-es 2> data-en-es/europarl-en-es.biltrans-pairs.en-es
</pre>

where data-en-es/europarl-en-es.biltrans-candidates.en-es contains the generated candidates for the bilingual dictionary.

===Extract frequency lexicon===

The next step is to extract the frequency lexicon.

<pre>
$ python ~/source/apertium-lex-tools/scripts/extract-freq-lexicon.py data-en-es/europarl-en-es.candidates.en-es > data-en-es/europarl-en-es.lex.en-es
</pre>

This file should look like:

<pre>
$ cat europarl.lex.en-es | head
31381 union<n> unión<n> @
101 union<n> sindicato<n>
1 union<n> situación<n>
1 union<n> monetario<adj>
4 slope<n> pendiente<n> @
1 slope<n> ladera<n>
</pre>

Where the highest frequency translation is marked with an <code>@</code>.

Note: This frequency lexicon can be used as a substitute for "choosing the most general translation" in your bilingual dictionary.

===Generate patterns===

Now we generate the ngrams that we are going to generate the rules from.

<pre>
$ crisphold=1.5 # ratio of how many times you see the alternative translation compared to the default
$ python ~/source/apertium-lex-tools/scripts/ngram-count-patterns.py europarl.lex.en-es data-en-es/europarl-en-es.candidates.en-es $crisphold 2>/dev/null > data-en-es/europarl-en-es.ngrams.en-es
</pre>

This script outputs lines in the following format:

<pre>
-language<n> and<cnjcoo> language<n> ,<cm> lengua<n> 2
+language<n> plain<adj> language<n> ,<cm> lenguaje<n> 3
-language<n> language<n> knowledge<n> lengua<n> 4
-language<n> language<n> of<pr> communication<n> lengua<n> 3
-language<n> Community<adj> language<n> .<sent> lengua<n> 5
-language<n> language<n> in~addition~to<pr> their<det><pos> lengua<n> 2
-language<n> every<det><ind> language<n> lengua<n> 2
+language<n> and<cnjcoo> *understandable language<n> lenguaje<n> 2
-language<n> two<num> language<n> lengua<n> 8
-language<n> only<adj> official<adj> language<n> lengua<n> 2
</pre>

The <code>+</code> and <code>-</code> indicate if this line chooses the most frequent transation (<code>-</code>) or a translation which is not the most frequent (<code>+</code>). The pattern selecting the translation is then shown, followed by the translation and then the frequency.

===Filter rules===

Now you can filter the rules, for example by removing rules with conjunctions, or removing rules with unknown words.

=== Generate rules ===

The final stage is to generate the rules,

<pre>
python3 ~/source/apertium-lex-tools/scripts/ngrams-to-rules.py data-en-es/europarl-en-es.ngrams.en-es $crisphold > data-en-es/europarl-en-es.ngrams.en-es.lrx
</pre>

=== Process script ===
For the whole process you can run the following script:

<pre>
CORPUS_DIR="/home/philip/Apertium/corpora/raw/europarl-fr-es"
CORPUS="Europarl3"
PAIR="es-fr"
SL="fr"
TL="es"
LEX_TOOLS="/home/philip/Apertium/apertium-lex-tools"
SCRIPTS="$LEX_TOOLS/scripts"
MOSESDECODER="/home/philip/Apertium/mosesdecoder/scripts/training"
TRAINING_LINES=50000
DATA="/home/philip/Apertium/apertium-fr-es"
BIN_DIR="/home/philip/Apertium/smt/bin"

# CLEAN CORPUS
perl "$MOSESDECODER/clean-corpus-n.perl" $CORPUS.$PAIR $SL $TL "$CORPUS.clean" 1 40;

#TAG CORPUS
cat "$CORPUS.clean.$SL" | tail -n $TRAINING_LINES | apertium -d "$DATA" $SL-$TL-tagger \
| apertium-pretransfer > $CORPUS.tagged.$SL;

cat "$CORPUS.clean.$TL" | tail -n $TRAINING_LINES | apertium -d "$DATA" $TL-$SL-tagger \
| apertium-pretransfer > $CORPUS.tagged.$TL;

cat "$CORPUS.tagged.$SL" | lt-proc -b "$DATA/$SL-$TL.autobil.bin" > $CORPUS.biltrans.$PAIR

N=`wc -l $CORPUS.clean.$SL | cut -d ' ' -f 1`

# REMOVE LINES WITH NO ANALYSES
seq 1 $N > $CORPUS.lines
paste $CORPUS.lines $CORPUS.tagged.$SL $CORPUS.tagged.$TL | grep '<' | cut -f1 > $CORPUS.lines.new
paste $CORPUS.lines $CORPUS.tagged.$SL $CORPUS.tagged.$TL | grep '<' | cut -f2 > $CORPUS.tagged.$SL.new
paste $CORPUS.lines $CORPUS.tagged.$SL $CORPUS.tagged.$TL | grep '<' | cut -f3 > $CORPUS.tagged.$TL.new

mv $CORPUS.lines.new $CORPUS.lines
mv $CORPUS.tagged.$SL.new $CORPUS.tagged.$SL
mv $CORPUS.tagged.$TL.new $CORPUS.tagged.$TL

# TRIM TAGS
cat $CORPUS.tagged.$SL | $LEX_TOOLS/process-tagger-output $DATA/$SL-$TL.autobil.bin -p -t > $CORPUS.tag-tok.$SL
cat $CORPUS.tagged.$TL | $LEX_TOOLS/process-tagger-output $DATA/$TL-$SL.autobil.bin -p -t > $CORPUS.tag-tok.$TL
cat $CORPUS.tagged.$SL | $LEX_TOOLS/process-tagger-output $DATA/$SL-$TL.autobil.bin -b -t > $CORPUS.biltrans-tok.$PAIR

# ALIGN
perl $MOSESDECODER/train-model.perl -external-bin-dir "$BIN_DIR" -corpus $CORPUS.tag-tok \
-f $TL -e $SL -alignment grow-diag-final-and -reordering msd-bidirectional-fe \
-lm 0:5:/home/philip/Apertium/gsoc2013/giza/europarl.lm:0 2>&1

# EXTRACT
zcat giza.$SL-$TL/$SL-$TL.A3.final.gz | $SCRIPTS/giza-to-moses.awk > $CORPUS.phrasetable.$SL-$TL

# SENTENCES
python3 $SCRIPTS/extract-sentences.py $CORPUS.phrasetable.$SL-$TL $CORPUS.biltrans-tok.$PAIR \
> $CORPUS.candidates.$SL-$TL 2>/dev/null

# FREQUENCY LEXICON
python $SCRIPTS/extract-freq-lexicon.py $CORPUS.candidates.$SL-$TL > $CORPUS.lex.$SL-$TL 2>/dev/null

# BILTRANS CANDIDATES
python3 $SCRIPTS/extract-biltrans-candidates.py $CORPUS.phrasetable.$SL-$TL $CORPUS.biltrans-tok.$PAIR \
> $CORPUS.biltrans-entries.$SL-$TL 2>$CORPUS.biltrans-pairs.$SL-$TL

# NGRAM PATTERNS
$crisphold=1.5
python $SCRIPTS/ngram-count-patterns.py $CORPUS.lex.$SL-$TL $CORPUS.candidates.$SL-$TL $crisphold 2>/dev/null > $CORPUS.ngrams.$SL-$TL

# FILTER PATTERNS

# NGRAMS TO RULES
$cripshold=1.5
python3 $SCRIPTS/ngrams-to-rules.py europarl.ngrams.en-es $crisphold > $Europarl3.ngrams.$SL-TL.lrx

</pre>

[[Category:Lexical selection]]
[[Category:Documentation in English]]

Generating lexical-selection rules from a parallel corpus

2013-09-21T15:14:22Z

Fpetkovski: /* Generate patterns */

{{TOCD}}
If you have a parallel corpus, one of the things you can do is generate some lexical selection rules from it, to improve translation of words with more than one possible translation.

== You will need ==

Here is a list of software that you will need installed:

* Giza++ (or some other word aligner)
* Moses (for making Giza++ less human hostile)
* All the Moses scripts
* [[lttoolbox]]
* Apertium
* [[apertium-lex-tools]]

Furthermore you'll need:

* an Apertium language pair
* a parallel corpus (see [[Corpora]])

===Installing prerequisites===
See [[Minimal installation from SVN]] for apertium/lttoolbox.

See [[Constraint-based lexical selection module]] for apertium-lex-tools.

For Giza++ and moses-decoder, etc. you can do
<pre>
$ mkdir ~/smt
$ cd ~/smt
$ mkdir local # our "install prefix"
$ wget https://giza-pp.googlecode.com/files/giza-pp-v1.0.7.tar.gz
$ tar xzvf giza-pp-v1.0.7.tar.gz
$ cd giza-pp
$ make
$ mkdir ../local/bin
$ cp GIZA++-v2/snt2cooc.out ../local/bin/
$ cp GIZA++-v2/snt2plain.out ../local/bin/
$ cp GIZA++-v2/GIZA++ ../local/bin/
$ cp mkcls-v2/mkcls ../local/bin/
$ git clone https://github.com/moses-smt/mosesdecoder
$ cd mosesdecoder/
$ ./bjam
</pre>

Now e.g. the clean-corpus and train-model scripts referred to below will be in ~/smt/mosesdecoder/scripts/training/clean-corpus-n.perl
See http://www.statmt.org/moses/?n=Development.GetStarted if you want to install the binaries to some other directory.

== Getting started ==

We're going to do the example with [[Corpora|EuroParl]] and the English to Spanish pair in Apertium.

Given that you've got all the stuff installed, the work will be as follows:

=== Prepare corpus ===

To generate the rules, we need three files,

* The tagged and tokenised source corpus
* The tagged and tokenised target corpus
* The output of the lexical transfer module in the source→target direction, tokenised

These three files should be sentence aligned.

Make a folder called data-en-es. We are going to keep all the generated files there.

The first thing that we need to do is tag both sides of the corpus:

<pre>
$ nohup cat europarl.en-es.en | apertium-destxt |\
apertium -f none -d /home/fran/source/apertium-en-es en-es-pretransfer > data-en-es/europarl-en-es.tagged.en &
$ nohup cat europarl.clean.es | apertium-destxt |\
apertium -f none -d /home/fran/source/apertium-en-es es-en-pretransfer > data-en-es.europarl-en-es.tagged.es &
</pre>

Then we need to remove the lines with no analyses on and replace blanks within lemmas with a new character (we will use `~`):

<pre>
$ paste data-en-es/europarl-en-es.tagged.en data-en-es/europarl-en-es.tagged.es | grep '<' | cut -f2 | sed 's/ /~/g' | sed 's/\$[^\^]*/\$ /g' > data-en-es/europarl-en-es.tagged.new.es
$ paste data-en-es/europarl-en-es.tagged.en data-en-es/europarl-en-es.tagged.es | grep '<' | cut -f3 | sed 's/ /~/g' | sed 's/\$[^\^]*/\$ /g' > data-en-es/europarl-en-es.tagged.new.en
</pre>

Next, we need to clean the corpus and remove long sentences.
(Make sure you are in the same directory as the one where you have your europarl corpus)

<pre>
$ perl (path to your mosesdecoder)/scripts/training/clean-corpus-n.perl data-en-es/europarl-en-es.tagged.new es en data-en-es/europarl-en-es.tag-clean 1 40
clean-corpus.perl: processing europarl-v6.es-en.es & .en to data-en-es/europarl-en-es.clean, cutoff 1-40
..........(100000)...

Input sentences: 1786594 Output sentences: 1467708
</pre>

We're going to cut off the bottom 67,658 for testing (also because Giza++ segfaults somewhere around there).
<pre>
$ mkdir testing
$ tail -67658 data-en-es/europarl-en-es.tag-clean.en > testing/europarl-en-es.tag-clean.67658.en
$ tail -67658 data-en-es/europarl-en-es.tag-clean.es > testing/europarl-en-es.tag-clean.67658.es
</pre>

<pre>
$ head -1400000 data-en-es/europarl-en-es.tag-clean.en > data-en-es/europarl-en-es.tag-clean.en.new
$ head -1400000 data-en-es/europarl-en-es.tag-clean.es > data-en-es/europarl-en-es.tag-clean.es.new
$ mv data-en-es/europarl-en-es.tag-clean.en.new data-en-es/europarl-en-es.tag-clean.en
$ mv data-en-es/europarl-en-es.tag-clean.es.new data-en-es/europarl-en-es.tag-clean.es
</pre>

These files are:

* <code>data-en-es/europarl-en-es.tag-clean.en</code>: The tagged source language side of the corpus
* <code>data-en-es/europarl-en-es.tag-clean.es</code>: The tagged target language side of the corpus

Check that they have the same length:

<pre>
$ wc -l europarl.*
1400000 data-en-es/europarl-en-es.tag-clean.en
1400000 data-en-es/europarl-en-es.tag-clean.es
2800000 total
</pre>

=== Align corpus ===

Now we've got the corpus files ready, we can align the corpus using the Moses scripts:

<pre>
nohup perl (path to your mosesdecoder)/scripts/training/train-model.perl -external-bin-dir \
~/smt/local/bin -corpus data-en-es/europarl-en-es.tag-clean \
-f en -e es -alignment grow-diag-final-and -reordering msd-bidirectional-fe \
-lm 0:5:/home/fran/corpora/europarl/europarl.lm:0 >log 2>&1 &
</pre>

Note: Remember to change all the paths in the above command!

You'll need an LM file, but you can copy it from a previous Moses installation. If you don't have one, make an empty file and put a few words in it. We won't be using the LM anyway.

This takes a while, from a few hours to a day. So leave it running and go and make a soufflé, or chop some wood or something.

=== Extract sentences ===
After the sentences are aligned, we need to trim unnecessary tags from the tokens, and generate a biltrans file.

<pre>
zcat giza.en-es/en-es.A3.final.gz | ~/source/apertium-lex-tools/scripts/giza-to-moses.awk > data-en-es/europarl.phrasetable.en-es
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 1 \
| sed 's/~/ /g' | ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil.bin -p -t > tmp1
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 2 \
| sed 's/~/ /g' | ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil.bin -p -t > tmp2
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 3 > tmp3
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 2 \
| sed 's/~/ /g' | ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil.bin -b -t > data-en-es/europarl.clean-biltrans.en-es
paste tmp1 tmp2 tmp3 | sed 's/\t/ ||| /g' > data-en-es/europarl.phrasetable.en-es
rm tmp1 tmp2 tmp3
</pre>

Then we want to make sure again that our file has the right number of lines:

<pre>
$ wc -l data-en-es/europarl.phrasetable.en-es
1400000 data-en-es/europarl.phrasetable.en-es
</pre>

Then we want to extract the sentences where the target language word aligned to a source language word is a possible translation in the bilingual dictionary:

<pre>
$ ~/source/apertium-lex-tools/scripts/extract-sentences.py data-en-es/europarl-en-es.phrasetable.en-es data-en-es/europarl-en-es.biltrans-tok.en-es \
> data-en-es/europarl-en-es.candidates.en-es
</pre>

These are basically sentences that we can hope that Apertium might be able to generate.

=== Extract bilingual dictionary candidates ===

Using the phrasetable and the bilingual file we can extract candidates for the bilingual dictionary.

<pre>
python3 ~/source/apertium-lex-tools/scripts/extract-biltrans-candidates.py data-en-es/europarl-en-es.phrasetable.en-es data-en-es/europarl-en-es.biltrans-tok.en-es > data-en-es/europarl-en-es.biltrans-candidates.en-es 2> data-en-es/europarl-en-es.biltrans-pairs.en-es
</pre>

where data-en-es/europarl-en-es.biltrans-candidates.en-es contains the generated candidates for the bilingual dictionary.

===Extract frequency lexicon===

The next step is to extract the frequency lexicon.

<pre>
$ python ~/source/apertium-lex-tools/scripts/extract-freq-lexicon.py data-en-es/europarl-en-es.candidates.en-es > data-en-es/europarl-en-es.lex.en-es
</pre>

This file should look like:

<pre>
$ cat europarl.lex.en-es | head
31381 union<n> unión<n> @
101 union<n> sindicato<n>
1 union<n> situación<n>
1 union<n> monetario<adj>
4 slope<n> pendiente<n> @
1 slope<n> ladera<n>
</pre>

Where the highest frequency translation is marked with an <code>@</code>.

Note: This frequency lexicon can be used as a substitute for "choosing the most general translation" in your bilingual dictionary.

===Generate patterns===

Now we generate the ngrams that we are going to generate the rules from.

<pre>
$ crisphold=1.5 # ratio of how many times you see the alternative translation compared to the default
$ python ~/source/apertium-lex-tools/scripts/ngram-count-patterns.py europarl.lex.en-es data-en-es/europarl-en-es.candidates.en-es $crisphold 2>/dev/null > data-en-es/europarl-en-es.ngrams.en-es
</pre>

This script outputs lines in the following format:

<pre>
-language<n> and<cnjcoo> language<n> ,<cm> lengua<n> 2
+language<n> plain<adj> language<n> ,<cm> lenguaje<n> 3
-language<n> language<n> knowledge<n> lengua<n> 4
-language<n> language<n> of<pr> communication<n> lengua<n> 3
-language<n> Community<adj> language<n> .<sent> lengua<n> 5
-language<n> language<n> in~addition~to<pr> their<det><pos> lengua<n> 2
-language<n> every<det><ind> language<n> lengua<n> 2
+language<n> and<cnjcoo> *understandable language<n> lenguaje<n> 2
-language<n> two<num> language<n> lengua<n> 8
-language<n> only<adj> official<adj> language<n> lengua<n> 2
</pre>

The <code>+</code> and <code>-</code> indicate if this line chooses the most frequent transation (<code>-</code>) or a translation which is not the most frequent (<code>+</code>). The pattern selecting the translation is then shown, followed by the translation and then the frequency.

===Filter rules===

Now you can filter the rules, for example by removing rules with conjunctions, or removing rules with unknown words.

=== Generate rules ===

The final stage is to generate the rules,

<pre>
python3 ~/source/apertium-lex-tools/scripts/ngrams-to-rules.py europarl.ngrams.en-es $crisphold > europarl.ngrams.en-es.lrx
</pre>

=== Process script ===
For the whole process you can run the following script:

<pre>
CORPUS_DIR="/home/philip/Apertium/corpora/raw/europarl-fr-es"
CORPUS="Europarl3"
PAIR="es-fr"
SL="fr"
TL="es"
LEX_TOOLS="/home/philip/Apertium/apertium-lex-tools"
SCRIPTS="$LEX_TOOLS/scripts"
MOSESDECODER="/home/philip/Apertium/mosesdecoder/scripts/training"
TRAINING_LINES=50000
DATA="/home/philip/Apertium/apertium-fr-es"
BIN_DIR="/home/philip/Apertium/smt/bin"

# CLEAN CORPUS
perl "$MOSESDECODER/clean-corpus-n.perl" $CORPUS.$PAIR $SL $TL "$CORPUS.clean" 1 40;

#TAG CORPUS
cat "$CORPUS.clean.$SL" | tail -n $TRAINING_LINES | apertium -d "$DATA" $SL-$TL-tagger \
| apertium-pretransfer > $CORPUS.tagged.$SL;

cat "$CORPUS.clean.$TL" | tail -n $TRAINING_LINES | apertium -d "$DATA" $TL-$SL-tagger \
| apertium-pretransfer > $CORPUS.tagged.$TL;

cat "$CORPUS.tagged.$SL" | lt-proc -b "$DATA/$SL-$TL.autobil.bin" > $CORPUS.biltrans.$PAIR

N=`wc -l $CORPUS.clean.$SL | cut -d ' ' -f 1`

# REMOVE LINES WITH NO ANALYSES
seq 1 $N > $CORPUS.lines
paste $CORPUS.lines $CORPUS.tagged.$SL $CORPUS.tagged.$TL | grep '<' | cut -f1 > $CORPUS.lines.new
paste $CORPUS.lines $CORPUS.tagged.$SL $CORPUS.tagged.$TL | grep '<' | cut -f2 > $CORPUS.tagged.$SL.new
paste $CORPUS.lines $CORPUS.tagged.$SL $CORPUS.tagged.$TL | grep '<' | cut -f3 > $CORPUS.tagged.$TL.new

mv $CORPUS.lines.new $CORPUS.lines
mv $CORPUS.tagged.$SL.new $CORPUS.tagged.$SL
mv $CORPUS.tagged.$TL.new $CORPUS.tagged.$TL

# TRIM TAGS
cat $CORPUS.tagged.$SL | $LEX_TOOLS/process-tagger-output $DATA/$SL-$TL.autobil.bin -p -t > $CORPUS.tag-tok.$SL
cat $CORPUS.tagged.$TL | $LEX_TOOLS/process-tagger-output $DATA/$TL-$SL.autobil.bin -p -t > $CORPUS.tag-tok.$TL
cat $CORPUS.tagged.$SL | $LEX_TOOLS/process-tagger-output $DATA/$SL-$TL.autobil.bin -b -t > $CORPUS.biltrans-tok.$PAIR

# ALIGN
perl $MOSESDECODER/train-model.perl -external-bin-dir "$BIN_DIR" -corpus $CORPUS.tag-tok \
-f $TL -e $SL -alignment grow-diag-final-and -reordering msd-bidirectional-fe \
-lm 0:5:/home/philip/Apertium/gsoc2013/giza/europarl.lm:0 2>&1

# EXTRACT
zcat giza.$SL-$TL/$SL-$TL.A3.final.gz | $SCRIPTS/giza-to-moses.awk > $CORPUS.phrasetable.$SL-$TL

# SENTENCES
python3 $SCRIPTS/extract-sentences.py $CORPUS.phrasetable.$SL-$TL $CORPUS.biltrans-tok.$PAIR \
> $CORPUS.candidates.$SL-$TL 2>/dev/null

# FREQUENCY LEXICON
python $SCRIPTS/extract-freq-lexicon.py $CORPUS.candidates.$SL-$TL > $CORPUS.lex.$SL-$TL 2>/dev/null

# BILTRANS CANDIDATES
python3 $SCRIPTS/extract-biltrans-candidates.py $CORPUS.phrasetable.$SL-$TL $CORPUS.biltrans-tok.$PAIR \
> $CORPUS.biltrans-entries.$SL-$TL 2>$CORPUS.biltrans-pairs.$SL-$TL

# NGRAM PATTERNS
$crisphold=1.5
python $SCRIPTS/ngram-count-patterns.py $CORPUS.lex.$SL-$TL $CORPUS.candidates.$SL-$TL $crisphold 2>/dev/null > $CORPUS.ngrams.$SL-$TL

# FILTER PATTERNS

# NGRAMS TO RULES
$cripshold=1.5
python3 $SCRIPTS/ngrams-to-rules.py europarl.ngrams.en-es $crisphold > $Europarl3.ngrams.$SL-TL.lrx

</pre>

[[Category:Lexical selection]]
[[Category:Documentation in English]]

Generating lexical-selection rules from a parallel corpus

2013-09-21T15:13:51Z

Fpetkovski: /* Extract frequency lexicon */

{{TOCD}}
If you have a parallel corpus, one of the things you can do is generate some lexical selection rules from it, to improve translation of words with more than one possible translation.

== You will need ==

Here is a list of software that you will need installed:

* Giza++ (or some other word aligner)
* Moses (for making Giza++ less human hostile)
* All the Moses scripts
* [[lttoolbox]]
* Apertium
* [[apertium-lex-tools]]

Furthermore you'll need:

* an Apertium language pair
* a parallel corpus (see [[Corpora]])

===Installing prerequisites===
See [[Minimal installation from SVN]] for apertium/lttoolbox.

See [[Constraint-based lexical selection module]] for apertium-lex-tools.

For Giza++ and moses-decoder, etc. you can do
<pre>
$ mkdir ~/smt
$ cd ~/smt
$ mkdir local # our "install prefix"
$ wget https://giza-pp.googlecode.com/files/giza-pp-v1.0.7.tar.gz
$ tar xzvf giza-pp-v1.0.7.tar.gz
$ cd giza-pp
$ make
$ mkdir ../local/bin
$ cp GIZA++-v2/snt2cooc.out ../local/bin/
$ cp GIZA++-v2/snt2plain.out ../local/bin/
$ cp GIZA++-v2/GIZA++ ../local/bin/
$ cp mkcls-v2/mkcls ../local/bin/
$ git clone https://github.com/moses-smt/mosesdecoder
$ cd mosesdecoder/
$ ./bjam
</pre>

Now e.g. the clean-corpus and train-model scripts referred to below will be in ~/smt/mosesdecoder/scripts/training/clean-corpus-n.perl
See http://www.statmt.org/moses/?n=Development.GetStarted if you want to install the binaries to some other directory.

== Getting started ==

We're going to do the example with [[Corpora|EuroParl]] and the English to Spanish pair in Apertium.

Given that you've got all the stuff installed, the work will be as follows:

=== Prepare corpus ===

To generate the rules, we need three files,

* The tagged and tokenised source corpus
* The tagged and tokenised target corpus
* The output of the lexical transfer module in the source→target direction, tokenised

These three files should be sentence aligned.

Make a folder called data-en-es. We are going to keep all the generated files there.

The first thing that we need to do is tag both sides of the corpus:

<pre>
$ nohup cat europarl.en-es.en | apertium-destxt |\
apertium -f none -d /home/fran/source/apertium-en-es en-es-pretransfer > data-en-es/europarl-en-es.tagged.en &
$ nohup cat europarl.clean.es | apertium-destxt |\
apertium -f none -d /home/fran/source/apertium-en-es es-en-pretransfer > data-en-es.europarl-en-es.tagged.es &
</pre>

Then we need to remove the lines with no analyses on and replace blanks within lemmas with a new character (we will use `~`):

<pre>
$ paste data-en-es/europarl-en-es.tagged.en data-en-es/europarl-en-es.tagged.es | grep '<' | cut -f2 | sed 's/ /~/g' | sed 's/\$[^\^]*/\$ /g' > data-en-es/europarl-en-es.tagged.new.es
$ paste data-en-es/europarl-en-es.tagged.en data-en-es/europarl-en-es.tagged.es | grep '<' | cut -f3 | sed 's/ /~/g' | sed 's/\$[^\^]*/\$ /g' > data-en-es/europarl-en-es.tagged.new.en
</pre>

Next, we need to clean the corpus and remove long sentences.
(Make sure you are in the same directory as the one where you have your europarl corpus)

<pre>
$ perl (path to your mosesdecoder)/scripts/training/clean-corpus-n.perl data-en-es/europarl-en-es.tagged.new es en data-en-es/europarl-en-es.tag-clean 1 40
clean-corpus.perl: processing europarl-v6.es-en.es & .en to data-en-es/europarl-en-es.clean, cutoff 1-40
..........(100000)...

Input sentences: 1786594 Output sentences: 1467708
</pre>

We're going to cut off the bottom 67,658 for testing (also because Giza++ segfaults somewhere around there).
<pre>
$ mkdir testing
$ tail -67658 data-en-es/europarl-en-es.tag-clean.en > testing/europarl-en-es.tag-clean.67658.en
$ tail -67658 data-en-es/europarl-en-es.tag-clean.es > testing/europarl-en-es.tag-clean.67658.es
</pre>

<pre>
$ head -1400000 data-en-es/europarl-en-es.tag-clean.en > data-en-es/europarl-en-es.tag-clean.en.new
$ head -1400000 data-en-es/europarl-en-es.tag-clean.es > data-en-es/europarl-en-es.tag-clean.es.new
$ mv data-en-es/europarl-en-es.tag-clean.en.new data-en-es/europarl-en-es.tag-clean.en
$ mv data-en-es/europarl-en-es.tag-clean.es.new data-en-es/europarl-en-es.tag-clean.es
</pre>

These files are:

* <code>data-en-es/europarl-en-es.tag-clean.en</code>: The tagged source language side of the corpus
* <code>data-en-es/europarl-en-es.tag-clean.es</code>: The tagged target language side of the corpus

Check that they have the same length:

<pre>
$ wc -l europarl.*
1400000 data-en-es/europarl-en-es.tag-clean.en
1400000 data-en-es/europarl-en-es.tag-clean.es
2800000 total
</pre>

=== Align corpus ===

Now we've got the corpus files ready, we can align the corpus using the Moses scripts:

<pre>
nohup perl (path to your mosesdecoder)/scripts/training/train-model.perl -external-bin-dir \
~/smt/local/bin -corpus data-en-es/europarl-en-es.tag-clean \
-f en -e es -alignment grow-diag-final-and -reordering msd-bidirectional-fe \
-lm 0:5:/home/fran/corpora/europarl/europarl.lm:0 >log 2>&1 &
</pre>

Note: Remember to change all the paths in the above command!

You'll need an LM file, but you can copy it from a previous Moses installation. If you don't have one, make an empty file and put a few words in it. We won't be using the LM anyway.

This takes a while, from a few hours to a day. So leave it running and go and make a soufflé, or chop some wood or something.

=== Extract sentences ===
After the sentences are aligned, we need to trim unnecessary tags from the tokens, and generate a biltrans file.

<pre>
zcat giza.en-es/en-es.A3.final.gz | ~/source/apertium-lex-tools/scripts/giza-to-moses.awk > data-en-es/europarl.phrasetable.en-es
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 1 \
| sed 's/~/ /g' | ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil.bin -p -t > tmp1
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 2 \
| sed 's/~/ /g' | ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil.bin -p -t > tmp2
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 3 > tmp3
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 2 \
| sed 's/~/ /g' | ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil.bin -b -t > data-en-es/europarl.clean-biltrans.en-es
paste tmp1 tmp2 tmp3 | sed 's/\t/ ||| /g' > data-en-es/europarl.phrasetable.en-es
rm tmp1 tmp2 tmp3
</pre>

Then we want to make sure again that our file has the right number of lines:

<pre>
$ wc -l data-en-es/europarl.phrasetable.en-es
1400000 data-en-es/europarl.phrasetable.en-es
</pre>

Then we want to extract the sentences where the target language word aligned to a source language word is a possible translation in the bilingual dictionary:

<pre>
$ ~/source/apertium-lex-tools/scripts/extract-sentences.py data-en-es/europarl-en-es.phrasetable.en-es data-en-es/europarl-en-es.biltrans-tok.en-es \
> data-en-es/europarl-en-es.candidates.en-es
</pre>

These are basically sentences that we can hope that Apertium might be able to generate.

=== Extract bilingual dictionary candidates ===

Using the phrasetable and the bilingual file we can extract candidates for the bilingual dictionary.

<pre>
python3 ~/source/apertium-lex-tools/scripts/extract-biltrans-candidates.py data-en-es/europarl-en-es.phrasetable.en-es data-en-es/europarl-en-es.biltrans-tok.en-es > data-en-es/europarl-en-es.biltrans-candidates.en-es 2> data-en-es/europarl-en-es.biltrans-pairs.en-es
</pre>

where data-en-es/europarl-en-es.biltrans-candidates.en-es contains the generated candidates for the bilingual dictionary.

===Extract frequency lexicon===

The next step is to extract the frequency lexicon.

<pre>
$ python ~/source/apertium-lex-tools/scripts/extract-freq-lexicon.py data-en-es/europarl-en-es.candidates.en-es > data-en-es/europarl-en-es.lex.en-es
</pre>

This file should look like:

<pre>
$ cat europarl.lex.en-es | head
31381 union<n> unión<n> @
101 union<n> sindicato<n>
1 union<n> situación<n>
1 union<n> monetario<adj>
4 slope<n> pendiente<n> @
1 slope<n> ladera<n>
</pre>

Where the highest frequency translation is marked with an <code>@</code>.

Note: This frequency lexicon can be used as a substitute for "choosing the most general translation" in your bilingual dictionary.

===Generate patterns===

Now we generate the ngrams that we are going to generate the rules from.

<pre>
$ crisphold=1.5 # ratio of how many times you see the alternative translation compared to the default
$ python ~/source/apertium-lex-tools/scripts/ngram-count-patterns.py europarl.lex.en-es europarl.candidates.en-es $crisphold 2>/dev/null > europarl.ngrams.en-es
</pre>

This script outputs lines in the following format:

<pre>
-language<n> and<cnjcoo> language<n> ,<cm> lengua<n> 2
+language<n> plain<adj> language<n> ,<cm> lenguaje<n> 3
-language<n> language<n> knowledge<n> lengua<n> 4
-language<n> language<n> of<pr> communication<n> lengua<n> 3
-language<n> Community<adj> language<n> .<sent> lengua<n> 5
-language<n> language<n> in~addition~to<pr> their<det><pos> lengua<n> 2
-language<n> every<det><ind> language<n> lengua<n> 2
+language<n> and<cnjcoo> *understandable language<n> lenguaje<n> 2
-language<n> two<num> language<n> lengua<n> 8
-language<n> only<adj> official<adj> language<n> lengua<n> 2
</pre>

The <code>+</code> and <code>-</code> indicate if this line chooses the most frequent transation (<code>-</code>) or a translation which is not the most frequent (<code>+</code>). The pattern selecting the translation is then shown, followed by the translation and then the frequency.

===Filter rules===

Now you can filter the rules, for example by removing rules with conjunctions, or removing rules with unknown words.

=== Generate rules ===

The final stage is to generate the rules,

<pre>
python3 ~/source/apertium-lex-tools/scripts/ngrams-to-rules.py europarl.ngrams.en-es $crisphold > europarl.ngrams.en-es.lrx
</pre>

=== Process script ===
For the whole process you can run the following script:

<pre>
CORPUS_DIR="/home/philip/Apertium/corpora/raw/europarl-fr-es"
CORPUS="Europarl3"
PAIR="es-fr"
SL="fr"
TL="es"
LEX_TOOLS="/home/philip/Apertium/apertium-lex-tools"
SCRIPTS="$LEX_TOOLS/scripts"
MOSESDECODER="/home/philip/Apertium/mosesdecoder/scripts/training"
TRAINING_LINES=50000
DATA="/home/philip/Apertium/apertium-fr-es"
BIN_DIR="/home/philip/Apertium/smt/bin"

# CLEAN CORPUS
perl "$MOSESDECODER/clean-corpus-n.perl" $CORPUS.$PAIR $SL $TL "$CORPUS.clean" 1 40;

#TAG CORPUS
cat "$CORPUS.clean.$SL" | tail -n $TRAINING_LINES | apertium -d "$DATA" $SL-$TL-tagger \
| apertium-pretransfer > $CORPUS.tagged.$SL;

cat "$CORPUS.clean.$TL" | tail -n $TRAINING_LINES | apertium -d "$DATA" $TL-$SL-tagger \
| apertium-pretransfer > $CORPUS.tagged.$TL;

cat "$CORPUS.tagged.$SL" | lt-proc -b "$DATA/$SL-$TL.autobil.bin" > $CORPUS.biltrans.$PAIR

N=`wc -l $CORPUS.clean.$SL | cut -d ' ' -f 1`

# REMOVE LINES WITH NO ANALYSES
seq 1 $N > $CORPUS.lines
paste $CORPUS.lines $CORPUS.tagged.$SL $CORPUS.tagged.$TL | grep '<' | cut -f1 > $CORPUS.lines.new
paste $CORPUS.lines $CORPUS.tagged.$SL $CORPUS.tagged.$TL | grep '<' | cut -f2 > $CORPUS.tagged.$SL.new
paste $CORPUS.lines $CORPUS.tagged.$SL $CORPUS.tagged.$TL | grep '<' | cut -f3 > $CORPUS.tagged.$TL.new

mv $CORPUS.lines.new $CORPUS.lines
mv $CORPUS.tagged.$SL.new $CORPUS.tagged.$SL
mv $CORPUS.tagged.$TL.new $CORPUS.tagged.$TL

# TRIM TAGS
cat $CORPUS.tagged.$SL | $LEX_TOOLS/process-tagger-output $DATA/$SL-$TL.autobil.bin -p -t > $CORPUS.tag-tok.$SL
cat $CORPUS.tagged.$TL | $LEX_TOOLS/process-tagger-output $DATA/$TL-$SL.autobil.bin -p -t > $CORPUS.tag-tok.$TL
cat $CORPUS.tagged.$SL | $LEX_TOOLS/process-tagger-output $DATA/$SL-$TL.autobil.bin -b -t > $CORPUS.biltrans-tok.$PAIR

# ALIGN
perl $MOSESDECODER/train-model.perl -external-bin-dir "$BIN_DIR" -corpus $CORPUS.tag-tok \
-f $TL -e $SL -alignment grow-diag-final-and -reordering msd-bidirectional-fe \
-lm 0:5:/home/philip/Apertium/gsoc2013/giza/europarl.lm:0 2>&1

# EXTRACT
zcat giza.$SL-$TL/$SL-$TL.A3.final.gz | $SCRIPTS/giza-to-moses.awk > $CORPUS.phrasetable.$SL-$TL

# SENTENCES
python3 $SCRIPTS/extract-sentences.py $CORPUS.phrasetable.$SL-$TL $CORPUS.biltrans-tok.$PAIR \
> $CORPUS.candidates.$SL-$TL 2>/dev/null

# FREQUENCY LEXICON
python $SCRIPTS/extract-freq-lexicon.py $CORPUS.candidates.$SL-$TL > $CORPUS.lex.$SL-$TL 2>/dev/null

# BILTRANS CANDIDATES
python3 $SCRIPTS/extract-biltrans-candidates.py $CORPUS.phrasetable.$SL-$TL $CORPUS.biltrans-tok.$PAIR \
> $CORPUS.biltrans-entries.$SL-$TL 2>$CORPUS.biltrans-pairs.$SL-$TL

# NGRAM PATTERNS
$crisphold=1.5
python $SCRIPTS/ngram-count-patterns.py $CORPUS.lex.$SL-$TL $CORPUS.candidates.$SL-$TL $crisphold 2>/dev/null > $CORPUS.ngrams.$SL-$TL

# FILTER PATTERNS

# NGRAMS TO RULES
$cripshold=1.5
python3 $SCRIPTS/ngrams-to-rules.py europarl.ngrams.en-es $crisphold > $Europarl3.ngrams.$SL-TL.lrx

</pre>

[[Category:Lexical selection]]
[[Category:Documentation in English]]

Generating lexical-selection rules from a parallel corpus

2013-09-21T15:12:21Z

Fpetkovski: /* Extract bilingual dictionary candidates */

{{TOCD}}
If you have a parallel corpus, one of the things you can do is generate some lexical selection rules from it, to improve translation of words with more than one possible translation.

== You will need ==

Here is a list of software that you will need installed:

* Giza++ (or some other word aligner)
* Moses (for making Giza++ less human hostile)
* All the Moses scripts
* [[lttoolbox]]
* Apertium
* [[apertium-lex-tools]]

Furthermore you'll need:

* an Apertium language pair
* a parallel corpus (see [[Corpora]])

===Installing prerequisites===
See [[Minimal installation from SVN]] for apertium/lttoolbox.

See [[Constraint-based lexical selection module]] for apertium-lex-tools.

For Giza++ and moses-decoder, etc. you can do
<pre>
$ mkdir ~/smt
$ cd ~/smt
$ mkdir local # our "install prefix"
$ wget https://giza-pp.googlecode.com/files/giza-pp-v1.0.7.tar.gz
$ tar xzvf giza-pp-v1.0.7.tar.gz
$ cd giza-pp
$ make
$ mkdir ../local/bin
$ cp GIZA++-v2/snt2cooc.out ../local/bin/
$ cp GIZA++-v2/snt2plain.out ../local/bin/
$ cp GIZA++-v2/GIZA++ ../local/bin/
$ cp mkcls-v2/mkcls ../local/bin/
$ git clone https://github.com/moses-smt/mosesdecoder
$ cd mosesdecoder/
$ ./bjam
</pre>

Now e.g. the clean-corpus and train-model scripts referred to below will be in ~/smt/mosesdecoder/scripts/training/clean-corpus-n.perl
See http://www.statmt.org/moses/?n=Development.GetStarted if you want to install the binaries to some other directory.

== Getting started ==

We're going to do the example with [[Corpora|EuroParl]] and the English to Spanish pair in Apertium.

Given that you've got all the stuff installed, the work will be as follows:

=== Prepare corpus ===

To generate the rules, we need three files,

* The tagged and tokenised source corpus
* The tagged and tokenised target corpus
* The output of the lexical transfer module in the source→target direction, tokenised

These three files should be sentence aligned.

Make a folder called data-en-es. We are going to keep all the generated files there.

The first thing that we need to do is tag both sides of the corpus:

<pre>
$ nohup cat europarl.en-es.en | apertium-destxt |\
apertium -f none -d /home/fran/source/apertium-en-es en-es-pretransfer > data-en-es/europarl-en-es.tagged.en &
$ nohup cat europarl.clean.es | apertium-destxt |\
apertium -f none -d /home/fran/source/apertium-en-es es-en-pretransfer > data-en-es.europarl-en-es.tagged.es &
</pre>

Then we need to remove the lines with no analyses on and replace blanks within lemmas with a new character (we will use `~`):

<pre>
$ paste data-en-es/europarl-en-es.tagged.en data-en-es/europarl-en-es.tagged.es | grep '<' | cut -f2 | sed 's/ /~/g' | sed 's/\$[^\^]*/\$ /g' > data-en-es/europarl-en-es.tagged.new.es
$ paste data-en-es/europarl-en-es.tagged.en data-en-es/europarl-en-es.tagged.es | grep '<' | cut -f3 | sed 's/ /~/g' | sed 's/\$[^\^]*/\$ /g' > data-en-es/europarl-en-es.tagged.new.en
</pre>

Next, we need to clean the corpus and remove long sentences.
(Make sure you are in the same directory as the one where you have your europarl corpus)

<pre>
$ perl (path to your mosesdecoder)/scripts/training/clean-corpus-n.perl data-en-es/europarl-en-es.tagged.new es en data-en-es/europarl-en-es.tag-clean 1 40
clean-corpus.perl: processing europarl-v6.es-en.es & .en to data-en-es/europarl-en-es.clean, cutoff 1-40
..........(100000)...

Input sentences: 1786594 Output sentences: 1467708
</pre>

We're going to cut off the bottom 67,658 for testing (also because Giza++ segfaults somewhere around there).
<pre>
$ mkdir testing
$ tail -67658 data-en-es/europarl-en-es.tag-clean.en > testing/europarl-en-es.tag-clean.67658.en
$ tail -67658 data-en-es/europarl-en-es.tag-clean.es > testing/europarl-en-es.tag-clean.67658.es
</pre>

<pre>
$ head -1400000 data-en-es/europarl-en-es.tag-clean.en > data-en-es/europarl-en-es.tag-clean.en.new
$ head -1400000 data-en-es/europarl-en-es.tag-clean.es > data-en-es/europarl-en-es.tag-clean.es.new
$ mv data-en-es/europarl-en-es.tag-clean.en.new data-en-es/europarl-en-es.tag-clean.en
$ mv data-en-es/europarl-en-es.tag-clean.es.new data-en-es/europarl-en-es.tag-clean.es
</pre>

These files are:

* <code>data-en-es/europarl-en-es.tag-clean.en</code>: The tagged source language side of the corpus
* <code>data-en-es/europarl-en-es.tag-clean.es</code>: The tagged target language side of the corpus

Check that they have the same length:

<pre>
$ wc -l europarl.*
1400000 data-en-es/europarl-en-es.tag-clean.en
1400000 data-en-es/europarl-en-es.tag-clean.es
2800000 total
</pre>

=== Align corpus ===

Now we've got the corpus files ready, we can align the corpus using the Moses scripts:

<pre>
nohup perl (path to your mosesdecoder)/scripts/training/train-model.perl -external-bin-dir \
~/smt/local/bin -corpus data-en-es/europarl-en-es.tag-clean \
-f en -e es -alignment grow-diag-final-and -reordering msd-bidirectional-fe \
-lm 0:5:/home/fran/corpora/europarl/europarl.lm:0 >log 2>&1 &
</pre>

Note: Remember to change all the paths in the above command!

You'll need an LM file, but you can copy it from a previous Moses installation. If you don't have one, make an empty file and put a few words in it. We won't be using the LM anyway.

This takes a while, from a few hours to a day. So leave it running and go and make a soufflé, or chop some wood or something.

=== Extract sentences ===
After the sentences are aligned, we need to trim unnecessary tags from the tokens, and generate a biltrans file.

<pre>
zcat giza.en-es/en-es.A3.final.gz | ~/source/apertium-lex-tools/scripts/giza-to-moses.awk > data-en-es/europarl.phrasetable.en-es
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 1 \
| sed 's/~/ /g' | ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil.bin -p -t > tmp1
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 2 \
| sed 's/~/ /g' | ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil.bin -p -t > tmp2
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 3 > tmp3
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 2 \
| sed 's/~/ /g' | ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil.bin -b -t > data-en-es/europarl.clean-biltrans.en-es
paste tmp1 tmp2 tmp3 | sed 's/\t/ ||| /g' > data-en-es/europarl.phrasetable.en-es
rm tmp1 tmp2 tmp3
</pre>

Then we want to make sure again that our file has the right number of lines:

<pre>
$ wc -l data-en-es/europarl.phrasetable.en-es
1400000 data-en-es/europarl.phrasetable.en-es
</pre>

Then we want to extract the sentences where the target language word aligned to a source language word is a possible translation in the bilingual dictionary:

<pre>
$ ~/source/apertium-lex-tools/scripts/extract-sentences.py data-en-es/europarl-en-es.phrasetable.en-es data-en-es/europarl-en-es.biltrans-tok.en-es \
> data-en-es/europarl-en-es.candidates.en-es
</pre>

These are basically sentences that we can hope that Apertium might be able to generate.

=== Extract bilingual dictionary candidates ===

Using the phrasetable and the bilingual file we can extract candidates for the bilingual dictionary.

<pre>
python3 ~/source/apertium-lex-tools/scripts/extract-biltrans-candidates.py data-en-es/europarl-en-es.phrasetable.en-es data-en-es/europarl-en-es.biltrans-tok.en-es > data-en-es/europarl-en-es.biltrans-candidates.en-es 2> data-en-es/europarl-en-es.biltrans-pairs.en-es
</pre>

where data-en-es/europarl-en-es.biltrans-candidates.en-es contains the generated candidates for the bilingual dictionary.

===Extract frequency lexicon===

The next step is to extract the frequency lexicon.

<pre>
$ python ~/source/apertium-lex-tools/scripts/extract-freq-lexicon.py europarl.candidates.en-es > europarl.lex.en-es
</pre>

This file should look like:

<pre>
$ cat europarl.lex.en-es | head
31381 union<n> unión<n> @
101 union<n> sindicato<n>
1 union<n> situación<n>
1 union<n> monetario<adj>
4 slope<n> pendiente<n> @
1 slope<n> ladera<n>
</pre>

Where the highest frequency translation is marked with an <code>@</code>.

Note: This frequency lexicon can be used as a substitute for "choosing the most general translation" in your bilingual dictionary.

===Generate patterns===

Now we generate the ngrams that we are going to generate the rules from.

<pre>
$ crisphold=1.5 # ratio of how many times you see the alternative translation compared to the default
$ python ~/source/apertium-lex-tools/scripts/ngram-count-patterns.py europarl.lex.en-es europarl.candidates.en-es $crisphold 2>/dev/null > europarl.ngrams.en-es
</pre>

This script outputs lines in the following format:

<pre>
-language<n> and<cnjcoo> language<n> ,<cm> lengua<n> 2
+language<n> plain<adj> language<n> ,<cm> lenguaje<n> 3
-language<n> language<n> knowledge<n> lengua<n> 4
-language<n> language<n> of<pr> communication<n> lengua<n> 3
-language<n> Community<adj> language<n> .<sent> lengua<n> 5
-language<n> language<n> in~addition~to<pr> their<det><pos> lengua<n> 2
-language<n> every<det><ind> language<n> lengua<n> 2
+language<n> and<cnjcoo> *understandable language<n> lenguaje<n> 2
-language<n> two<num> language<n> lengua<n> 8
-language<n> only<adj> official<adj> language<n> lengua<n> 2
</pre>

The <code>+</code> and <code>-</code> indicate if this line chooses the most frequent transation (<code>-</code>) or a translation which is not the most frequent (<code>+</code>). The pattern selecting the translation is then shown, followed by the translation and then the frequency.

===Filter rules===

Now you can filter the rules, for example by removing rules with conjunctions, or removing rules with unknown words.

=== Generate rules ===

The final stage is to generate the rules,

<pre>
python3 ~/source/apertium-lex-tools/scripts/ngrams-to-rules.py europarl.ngrams.en-es $crisphold > europarl.ngrams.en-es.lrx
</pre>

=== Process script ===
For the whole process you can run the following script:

<pre>
CORPUS_DIR="/home/philip/Apertium/corpora/raw/europarl-fr-es"
CORPUS="Europarl3"
PAIR="es-fr"
SL="fr"
TL="es"
LEX_TOOLS="/home/philip/Apertium/apertium-lex-tools"
SCRIPTS="$LEX_TOOLS/scripts"
MOSESDECODER="/home/philip/Apertium/mosesdecoder/scripts/training"
TRAINING_LINES=50000
DATA="/home/philip/Apertium/apertium-fr-es"
BIN_DIR="/home/philip/Apertium/smt/bin"

# CLEAN CORPUS
perl "$MOSESDECODER/clean-corpus-n.perl" $CORPUS.$PAIR $SL $TL "$CORPUS.clean" 1 40;

#TAG CORPUS
cat "$CORPUS.clean.$SL" | tail -n $TRAINING_LINES | apertium -d "$DATA" $SL-$TL-tagger \
| apertium-pretransfer > $CORPUS.tagged.$SL;

cat "$CORPUS.clean.$TL" | tail -n $TRAINING_LINES | apertium -d "$DATA" $TL-$SL-tagger \
| apertium-pretransfer > $CORPUS.tagged.$TL;

cat "$CORPUS.tagged.$SL" | lt-proc -b "$DATA/$SL-$TL.autobil.bin" > $CORPUS.biltrans.$PAIR

N=`wc -l $CORPUS.clean.$SL | cut -d ' ' -f 1`

# REMOVE LINES WITH NO ANALYSES
seq 1 $N > $CORPUS.lines
paste $CORPUS.lines $CORPUS.tagged.$SL $CORPUS.tagged.$TL | grep '<' | cut -f1 > $CORPUS.lines.new
paste $CORPUS.lines $CORPUS.tagged.$SL $CORPUS.tagged.$TL | grep '<' | cut -f2 > $CORPUS.tagged.$SL.new
paste $CORPUS.lines $CORPUS.tagged.$SL $CORPUS.tagged.$TL | grep '<' | cut -f3 > $CORPUS.tagged.$TL.new

mv $CORPUS.lines.new $CORPUS.lines
mv $CORPUS.tagged.$SL.new $CORPUS.tagged.$SL
mv $CORPUS.tagged.$TL.new $CORPUS.tagged.$TL

# TRIM TAGS
cat $CORPUS.tagged.$SL | $LEX_TOOLS/process-tagger-output $DATA/$SL-$TL.autobil.bin -p -t > $CORPUS.tag-tok.$SL
cat $CORPUS.tagged.$TL | $LEX_TOOLS/process-tagger-output $DATA/$TL-$SL.autobil.bin -p -t > $CORPUS.tag-tok.$TL
cat $CORPUS.tagged.$SL | $LEX_TOOLS/process-tagger-output $DATA/$SL-$TL.autobil.bin -b -t > $CORPUS.biltrans-tok.$PAIR

# ALIGN
perl $MOSESDECODER/train-model.perl -external-bin-dir "$BIN_DIR" -corpus $CORPUS.tag-tok \
-f $TL -e $SL -alignment grow-diag-final-and -reordering msd-bidirectional-fe \
-lm 0:5:/home/philip/Apertium/gsoc2013/giza/europarl.lm:0 2>&1

# EXTRACT
zcat giza.$SL-$TL/$SL-$TL.A3.final.gz | $SCRIPTS/giza-to-moses.awk > $CORPUS.phrasetable.$SL-$TL

# SENTENCES
python3 $SCRIPTS/extract-sentences.py $CORPUS.phrasetable.$SL-$TL $CORPUS.biltrans-tok.$PAIR \
> $CORPUS.candidates.$SL-$TL 2>/dev/null

# FREQUENCY LEXICON
python $SCRIPTS/extract-freq-lexicon.py $CORPUS.candidates.$SL-$TL > $CORPUS.lex.$SL-$TL 2>/dev/null

# BILTRANS CANDIDATES
python3 $SCRIPTS/extract-biltrans-candidates.py $CORPUS.phrasetable.$SL-$TL $CORPUS.biltrans-tok.$PAIR \
> $CORPUS.biltrans-entries.$SL-$TL 2>$CORPUS.biltrans-pairs.$SL-$TL

# NGRAM PATTERNS
$crisphold=1.5
python $SCRIPTS/ngram-count-patterns.py $CORPUS.lex.$SL-$TL $CORPUS.candidates.$SL-$TL $crisphold 2>/dev/null > $CORPUS.ngrams.$SL-$TL

# FILTER PATTERNS

# NGRAMS TO RULES
$cripshold=1.5
python3 $SCRIPTS/ngrams-to-rules.py europarl.ngrams.en-es $crisphold > $Europarl3.ngrams.$SL-TL.lrx

</pre>

[[Category:Lexical selection]]
[[Category:Documentation in English]]

Generating lexical-selection rules from a parallel corpus

2013-09-21T15:10:09Z

Fpetkovski: /* Extract sentences */

{{TOCD}}
If you have a parallel corpus, one of the things you can do is generate some lexical selection rules from it, to improve translation of words with more than one possible translation.

== You will need ==

Here is a list of software that you will need installed:

* Giza++ (or some other word aligner)
* Moses (for making Giza++ less human hostile)
* All the Moses scripts
* [[lttoolbox]]
* Apertium
* [[apertium-lex-tools]]

Furthermore you'll need:

* an Apertium language pair
* a parallel corpus (see [[Corpora]])

===Installing prerequisites===
See [[Minimal installation from SVN]] for apertium/lttoolbox.

See [[Constraint-based lexical selection module]] for apertium-lex-tools.

For Giza++ and moses-decoder, etc. you can do
<pre>
$ mkdir ~/smt
$ cd ~/smt
$ mkdir local # our "install prefix"
$ wget https://giza-pp.googlecode.com/files/giza-pp-v1.0.7.tar.gz
$ tar xzvf giza-pp-v1.0.7.tar.gz
$ cd giza-pp
$ make
$ mkdir ../local/bin
$ cp GIZA++-v2/snt2cooc.out ../local/bin/
$ cp GIZA++-v2/snt2plain.out ../local/bin/
$ cp GIZA++-v2/GIZA++ ../local/bin/
$ cp mkcls-v2/mkcls ../local/bin/
$ git clone https://github.com/moses-smt/mosesdecoder
$ cd mosesdecoder/
$ ./bjam
</pre>

Now e.g. the clean-corpus and train-model scripts referred to below will be in ~/smt/mosesdecoder/scripts/training/clean-corpus-n.perl
See http://www.statmt.org/moses/?n=Development.GetStarted if you want to install the binaries to some other directory.

== Getting started ==

We're going to do the example with [[Corpora|EuroParl]] and the English to Spanish pair in Apertium.

Given that you've got all the stuff installed, the work will be as follows:

=== Prepare corpus ===

To generate the rules, we need three files,

* The tagged and tokenised source corpus
* The tagged and tokenised target corpus
* The output of the lexical transfer module in the source→target direction, tokenised

These three files should be sentence aligned.

Make a folder called data-en-es. We are going to keep all the generated files there.

The first thing that we need to do is tag both sides of the corpus:

<pre>
$ nohup cat europarl.en-es.en | apertium-destxt |\
apertium -f none -d /home/fran/source/apertium-en-es en-es-pretransfer > data-en-es/europarl-en-es.tagged.en &
$ nohup cat europarl.clean.es | apertium-destxt |\
apertium -f none -d /home/fran/source/apertium-en-es es-en-pretransfer > data-en-es.europarl-en-es.tagged.es &
</pre>

Then we need to remove the lines with no analyses on and replace blanks within lemmas with a new character (we will use `~`):

<pre>
$ paste data-en-es/europarl-en-es.tagged.en data-en-es/europarl-en-es.tagged.es | grep '<' | cut -f2 | sed 's/ /~/g' | sed 's/\$[^\^]*/\$ /g' > data-en-es/europarl-en-es.tagged.new.es
$ paste data-en-es/europarl-en-es.tagged.en data-en-es/europarl-en-es.tagged.es | grep '<' | cut -f3 | sed 's/ /~/g' | sed 's/\$[^\^]*/\$ /g' > data-en-es/europarl-en-es.tagged.new.en
</pre>

Next, we need to clean the corpus and remove long sentences.
(Make sure you are in the same directory as the one where you have your europarl corpus)

<pre>
$ perl (path to your mosesdecoder)/scripts/training/clean-corpus-n.perl data-en-es/europarl-en-es.tagged.new es en data-en-es/europarl-en-es.tag-clean 1 40
clean-corpus.perl: processing europarl-v6.es-en.es & .en to data-en-es/europarl-en-es.clean, cutoff 1-40
..........(100000)...

Input sentences: 1786594 Output sentences: 1467708
</pre>

We're going to cut off the bottom 67,658 for testing (also because Giza++ segfaults somewhere around there).
<pre>
$ mkdir testing
$ tail -67658 data-en-es/europarl-en-es.tag-clean.en > testing/europarl-en-es.tag-clean.67658.en
$ tail -67658 data-en-es/europarl-en-es.tag-clean.es > testing/europarl-en-es.tag-clean.67658.es
</pre>

<pre>
$ head -1400000 data-en-es/europarl-en-es.tag-clean.en > data-en-es/europarl-en-es.tag-clean.en.new
$ head -1400000 data-en-es/europarl-en-es.tag-clean.es > data-en-es/europarl-en-es.tag-clean.es.new
$ mv data-en-es/europarl-en-es.tag-clean.en.new data-en-es/europarl-en-es.tag-clean.en
$ mv data-en-es/europarl-en-es.tag-clean.es.new data-en-es/europarl-en-es.tag-clean.es
</pre>

These files are:

* <code>data-en-es/europarl-en-es.tag-clean.en</code>: The tagged source language side of the corpus
* <code>data-en-es/europarl-en-es.tag-clean.es</code>: The tagged target language side of the corpus

Check that they have the same length:

<pre>
$ wc -l europarl.*
1400000 data-en-es/europarl-en-es.tag-clean.en
1400000 data-en-es/europarl-en-es.tag-clean.es
2800000 total
</pre>

=== Align corpus ===

Now we've got the corpus files ready, we can align the corpus using the Moses scripts:

<pre>
nohup perl (path to your mosesdecoder)/scripts/training/train-model.perl -external-bin-dir \
~/smt/local/bin -corpus data-en-es/europarl-en-es.tag-clean \
-f en -e es -alignment grow-diag-final-and -reordering msd-bidirectional-fe \
-lm 0:5:/home/fran/corpora/europarl/europarl.lm:0 >log 2>&1 &
</pre>

Note: Remember to change all the paths in the above command!

You'll need an LM file, but you can copy it from a previous Moses installation. If you don't have one, make an empty file and put a few words in it. We won't be using the LM anyway.

This takes a while, from a few hours to a day. So leave it running and go and make a soufflé, or chop some wood or something.

=== Extract sentences ===
After the sentences are aligned, we need to trim unnecessary tags from the tokens, and generate a biltrans file.

<pre>
zcat giza.en-es/en-es.A3.final.gz | ~/source/apertium-lex-tools/scripts/giza-to-moses.awk > data-en-es/europarl.phrasetable.en-es
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 1 \
| sed 's/~/ /g' | ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil.bin -p -t > tmp1
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 2 \
| sed 's/~/ /g' | ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil.bin -p -t > tmp2
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 3 > tmp3
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 2 \
| sed 's/~/ /g' | ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil.bin -b -t > data-en-es/europarl.clean-biltrans.en-es
paste tmp1 tmp2 tmp3 | sed 's/\t/ ||| /g' > data-en-es/europarl.phrasetable.en-es
rm tmp1 tmp2 tmp3
</pre>

Then we want to make sure again that our file has the right number of lines:

<pre>
$ wc -l data-en-es/europarl.phrasetable.en-es
1400000 data-en-es/europarl.phrasetable.en-es
</pre>

Then we want to extract the sentences where the target language word aligned to a source language word is a possible translation in the bilingual dictionary:

<pre>
$ ~/source/apertium-lex-tools/scripts/extract-sentences.py data-en-es/europarl-en-es.phrasetable.en-es data-en-es/europarl-en-es.biltrans-tok.en-es \
> data-en-es/europarl-en-es.candidates.en-es
</pre>

These are basically sentences that we can hope that Apertium might be able to generate.

=== Extract bilingual dictionary candidates ===

Using the phrasetable and the bilingual file we can extract candidates for the bilingual dictionary.

<pre>
python3 ~/Apertium/apertium-lex-tools/scripts/extract-biltrans-candidates.py europarl.phrasetable.en-es europarl.biltrans-tok.en-es > europarl.biltrans-candidates.en-es 2> europarl.biltrans-pairs.en-es
</pre>

where europarl.biltrans-candidates.en-es contains the generated entries for the bilingual dictionary.

===Extract frequency lexicon===

The next step is to extract the frequency lexicon.

<pre>
$ python ~/source/apertium-lex-tools/scripts/extract-freq-lexicon.py europarl.candidates.en-es > europarl.lex.en-es
</pre>

This file should look like:

<pre>
$ cat europarl.lex.en-es | head
31381 union<n> unión<n> @
101 union<n> sindicato<n>
1 union<n> situación<n>
1 union<n> monetario<adj>
4 slope<n> pendiente<n> @
1 slope<n> ladera<n>
</pre>

Where the highest frequency translation is marked with an <code>@</code>.

Note: This frequency lexicon can be used as a substitute for "choosing the most general translation" in your bilingual dictionary.

===Generate patterns===

Now we generate the ngrams that we are going to generate the rules from.

<pre>
$ crisphold=1.5 # ratio of how many times you see the alternative translation compared to the default
$ python ~/source/apertium-lex-tools/scripts/ngram-count-patterns.py europarl.lex.en-es europarl.candidates.en-es $crisphold 2>/dev/null > europarl.ngrams.en-es
</pre>

This script outputs lines in the following format:

<pre>
-language<n> and<cnjcoo> language<n> ,<cm> lengua<n> 2
+language<n> plain<adj> language<n> ,<cm> lenguaje<n> 3
-language<n> language<n> knowledge<n> lengua<n> 4
-language<n> language<n> of<pr> communication<n> lengua<n> 3
-language<n> Community<adj> language<n> .<sent> lengua<n> 5
-language<n> language<n> in~addition~to<pr> their<det><pos> lengua<n> 2
-language<n> every<det><ind> language<n> lengua<n> 2
+language<n> and<cnjcoo> *understandable language<n> lenguaje<n> 2
-language<n> two<num> language<n> lengua<n> 8
-language<n> only<adj> official<adj> language<n> lengua<n> 2
</pre>

The <code>+</code> and <code>-</code> indicate if this line chooses the most frequent transation (<code>-</code>) or a translation which is not the most frequent (<code>+</code>). The pattern selecting the translation is then shown, followed by the translation and then the frequency.

===Filter rules===

Now you can filter the rules, for example by removing rules with conjunctions, or removing rules with unknown words.

=== Generate rules ===

The final stage is to generate the rules,

<pre>
python3 ~/source/apertium-lex-tools/scripts/ngrams-to-rules.py europarl.ngrams.en-es $crisphold > europarl.ngrams.en-es.lrx
</pre>

=== Process script ===
For the whole process you can run the following script:

<pre>
CORPUS_DIR="/home/philip/Apertium/corpora/raw/europarl-fr-es"
CORPUS="Europarl3"
PAIR="es-fr"
SL="fr"
TL="es"
LEX_TOOLS="/home/philip/Apertium/apertium-lex-tools"
SCRIPTS="$LEX_TOOLS/scripts"
MOSESDECODER="/home/philip/Apertium/mosesdecoder/scripts/training"
TRAINING_LINES=50000
DATA="/home/philip/Apertium/apertium-fr-es"
BIN_DIR="/home/philip/Apertium/smt/bin"

# CLEAN CORPUS
perl "$MOSESDECODER/clean-corpus-n.perl" $CORPUS.$PAIR $SL $TL "$CORPUS.clean" 1 40;

#TAG CORPUS
cat "$CORPUS.clean.$SL" | tail -n $TRAINING_LINES | apertium -d "$DATA" $SL-$TL-tagger \
| apertium-pretransfer > $CORPUS.tagged.$SL;

cat "$CORPUS.clean.$TL" | tail -n $TRAINING_LINES | apertium -d "$DATA" $TL-$SL-tagger \
| apertium-pretransfer > $CORPUS.tagged.$TL;

cat "$CORPUS.tagged.$SL" | lt-proc -b "$DATA/$SL-$TL.autobil.bin" > $CORPUS.biltrans.$PAIR

N=`wc -l $CORPUS.clean.$SL | cut -d ' ' -f 1`

# REMOVE LINES WITH NO ANALYSES
seq 1 $N > $CORPUS.lines
paste $CORPUS.lines $CORPUS.tagged.$SL $CORPUS.tagged.$TL | grep '<' | cut -f1 > $CORPUS.lines.new
paste $CORPUS.lines $CORPUS.tagged.$SL $CORPUS.tagged.$TL | grep '<' | cut -f2 > $CORPUS.tagged.$SL.new
paste $CORPUS.lines $CORPUS.tagged.$SL $CORPUS.tagged.$TL | grep '<' | cut -f3 > $CORPUS.tagged.$TL.new

mv $CORPUS.lines.new $CORPUS.lines
mv $CORPUS.tagged.$SL.new $CORPUS.tagged.$SL
mv $CORPUS.tagged.$TL.new $CORPUS.tagged.$TL

# TRIM TAGS
cat $CORPUS.tagged.$SL | $LEX_TOOLS/process-tagger-output $DATA/$SL-$TL.autobil.bin -p -t > $CORPUS.tag-tok.$SL
cat $CORPUS.tagged.$TL | $LEX_TOOLS/process-tagger-output $DATA/$TL-$SL.autobil.bin -p -t > $CORPUS.tag-tok.$TL
cat $CORPUS.tagged.$SL | $LEX_TOOLS/process-tagger-output $DATA/$SL-$TL.autobil.bin -b -t > $CORPUS.biltrans-tok.$PAIR

# ALIGN
perl $MOSESDECODER/train-model.perl -external-bin-dir "$BIN_DIR" -corpus $CORPUS.tag-tok \
-f $TL -e $SL -alignment grow-diag-final-and -reordering msd-bidirectional-fe \
-lm 0:5:/home/philip/Apertium/gsoc2013/giza/europarl.lm:0 2>&1

# EXTRACT
zcat giza.$SL-$TL/$SL-$TL.A3.final.gz | $SCRIPTS/giza-to-moses.awk > $CORPUS.phrasetable.$SL-$TL

# SENTENCES
python3 $SCRIPTS/extract-sentences.py $CORPUS.phrasetable.$SL-$TL $CORPUS.biltrans-tok.$PAIR \
> $CORPUS.candidates.$SL-$TL 2>/dev/null

# FREQUENCY LEXICON
python $SCRIPTS/extract-freq-lexicon.py $CORPUS.candidates.$SL-$TL > $CORPUS.lex.$SL-$TL 2>/dev/null

# BILTRANS CANDIDATES
python3 $SCRIPTS/extract-biltrans-candidates.py $CORPUS.phrasetable.$SL-$TL $CORPUS.biltrans-tok.$PAIR \
> $CORPUS.biltrans-entries.$SL-$TL 2>$CORPUS.biltrans-pairs.$SL-$TL

# NGRAM PATTERNS
$crisphold=1.5
python $SCRIPTS/ngram-count-patterns.py $CORPUS.lex.$SL-$TL $CORPUS.candidates.$SL-$TL $crisphold 2>/dev/null > $CORPUS.ngrams.$SL-$TL

# FILTER PATTERNS

# NGRAMS TO RULES
$cripshold=1.5
python3 $SCRIPTS/ngrams-to-rules.py europarl.ngrams.en-es $crisphold > $Europarl3.ngrams.$SL-TL.lrx

</pre>

[[Category:Lexical selection]]
[[Category:Documentation in English]]

Generating lexical-selection rules from a parallel corpus

2013-09-21T15:02:54Z

Fpetkovski: /* Align corpus */

{{TOCD}}
If you have a parallel corpus, one of the things you can do is generate some lexical selection rules from it, to improve translation of words with more than one possible translation.

== You will need ==

Here is a list of software that you will need installed:

* Giza++ (or some other word aligner)
* Moses (for making Giza++ less human hostile)
* All the Moses scripts
* [[lttoolbox]]
* Apertium
* [[apertium-lex-tools]]

Furthermore you'll need:

* an Apertium language pair
* a parallel corpus (see [[Corpora]])

===Installing prerequisites===
See [[Minimal installation from SVN]] for apertium/lttoolbox.

See [[Constraint-based lexical selection module]] for apertium-lex-tools.

For Giza++ and moses-decoder, etc. you can do
<pre>
$ mkdir ~/smt
$ cd ~/smt
$ mkdir local # our "install prefix"
$ wget https://giza-pp.googlecode.com/files/giza-pp-v1.0.7.tar.gz
$ tar xzvf giza-pp-v1.0.7.tar.gz
$ cd giza-pp
$ make
$ mkdir ../local/bin
$ cp GIZA++-v2/snt2cooc.out ../local/bin/
$ cp GIZA++-v2/snt2plain.out ../local/bin/
$ cp GIZA++-v2/GIZA++ ../local/bin/
$ cp mkcls-v2/mkcls ../local/bin/
$ git clone https://github.com/moses-smt/mosesdecoder
$ cd mosesdecoder/
$ ./bjam
</pre>

Now e.g. the clean-corpus and train-model scripts referred to below will be in ~/smt/mosesdecoder/scripts/training/clean-corpus-n.perl
See http://www.statmt.org/moses/?n=Development.GetStarted if you want to install the binaries to some other directory.

== Getting started ==

We're going to do the example with [[Corpora|EuroParl]] and the English to Spanish pair in Apertium.

Given that you've got all the stuff installed, the work will be as follows:

=== Prepare corpus ===

To generate the rules, we need three files,

* The tagged and tokenised source corpus
* The tagged and tokenised target corpus
* The output of the lexical transfer module in the source→target direction, tokenised

These three files should be sentence aligned.

Make a folder called data-en-es. We are going to keep all the generated files there.

The first thing that we need to do is tag both sides of the corpus:

<pre>
$ nohup cat europarl.en-es.en | apertium-destxt |\
apertium -f none -d /home/fran/source/apertium-en-es en-es-pretransfer > data-en-es/europarl-en-es.tagged.en &
$ nohup cat europarl.clean.es | apertium-destxt |\
apertium -f none -d /home/fran/source/apertium-en-es es-en-pretransfer > data-en-es.europarl-en-es.tagged.es &
</pre>

Then we need to remove the lines with no analyses on and replace blanks within lemmas with a new character (we will use `~`):

<pre>
$ paste data-en-es/europarl-en-es.tagged.en data-en-es/europarl-en-es.tagged.es | grep '<' | cut -f2 | sed 's/ /~/g' | sed 's/\$[^\^]*/\$ /g' > data-en-es/europarl-en-es.tagged.new.es
$ paste data-en-es/europarl-en-es.tagged.en data-en-es/europarl-en-es.tagged.es | grep '<' | cut -f3 | sed 's/ /~/g' | sed 's/\$[^\^]*/\$ /g' > data-en-es/europarl-en-es.tagged.new.en
</pre>

Next, we need to clean the corpus and remove long sentences.
(Make sure you are in the same directory as the one where you have your europarl corpus)

<pre>
$ perl (path to your mosesdecoder)/scripts/training/clean-corpus-n.perl data-en-es/europarl-en-es.tagged.new es en data-en-es/europarl-en-es.tag-clean 1 40
clean-corpus.perl: processing europarl-v6.es-en.es & .en to data-en-es/europarl-en-es.clean, cutoff 1-40
..........(100000)...

Input sentences: 1786594 Output sentences: 1467708
</pre>

We're going to cut off the bottom 67,658 for testing (also because Giza++ segfaults somewhere around there).
<pre>
$ mkdir testing
$ tail -67658 data-en-es/europarl-en-es.tag-clean.en > testing/europarl-en-es.tag-clean.67658.en
$ tail -67658 data-en-es/europarl-en-es.tag-clean.es > testing/europarl-en-es.tag-clean.67658.es
</pre>

<pre>
$ head -1400000 data-en-es/europarl-en-es.tag-clean.en > data-en-es/europarl-en-es.tag-clean.en.new
$ head -1400000 data-en-es/europarl-en-es.tag-clean.es > data-en-es/europarl-en-es.tag-clean.es.new
$ mv data-en-es/europarl-en-es.tag-clean.en.new data-en-es/europarl-en-es.tag-clean.en
$ mv data-en-es/europarl-en-es.tag-clean.es.new data-en-es/europarl-en-es.tag-clean.es
</pre>

These files are:

* <code>data-en-es/europarl-en-es.tag-clean.en</code>: The tagged source language side of the corpus
* <code>data-en-es/europarl-en-es.tag-clean.es</code>: The tagged target language side of the corpus

Check that they have the same length:

<pre>
$ wc -l europarl.*
1400000 data-en-es/europarl-en-es.tag-clean.en
1400000 data-en-es/europarl-en-es.tag-clean.es
2800000 total
</pre>

=== Align corpus ===

Now we've got the corpus files ready, we can align the corpus using the Moses scripts:

<pre>
nohup perl (path to your mosesdecoder)/scripts/training/train-model.perl -external-bin-dir \
~/smt/local/bin -corpus data-en-es/europarl-en-es.tag-clean \
-f en -e es -alignment grow-diag-final-and -reordering msd-bidirectional-fe \
-lm 0:5:/home/fran/corpora/europarl/europarl.lm:0 >log 2>&1 &
</pre>

Note: Remember to change all the paths in the above command!

You'll need an LM file, but you can copy it from a previous Moses installation. If you don't have one, make an empty file and put a few words in it. We won't be using the LM anyway.

This takes a while, from a few hours to a day. So leave it running and go and make a soufflé, or chop some wood or something.

=== Extract sentences ===
After the sentences are aligned, we need to trim unnecessary tags from the tokens, and generate a biltrans file.

<pre>
zcat giza.en-es/en-es.A3.final.gz | ~/source/apertium-lex-tools/scripts/giza-to-moses.awk > data-en-es/europarl.phrasetable.en-es
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 1 \
| sed 's/~/ /g' | ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil.bin -p -t > tmp1
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 2 \
| sed 's/~/ /g' | ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil.bin -p -t > tmp2
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 3 > tmp3
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 2 \
| sed 's/~/ /g' | ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil.bin -b -t > data-en-es/europarl.clean-biltrans.en-es
paste tmp1 tmp2 tmp3 | sed 's/\t/ ||| /g' > data-en-es/europarl.phrasetable.en-es
rm tmp1 tmp2 tmp3
</pre>

Then we want to make sure again that our file has the right number of lines:

<pre>
$ wc -l data/en-es/europarl.phrasetable.en-es
1400000 data/en-es/europarl.phrasetable.en-es
</pre>

Then we want to extract the sentences where the target language word aligned to a source language word is a possible translation in the bilingual dictionary:

<pre>
$ ~/source/apertium-lex-tools/scripts/extract-sentences.py europarl.phrasetable.en-es europarl.biltrans-tok.en-es \
> europarl.candidates.en-es
</pre>

These are basically sentences that we can hope that Apertium might be able to generate.

=== Extract bilingual dictionary candidates ===

Using the phrasetable and the bilingual file we can extract candidates for the bilingual dictionary.

<pre>
python3 ~/Apertium/apertium-lex-tools/scripts/extract-biltrans-candidates.py europarl.phrasetable.en-es europarl.biltrans-tok.en-es > europarl.biltrans-candidates.en-es 2> europarl.biltrans-pairs.en-es
</pre>

where europarl.biltrans-candidates.en-es contains the generated entries for the bilingual dictionary.

===Extract frequency lexicon===

The next step is to extract the frequency lexicon.

<pre>
$ python ~/source/apertium-lex-tools/scripts/extract-freq-lexicon.py europarl.candidates.en-es > europarl.lex.en-es
</pre>

This file should look like:

<pre>
$ cat europarl.lex.en-es | head
31381 union<n> unión<n> @
101 union<n> sindicato<n>
1 union<n> situación<n>
1 union<n> monetario<adj>
4 slope<n> pendiente<n> @
1 slope<n> ladera<n>
</pre>

Where the highest frequency translation is marked with an <code>@</code>.

Note: This frequency lexicon can be used as a substitute for "choosing the most general translation" in your bilingual dictionary.

===Generate patterns===

Now we generate the ngrams that we are going to generate the rules from.

<pre>
$ crisphold=1.5 # ratio of how many times you see the alternative translation compared to the default
$ python ~/source/apertium-lex-tools/scripts/ngram-count-patterns.py europarl.lex.en-es europarl.candidates.en-es $crisphold 2>/dev/null > europarl.ngrams.en-es
</pre>

This script outputs lines in the following format:

<pre>
-language<n> and<cnjcoo> language<n> ,<cm> lengua<n> 2
+language<n> plain<adj> language<n> ,<cm> lenguaje<n> 3
-language<n> language<n> knowledge<n> lengua<n> 4
-language<n> language<n> of<pr> communication<n> lengua<n> 3
-language<n> Community<adj> language<n> .<sent> lengua<n> 5
-language<n> language<n> in~addition~to<pr> their<det><pos> lengua<n> 2
-language<n> every<det><ind> language<n> lengua<n> 2
+language<n> and<cnjcoo> *understandable language<n> lenguaje<n> 2
-language<n> two<num> language<n> lengua<n> 8
-language<n> only<adj> official<adj> language<n> lengua<n> 2
</pre>

The <code>+</code> and <code>-</code> indicate if this line chooses the most frequent transation (<code>-</code>) or a translation which is not the most frequent (<code>+</code>). The pattern selecting the translation is then shown, followed by the translation and then the frequency.

===Filter rules===

Now you can filter the rules, for example by removing rules with conjunctions, or removing rules with unknown words.

=== Generate rules ===

The final stage is to generate the rules,

<pre>
python3 ~/source/apertium-lex-tools/scripts/ngrams-to-rules.py europarl.ngrams.en-es $crisphold > europarl.ngrams.en-es.lrx
</pre>

=== Process script ===
For the whole process you can run the following script:

<pre>
CORPUS_DIR="/home/philip/Apertium/corpora/raw/europarl-fr-es"
CORPUS="Europarl3"
PAIR="es-fr"
SL="fr"
TL="es"
LEX_TOOLS="/home/philip/Apertium/apertium-lex-tools"
SCRIPTS="$LEX_TOOLS/scripts"
MOSESDECODER="/home/philip/Apertium/mosesdecoder/scripts/training"
TRAINING_LINES=50000
DATA="/home/philip/Apertium/apertium-fr-es"
BIN_DIR="/home/philip/Apertium/smt/bin"

# CLEAN CORPUS
perl "$MOSESDECODER/clean-corpus-n.perl" $CORPUS.$PAIR $SL $TL "$CORPUS.clean" 1 40;

#TAG CORPUS
cat "$CORPUS.clean.$SL" | tail -n $TRAINING_LINES | apertium -d "$DATA" $SL-$TL-tagger \
| apertium-pretransfer > $CORPUS.tagged.$SL;

cat "$CORPUS.clean.$TL" | tail -n $TRAINING_LINES | apertium -d "$DATA" $TL-$SL-tagger \
| apertium-pretransfer > $CORPUS.tagged.$TL;

cat "$CORPUS.tagged.$SL" | lt-proc -b "$DATA/$SL-$TL.autobil.bin" > $CORPUS.biltrans.$PAIR

N=`wc -l $CORPUS.clean.$SL | cut -d ' ' -f 1`

# REMOVE LINES WITH NO ANALYSES
seq 1 $N > $CORPUS.lines
paste $CORPUS.lines $CORPUS.tagged.$SL $CORPUS.tagged.$TL | grep '<' | cut -f1 > $CORPUS.lines.new
paste $CORPUS.lines $CORPUS.tagged.$SL $CORPUS.tagged.$TL | grep '<' | cut -f2 > $CORPUS.tagged.$SL.new
paste $CORPUS.lines $CORPUS.tagged.$SL $CORPUS.tagged.$TL | grep '<' | cut -f3 > $CORPUS.tagged.$TL.new

mv $CORPUS.lines.new $CORPUS.lines
mv $CORPUS.tagged.$SL.new $CORPUS.tagged.$SL
mv $CORPUS.tagged.$TL.new $CORPUS.tagged.$TL

# TRIM TAGS
cat $CORPUS.tagged.$SL | $LEX_TOOLS/process-tagger-output $DATA/$SL-$TL.autobil.bin -p -t > $CORPUS.tag-tok.$SL
cat $CORPUS.tagged.$TL | $LEX_TOOLS/process-tagger-output $DATA/$TL-$SL.autobil.bin -p -t > $CORPUS.tag-tok.$TL
cat $CORPUS.tagged.$SL | $LEX_TOOLS/process-tagger-output $DATA/$SL-$TL.autobil.bin -b -t > $CORPUS.biltrans-tok.$PAIR

# ALIGN
perl $MOSESDECODER/train-model.perl -external-bin-dir "$BIN_DIR" -corpus $CORPUS.tag-tok \
-f $TL -e $SL -alignment grow-diag-final-and -reordering msd-bidirectional-fe \
-lm 0:5:/home/philip/Apertium/gsoc2013/giza/europarl.lm:0 2>&1

# EXTRACT
zcat giza.$SL-$TL/$SL-$TL.A3.final.gz | $SCRIPTS/giza-to-moses.awk > $CORPUS.phrasetable.$SL-$TL

# SENTENCES
python3 $SCRIPTS/extract-sentences.py $CORPUS.phrasetable.$SL-$TL $CORPUS.biltrans-tok.$PAIR \
> $CORPUS.candidates.$SL-$TL 2>/dev/null

# FREQUENCY LEXICON
python $SCRIPTS/extract-freq-lexicon.py $CORPUS.candidates.$SL-$TL > $CORPUS.lex.$SL-$TL 2>/dev/null

# BILTRANS CANDIDATES
python3 $SCRIPTS/extract-biltrans-candidates.py $CORPUS.phrasetable.$SL-$TL $CORPUS.biltrans-tok.$PAIR \
> $CORPUS.biltrans-entries.$SL-$TL 2>$CORPUS.biltrans-pairs.$SL-$TL

# NGRAM PATTERNS
$crisphold=1.5
python $SCRIPTS/ngram-count-patterns.py $CORPUS.lex.$SL-$TL $CORPUS.candidates.$SL-$TL $crisphold 2>/dev/null > $CORPUS.ngrams.$SL-$TL

# FILTER PATTERNS

# NGRAMS TO RULES
$cripshold=1.5
python3 $SCRIPTS/ngrams-to-rules.py europarl.ngrams.en-es $crisphold > $Europarl3.ngrams.$SL-TL.lrx

</pre>

[[Category:Lexical selection]]
[[Category:Documentation in English]]

Generating lexical-selection rules from a parallel corpus

2013-09-21T15:00:06Z

Fpetkovski: /* Prepare corpus */

{{TOCD}}
If you have a parallel corpus, one of the things you can do is generate some lexical selection rules from it, to improve translation of words with more than one possible translation.

== You will need ==

Here is a list of software that you will need installed:

* Giza++ (or some other word aligner)
* Moses (for making Giza++ less human hostile)
* All the Moses scripts
* [[lttoolbox]]
* Apertium
* [[apertium-lex-tools]]

Furthermore you'll need:

* an Apertium language pair
* a parallel corpus (see [[Corpora]])

===Installing prerequisites===
See [[Minimal installation from SVN]] for apertium/lttoolbox.

See [[Constraint-based lexical selection module]] for apertium-lex-tools.

For Giza++ and moses-decoder, etc. you can do
<pre>
$ mkdir ~/smt
$ cd ~/smt
$ mkdir local # our "install prefix"
$ wget https://giza-pp.googlecode.com/files/giza-pp-v1.0.7.tar.gz
$ tar xzvf giza-pp-v1.0.7.tar.gz
$ cd giza-pp
$ make
$ mkdir ../local/bin
$ cp GIZA++-v2/snt2cooc.out ../local/bin/
$ cp GIZA++-v2/snt2plain.out ../local/bin/
$ cp GIZA++-v2/GIZA++ ../local/bin/
$ cp mkcls-v2/mkcls ../local/bin/
$ git clone https://github.com/moses-smt/mosesdecoder
$ cd mosesdecoder/
$ ./bjam
</pre>

Now e.g. the clean-corpus and train-model scripts referred to below will be in ~/smt/mosesdecoder/scripts/training/clean-corpus-n.perl
See http://www.statmt.org/moses/?n=Development.GetStarted if you want to install the binaries to some other directory.

== Getting started ==

We're going to do the example with [[Corpora|EuroParl]] and the English to Spanish pair in Apertium.

Given that you've got all the stuff installed, the work will be as follows:

=== Prepare corpus ===

To generate the rules, we need three files,

* The tagged and tokenised source corpus
* The tagged and tokenised target corpus
* The output of the lexical transfer module in the source→target direction, tokenised

These three files should be sentence aligned.

Make a folder called data-en-es. We are going to keep all the generated files there.

The first thing that we need to do is tag both sides of the corpus:

<pre>
$ nohup cat europarl.en-es.en | apertium-destxt |\
apertium -f none -d /home/fran/source/apertium-en-es en-es-pretransfer > data-en-es/europarl-en-es.tagged.en &
$ nohup cat europarl.clean.es | apertium-destxt |\
apertium -f none -d /home/fran/source/apertium-en-es es-en-pretransfer > data-en-es.europarl-en-es.tagged.es &
</pre>

Then we need to remove the lines with no analyses on and replace blanks within lemmas with a new character (we will use `~`):

<pre>
$ paste data-en-es/europarl-en-es.tagged.en data-en-es/europarl-en-es.tagged.es | grep '<' | cut -f2 | sed 's/ /~/g' | sed 's/\$[^\^]*/\$ /g' > data-en-es/europarl-en-es.tagged.new.es
$ paste data-en-es/europarl-en-es.tagged.en data-en-es/europarl-en-es.tagged.es | grep '<' | cut -f3 | sed 's/ /~/g' | sed 's/\$[^\^]*/\$ /g' > data-en-es/europarl-en-es.tagged.new.en
</pre>

Next, we need to clean the corpus and remove long sentences.
(Make sure you are in the same directory as the one where you have your europarl corpus)

<pre>
$ perl (path to your mosesdecoder)/scripts/training/clean-corpus-n.perl data-en-es/europarl-en-es.tagged.new es en data-en-es/europarl-en-es.tag-clean 1 40
clean-corpus.perl: processing europarl-v6.es-en.es & .en to data-en-es/europarl-en-es.clean, cutoff 1-40
..........(100000)...

Input sentences: 1786594 Output sentences: 1467708
</pre>

We're going to cut off the bottom 67,658 for testing (also because Giza++ segfaults somewhere around there).
<pre>
$ mkdir testing
$ tail -67658 data-en-es/europarl-en-es.tag-clean.en > testing/europarl-en-es.tag-clean.67658.en
$ tail -67658 data-en-es/europarl-en-es.tag-clean.es > testing/europarl-en-es.tag-clean.67658.es
</pre>

<pre>
$ head -1400000 data-en-es/europarl-en-es.tag-clean.en > data-en-es/europarl-en-es.tag-clean.en.new
$ head -1400000 data-en-es/europarl-en-es.tag-clean.es > data-en-es/europarl-en-es.tag-clean.es.new
$ mv data-en-es/europarl-en-es.tag-clean.en.new data-en-es/europarl-en-es.tag-clean.en
$ mv data-en-es/europarl-en-es.tag-clean.es.new data-en-es/europarl-en-es.tag-clean.es
</pre>

These files are:

* <code>data-en-es/europarl-en-es.tag-clean.en</code>: The tagged source language side of the corpus
* <code>data-en-es/europarl-en-es.tag-clean.es</code>: The tagged target language side of the corpus

Check that they have the same length:

<pre>
$ wc -l europarl.*
1400000 data-en-es/europarl-en-es.tag-clean.en
1400000 data-en-es/europarl-en-es.tag-clean.es
2800000 total
</pre>

=== Align corpus ===

Now we've got the corpus files ready, we can align the corpus using the Moses scripts:

<pre>
nohup perl (path to your mosesdecoder)/scripts/training/train-model.perl -external-bin-dir \
~/smt/local/bin -corpus europarl.tag-clean \
-f en -e es -alignment grow-diag-final-and -reordering msd-bidirectional-fe \
-lm 0:5:/home/fran/corpora/europarl/europarl.lm:0 >log 2>&1 &
</pre>

Note: Remember to change all the paths in the above command!

You'll need an LM file, but you can copy it from a previous Moses installation. If you don't have one, make an empty file and put a few words in it. We won't be using the LM anyway.

This takes a while, from a few hours to a day. So leave it running and go and make a soufflé, or chop some wood or something.

=== Extract sentences ===
After the sentences are aligned, we need to trim unnecessary tags from the tokens, and generate a biltrans file.

<pre>
zcat giza.en-es/en-es.A3.final.gz | ~/source/apertium-lex-tools/scripts/giza-to-moses.awk > data-en-es/europarl.phrasetable.en-es
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 1 \
| sed 's/~/ /g' | ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil.bin -p -t > tmp1
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 2 \
| sed 's/~/ /g' | ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil.bin -p -t > tmp2
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 3 > tmp3
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 2 \
| sed 's/~/ /g' | ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil.bin -b -t > data-en-es/europarl.clean-biltrans.en-es
paste tmp1 tmp2 tmp3 | sed 's/\t/ ||| /g' > data-en-es/europarl.phrasetable.en-es
rm tmp1 tmp2 tmp3
</pre>

Then we want to make sure again that our file has the right number of lines:

<pre>
$ wc -l data/en-es/europarl.phrasetable.en-es
1400000 data/en-es/europarl.phrasetable.en-es
</pre>

Then we want to extract the sentences where the target language word aligned to a source language word is a possible translation in the bilingual dictionary:

<pre>
$ ~/source/apertium-lex-tools/scripts/extract-sentences.py europarl.phrasetable.en-es europarl.biltrans-tok.en-es \
> europarl.candidates.en-es
</pre>

These are basically sentences that we can hope that Apertium might be able to generate.

=== Extract bilingual dictionary candidates ===

Using the phrasetable and the bilingual file we can extract candidates for the bilingual dictionary.

<pre>
python3 ~/Apertium/apertium-lex-tools/scripts/extract-biltrans-candidates.py europarl.phrasetable.en-es europarl.biltrans-tok.en-es > europarl.biltrans-candidates.en-es 2> europarl.biltrans-pairs.en-es
</pre>

where europarl.biltrans-candidates.en-es contains the generated entries for the bilingual dictionary.

===Extract frequency lexicon===

The next step is to extract the frequency lexicon.

<pre>
$ python ~/source/apertium-lex-tools/scripts/extract-freq-lexicon.py europarl.candidates.en-es > europarl.lex.en-es
</pre>

This file should look like:

<pre>
$ cat europarl.lex.en-es | head
31381 union<n> unión<n> @
101 union<n> sindicato<n>
1 union<n> situación<n>
1 union<n> monetario<adj>
4 slope<n> pendiente<n> @
1 slope<n> ladera<n>
</pre>

Where the highest frequency translation is marked with an <code>@</code>.

Note: This frequency lexicon can be used as a substitute for "choosing the most general translation" in your bilingual dictionary.

===Generate patterns===

Now we generate the ngrams that we are going to generate the rules from.

<pre>
$ crisphold=1.5 # ratio of how many times you see the alternative translation compared to the default
$ python ~/source/apertium-lex-tools/scripts/ngram-count-patterns.py europarl.lex.en-es europarl.candidates.en-es $crisphold 2>/dev/null > europarl.ngrams.en-es
</pre>

This script outputs lines in the following format:

<pre>
-language<n> and<cnjcoo> language<n> ,<cm> lengua<n> 2
+language<n> plain<adj> language<n> ,<cm> lenguaje<n> 3
-language<n> language<n> knowledge<n> lengua<n> 4
-language<n> language<n> of<pr> communication<n> lengua<n> 3
-language<n> Community<adj> language<n> .<sent> lengua<n> 5
-language<n> language<n> in~addition~to<pr> their<det><pos> lengua<n> 2
-language<n> every<det><ind> language<n> lengua<n> 2
+language<n> and<cnjcoo> *understandable language<n> lenguaje<n> 2
-language<n> two<num> language<n> lengua<n> 8
-language<n> only<adj> official<adj> language<n> lengua<n> 2
</pre>

The <code>+</code> and <code>-</code> indicate if this line chooses the most frequent transation (<code>-</code>) or a translation which is not the most frequent (<code>+</code>). The pattern selecting the translation is then shown, followed by the translation and then the frequency.

===Filter rules===

Now you can filter the rules, for example by removing rules with conjunctions, or removing rules with unknown words.

=== Generate rules ===

The final stage is to generate the rules,

<pre>
python3 ~/source/apertium-lex-tools/scripts/ngrams-to-rules.py europarl.ngrams.en-es $crisphold > europarl.ngrams.en-es.lrx
</pre>

=== Process script ===
For the whole process you can run the following script:

<pre>
CORPUS_DIR="/home/philip/Apertium/corpora/raw/europarl-fr-es"
CORPUS="Europarl3"
PAIR="es-fr"
SL="fr"
TL="es"
LEX_TOOLS="/home/philip/Apertium/apertium-lex-tools"
SCRIPTS="$LEX_TOOLS/scripts"
MOSESDECODER="/home/philip/Apertium/mosesdecoder/scripts/training"
TRAINING_LINES=50000
DATA="/home/philip/Apertium/apertium-fr-es"
BIN_DIR="/home/philip/Apertium/smt/bin"

# CLEAN CORPUS
perl "$MOSESDECODER/clean-corpus-n.perl" $CORPUS.$PAIR $SL $TL "$CORPUS.clean" 1 40;

#TAG CORPUS
cat "$CORPUS.clean.$SL" | tail -n $TRAINING_LINES | apertium -d "$DATA" $SL-$TL-tagger \
| apertium-pretransfer > $CORPUS.tagged.$SL;

cat "$CORPUS.clean.$TL" | tail -n $TRAINING_LINES | apertium -d "$DATA" $TL-$SL-tagger \
| apertium-pretransfer > $CORPUS.tagged.$TL;

cat "$CORPUS.tagged.$SL" | lt-proc -b "$DATA/$SL-$TL.autobil.bin" > $CORPUS.biltrans.$PAIR

N=`wc -l $CORPUS.clean.$SL | cut -d ' ' -f 1`

# REMOVE LINES WITH NO ANALYSES
seq 1 $N > $CORPUS.lines
paste $CORPUS.lines $CORPUS.tagged.$SL $CORPUS.tagged.$TL | grep '<' | cut -f1 > $CORPUS.lines.new
paste $CORPUS.lines $CORPUS.tagged.$SL $CORPUS.tagged.$TL | grep '<' | cut -f2 > $CORPUS.tagged.$SL.new
paste $CORPUS.lines $CORPUS.tagged.$SL $CORPUS.tagged.$TL | grep '<' | cut -f3 > $CORPUS.tagged.$TL.new

mv $CORPUS.lines.new $CORPUS.lines
mv $CORPUS.tagged.$SL.new $CORPUS.tagged.$SL
mv $CORPUS.tagged.$TL.new $CORPUS.tagged.$TL

# TRIM TAGS
cat $CORPUS.tagged.$SL | $LEX_TOOLS/process-tagger-output $DATA/$SL-$TL.autobil.bin -p -t > $CORPUS.tag-tok.$SL
cat $CORPUS.tagged.$TL | $LEX_TOOLS/process-tagger-output $DATA/$TL-$SL.autobil.bin -p -t > $CORPUS.tag-tok.$TL
cat $CORPUS.tagged.$SL | $LEX_TOOLS/process-tagger-output $DATA/$SL-$TL.autobil.bin -b -t > $CORPUS.biltrans-tok.$PAIR

# ALIGN
perl $MOSESDECODER/train-model.perl -external-bin-dir "$BIN_DIR" -corpus $CORPUS.tag-tok \
-f $TL -e $SL -alignment grow-diag-final-and -reordering msd-bidirectional-fe \
-lm 0:5:/home/philip/Apertium/gsoc2013/giza/europarl.lm:0 2>&1

# EXTRACT
zcat giza.$SL-$TL/$SL-$TL.A3.final.gz | $SCRIPTS/giza-to-moses.awk > $CORPUS.phrasetable.$SL-$TL

# SENTENCES
python3 $SCRIPTS/extract-sentences.py $CORPUS.phrasetable.$SL-$TL $CORPUS.biltrans-tok.$PAIR \
> $CORPUS.candidates.$SL-$TL 2>/dev/null

# FREQUENCY LEXICON
python $SCRIPTS/extract-freq-lexicon.py $CORPUS.candidates.$SL-$TL > $CORPUS.lex.$SL-$TL 2>/dev/null

# BILTRANS CANDIDATES
python3 $SCRIPTS/extract-biltrans-candidates.py $CORPUS.phrasetable.$SL-$TL $CORPUS.biltrans-tok.$PAIR \
> $CORPUS.biltrans-entries.$SL-$TL 2>$CORPUS.biltrans-pairs.$SL-$TL

# NGRAM PATTERNS
$crisphold=1.5
python $SCRIPTS/ngram-count-patterns.py $CORPUS.lex.$SL-$TL $CORPUS.candidates.$SL-$TL $crisphold 2>/dev/null > $CORPUS.ngrams.$SL-$TL

# FILTER PATTERNS

# NGRAMS TO RULES
$cripshold=1.5
python3 $SCRIPTS/ngrams-to-rules.py europarl.ngrams.en-es $crisphold > $Europarl3.ngrams.$SL-TL.lrx

</pre>

[[Category:Lexical selection]]
[[Category:Documentation in English]]

Generating lexical-selection rules from a parallel corpus

2013-09-21T14:58:39Z

Fpetkovski: /* Prepare corpus */

{{TOCD}}
If you have a parallel corpus, one of the things you can do is generate some lexical selection rules from it, to improve translation of words with more than one possible translation.

== You will need ==

Here is a list of software that you will need installed:

* Giza++ (or some other word aligner)
* Moses (for making Giza++ less human hostile)
* All the Moses scripts
* [[lttoolbox]]
* Apertium
* [[apertium-lex-tools]]

Furthermore you'll need:

* an Apertium language pair
* a parallel corpus (see [[Corpora]])

===Installing prerequisites===
See [[Minimal installation from SVN]] for apertium/lttoolbox.

See [[Constraint-based lexical selection module]] for apertium-lex-tools.

For Giza++ and moses-decoder, etc. you can do
<pre>
$ mkdir ~/smt
$ cd ~/smt
$ mkdir local # our "install prefix"
$ wget https://giza-pp.googlecode.com/files/giza-pp-v1.0.7.tar.gz
$ tar xzvf giza-pp-v1.0.7.tar.gz
$ cd giza-pp
$ make
$ mkdir ../local/bin
$ cp GIZA++-v2/snt2cooc.out ../local/bin/
$ cp GIZA++-v2/snt2plain.out ../local/bin/
$ cp GIZA++-v2/GIZA++ ../local/bin/
$ cp mkcls-v2/mkcls ../local/bin/
$ git clone https://github.com/moses-smt/mosesdecoder
$ cd mosesdecoder/
$ ./bjam
</pre>

Now e.g. the clean-corpus and train-model scripts referred to below will be in ~/smt/mosesdecoder/scripts/training/clean-corpus-n.perl
See http://www.statmt.org/moses/?n=Development.GetStarted if you want to install the binaries to some other directory.

== Getting started ==

We're going to do the example with [[Corpora|EuroParl]] and the English to Spanish pair in Apertium.

Given that you've got all the stuff installed, the work will be as follows:

=== Prepare corpus ===

To generate the rules, we need three files,

* The tagged and tokenised source corpus
* The tagged and tokenised target corpus
* The output of the lexical transfer module in the source→target direction, tokenised

These three files should be sentence aligned.

Make a folder called data-en-es. We are going to keep all the generated files there.

The first thing that we need to do is tag both sides of the corpus:

<pre>
$ nohup cat europarl.en-es.en | apertium-destxt |\
apertium -f none -d /home/fran/source/apertium-en-es en-es-pretransfer > data-en-es/europarl-en-es.tagged.en &
$ nohup cat europarl.clean.es | apertium-destxt |\
apertium -f none -d /home/fran/source/apertium-en-es es-en-pretransfer > data-en-es.europarl-en-es.tagged.es &
</pre>

Then we need to remove the lines with no analyses on and replace blanks within lemmas with a new character (we will use `~`):

<pre>
$ paste data-en-es/europarl-en-es.tagged.en data-en-es/europarl-en-es.tagged.es | grep '<' | cut -f2 | sed 's/ /~/g' | sed 's/\$[^\^]*/\$ /g' > data-en-es/europarl-en-es.tagged.new.es
$ paste data-en-es/europarl-en-es.tagged.en data-en-es/europarl-en-es.tagged.es | grep '<' | cut -f3 | sed 's/ /~/g' | sed 's/\$[^\^]*/\$ /g' > data-en-es/europarl-en-es.tagged.new.en
</pre>

Next, we need to clean the corpus and remove long sentences.
(Make sure you are in the same directory as the one where you have your europarl corpus)

<pre>
$ perl (path to your mosesdecoder)/scripts/training/clean-corpus-n.perl data-en-es/europarl-en-es.tagged.new es en data-en-es/europarl-en-es.tag-clean 1 40
clean-corpus.perl: processing europarl-v6.es-en.es & .en to data-en-es/europarl-en-es.clean, cutoff 1-40
..........(100000)...

Input sentences: 1786594 Output sentences: 1467708
</pre>

We're going to cut off the bottom 67,658 for testing (also because Giza++ segfaults somewhere around there).
<pre>
$ mkdir testing
$ tail -67658 data-en-es/europarl-en-es.tag-clean.en > testing/europarl.tag-clean.67658.en
$ tail -67658 data-en-es/europarl-en-es.tagged.es > testing/europarl.tag-clean.67658.es
</pre>

<pre>
$ head -1400000 data-en-es/europarl-en-es.tag-clean.en > data-en-es/europarl-en-es.tag-clean.en.new
$ head -1400000 data-en-es/europarl-en-es.tag-clean.es > data-en-es/europarl-en-es.tag-clean.es.new
$ mv data-en-es/europarl-en-es.tag-clean.en.new data-en-es/europarl-en-es.tag-clean.en
$ mv data-en-es/europarl-en-es.tag-clean.es.new data-en-es/europarl-en-es.tag-clean.es
</pre>

These files are:

* <code>data-en-es/europarl-en-es.tag-clean.en</code>: The tagged source language side of the corpus
* <code>data-en-es/europarl-en-es.tag-clean.es</code>: The tagged target language side of the corpus

Check that they have the same length:

<pre>
$ wc -l europarl.*
1400000 data-en-es/europarl-en-es.tag-clean.en
1400000 data-en-es/europarl-en-es.tag-clean.es
2800000 total
</pre>

=== Align corpus ===

Now we've got the corpus files ready, we can align the corpus using the Moses scripts:

<pre>
nohup perl (path to your mosesdecoder)/scripts/training/train-model.perl -external-bin-dir \
~/smt/local/bin -corpus europarl.tag-clean \
-f en -e es -alignment grow-diag-final-and -reordering msd-bidirectional-fe \
-lm 0:5:/home/fran/corpora/europarl/europarl.lm:0 >log 2>&1 &
</pre>

Note: Remember to change all the paths in the above command!

You'll need an LM file, but you can copy it from a previous Moses installation. If you don't have one, make an empty file and put a few words in it. We won't be using the LM anyway.

This takes a while, from a few hours to a day. So leave it running and go and make a soufflé, or chop some wood or something.

=== Extract sentences ===
After the sentences are aligned, we need to trim unnecessary tags from the tokens, and generate a biltrans file.

<pre>
zcat giza.en-es/en-es.A3.final.gz | ~/source/apertium-lex-tools/scripts/giza-to-moses.awk > data-en-es/europarl.phrasetable.en-es
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 1 \
| sed 's/~/ /g' | ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil.bin -p -t > tmp1
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 2 \
| sed 's/~/ /g' | ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil.bin -p -t > tmp2
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 3 > tmp3
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 2 \
| sed 's/~/ /g' | ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil.bin -b -t > data-en-es/europarl.clean-biltrans.en-es
paste tmp1 tmp2 tmp3 | sed 's/\t/ ||| /g' > data-en-es/europarl.phrasetable.en-es
rm tmp1 tmp2 tmp3
</pre>

Then we want to make sure again that our file has the right number of lines:

<pre>
$ wc -l data/en-es/europarl.phrasetable.en-es
1400000 data/en-es/europarl.phrasetable.en-es
</pre>

Then we want to extract the sentences where the target language word aligned to a source language word is a possible translation in the bilingual dictionary:

<pre>
$ ~/source/apertium-lex-tools/scripts/extract-sentences.py europarl.phrasetable.en-es europarl.biltrans-tok.en-es \
> europarl.candidates.en-es
</pre>

These are basically sentences that we can hope that Apertium might be able to generate.

=== Extract bilingual dictionary candidates ===

Using the phrasetable and the bilingual file we can extract candidates for the bilingual dictionary.

<pre>
python3 ~/Apertium/apertium-lex-tools/scripts/extract-biltrans-candidates.py europarl.phrasetable.en-es europarl.biltrans-tok.en-es > europarl.biltrans-candidates.en-es 2> europarl.biltrans-pairs.en-es
</pre>

where europarl.biltrans-candidates.en-es contains the generated entries for the bilingual dictionary.

===Extract frequency lexicon===

The next step is to extract the frequency lexicon.

<pre>
$ python ~/source/apertium-lex-tools/scripts/extract-freq-lexicon.py europarl.candidates.en-es > europarl.lex.en-es
</pre>

This file should look like:

<pre>
$ cat europarl.lex.en-es | head
31381 union<n> unión<n> @
101 union<n> sindicato<n>
1 union<n> situación<n>
1 union<n> monetario<adj>
4 slope<n> pendiente<n> @
1 slope<n> ladera<n>
</pre>

Where the highest frequency translation is marked with an <code>@</code>.

Note: This frequency lexicon can be used as a substitute for "choosing the most general translation" in your bilingual dictionary.

===Generate patterns===

Now we generate the ngrams that we are going to generate the rules from.

<pre>
$ crisphold=1.5 # ratio of how many times you see the alternative translation compared to the default
$ python ~/source/apertium-lex-tools/scripts/ngram-count-patterns.py europarl.lex.en-es europarl.candidates.en-es $crisphold 2>/dev/null > europarl.ngrams.en-es
</pre>

This script outputs lines in the following format:

<pre>
-language<n> and<cnjcoo> language<n> ,<cm> lengua<n> 2
+language<n> plain<adj> language<n> ,<cm> lenguaje<n> 3
-language<n> language<n> knowledge<n> lengua<n> 4
-language<n> language<n> of<pr> communication<n> lengua<n> 3
-language<n> Community<adj> language<n> .<sent> lengua<n> 5
-language<n> language<n> in~addition~to<pr> their<det><pos> lengua<n> 2
-language<n> every<det><ind> language<n> lengua<n> 2
+language<n> and<cnjcoo> *understandable language<n> lenguaje<n> 2
-language<n> two<num> language<n> lengua<n> 8
-language<n> only<adj> official<adj> language<n> lengua<n> 2
</pre>

The <code>+</code> and <code>-</code> indicate if this line chooses the most frequent transation (<code>-</code>) or a translation which is not the most frequent (<code>+</code>). The pattern selecting the translation is then shown, followed by the translation and then the frequency.

===Filter rules===

Now you can filter the rules, for example by removing rules with conjunctions, or removing rules with unknown words.

=== Generate rules ===

The final stage is to generate the rules,

<pre>
python3 ~/source/apertium-lex-tools/scripts/ngrams-to-rules.py europarl.ngrams.en-es $crisphold > europarl.ngrams.en-es.lrx
</pre>

=== Process script ===
For the whole process you can run the following script:

<pre>
CORPUS_DIR="/home/philip/Apertium/corpora/raw/europarl-fr-es"
CORPUS="Europarl3"
PAIR="es-fr"
SL="fr"
TL="es"
LEX_TOOLS="/home/philip/Apertium/apertium-lex-tools"
SCRIPTS="$LEX_TOOLS/scripts"
MOSESDECODER="/home/philip/Apertium/mosesdecoder/scripts/training"
TRAINING_LINES=50000
DATA="/home/philip/Apertium/apertium-fr-es"
BIN_DIR="/home/philip/Apertium/smt/bin"

# CLEAN CORPUS
perl "$MOSESDECODER/clean-corpus-n.perl" $CORPUS.$PAIR $SL $TL "$CORPUS.clean" 1 40;

#TAG CORPUS
cat "$CORPUS.clean.$SL" | tail -n $TRAINING_LINES | apertium -d "$DATA" $SL-$TL-tagger \
| apertium-pretransfer > $CORPUS.tagged.$SL;

cat "$CORPUS.clean.$TL" | tail -n $TRAINING_LINES | apertium -d "$DATA" $TL-$SL-tagger \
| apertium-pretransfer > $CORPUS.tagged.$TL;

cat "$CORPUS.tagged.$SL" | lt-proc -b "$DATA/$SL-$TL.autobil.bin" > $CORPUS.biltrans.$PAIR

N=`wc -l $CORPUS.clean.$SL | cut -d ' ' -f 1`

# REMOVE LINES WITH NO ANALYSES
seq 1 $N > $CORPUS.lines
paste $CORPUS.lines $CORPUS.tagged.$SL $CORPUS.tagged.$TL | grep '<' | cut -f1 > $CORPUS.lines.new
paste $CORPUS.lines $CORPUS.tagged.$SL $CORPUS.tagged.$TL | grep '<' | cut -f2 > $CORPUS.tagged.$SL.new
paste $CORPUS.lines $CORPUS.tagged.$SL $CORPUS.tagged.$TL | grep '<' | cut -f3 > $CORPUS.tagged.$TL.new

mv $CORPUS.lines.new $CORPUS.lines
mv $CORPUS.tagged.$SL.new $CORPUS.tagged.$SL
mv $CORPUS.tagged.$TL.new $CORPUS.tagged.$TL

# TRIM TAGS
cat $CORPUS.tagged.$SL | $LEX_TOOLS/process-tagger-output $DATA/$SL-$TL.autobil.bin -p -t > $CORPUS.tag-tok.$SL
cat $CORPUS.tagged.$TL | $LEX_TOOLS/process-tagger-output $DATA/$TL-$SL.autobil.bin -p -t > $CORPUS.tag-tok.$TL
cat $CORPUS.tagged.$SL | $LEX_TOOLS/process-tagger-output $DATA/$SL-$TL.autobil.bin -b -t > $CORPUS.biltrans-tok.$PAIR

# ALIGN
perl $MOSESDECODER/train-model.perl -external-bin-dir "$BIN_DIR" -corpus $CORPUS.tag-tok \
-f $TL -e $SL -alignment grow-diag-final-and -reordering msd-bidirectional-fe \
-lm 0:5:/home/philip/Apertium/gsoc2013/giza/europarl.lm:0 2>&1

# EXTRACT
zcat giza.$SL-$TL/$SL-$TL.A3.final.gz | $SCRIPTS/giza-to-moses.awk > $CORPUS.phrasetable.$SL-$TL

# SENTENCES
python3 $SCRIPTS/extract-sentences.py $CORPUS.phrasetable.$SL-$TL $CORPUS.biltrans-tok.$PAIR \
> $CORPUS.candidates.$SL-$TL 2>/dev/null

# FREQUENCY LEXICON
python $SCRIPTS/extract-freq-lexicon.py $CORPUS.candidates.$SL-$TL > $CORPUS.lex.$SL-$TL 2>/dev/null

# BILTRANS CANDIDATES
python3 $SCRIPTS/extract-biltrans-candidates.py $CORPUS.phrasetable.$SL-$TL $CORPUS.biltrans-tok.$PAIR \
> $CORPUS.biltrans-entries.$SL-$TL 2>$CORPUS.biltrans-pairs.$SL-$TL

# NGRAM PATTERNS
$crisphold=1.5
python $SCRIPTS/ngram-count-patterns.py $CORPUS.lex.$SL-$TL $CORPUS.candidates.$SL-$TL $crisphold 2>/dev/null > $CORPUS.ngrams.$SL-$TL

# FILTER PATTERNS

# NGRAMS TO RULES
$cripshold=1.5
python3 $SCRIPTS/ngrams-to-rules.py europarl.ngrams.en-es $crisphold > $Europarl3.ngrams.$SL-TL.lrx

</pre>

[[Category:Lexical selection]]
[[Category:Documentation in English]]

Generating lexical-selection rules from a parallel corpus

2013-09-21T14:58:19Z

Fpetkovski: /* Prepare corpus */

{{TOCD}}
If you have a parallel corpus, one of the things you can do is generate some lexical selection rules from it, to improve translation of words with more than one possible translation.

== You will need ==

Here is a list of software that you will need installed:

* Giza++ (or some other word aligner)
* Moses (for making Giza++ less human hostile)
* All the Moses scripts
* [[lttoolbox]]
* Apertium
* [[apertium-lex-tools]]

Furthermore you'll need:

* an Apertium language pair
* a parallel corpus (see [[Corpora]])

===Installing prerequisites===
See [[Minimal installation from SVN]] for apertium/lttoolbox.

See [[Constraint-based lexical selection module]] for apertium-lex-tools.

For Giza++ and moses-decoder, etc. you can do
<pre>
$ mkdir ~/smt
$ cd ~/smt
$ mkdir local # our "install prefix"
$ wget https://giza-pp.googlecode.com/files/giza-pp-v1.0.7.tar.gz
$ tar xzvf giza-pp-v1.0.7.tar.gz
$ cd giza-pp
$ make
$ mkdir ../local/bin
$ cp GIZA++-v2/snt2cooc.out ../local/bin/
$ cp GIZA++-v2/snt2plain.out ../local/bin/
$ cp GIZA++-v2/GIZA++ ../local/bin/
$ cp mkcls-v2/mkcls ../local/bin/
$ git clone https://github.com/moses-smt/mosesdecoder
$ cd mosesdecoder/
$ ./bjam
</pre>

Now e.g. the clean-corpus and train-model scripts referred to below will be in ~/smt/mosesdecoder/scripts/training/clean-corpus-n.perl
See http://www.statmt.org/moses/?n=Development.GetStarted if you want to install the binaries to some other directory.

== Getting started ==

We're going to do the example with [[Corpora|EuroParl]] and the English to Spanish pair in Apertium.

Given that you've got all the stuff installed, the work will be as follows:

=== Prepare corpus ===

To generate the rules, we need three files,

* The tagged and tokenised source corpus
* The tagged and tokenised target corpus
* The output of the lexical transfer module in the source→target direction, tokenised

These three files should be sentence aligned.

Make a folder called data-en-es. We are going to keep all the generated files there.

The first thing that we need to do is tag both sides of the corpus:

<pre>
$ nohup cat europarl.en-es.en | apertium-destxt |\
apertium -f none -d /home/fran/source/apertium-en-es en-es-pretransfer > data-en-es/europarl-en-es.tagged.en &
$ nohup cat europarl.clean.es | apertium-destxt |\
apertium -f none -d /home/fran/source/apertium-en-es es-en-pretransfer > data-en-es.europarl-en-es.tagged.es &
</pre>

Then we need to remove the lines with no analyses on and replace blanks within lemmas with a new character (we will use `~`):

<pre>
$ paste data-en-es/europarl-en-es.tagged.en data-en-es/europarl-en-es.tagged.es | grep '<' | cut -f2 | sed 's/ /~/g' | sed 's/\$[^\^]*/\$ /g' > data-en-es/europarl-en-es.tagged.new.es
$ paste data-en-es/europarl-en-es.tagged.en data-en-es/europarl-en-es.tagged.es | grep '<' | cut -f3 | sed 's/ /~/g' | sed 's/\$[^\^]*/\$ /g'> data-en-es/europarl-en-es.tagged.new.en
</pre>

Next, we need to clean the corpus and remove long sentences.
(Make sure you are in the same directory as the one where you have your europarl corpus)

<pre>
$ perl (path to your mosesdecoder)/scripts/training/clean-corpus-n.perl data-en-es/europarl-en-es.tagged.new es en data-en-es/europarl-en-es.tag-clean 1 40
clean-corpus.perl: processing europarl-v6.es-en.es & .en to data-en-es/europarl-en-es.clean, cutoff 1-40
..........(100000)...

Input sentences: 1786594 Output sentences: 1467708
</pre>

We're going to cut off the bottom 67,658 for testing (also because Giza++ segfaults somewhere around there).
<pre>
$ mkdir testing
$ tail -67658 data-en-es/europarl-en-es.tag-clean.en > testing/europarl.tag-clean.67658.en
$ tail -67658 data-en-es/europarl-en-es.tagged.es > testing/europarl.tag-clean.67658.es
</pre>

<pre>
$ head -1400000 data-en-es/europarl-en-es.tag-clean.en > data-en-es/europarl-en-es.tag-clean.en.new
$ head -1400000 data-en-es/europarl-en-es.tag-clean.es > data-en-es/europarl-en-es.tag-clean.es.new
$ mv data-en-es/europarl-en-es.tag-clean.en.new data-en-es/europarl-en-es.tag-clean.en
$ mv data-en-es/europarl-en-es.tag-clean.es.new data-en-es/europarl-en-es.tag-clean.es
</pre>

These files are:

* <code>data-en-es/europarl-en-es.tag-clean.en</code>: The tagged source language side of the corpus
* <code>data-en-es/europarl-en-es.tag-clean.es</code>: The tagged target language side of the corpus

Check that they have the same length:

<pre>
$ wc -l europarl.*
1400000 data-en-es/europarl-en-es.tag-clean.en
1400000 data-en-es/europarl-en-es.tag-clean.es
2800000 total
</pre>

=== Align corpus ===

Now we've got the corpus files ready, we can align the corpus using the Moses scripts:

<pre>
nohup perl (path to your mosesdecoder)/scripts/training/train-model.perl -external-bin-dir \
~/smt/local/bin -corpus europarl.tag-clean \
-f en -e es -alignment grow-diag-final-and -reordering msd-bidirectional-fe \
-lm 0:5:/home/fran/corpora/europarl/europarl.lm:0 >log 2>&1 &
</pre>

Note: Remember to change all the paths in the above command!

You'll need an LM file, but you can copy it from a previous Moses installation. If you don't have one, make an empty file and put a few words in it. We won't be using the LM anyway.

This takes a while, from a few hours to a day. So leave it running and go and make a soufflé, or chop some wood or something.

=== Extract sentences ===
After the sentences are aligned, we need to trim unnecessary tags from the tokens, and generate a biltrans file.

<pre>
zcat giza.en-es/en-es.A3.final.gz | ~/source/apertium-lex-tools/scripts/giza-to-moses.awk > data-en-es/europarl.phrasetable.en-es
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 1 \
| sed 's/~/ /g' | ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil.bin -p -t > tmp1
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 2 \
| sed 's/~/ /g' | ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil.bin -p -t > tmp2
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 3 > tmp3
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 2 \
| sed 's/~/ /g' | ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil.bin -b -t > data-en-es/europarl.clean-biltrans.en-es
paste tmp1 tmp2 tmp3 | sed 's/\t/ ||| /g' > data-en-es/europarl.phrasetable.en-es
rm tmp1 tmp2 tmp3
</pre>

Then we want to make sure again that our file has the right number of lines:

<pre>
$ wc -l data/en-es/europarl.phrasetable.en-es
1400000 data/en-es/europarl.phrasetable.en-es
</pre>

Then we want to extract the sentences where the target language word aligned to a source language word is a possible translation in the bilingual dictionary:

<pre>
$ ~/source/apertium-lex-tools/scripts/extract-sentences.py europarl.phrasetable.en-es europarl.biltrans-tok.en-es \
> europarl.candidates.en-es
</pre>

These are basically sentences that we can hope that Apertium might be able to generate.

=== Extract bilingual dictionary candidates ===

Using the phrasetable and the bilingual file we can extract candidates for the bilingual dictionary.

<pre>
python3 ~/Apertium/apertium-lex-tools/scripts/extract-biltrans-candidates.py europarl.phrasetable.en-es europarl.biltrans-tok.en-es > europarl.biltrans-candidates.en-es 2> europarl.biltrans-pairs.en-es
</pre>

where europarl.biltrans-candidates.en-es contains the generated entries for the bilingual dictionary.

===Extract frequency lexicon===

The next step is to extract the frequency lexicon.

<pre>
$ python ~/source/apertium-lex-tools/scripts/extract-freq-lexicon.py europarl.candidates.en-es > europarl.lex.en-es
</pre>

This file should look like:

<pre>
$ cat europarl.lex.en-es | head
31381 union<n> unión<n> @
101 union<n> sindicato<n>
1 union<n> situación<n>
1 union<n> monetario<adj>
4 slope<n> pendiente<n> @
1 slope<n> ladera<n>
</pre>

Where the highest frequency translation is marked with an <code>@</code>.

Note: This frequency lexicon can be used as a substitute for "choosing the most general translation" in your bilingual dictionary.

===Generate patterns===

Now we generate the ngrams that we are going to generate the rules from.

<pre>
$ crisphold=1.5 # ratio of how many times you see the alternative translation compared to the default
$ python ~/source/apertium-lex-tools/scripts/ngram-count-patterns.py europarl.lex.en-es europarl.candidates.en-es $crisphold 2>/dev/null > europarl.ngrams.en-es
</pre>

This script outputs lines in the following format:

<pre>
-language<n> and<cnjcoo> language<n> ,<cm> lengua<n> 2
+language<n> plain<adj> language<n> ,<cm> lenguaje<n> 3
-language<n> language<n> knowledge<n> lengua<n> 4
-language<n> language<n> of<pr> communication<n> lengua<n> 3
-language<n> Community<adj> language<n> .<sent> lengua<n> 5
-language<n> language<n> in~addition~to<pr> their<det><pos> lengua<n> 2
-language<n> every<det><ind> language<n> lengua<n> 2
+language<n> and<cnjcoo> *understandable language<n> lenguaje<n> 2
-language<n> two<num> language<n> lengua<n> 8
-language<n> only<adj> official<adj> language<n> lengua<n> 2
</pre>

The <code>+</code> and <code>-</code> indicate if this line chooses the most frequent transation (<code>-</code>) or a translation which is not the most frequent (<code>+</code>). The pattern selecting the translation is then shown, followed by the translation and then the frequency.

===Filter rules===

Now you can filter the rules, for example by removing rules with conjunctions, or removing rules with unknown words.

=== Generate rules ===

The final stage is to generate the rules,

<pre>
python3 ~/source/apertium-lex-tools/scripts/ngrams-to-rules.py europarl.ngrams.en-es $crisphold > europarl.ngrams.en-es.lrx
</pre>

=== Process script ===
For the whole process you can run the following script:

<pre>
CORPUS_DIR="/home/philip/Apertium/corpora/raw/europarl-fr-es"
CORPUS="Europarl3"
PAIR="es-fr"
SL="fr"
TL="es"
LEX_TOOLS="/home/philip/Apertium/apertium-lex-tools"
SCRIPTS="$LEX_TOOLS/scripts"
MOSESDECODER="/home/philip/Apertium/mosesdecoder/scripts/training"
TRAINING_LINES=50000
DATA="/home/philip/Apertium/apertium-fr-es"
BIN_DIR="/home/philip/Apertium/smt/bin"

# CLEAN CORPUS
perl "$MOSESDECODER/clean-corpus-n.perl" $CORPUS.$PAIR $SL $TL "$CORPUS.clean" 1 40;

#TAG CORPUS
cat "$CORPUS.clean.$SL" | tail -n $TRAINING_LINES | apertium -d "$DATA" $SL-$TL-tagger \
| apertium-pretransfer > $CORPUS.tagged.$SL;

cat "$CORPUS.clean.$TL" | tail -n $TRAINING_LINES | apertium -d "$DATA" $TL-$SL-tagger \
| apertium-pretransfer > $CORPUS.tagged.$TL;

cat "$CORPUS.tagged.$SL" | lt-proc -b "$DATA/$SL-$TL.autobil.bin" > $CORPUS.biltrans.$PAIR

N=`wc -l $CORPUS.clean.$SL | cut -d ' ' -f 1`

# REMOVE LINES WITH NO ANALYSES
seq 1 $N > $CORPUS.lines
paste $CORPUS.lines $CORPUS.tagged.$SL $CORPUS.tagged.$TL | grep '<' | cut -f1 > $CORPUS.lines.new
paste $CORPUS.lines $CORPUS.tagged.$SL $CORPUS.tagged.$TL | grep '<' | cut -f2 > $CORPUS.tagged.$SL.new
paste $CORPUS.lines $CORPUS.tagged.$SL $CORPUS.tagged.$TL | grep '<' | cut -f3 > $CORPUS.tagged.$TL.new

mv $CORPUS.lines.new $CORPUS.lines
mv $CORPUS.tagged.$SL.new $CORPUS.tagged.$SL
mv $CORPUS.tagged.$TL.new $CORPUS.tagged.$TL

# TRIM TAGS
cat $CORPUS.tagged.$SL | $LEX_TOOLS/process-tagger-output $DATA/$SL-$TL.autobil.bin -p -t > $CORPUS.tag-tok.$SL
cat $CORPUS.tagged.$TL | $LEX_TOOLS/process-tagger-output $DATA/$TL-$SL.autobil.bin -p -t > $CORPUS.tag-tok.$TL
cat $CORPUS.tagged.$SL | $LEX_TOOLS/process-tagger-output $DATA/$SL-$TL.autobil.bin -b -t > $CORPUS.biltrans-tok.$PAIR

# ALIGN
perl $MOSESDECODER/train-model.perl -external-bin-dir "$BIN_DIR" -corpus $CORPUS.tag-tok \
-f $TL -e $SL -alignment grow-diag-final-and -reordering msd-bidirectional-fe \
-lm 0:5:/home/philip/Apertium/gsoc2013/giza/europarl.lm:0 2>&1

# EXTRACT
zcat giza.$SL-$TL/$SL-$TL.A3.final.gz | $SCRIPTS/giza-to-moses.awk > $CORPUS.phrasetable.$SL-$TL

# SENTENCES
python3 $SCRIPTS/extract-sentences.py $CORPUS.phrasetable.$SL-$TL $CORPUS.biltrans-tok.$PAIR \
> $CORPUS.candidates.$SL-$TL 2>/dev/null

# FREQUENCY LEXICON
python $SCRIPTS/extract-freq-lexicon.py $CORPUS.candidates.$SL-$TL > $CORPUS.lex.$SL-$TL 2>/dev/null

# BILTRANS CANDIDATES
python3 $SCRIPTS/extract-biltrans-candidates.py $CORPUS.phrasetable.$SL-$TL $CORPUS.biltrans-tok.$PAIR \
> $CORPUS.biltrans-entries.$SL-$TL 2>$CORPUS.biltrans-pairs.$SL-$TL

# NGRAM PATTERNS
$crisphold=1.5
python $SCRIPTS/ngram-count-patterns.py $CORPUS.lex.$SL-$TL $CORPUS.candidates.$SL-$TL $crisphold 2>/dev/null > $CORPUS.ngrams.$SL-$TL

# FILTER PATTERNS

# NGRAMS TO RULES
$cripshold=1.5
python3 $SCRIPTS/ngrams-to-rules.py europarl.ngrams.en-es $crisphold > $Europarl3.ngrams.$SL-TL.lrx

</pre>

[[Category:Lexical selection]]
[[Category:Documentation in English]]

Generating lexical-selection rules from a parallel corpus

2013-09-21T14:57:42Z

Fpetkovski: /* Prepare corpus */

{{TOCD}}
If you have a parallel corpus, one of the things you can do is generate some lexical selection rules from it, to improve translation of words with more than one possible translation.

== You will need ==

Here is a list of software that you will need installed:

* Giza++ (or some other word aligner)
* Moses (for making Giza++ less human hostile)
* All the Moses scripts
* [[lttoolbox]]
* Apertium
* [[apertium-lex-tools]]

Furthermore you'll need:

* an Apertium language pair
* a parallel corpus (see [[Corpora]])

===Installing prerequisites===
See [[Minimal installation from SVN]] for apertium/lttoolbox.

See [[Constraint-based lexical selection module]] for apertium-lex-tools.

For Giza++ and moses-decoder, etc. you can do
<pre>
$ mkdir ~/smt
$ cd ~/smt
$ mkdir local # our "install prefix"
$ wget https://giza-pp.googlecode.com/files/giza-pp-v1.0.7.tar.gz
$ tar xzvf giza-pp-v1.0.7.tar.gz
$ cd giza-pp
$ make
$ mkdir ../local/bin
$ cp GIZA++-v2/snt2cooc.out ../local/bin/
$ cp GIZA++-v2/snt2plain.out ../local/bin/
$ cp GIZA++-v2/GIZA++ ../local/bin/
$ cp mkcls-v2/mkcls ../local/bin/
$ git clone https://github.com/moses-smt/mosesdecoder
$ cd mosesdecoder/
$ ./bjam
</pre>

Now e.g. the clean-corpus and train-model scripts referred to below will be in ~/smt/mosesdecoder/scripts/training/clean-corpus-n.perl
See http://www.statmt.org/moses/?n=Development.GetStarted if you want to install the binaries to some other directory.

== Getting started ==

We're going to do the example with [[Corpora|EuroParl]] and the English to Spanish pair in Apertium.

Given that you've got all the stuff installed, the work will be as follows:

=== Prepare corpus ===

To generate the rules, we need three files,

* The tagged and tokenised source corpus
* The tagged and tokenised target corpus
* The output of the lexical transfer module in the source→target direction, tokenised

These three files should be sentence aligned.

Make a folder called data-en-es. We are going to keep all the generated files there.

The first thing that we need to do is tag both sides of the corpus:

<pre>
$ nohup cat europarl.en-es.en | apertium-destxt |\
apertium -f none -d /home/fran/source/apertium-en-es en-es-pretransfer > data-en-es/europarl-en-es.tagged.en &
$ nohup cat europarl.clean.es | apertium-destxt |\
apertium -f none -d /home/fran/source/apertium-en-es es-en-pretransfer > data-en-es.europarl-en-es.tagged.es &
</pre>

Then we need to remove the lines with no analyses on and replace blanks within lemmas with a new character (we will use `~`):

<pre>
$ paste data-en-es/europarl-en-es.tagged.en data-en-es/europarl-en-es.tagged.es | grep '<' | cut -f2 | sed 's/ /~/g' | sed 's/\$[^\^]*/\$ /g' > data/europarl-en-es.tagged.new.es
$ paste data-en-es/europarl-en-es.tagged.en data-en-es/europarl-en-es.tagged.es | grep '<' | cut -f3 | sed 's/ /~/g' | sed 's/\$[^\^]*/\$ /g'> data-en-es/europarl-en-es.tagged.new.en
</pre>

Next, we need to clean the corpus and remove long sentences.
(Make sure you are in the same directory as the one where you have your europarl corpus)

<pre>
$ perl (path to your mosesdecoder)/scripts/training/clean-corpus-n.perl data-en-es/europarl-en-es.tagged.new es en data-en-es/europarl-en-es.tag-clean 1 40
clean-corpus.perl: processing europarl-v6.es-en.es & .en to data-en-es/europarl-en-es.clean, cutoff 1-40
..........(100000)...

Input sentences: 1786594 Output sentences: 1467708
</pre>

We're going to cut off the bottom 67,658 for testing (also because Giza++ segfaults somewhere around there).
<pre>
$ mkdir testing
$ tail -67658 data-en-es/europarl-en-es.tag-clean.en > testing/europarl.tag-clean.67658.en
$ tail -67658 data-en-es/europarl-en-es.tagged.es > testing/europarl.tag-clean.67658.es
</pre>

<pre>
$ head -1400000 data-en-es/europarl-en-es.tag-clean.en > data-en-es/europarl-en-es.tag-clean.en.new
$ head -1400000 data-en-es/europarl-en-es.tag-clean.es > data-en-es/europarl-en-es.tag-clean.es.new
$ mv data-en-es/europarl-en-es.tag-clean.en.new data-en-es/europarl-en-es.tag-clean.en
$ mv data-en-es/europarl-en-es.tag-clean.es.new data-en-es/europarl-en-es.tag-clean.es
</pre>

These files are:

* <code>data-en-es/europarl-en-es.tag-clean.en</code>: The tagged source language side of the corpus
* <code>data-en-es/europarl-en-es.tag-clean.es</code>: The tagged target language side of the corpus

Check that they have the same length:

<pre>
$ wc -l europarl.*
1400000 data-en-es/europarl-en-es.tag-clean.en
1400000 data-en-es/europarl-en-es.tag-clean.es
2800000 total
</pre>

=== Align corpus ===

Now we've got the corpus files ready, we can align the corpus using the Moses scripts:

<pre>
nohup perl (path to your mosesdecoder)/scripts/training/train-model.perl -external-bin-dir \
~/smt/local/bin -corpus europarl.tag-clean \
-f en -e es -alignment grow-diag-final-and -reordering msd-bidirectional-fe \
-lm 0:5:/home/fran/corpora/europarl/europarl.lm:0 >log 2>&1 &
</pre>

Note: Remember to change all the paths in the above command!

You'll need an LM file, but you can copy it from a previous Moses installation. If you don't have one, make an empty file and put a few words in it. We won't be using the LM anyway.

This takes a while, from a few hours to a day. So leave it running and go and make a soufflé, or chop some wood or something.

=== Extract sentences ===
After the sentences are aligned, we need to trim unnecessary tags from the tokens, and generate a biltrans file.

<pre>
zcat giza.en-es/en-es.A3.final.gz | ~/source/apertium-lex-tools/scripts/giza-to-moses.awk > data-en-es/europarl.phrasetable.en-es
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 1 \
| sed 's/~/ /g' | ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil.bin -p -t > tmp1
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 2 \
| sed 's/~/ /g' | ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil.bin -p -t > tmp2
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 3 > tmp3
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 2 \
| sed 's/~/ /g' | ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil.bin -b -t > data-en-es/europarl.clean-biltrans.en-es
paste tmp1 tmp2 tmp3 | sed 's/\t/ ||| /g' > data-en-es/europarl.phrasetable.en-es
rm tmp1 tmp2 tmp3
</pre>

Then we want to make sure again that our file has the right number of lines:

<pre>
$ wc -l data/en-es/europarl.phrasetable.en-es
1400000 data/en-es/europarl.phrasetable.en-es
</pre>

Then we want to extract the sentences where the target language word aligned to a source language word is a possible translation in the bilingual dictionary:

<pre>
$ ~/source/apertium-lex-tools/scripts/extract-sentences.py europarl.phrasetable.en-es europarl.biltrans-tok.en-es \
> europarl.candidates.en-es
</pre>

These are basically sentences that we can hope that Apertium might be able to generate.

=== Extract bilingual dictionary candidates ===

Using the phrasetable and the bilingual file we can extract candidates for the bilingual dictionary.

<pre>
python3 ~/Apertium/apertium-lex-tools/scripts/extract-biltrans-candidates.py europarl.phrasetable.en-es europarl.biltrans-tok.en-es > europarl.biltrans-candidates.en-es 2> europarl.biltrans-pairs.en-es
</pre>

where europarl.biltrans-candidates.en-es contains the generated entries for the bilingual dictionary.

===Extract frequency lexicon===

The next step is to extract the frequency lexicon.

<pre>
$ python ~/source/apertium-lex-tools/scripts/extract-freq-lexicon.py europarl.candidates.en-es > europarl.lex.en-es
</pre>

This file should look like:

<pre>
$ cat europarl.lex.en-es | head
31381 union<n> unión<n> @
101 union<n> sindicato<n>
1 union<n> situación<n>
1 union<n> monetario<adj>
4 slope<n> pendiente<n> @
1 slope<n> ladera<n>
</pre>

Where the highest frequency translation is marked with an <code>@</code>.

Note: This frequency lexicon can be used as a substitute for "choosing the most general translation" in your bilingual dictionary.

===Generate patterns===

Now we generate the ngrams that we are going to generate the rules from.

<pre>
$ crisphold=1.5 # ratio of how many times you see the alternative translation compared to the default
$ python ~/source/apertium-lex-tools/scripts/ngram-count-patterns.py europarl.lex.en-es europarl.candidates.en-es $crisphold 2>/dev/null > europarl.ngrams.en-es
</pre>

This script outputs lines in the following format:

<pre>
-language<n> and<cnjcoo> language<n> ,<cm> lengua<n> 2
+language<n> plain<adj> language<n> ,<cm> lenguaje<n> 3
-language<n> language<n> knowledge<n> lengua<n> 4
-language<n> language<n> of<pr> communication<n> lengua<n> 3
-language<n> Community<adj> language<n> .<sent> lengua<n> 5
-language<n> language<n> in~addition~to<pr> their<det><pos> lengua<n> 2
-language<n> every<det><ind> language<n> lengua<n> 2
+language<n> and<cnjcoo> *understandable language<n> lenguaje<n> 2
-language<n> two<num> language<n> lengua<n> 8
-language<n> only<adj> official<adj> language<n> lengua<n> 2
</pre>

The <code>+</code> and <code>-</code> indicate if this line chooses the most frequent transation (<code>-</code>) or a translation which is not the most frequent (<code>+</code>). The pattern selecting the translation is then shown, followed by the translation and then the frequency.

===Filter rules===

Now you can filter the rules, for example by removing rules with conjunctions, or removing rules with unknown words.

=== Generate rules ===

The final stage is to generate the rules,

<pre>
python3 ~/source/apertium-lex-tools/scripts/ngrams-to-rules.py europarl.ngrams.en-es $crisphold > europarl.ngrams.en-es.lrx
</pre>

=== Process script ===
For the whole process you can run the following script:

<pre>
CORPUS_DIR="/home/philip/Apertium/corpora/raw/europarl-fr-es"
CORPUS="Europarl3"
PAIR="es-fr"
SL="fr"
TL="es"
LEX_TOOLS="/home/philip/Apertium/apertium-lex-tools"
SCRIPTS="$LEX_TOOLS/scripts"
MOSESDECODER="/home/philip/Apertium/mosesdecoder/scripts/training"
TRAINING_LINES=50000
DATA="/home/philip/Apertium/apertium-fr-es"
BIN_DIR="/home/philip/Apertium/smt/bin"

# CLEAN CORPUS
perl "$MOSESDECODER/clean-corpus-n.perl" $CORPUS.$PAIR $SL $TL "$CORPUS.clean" 1 40;

#TAG CORPUS
cat "$CORPUS.clean.$SL" | tail -n $TRAINING_LINES | apertium -d "$DATA" $SL-$TL-tagger \
| apertium-pretransfer > $CORPUS.tagged.$SL;

cat "$CORPUS.clean.$TL" | tail -n $TRAINING_LINES | apertium -d "$DATA" $TL-$SL-tagger \
| apertium-pretransfer > $CORPUS.tagged.$TL;

cat "$CORPUS.tagged.$SL" | lt-proc -b "$DATA/$SL-$TL.autobil.bin" > $CORPUS.biltrans.$PAIR

N=`wc -l $CORPUS.clean.$SL | cut -d ' ' -f 1`

# REMOVE LINES WITH NO ANALYSES
seq 1 $N > $CORPUS.lines
paste $CORPUS.lines $CORPUS.tagged.$SL $CORPUS.tagged.$TL | grep '<' | cut -f1 > $CORPUS.lines.new
paste $CORPUS.lines $CORPUS.tagged.$SL $CORPUS.tagged.$TL | grep '<' | cut -f2 > $CORPUS.tagged.$SL.new
paste $CORPUS.lines $CORPUS.tagged.$SL $CORPUS.tagged.$TL | grep '<' | cut -f3 > $CORPUS.tagged.$TL.new

mv $CORPUS.lines.new $CORPUS.lines
mv $CORPUS.tagged.$SL.new $CORPUS.tagged.$SL
mv $CORPUS.tagged.$TL.new $CORPUS.tagged.$TL

# TRIM TAGS
cat $CORPUS.tagged.$SL | $LEX_TOOLS/process-tagger-output $DATA/$SL-$TL.autobil.bin -p -t > $CORPUS.tag-tok.$SL
cat $CORPUS.tagged.$TL | $LEX_TOOLS/process-tagger-output $DATA/$TL-$SL.autobil.bin -p -t > $CORPUS.tag-tok.$TL
cat $CORPUS.tagged.$SL | $LEX_TOOLS/process-tagger-output $DATA/$SL-$TL.autobil.bin -b -t > $CORPUS.biltrans-tok.$PAIR

# ALIGN
perl $MOSESDECODER/train-model.perl -external-bin-dir "$BIN_DIR" -corpus $CORPUS.tag-tok \
-f $TL -e $SL -alignment grow-diag-final-and -reordering msd-bidirectional-fe \
-lm 0:5:/home/philip/Apertium/gsoc2013/giza/europarl.lm:0 2>&1

# EXTRACT
zcat giza.$SL-$TL/$SL-$TL.A3.final.gz | $SCRIPTS/giza-to-moses.awk > $CORPUS.phrasetable.$SL-$TL

# SENTENCES
python3 $SCRIPTS/extract-sentences.py $CORPUS.phrasetable.$SL-$TL $CORPUS.biltrans-tok.$PAIR \
> $CORPUS.candidates.$SL-$TL 2>/dev/null

# FREQUENCY LEXICON
python $SCRIPTS/extract-freq-lexicon.py $CORPUS.candidates.$SL-$TL > $CORPUS.lex.$SL-$TL 2>/dev/null

# BILTRANS CANDIDATES
python3 $SCRIPTS/extract-biltrans-candidates.py $CORPUS.phrasetable.$SL-$TL $CORPUS.biltrans-tok.$PAIR \
> $CORPUS.biltrans-entries.$SL-$TL 2>$CORPUS.biltrans-pairs.$SL-$TL

# NGRAM PATTERNS
$crisphold=1.5
python $SCRIPTS/ngram-count-patterns.py $CORPUS.lex.$SL-$TL $CORPUS.candidates.$SL-$TL $crisphold 2>/dev/null > $CORPUS.ngrams.$SL-$TL

# FILTER PATTERNS

# NGRAMS TO RULES
$cripshold=1.5
python3 $SCRIPTS/ngrams-to-rules.py europarl.ngrams.en-es $crisphold > $Europarl3.ngrams.$SL-TL.lrx

</pre>

[[Category:Lexical selection]]
[[Category:Documentation in English]]

Generating lexical-selection rules from a parallel corpus

2013-09-21T14:52:00Z

Fpetkovski: /* Extract sentences */

{{TOCD}}
If you have a parallel corpus, one of the things you can do is generate some lexical selection rules from it, to improve translation of words with more than one possible translation.

== You will need ==

Here is a list of software that you will need installed:

* Giza++ (or some other word aligner)
* Moses (for making Giza++ less human hostile)
* All the Moses scripts
* [[lttoolbox]]
* Apertium
* [[apertium-lex-tools]]

Furthermore you'll need:

* an Apertium language pair
* a parallel corpus (see [[Corpora]])

===Installing prerequisites===
See [[Minimal installation from SVN]] for apertium/lttoolbox.

See [[Constraint-based lexical selection module]] for apertium-lex-tools.

For Giza++ and moses-decoder, etc. you can do
<pre>
$ mkdir ~/smt
$ cd ~/smt
$ mkdir local # our "install prefix"
$ wget https://giza-pp.googlecode.com/files/giza-pp-v1.0.7.tar.gz
$ tar xzvf giza-pp-v1.0.7.tar.gz
$ cd giza-pp
$ make
$ mkdir ../local/bin
$ cp GIZA++-v2/snt2cooc.out ../local/bin/
$ cp GIZA++-v2/snt2plain.out ../local/bin/
$ cp GIZA++-v2/GIZA++ ../local/bin/
$ cp mkcls-v2/mkcls ../local/bin/
$ git clone https://github.com/moses-smt/mosesdecoder
$ cd mosesdecoder/
$ ./bjam
</pre>

Now e.g. the clean-corpus and train-model scripts referred to below will be in ~/smt/mosesdecoder/scripts/training/clean-corpus-n.perl
See http://www.statmt.org/moses/?n=Development.GetStarted if you want to install the binaries to some other directory.

== Getting started ==

We're going to do the example with [[Corpora|EuroParl]] and the English to Spanish pair in Apertium.

Given that you've got all the stuff installed, the work will be as follows:

=== Prepare corpus ===

To generate the rules, we need three files,

* The tagged and tokenised source corpus
* The tagged and tokenised target corpus
* The output of the lexical transfer module in the source→target direction, tokenised

These three files should be sentence aligned.

The first thing that we need to do is tag both sides of the corpus:

<pre>
$ nohup cat europarl.clean.en | apertium-destxt |\
apertium -f none -d /home/fran/source/apertium-en-es en-es-pretransfer > europarl.tagged.en &
$ nohup cat europarl.clean.es | apertium-destxt |\
apertium -f none -d /home/fran/source/apertium-en-es es-en-pretransfer > europarl.tagged.es &
</pre>

Then we need to remove the lines with no analyses on and replace blanks within lemmas with a new character (we will use `~`):

<pre>
$ paste europarl.lines europarl.tagged.en europarl.tagged.es | grep '<' | cut -f2 | sed 's/ /~/g' | sed 's/$$[^\^]*/$$ /g' > europarl.tagged.new.es
$ paste europarl.lines europarl.tagged.en europarl.tagged.es | grep '<' | cut -f3 | sed 's/ /~/g' | sed 's/$$[^\^]*/$$ /g'> europarl.tagged.new.en
</pre>

Next, we need to clean the corpus and remove long sentences.
(Make sure you are in the same directory as the one where you have your europarl corpus)

<pre>
$ perl (path to your mosesdecoder)/scripts/training/clean-corpus-n.perl europarl.tagged.new es en europarl.tag-clean 1 40
clean-corpus.perl: processing europarl-v6.es-en.es & .en to europarl.clean, cutoff 1-40
..........(100000)...

Input sentences: 1786594 Output sentences: 1467708
</pre>

We're going to cut off the bottom 67,658 for testing (also because Giza++ segfaults somewhere around there).
<pre>
$ mkdir testing
$ tail -67658 europarl.tag-clean.en > testing/europarl.tag-clean.67658.en
$ tail -67658 europarl.tagged.es > testing/europarl.tag-clean.67658.es
</pre>

<pre>
$ head -1400000 europarl.tag-clean.en > europarl.tag-clean.en.new
$ head -1400000 europarl.tag-clean.es > europarl.tag-clean.es.new
$ mv europarl.tag-clean.en.new europarl.tag-clean.en
$ mv europarl.tag-clean.es.new europarl.tag-clean.es
</pre>

These files are:

* <code>europarl.tag-clean.en</code>: The tagged source language side of the corpus
* <code>europarl.tag-clean.es</code>: The tagged target language side of the corpus

Check that they have the same length:

<pre>
$ wc -l europarl.*
1400000 europarl.tag-clean.en
1400000 europarl.tag-clean.es
2800000 total
</pre>

=== Align corpus ===

Now we've got the corpus files ready, we can align the corpus using the Moses scripts:

<pre>
nohup perl (path to your mosesdecoder)/scripts/training/train-model.perl -external-bin-dir \
~/smt/local/bin -corpus europarl.tag-clean \
-f en -e es -alignment grow-diag-final-and -reordering msd-bidirectional-fe \
-lm 0:5:/home/fran/corpora/europarl/europarl.lm:0 >log 2>&1 &
</pre>

Note: Remember to change all the paths in the above command!

You'll need an LM file, but you can copy it from a previous Moses installation. If you don't have one, make an empty file and put a few words in it. We won't be using the LM anyway.

This takes a while, from a few hours to a day. So leave it running and go and make a soufflé, or chop some wood or something.

=== Extract sentences ===
After the sentences are aligned, we need to trim unnecessary tags from the tokens, and generate a biltrans file.

<pre>
zcat giza.en-es/en-es.A3.final.gz | ~/source/apertium-lex-tools/scripts/giza-to-moses.awk > data-en-es/europarl.phrasetable.en-es
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 1 \
| sed 's/~/ /g' | ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil.bin -p -t > tmp1
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 2 \
| sed 's/~/ /g' | ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil.bin -p -t > tmp2
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 3 > tmp3
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 2 \
| sed 's/~/ /g' | ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil.bin -b -t > data-en-es/europarl.clean-biltrans.en-es
paste tmp1 tmp2 tmp3 | sed 's/\t/ ||| /g' > data-en-es/europarl.phrasetable.en-es
rm tmp1 tmp2 tmp3
</pre>

Then we want to make sure again that our file has the right number of lines:

<pre>
$ wc -l data/en-es/europarl.phrasetable.en-es
1400000 data/en-es/europarl.phrasetable.en-es
</pre>

Then we want to extract the sentences where the target language word aligned to a source language word is a possible translation in the bilingual dictionary:

<pre>
$ ~/source/apertium-lex-tools/scripts/extract-sentences.py europarl.phrasetable.en-es europarl.biltrans-tok.en-es \
> europarl.candidates.en-es
</pre>

These are basically sentences that we can hope that Apertium might be able to generate.

=== Extract bilingual dictionary candidates ===

Using the phrasetable and the bilingual file we can extract candidates for the bilingual dictionary.

<pre>
python3 ~/Apertium/apertium-lex-tools/scripts/extract-biltrans-candidates.py europarl.phrasetable.en-es europarl.biltrans-tok.en-es > europarl.biltrans-candidates.en-es 2> europarl.biltrans-pairs.en-es
</pre>

where europarl.biltrans-candidates.en-es contains the generated entries for the bilingual dictionary.

===Extract frequency lexicon===

The next step is to extract the frequency lexicon.

<pre>
$ python ~/source/apertium-lex-tools/scripts/extract-freq-lexicon.py europarl.candidates.en-es > europarl.lex.en-es
</pre>

This file should look like:

<pre>
$ cat europarl.lex.en-es | head
31381 union<n> unión<n> @
101 union<n> sindicato<n>
1 union<n> situación<n>
1 union<n> monetario<adj>
4 slope<n> pendiente<n> @
1 slope<n> ladera<n>
</pre>

Where the highest frequency translation is marked with an <code>@</code>.

Note: This frequency lexicon can be used as a substitute for "choosing the most general translation" in your bilingual dictionary.

===Generate patterns===

Now we generate the ngrams that we are going to generate the rules from.

<pre>
$ crisphold=1.5 # ratio of how many times you see the alternative translation compared to the default
$ python ~/source/apertium-lex-tools/scripts/ngram-count-patterns.py europarl.lex.en-es europarl.candidates.en-es $crisphold 2>/dev/null > europarl.ngrams.en-es
</pre>

This script outputs lines in the following format:

<pre>
-language<n> and<cnjcoo> language<n> ,<cm> lengua<n> 2
+language<n> plain<adj> language<n> ,<cm> lenguaje<n> 3
-language<n> language<n> knowledge<n> lengua<n> 4
-language<n> language<n> of<pr> communication<n> lengua<n> 3
-language<n> Community<adj> language<n> .<sent> lengua<n> 5
-language<n> language<n> in~addition~to<pr> their<det><pos> lengua<n> 2
-language<n> every<det><ind> language<n> lengua<n> 2
+language<n> and<cnjcoo> *understandable language<n> lenguaje<n> 2
-language<n> two<num> language<n> lengua<n> 8
-language<n> only<adj> official<adj> language<n> lengua<n> 2
</pre>

The <code>+</code> and <code>-</code> indicate if this line chooses the most frequent transation (<code>-</code>) or a translation which is not the most frequent (<code>+</code>). The pattern selecting the translation is then shown, followed by the translation and then the frequency.

===Filter rules===

Now you can filter the rules, for example by removing rules with conjunctions, or removing rules with unknown words.

=== Generate rules ===

The final stage is to generate the rules,

<pre>
python3 ~/source/apertium-lex-tools/scripts/ngrams-to-rules.py europarl.ngrams.en-es $crisphold > europarl.ngrams.en-es.lrx
</pre>

=== Process script ===
For the whole process you can run the following script:

<pre>
CORPUS_DIR="/home/philip/Apertium/corpora/raw/europarl-fr-es"
CORPUS="Europarl3"
PAIR="es-fr"
SL="fr"
TL="es"
LEX_TOOLS="/home/philip/Apertium/apertium-lex-tools"
SCRIPTS="$LEX_TOOLS/scripts"
MOSESDECODER="/home/philip/Apertium/mosesdecoder/scripts/training"
TRAINING_LINES=50000
DATA="/home/philip/Apertium/apertium-fr-es"
BIN_DIR="/home/philip/Apertium/smt/bin"

# CLEAN CORPUS
perl "$MOSESDECODER/clean-corpus-n.perl" $CORPUS.$PAIR $SL $TL "$CORPUS.clean" 1 40;

#TAG CORPUS
cat "$CORPUS.clean.$SL" | tail -n $TRAINING_LINES | apertium -d "$DATA" $SL-$TL-tagger \
| apertium-pretransfer > $CORPUS.tagged.$SL;

cat "$CORPUS.clean.$TL" | tail -n $TRAINING_LINES | apertium -d "$DATA" $TL-$SL-tagger \
| apertium-pretransfer > $CORPUS.tagged.$TL;

cat "$CORPUS.tagged.$SL" | lt-proc -b "$DATA/$SL-$TL.autobil.bin" > $CORPUS.biltrans.$PAIR

N=`wc -l $CORPUS.clean.$SL | cut -d ' ' -f 1`

# REMOVE LINES WITH NO ANALYSES
seq 1 $N > $CORPUS.lines
paste $CORPUS.lines $CORPUS.tagged.$SL $CORPUS.tagged.$TL | grep '<' | cut -f1 > $CORPUS.lines.new
paste $CORPUS.lines $CORPUS.tagged.$SL $CORPUS.tagged.$TL | grep '<' | cut -f2 > $CORPUS.tagged.$SL.new
paste $CORPUS.lines $CORPUS.tagged.$SL $CORPUS.tagged.$TL | grep '<' | cut -f3 > $CORPUS.tagged.$TL.new

mv $CORPUS.lines.new $CORPUS.lines
mv $CORPUS.tagged.$SL.new $CORPUS.tagged.$SL
mv $CORPUS.tagged.$TL.new $CORPUS.tagged.$TL

# TRIM TAGS
cat $CORPUS.tagged.$SL | $LEX_TOOLS/process-tagger-output $DATA/$SL-$TL.autobil.bin -p -t > $CORPUS.tag-tok.$SL
cat $CORPUS.tagged.$TL | $LEX_TOOLS/process-tagger-output $DATA/$TL-$SL.autobil.bin -p -t > $CORPUS.tag-tok.$TL
cat $CORPUS.tagged.$SL | $LEX_TOOLS/process-tagger-output $DATA/$SL-$TL.autobil.bin -b -t > $CORPUS.biltrans-tok.$PAIR

# ALIGN
perl $MOSESDECODER/train-model.perl -external-bin-dir "$BIN_DIR" -corpus $CORPUS.tag-tok \
-f $TL -e $SL -alignment grow-diag-final-and -reordering msd-bidirectional-fe \
-lm 0:5:/home/philip/Apertium/gsoc2013/giza/europarl.lm:0 2>&1

# EXTRACT
zcat giza.$SL-$TL/$SL-$TL.A3.final.gz | $SCRIPTS/giza-to-moses.awk > $CORPUS.phrasetable.$SL-$TL

# SENTENCES
python3 $SCRIPTS/extract-sentences.py $CORPUS.phrasetable.$SL-$TL $CORPUS.biltrans-tok.$PAIR \
> $CORPUS.candidates.$SL-$TL 2>/dev/null

# FREQUENCY LEXICON
python $SCRIPTS/extract-freq-lexicon.py $CORPUS.candidates.$SL-$TL > $CORPUS.lex.$SL-$TL 2>/dev/null

# BILTRANS CANDIDATES
python3 $SCRIPTS/extract-biltrans-candidates.py $CORPUS.phrasetable.$SL-$TL $CORPUS.biltrans-tok.$PAIR \
> $CORPUS.biltrans-entries.$SL-$TL 2>$CORPUS.biltrans-pairs.$SL-$TL

# NGRAM PATTERNS
$crisphold=1.5
python $SCRIPTS/ngram-count-patterns.py $CORPUS.lex.$SL-$TL $CORPUS.candidates.$SL-$TL $crisphold 2>/dev/null > $CORPUS.ngrams.$SL-$TL

# FILTER PATTERNS

# NGRAMS TO RULES
$cripshold=1.5
python3 $SCRIPTS/ngrams-to-rules.py europarl.ngrams.en-es $crisphold > $Europarl3.ngrams.$SL-TL.lrx

</pre>

[[Category:Lexical selection]]
[[Category:Documentation in English]]

Generating lexical-selection rules from a parallel corpus

2013-09-21T14:44:47Z

Fpetkovski: /* Align corpus */

{{TOCD}}
If you have a parallel corpus, one of the things you can do is generate some lexical selection rules from it, to improve translation of words with more than one possible translation.

== You will need ==

Here is a list of software that you will need installed:

* Giza++ (or some other word aligner)
* Moses (for making Giza++ less human hostile)
* All the Moses scripts
* [[lttoolbox]]
* Apertium
* [[apertium-lex-tools]]

Furthermore you'll need:

* an Apertium language pair
* a parallel corpus (see [[Corpora]])

===Installing prerequisites===
See [[Minimal installation from SVN]] for apertium/lttoolbox.

See [[Constraint-based lexical selection module]] for apertium-lex-tools.

For Giza++ and moses-decoder, etc. you can do
<pre>
$ mkdir ~/smt
$ cd ~/smt
$ mkdir local # our "install prefix"
$ wget https://giza-pp.googlecode.com/files/giza-pp-v1.0.7.tar.gz
$ tar xzvf giza-pp-v1.0.7.tar.gz
$ cd giza-pp
$ make
$ mkdir ../local/bin
$ cp GIZA++-v2/snt2cooc.out ../local/bin/
$ cp GIZA++-v2/snt2plain.out ../local/bin/
$ cp GIZA++-v2/GIZA++ ../local/bin/
$ cp mkcls-v2/mkcls ../local/bin/
$ git clone https://github.com/moses-smt/mosesdecoder
$ cd mosesdecoder/
$ ./bjam
</pre>

Now e.g. the clean-corpus and train-model scripts referred to below will be in ~/smt/mosesdecoder/scripts/training/clean-corpus-n.perl
See http://www.statmt.org/moses/?n=Development.GetStarted if you want to install the binaries to some other directory.

== Getting started ==

We're going to do the example with [[Corpora|EuroParl]] and the English to Spanish pair in Apertium.

Given that you've got all the stuff installed, the work will be as follows:

=== Prepare corpus ===

To generate the rules, we need three files,

* The tagged and tokenised source corpus
* The tagged and tokenised target corpus
* The output of the lexical transfer module in the source→target direction, tokenised

These three files should be sentence aligned.

The first thing that we need to do is tag both sides of the corpus:

<pre>
$ nohup cat europarl.clean.en | apertium-destxt |\
apertium -f none -d /home/fran/source/apertium-en-es en-es-pretransfer > europarl.tagged.en &
$ nohup cat europarl.clean.es | apertium-destxt |\
apertium -f none -d /home/fran/source/apertium-en-es es-en-pretransfer > europarl.tagged.es &
</pre>

Then we need to remove the lines with no analyses on and replace blanks within lemmas with a new character (we will use `~`):

<pre>
$ paste europarl.lines europarl.tagged.en europarl.tagged.es | grep '<' | cut -f2 | sed 's/ /~/g' | sed 's/$$[^\^]*/$$ /g' > europarl.tagged.new.es
$ paste europarl.lines europarl.tagged.en europarl.tagged.es | grep '<' | cut -f3 | sed 's/ /~/g' | sed 's/$$[^\^]*/$$ /g'> europarl.tagged.new.en
</pre>

Next, we need to clean the corpus and remove long sentences.
(Make sure you are in the same directory as the one where you have your europarl corpus)

<pre>
$ perl (path to your mosesdecoder)/scripts/training/clean-corpus-n.perl europarl.tagged.new es en europarl.tag-clean 1 40
clean-corpus.perl: processing europarl-v6.es-en.es & .en to europarl.clean, cutoff 1-40
..........(100000)...

Input sentences: 1786594 Output sentences: 1467708
</pre>

We're going to cut off the bottom 67,658 for testing (also because Giza++ segfaults somewhere around there).
<pre>
$ mkdir testing
$ tail -67658 europarl.tag-clean.en > testing/europarl.tag-clean.67658.en
$ tail -67658 europarl.tagged.es > testing/europarl.tag-clean.67658.es
</pre>

<pre>
$ head -1400000 europarl.tag-clean.en > europarl.tag-clean.en.new
$ head -1400000 europarl.tag-clean.es > europarl.tag-clean.es.new
$ mv europarl.tag-clean.en.new europarl.tag-clean.en
$ mv europarl.tag-clean.es.new europarl.tag-clean.es
</pre>

These files are:

* <code>europarl.tag-clean.en</code>: The tagged source language side of the corpus
* <code>europarl.tag-clean.es</code>: The tagged target language side of the corpus

Check that they have the same length:

<pre>
$ wc -l europarl.*
1400000 europarl.tag-clean.en
1400000 europarl.tag-clean.es
2800000 total
</pre>

=== Align corpus ===

Now we've got the corpus files ready, we can align the corpus using the Moses scripts:

<pre>
nohup perl (path to your mosesdecoder)/scripts/training/train-model.perl -external-bin-dir \
~/smt/local/bin -corpus europarl.tag-clean \
-f en -e es -alignment grow-diag-final-and -reordering msd-bidirectional-fe \
-lm 0:5:/home/fran/corpora/europarl/europarl.lm:0 >log 2>&1 &
</pre>

Note: Remember to change all the paths in the above command!

You'll need an LM file, but you can copy it from a previous Moses installation. If you don't have one, make an empty file and put a few words in it. We won't be using the LM anyway.

This takes a while, from a few hours to a day. So leave it running and go and make a soufflé, or chop some wood or something.

=== Extract sentences ===

The first thing we need to do after Moses has finished training is convert the Giza++ alignments to a less human- (and machine-) hostile format:

<pre>
$ zcat giza.en-es/en-es.A3.final.gz | ~/source/apertium-lex-tools/scripts/giza-to-moses.awk > europarl.phrasetable.en-es
</pre>

Then we want to make sure again that our file has the right number of lines:

<pre>
$ wc -l europarl.phrasetable.en-es
1400000 europarl.phrasetable.en-es
</pre>

Then we want to extract the sentences where the target language word aligned to a source language word is a possible translation in the bilingual dictionary:

<pre>
$ ~/source/apertium-lex-tools/scripts/extract-sentences.py europarl.phrasetable.en-es europarl.biltrans-tok.en-es \
> europarl.candidates.en-es
</pre>

These are basically sentences that we can hope that Apertium might be able to generate.

=== Extract bilingual dictionary candidates ===

Using the phrasetable and the bilingual file we can extract candidates for the bilingual dictionary.

<pre>
python3 ~/Apertium/apertium-lex-tools/scripts/extract-biltrans-candidates.py europarl.phrasetable.en-es europarl.biltrans-tok.en-es > europarl.biltrans-candidates.en-es 2> europarl.biltrans-pairs.en-es
</pre>

where europarl.biltrans-candidates.en-es contains the generated entries for the bilingual dictionary.

===Extract frequency lexicon===

The next step is to extract the frequency lexicon.

<pre>
$ python ~/source/apertium-lex-tools/scripts/extract-freq-lexicon.py europarl.candidates.en-es > europarl.lex.en-es
</pre>

This file should look like:

<pre>
$ cat europarl.lex.en-es | head
31381 union<n> unión<n> @
101 union<n> sindicato<n>
1 union<n> situación<n>
1 union<n> monetario<adj>
4 slope<n> pendiente<n> @
1 slope<n> ladera<n>
</pre>

Where the highest frequency translation is marked with an <code>@</code>.

Note: This frequency lexicon can be used as a substitute for "choosing the most general translation" in your bilingual dictionary.

===Generate patterns===

Now we generate the ngrams that we are going to generate the rules from.

<pre>
$ crisphold=1.5 # ratio of how many times you see the alternative translation compared to the default
$ python ~/source/apertium-lex-tools/scripts/ngram-count-patterns.py europarl.lex.en-es europarl.candidates.en-es $crisphold 2>/dev/null > europarl.ngrams.en-es
</pre>

This script outputs lines in the following format:

<pre>
-language<n> and<cnjcoo> language<n> ,<cm> lengua<n> 2
+language<n> plain<adj> language<n> ,<cm> lenguaje<n> 3
-language<n> language<n> knowledge<n> lengua<n> 4
-language<n> language<n> of<pr> communication<n> lengua<n> 3
-language<n> Community<adj> language<n> .<sent> lengua<n> 5
-language<n> language<n> in~addition~to<pr> their<det><pos> lengua<n> 2
-language<n> every<det><ind> language<n> lengua<n> 2
+language<n> and<cnjcoo> *understandable language<n> lenguaje<n> 2
-language<n> two<num> language<n> lengua<n> 8
-language<n> only<adj> official<adj> language<n> lengua<n> 2
</pre>

The <code>+</code> and <code>-</code> indicate if this line chooses the most frequent transation (<code>-</code>) or a translation which is not the most frequent (<code>+</code>). The pattern selecting the translation is then shown, followed by the translation and then the frequency.

===Filter rules===

Now you can filter the rules, for example by removing rules with conjunctions, or removing rules with unknown words.

=== Generate rules ===

The final stage is to generate the rules,

<pre>
python3 ~/source/apertium-lex-tools/scripts/ngrams-to-rules.py europarl.ngrams.en-es $crisphold > europarl.ngrams.en-es.lrx
</pre>

=== Process script ===
For the whole process you can run the following script:

<pre>
CORPUS_DIR="/home/philip/Apertium/corpora/raw/europarl-fr-es"
CORPUS="Europarl3"
PAIR="es-fr"
SL="fr"
TL="es"
LEX_TOOLS="/home/philip/Apertium/apertium-lex-tools"
SCRIPTS="$LEX_TOOLS/scripts"
MOSESDECODER="/home/philip/Apertium/mosesdecoder/scripts/training"
TRAINING_LINES=50000
DATA="/home/philip/Apertium/apertium-fr-es"
BIN_DIR="/home/philip/Apertium/smt/bin"

# CLEAN CORPUS
perl "$MOSESDECODER/clean-corpus-n.perl" $CORPUS.$PAIR $SL $TL "$CORPUS.clean" 1 40;

#TAG CORPUS
cat "$CORPUS.clean.$SL" | tail -n $TRAINING_LINES | apertium -d "$DATA" $SL-$TL-tagger \
| apertium-pretransfer > $CORPUS.tagged.$SL;

cat "$CORPUS.clean.$TL" | tail -n $TRAINING_LINES | apertium -d "$DATA" $TL-$SL-tagger \
| apertium-pretransfer > $CORPUS.tagged.$TL;

cat "$CORPUS.tagged.$SL" | lt-proc -b "$DATA/$SL-$TL.autobil.bin" > $CORPUS.biltrans.$PAIR

N=`wc -l $CORPUS.clean.$SL | cut -d ' ' -f 1`

# REMOVE LINES WITH NO ANALYSES
seq 1 $N > $CORPUS.lines
paste $CORPUS.lines $CORPUS.tagged.$SL $CORPUS.tagged.$TL | grep '<' | cut -f1 > $CORPUS.lines.new
paste $CORPUS.lines $CORPUS.tagged.$SL $CORPUS.tagged.$TL | grep '<' | cut -f2 > $CORPUS.tagged.$SL.new
paste $CORPUS.lines $CORPUS.tagged.$SL $CORPUS.tagged.$TL | grep '<' | cut -f3 > $CORPUS.tagged.$TL.new

mv $CORPUS.lines.new $CORPUS.lines
mv $CORPUS.tagged.$SL.new $CORPUS.tagged.$SL
mv $CORPUS.tagged.$TL.new $CORPUS.tagged.$TL

# TRIM TAGS
cat $CORPUS.tagged.$SL | $LEX_TOOLS/process-tagger-output $DATA/$SL-$TL.autobil.bin -p -t > $CORPUS.tag-tok.$SL
cat $CORPUS.tagged.$TL | $LEX_TOOLS/process-tagger-output $DATA/$TL-$SL.autobil.bin -p -t > $CORPUS.tag-tok.$TL
cat $CORPUS.tagged.$SL | $LEX_TOOLS/process-tagger-output $DATA/$SL-$TL.autobil.bin -b -t > $CORPUS.biltrans-tok.$PAIR

# ALIGN
perl $MOSESDECODER/train-model.perl -external-bin-dir "$BIN_DIR" -corpus $CORPUS.tag-tok \
-f $TL -e $SL -alignment grow-diag-final-and -reordering msd-bidirectional-fe \
-lm 0:5:/home/philip/Apertium/gsoc2013/giza/europarl.lm:0 2>&1

# EXTRACT
zcat giza.$SL-$TL/$SL-$TL.A3.final.gz | $SCRIPTS/giza-to-moses.awk > $CORPUS.phrasetable.$SL-$TL

# SENTENCES
python3 $SCRIPTS/extract-sentences.py $CORPUS.phrasetable.$SL-$TL $CORPUS.biltrans-tok.$PAIR \
> $CORPUS.candidates.$SL-$TL 2>/dev/null

# FREQUENCY LEXICON
python $SCRIPTS/extract-freq-lexicon.py $CORPUS.candidates.$SL-$TL > $CORPUS.lex.$SL-$TL 2>/dev/null

# BILTRANS CANDIDATES
python3 $SCRIPTS/extract-biltrans-candidates.py $CORPUS.phrasetable.$SL-$TL $CORPUS.biltrans-tok.$PAIR \
> $CORPUS.biltrans-entries.$SL-$TL 2>$CORPUS.biltrans-pairs.$SL-$TL

# NGRAM PATTERNS
$crisphold=1.5
python $SCRIPTS/ngram-count-patterns.py $CORPUS.lex.$SL-$TL $CORPUS.candidates.$SL-$TL $crisphold 2>/dev/null > $CORPUS.ngrams.$SL-$TL

# FILTER PATTERNS

# NGRAMS TO RULES
$cripshold=1.5
python3 $SCRIPTS/ngrams-to-rules.py europarl.ngrams.en-es $crisphold > $Europarl3.ngrams.$SL-TL.lrx

</pre>

[[Category:Lexical selection]]
[[Category:Documentation in English]]

Generating lexical-selection rules from a parallel corpus

2013-09-21T14:43:45Z

Fpetkovski: /* Prepare corpus */

{{TOCD}}
If you have a parallel corpus, one of the things you can do is generate some lexical selection rules from it, to improve translation of words with more than one possible translation.

== You will need ==

Here is a list of software that you will need installed:

* Giza++ (or some other word aligner)
* Moses (for making Giza++ less human hostile)
* All the Moses scripts
* [[lttoolbox]]
* Apertium
* [[apertium-lex-tools]]

Furthermore you'll need:

* an Apertium language pair
* a parallel corpus (see [[Corpora]])

===Installing prerequisites===
See [[Minimal installation from SVN]] for apertium/lttoolbox.

See [[Constraint-based lexical selection module]] for apertium-lex-tools.

For Giza++ and moses-decoder, etc. you can do
<pre>
$ mkdir ~/smt
$ cd ~/smt
$ mkdir local # our "install prefix"
$ wget https://giza-pp.googlecode.com/files/giza-pp-v1.0.7.tar.gz
$ tar xzvf giza-pp-v1.0.7.tar.gz
$ cd giza-pp
$ make
$ mkdir ../local/bin
$ cp GIZA++-v2/snt2cooc.out ../local/bin/
$ cp GIZA++-v2/snt2plain.out ../local/bin/
$ cp GIZA++-v2/GIZA++ ../local/bin/
$ cp mkcls-v2/mkcls ../local/bin/
$ git clone https://github.com/moses-smt/mosesdecoder
$ cd mosesdecoder/
$ ./bjam
</pre>

Now e.g. the clean-corpus and train-model scripts referred to below will be in ~/smt/mosesdecoder/scripts/training/clean-corpus-n.perl
See http://www.statmt.org/moses/?n=Development.GetStarted if you want to install the binaries to some other directory.

== Getting started ==

We're going to do the example with [[Corpora|EuroParl]] and the English to Spanish pair in Apertium.

Given that you've got all the stuff installed, the work will be as follows:

=== Prepare corpus ===

To generate the rules, we need three files,

* The tagged and tokenised source corpus
* The tagged and tokenised target corpus
* The output of the lexical transfer module in the source→target direction, tokenised

These three files should be sentence aligned.

The first thing that we need to do is tag both sides of the corpus:

<pre>
$ nohup cat europarl.clean.en | apertium-destxt |\
apertium -f none -d /home/fran/source/apertium-en-es en-es-pretransfer > europarl.tagged.en &
$ nohup cat europarl.clean.es | apertium-destxt |\
apertium -f none -d /home/fran/source/apertium-en-es es-en-pretransfer > europarl.tagged.es &
</pre>

Then we need to remove the lines with no analyses on and replace blanks within lemmas with a new character (we will use `~`):

<pre>
$ paste europarl.lines europarl.tagged.en europarl.tagged.es | grep '<' | cut -f2 | sed 's/ /~/g' | sed 's/$$[^\^]*/$$ /g' > europarl.tagged.new.es
$ paste europarl.lines europarl.tagged.en europarl.tagged.es | grep '<' | cut -f3 | sed 's/ /~/g' | sed 's/$$[^\^]*/$$ /g'> europarl.tagged.new.en
</pre>

Next, we need to clean the corpus and remove long sentences.
(Make sure you are in the same directory as the one where you have your europarl corpus)

<pre>
$ perl (path to your mosesdecoder)/scripts/training/clean-corpus-n.perl europarl.tagged.new es en europarl.tag-clean 1 40
clean-corpus.perl: processing europarl-v6.es-en.es & .en to europarl.clean, cutoff 1-40
..........(100000)...

Input sentences: 1786594 Output sentences: 1467708
</pre>

We're going to cut off the bottom 67,658 for testing (also because Giza++ segfaults somewhere around there).
<pre>
$ mkdir testing
$ tail -67658 europarl.tag-clean.en > testing/europarl.tag-clean.67658.en
$ tail -67658 europarl.tagged.es > testing/europarl.tag-clean.67658.es
</pre>

<pre>
$ head -1400000 europarl.tag-clean.en > europarl.tag-clean.en.new
$ head -1400000 europarl.tag-clean.es > europarl.tag-clean.es.new
$ mv europarl.tag-clean.en.new europarl.tag-clean.en
$ mv europarl.tag-clean.es.new europarl.tag-clean.es
</pre>

These files are:

* <code>europarl.tag-clean.en</code>: The tagged source language side of the corpus
* <code>europarl.tag-clean.es</code>: The tagged target language side of the corpus

Check that they have the same length:

<pre>
$ wc -l europarl.*
1400000 europarl.tag-clean.en
1400000 europarl.tag-clean.es
2800000 total
</pre>

=== Align corpus ===

Now we've got the corpus files ready, we can align the corpus using the Moses scripts:

<pre>
nohup perl (path to your mosesdecoder)/scripts/training/train-model.perl -external-bin-dir \
~/smt/local/bin -corpus europarl.tag-tok \
-f en -e es -alignment grow-diag-final-and -reordering msd-bidirectional-fe \
-lm 0:5:/home/fran/corpora/europarl/europarl.lm:0 >log 2>&1 &
</pre>

Note: Remember to change all the paths in the above command!

You'll need an LM file, but you can copy it from a previous Moses installation. If you don't have one, make an empty file and put a few words in it. We won't be using the LM anyway.

This takes a while, from a few hours to a day. So leave it running and go and make a soufflé, or chop some wood or something.

=== Extract sentences ===

The first thing we need to do after Moses has finished training is convert the Giza++ alignments to a less human- (and machine-) hostile format:

<pre>
$ zcat giza.en-es/en-es.A3.final.gz | ~/source/apertium-lex-tools/scripts/giza-to-moses.awk > europarl.phrasetable.en-es
</pre>

Then we want to make sure again that our file has the right number of lines:

<pre>
$ wc -l europarl.phrasetable.en-es
1400000 europarl.phrasetable.en-es
</pre>

Then we want to extract the sentences where the target language word aligned to a source language word is a possible translation in the bilingual dictionary:

<pre>
$ ~/source/apertium-lex-tools/scripts/extract-sentences.py europarl.phrasetable.en-es europarl.biltrans-tok.en-es \
> europarl.candidates.en-es
</pre>

These are basically sentences that we can hope that Apertium might be able to generate.

=== Extract bilingual dictionary candidates ===

Using the phrasetable and the bilingual file we can extract candidates for the bilingual dictionary.

<pre>
python3 ~/Apertium/apertium-lex-tools/scripts/extract-biltrans-candidates.py europarl.phrasetable.en-es europarl.biltrans-tok.en-es > europarl.biltrans-candidates.en-es 2> europarl.biltrans-pairs.en-es
</pre>

where europarl.biltrans-candidates.en-es contains the generated entries for the bilingual dictionary.

===Extract frequency lexicon===

The next step is to extract the frequency lexicon.

<pre>
$ python ~/source/apertium-lex-tools/scripts/extract-freq-lexicon.py europarl.candidates.en-es > europarl.lex.en-es
</pre>

This file should look like:

<pre>
$ cat europarl.lex.en-es | head
31381 union<n> unión<n> @
101 union<n> sindicato<n>
1 union<n> situación<n>
1 union<n> monetario<adj>
4 slope<n> pendiente<n> @
1 slope<n> ladera<n>
</pre>

Where the highest frequency translation is marked with an <code>@</code>.

Note: This frequency lexicon can be used as a substitute for "choosing the most general translation" in your bilingual dictionary.

===Generate patterns===

Now we generate the ngrams that we are going to generate the rules from.

<pre>
$ crisphold=1.5 # ratio of how many times you see the alternative translation compared to the default
$ python ~/source/apertium-lex-tools/scripts/ngram-count-patterns.py europarl.lex.en-es europarl.candidates.en-es $crisphold 2>/dev/null > europarl.ngrams.en-es
</pre>

This script outputs lines in the following format:

<pre>
-language<n> and<cnjcoo> language<n> ,<cm> lengua<n> 2
+language<n> plain<adj> language<n> ,<cm> lenguaje<n> 3
-language<n> language<n> knowledge<n> lengua<n> 4
-language<n> language<n> of<pr> communication<n> lengua<n> 3
-language<n> Community<adj> language<n> .<sent> lengua<n> 5
-language<n> language<n> in~addition~to<pr> their<det><pos> lengua<n> 2
-language<n> every<det><ind> language<n> lengua<n> 2
+language<n> and<cnjcoo> *understandable language<n> lenguaje<n> 2
-language<n> two<num> language<n> lengua<n> 8
-language<n> only<adj> official<adj> language<n> lengua<n> 2
</pre>

The <code>+</code> and <code>-</code> indicate if this line chooses the most frequent transation (<code>-</code>) or a translation which is not the most frequent (<code>+</code>). The pattern selecting the translation is then shown, followed by the translation and then the frequency.

===Filter rules===

Now you can filter the rules, for example by removing rules with conjunctions, or removing rules with unknown words.

=== Generate rules ===

The final stage is to generate the rules,

<pre>
python3 ~/source/apertium-lex-tools/scripts/ngrams-to-rules.py europarl.ngrams.en-es $crisphold > europarl.ngrams.en-es.lrx
</pre>

=== Process script ===
For the whole process you can run the following script:

<pre>
CORPUS_DIR="/home/philip/Apertium/corpora/raw/europarl-fr-es"
CORPUS="Europarl3"
PAIR="es-fr"
SL="fr"
TL="es"
LEX_TOOLS="/home/philip/Apertium/apertium-lex-tools"
SCRIPTS="$LEX_TOOLS/scripts"
MOSESDECODER="/home/philip/Apertium/mosesdecoder/scripts/training"
TRAINING_LINES=50000
DATA="/home/philip/Apertium/apertium-fr-es"
BIN_DIR="/home/philip/Apertium/smt/bin"

# CLEAN CORPUS
perl "$MOSESDECODER/clean-corpus-n.perl" $CORPUS.$PAIR $SL $TL "$CORPUS.clean" 1 40;

#TAG CORPUS
cat "$CORPUS.clean.$SL" | tail -n $TRAINING_LINES | apertium -d "$DATA" $SL-$TL-tagger \
| apertium-pretransfer > $CORPUS.tagged.$SL;

cat "$CORPUS.clean.$TL" | tail -n $TRAINING_LINES | apertium -d "$DATA" $TL-$SL-tagger \
| apertium-pretransfer > $CORPUS.tagged.$TL;

cat "$CORPUS.tagged.$SL" | lt-proc -b "$DATA/$SL-$TL.autobil.bin" > $CORPUS.biltrans.$PAIR

N=`wc -l $CORPUS.clean.$SL | cut -d ' ' -f 1`

# REMOVE LINES WITH NO ANALYSES
seq 1 $N > $CORPUS.lines
paste $CORPUS.lines $CORPUS.tagged.$SL $CORPUS.tagged.$TL | grep '<' | cut -f1 > $CORPUS.lines.new
paste $CORPUS.lines $CORPUS.tagged.$SL $CORPUS.tagged.$TL | grep '<' | cut -f2 > $CORPUS.tagged.$SL.new
paste $CORPUS.lines $CORPUS.tagged.$SL $CORPUS.tagged.$TL | grep '<' | cut -f3 > $CORPUS.tagged.$TL.new

mv $CORPUS.lines.new $CORPUS.lines
mv $CORPUS.tagged.$SL.new $CORPUS.tagged.$SL
mv $CORPUS.tagged.$TL.new $CORPUS.tagged.$TL

# TRIM TAGS
cat $CORPUS.tagged.$SL | $LEX_TOOLS/process-tagger-output $DATA/$SL-$TL.autobil.bin -p -t > $CORPUS.tag-tok.$SL
cat $CORPUS.tagged.$TL | $LEX_TOOLS/process-tagger-output $DATA/$TL-$SL.autobil.bin -p -t > $CORPUS.tag-tok.$TL
cat $CORPUS.tagged.$SL | $LEX_TOOLS/process-tagger-output $DATA/$SL-$TL.autobil.bin -b -t > $CORPUS.biltrans-tok.$PAIR

# ALIGN
perl $MOSESDECODER/train-model.perl -external-bin-dir "$BIN_DIR" -corpus $CORPUS.tag-tok \
-f $TL -e $SL -alignment grow-diag-final-and -reordering msd-bidirectional-fe \
-lm 0:5:/home/philip/Apertium/gsoc2013/giza/europarl.lm:0 2>&1

# EXTRACT
zcat giza.$SL-$TL/$SL-$TL.A3.final.gz | $SCRIPTS/giza-to-moses.awk > $CORPUS.phrasetable.$SL-$TL

# SENTENCES
python3 $SCRIPTS/extract-sentences.py $CORPUS.phrasetable.$SL-$TL $CORPUS.biltrans-tok.$PAIR \
> $CORPUS.candidates.$SL-$TL 2>/dev/null

# FREQUENCY LEXICON
python $SCRIPTS/extract-freq-lexicon.py $CORPUS.candidates.$SL-$TL > $CORPUS.lex.$SL-$TL 2>/dev/null

# BILTRANS CANDIDATES
python3 $SCRIPTS/extract-biltrans-candidates.py $CORPUS.phrasetable.$SL-$TL $CORPUS.biltrans-tok.$PAIR \
> $CORPUS.biltrans-entries.$SL-$TL 2>$CORPUS.biltrans-pairs.$SL-$TL

# NGRAM PATTERNS
$crisphold=1.5
python $SCRIPTS/ngram-count-patterns.py $CORPUS.lex.$SL-$TL $CORPUS.candidates.$SL-$TL $crisphold 2>/dev/null > $CORPUS.ngrams.$SL-$TL

# FILTER PATTERNS

# NGRAMS TO RULES
$cripshold=1.5
python3 $SCRIPTS/ngrams-to-rules.py europarl.ngrams.en-es $crisphold > $Europarl3.ngrams.$SL-TL.lrx

</pre>

[[Category:Lexical selection]]
[[Category:Documentation in English]]

Generating lexical-selection rules from a parallel corpus

2013-09-21T14:40:51Z

Fpetkovski: /* Prepare corpus */

{{TOCD}}
If you have a parallel corpus, one of the things you can do is generate some lexical selection rules from it, to improve translation of words with more than one possible translation.

== You will need ==

Here is a list of software that you will need installed:

* Giza++ (or some other word aligner)
* Moses (for making Giza++ less human hostile)
* All the Moses scripts
* [[lttoolbox]]
* Apertium
* [[apertium-lex-tools]]

Furthermore you'll need:

* an Apertium language pair
* a parallel corpus (see [[Corpora]])

===Installing prerequisites===
See [[Minimal installation from SVN]] for apertium/lttoolbox.

See [[Constraint-based lexical selection module]] for apertium-lex-tools.

For Giza++ and moses-decoder, etc. you can do
<pre>
$ mkdir ~/smt
$ cd ~/smt
$ mkdir local # our "install prefix"
$ wget https://giza-pp.googlecode.com/files/giza-pp-v1.0.7.tar.gz
$ tar xzvf giza-pp-v1.0.7.tar.gz
$ cd giza-pp
$ make
$ mkdir ../local/bin
$ cp GIZA++-v2/snt2cooc.out ../local/bin/
$ cp GIZA++-v2/snt2plain.out ../local/bin/
$ cp GIZA++-v2/GIZA++ ../local/bin/
$ cp mkcls-v2/mkcls ../local/bin/
$ git clone https://github.com/moses-smt/mosesdecoder
$ cd mosesdecoder/
$ ./bjam
</pre>

Now e.g. the clean-corpus and train-model scripts referred to below will be in ~/smt/mosesdecoder/scripts/training/clean-corpus-n.perl
See http://www.statmt.org/moses/?n=Development.GetStarted if you want to install the binaries to some other directory.

== Getting started ==

We're going to do the example with [[Corpora|EuroParl]] and the English to Spanish pair in Apertium.

Given that you've got all the stuff installed, the work will be as follows:

=== Prepare corpus ===

To generate the rules, we need three files,

* The tagged and tokenised source corpus
* The tagged and tokenised target corpus
* The output of the lexical transfer module in the source→target direction, tokenised

These three files should be sentence aligned.

The first thing that we need to do is tag both sides of the corpus:

<pre>
$ nohup cat europarl.clean.en | apertium-destxt |\
apertium -f none -d /home/fran/source/apertium-en-es en-es-pretransfer > europarl.tagged.en &
$ nohup cat europarl.clean.es | apertium-destxt |\
apertium -f none -d /home/fran/source/apertium-en-es es-en-pretransfer > europarl.tagged.es &
</pre>

Then we need to remove the lines with no analyses on:

<pre>
$ paste europarl.lines europarl.tagged.en europarl.tagged.es | grep '<' | cut -f2 > europarl.tagged.new.es
$ paste europarl.lines europarl.tagged.en europarl.tagged.es | grep '<' | cut -f3 > europarl.tagged.new.en
</pre>

Next, we need to clean the corpus and remove long sentences.
(Make sure you are in the same directory as the one where you have your europarl corpus)

<pre>
$ perl (path to your mosesdecoder)/scripts/training/clean-corpus-n.perl europarl.tagged.new es en europarl.tag-clean 1 40
clean-corpus.perl: processing europarl-v6.es-en.es & .en to europarl.clean, cutoff 1-40
..........(100000)...

Input sentences: 1786594 Output sentences: 1467708
</pre>

We're going to cut off the bottom 67,658 for testing (also because Giza++ segfaults somewhere around there).
<pre>
$ mkdir testing
$ tail -67658 europarl.tag-clean.en > testing/europarl.tag-clean.67658.en
$ tail -67658 europarl.tagged.es > testing/europarl.tag-clean.67658.es
</pre>

<pre>
$ head -1400000 europarl.tag-clean.en > europarl.tag-clean.en.new
$ head -1400000 europarl.tag-clean.es > europarl.tag-clean.es.new
$ mv europarl.tag-clean.en.new europarl.tag-clean.en
$ mv europarl.tag-clean.es.new europarl.tag-clean.es
</pre>

These files are:

* <code>europarl.tag-clean.en</code>: The tagged source language side of the corpus
* <code>europarl.tag-clean.es</code>: The tagged target language side of the corpus

Check that they have the same length:

<pre>
$ wc -l europarl.*
1400000 europarl.tag-clean.en
1400000 europarl.tag-clean.es
2800000 total
</pre>

=== Align corpus ===

Now we've got the corpus files ready, we can align the corpus using the Moses scripts:

<pre>
nohup perl (path to your mosesdecoder)/scripts/training/train-model.perl -external-bin-dir \
~/smt/local/bin -corpus europarl.tag-tok \
-f en -e es -alignment grow-diag-final-and -reordering msd-bidirectional-fe \
-lm 0:5:/home/fran/corpora/europarl/europarl.lm:0 >log 2>&1 &
</pre>

Note: Remember to change all the paths in the above command!

You'll need an LM file, but you can copy it from a previous Moses installation. If you don't have one, make an empty file and put a few words in it. We won't be using the LM anyway.

This takes a while, from a few hours to a day. So leave it running and go and make a soufflé, or chop some wood or something.

=== Extract sentences ===

The first thing we need to do after Moses has finished training is convert the Giza++ alignments to a less human- (and machine-) hostile format:

<pre>
$ zcat giza.en-es/en-es.A3.final.gz | ~/source/apertium-lex-tools/scripts/giza-to-moses.awk > europarl.phrasetable.en-es
</pre>

Then we want to make sure again that our file has the right number of lines:

<pre>
$ wc -l europarl.phrasetable.en-es
1400000 europarl.phrasetable.en-es
</pre>

Then we want to extract the sentences where the target language word aligned to a source language word is a possible translation in the bilingual dictionary:

<pre>
$ ~/source/apertium-lex-tools/scripts/extract-sentences.py europarl.phrasetable.en-es europarl.biltrans-tok.en-es \
> europarl.candidates.en-es
</pre>

These are basically sentences that we can hope that Apertium might be able to generate.

=== Extract bilingual dictionary candidates ===

Using the phrasetable and the bilingual file we can extract candidates for the bilingual dictionary.

<pre>
python3 ~/Apertium/apertium-lex-tools/scripts/extract-biltrans-candidates.py europarl.phrasetable.en-es europarl.biltrans-tok.en-es > europarl.biltrans-candidates.en-es 2> europarl.biltrans-pairs.en-es
</pre>

where europarl.biltrans-candidates.en-es contains the generated entries for the bilingual dictionary.

===Extract frequency lexicon===

The next step is to extract the frequency lexicon.

<pre>
$ python ~/source/apertium-lex-tools/scripts/extract-freq-lexicon.py europarl.candidates.en-es > europarl.lex.en-es
</pre>

This file should look like:

<pre>
$ cat europarl.lex.en-es | head
31381 union<n> unión<n> @
101 union<n> sindicato<n>
1 union<n> situación<n>
1 union<n> monetario<adj>
4 slope<n> pendiente<n> @
1 slope<n> ladera<n>
</pre>

Where the highest frequency translation is marked with an <code>@</code>.

Note: This frequency lexicon can be used as a substitute for "choosing the most general translation" in your bilingual dictionary.

===Generate patterns===

Now we generate the ngrams that we are going to generate the rules from.

<pre>
$ crisphold=1.5 # ratio of how many times you see the alternative translation compared to the default
$ python ~/source/apertium-lex-tools/scripts/ngram-count-patterns.py europarl.lex.en-es europarl.candidates.en-es $crisphold 2>/dev/null > europarl.ngrams.en-es
</pre>

This script outputs lines in the following format:

<pre>
-language<n> and<cnjcoo> language<n> ,<cm> lengua<n> 2
+language<n> plain<adj> language<n> ,<cm> lenguaje<n> 3
-language<n> language<n> knowledge<n> lengua<n> 4
-language<n> language<n> of<pr> communication<n> lengua<n> 3
-language<n> Community<adj> language<n> .<sent> lengua<n> 5
-language<n> language<n> in~addition~to<pr> their<det><pos> lengua<n> 2
-language<n> every<det><ind> language<n> lengua<n> 2
+language<n> and<cnjcoo> *understandable language<n> lenguaje<n> 2
-language<n> two<num> language<n> lengua<n> 8
-language<n> only<adj> official<adj> language<n> lengua<n> 2
</pre>

The <code>+</code> and <code>-</code> indicate if this line chooses the most frequent transation (<code>-</code>) or a translation which is not the most frequent (<code>+</code>). The pattern selecting the translation is then shown, followed by the translation and then the frequency.

===Filter rules===

Now you can filter the rules, for example by removing rules with conjunctions, or removing rules with unknown words.

=== Generate rules ===

The final stage is to generate the rules,

<pre>
python3 ~/source/apertium-lex-tools/scripts/ngrams-to-rules.py europarl.ngrams.en-es $crisphold > europarl.ngrams.en-es.lrx
</pre>

=== Process script ===
For the whole process you can run the following script:

<pre>
CORPUS_DIR="/home/philip/Apertium/corpora/raw/europarl-fr-es"
CORPUS="Europarl3"
PAIR="es-fr"
SL="fr"
TL="es"
LEX_TOOLS="/home/philip/Apertium/apertium-lex-tools"
SCRIPTS="$LEX_TOOLS/scripts"
MOSESDECODER="/home/philip/Apertium/mosesdecoder/scripts/training"
TRAINING_LINES=50000
DATA="/home/philip/Apertium/apertium-fr-es"
BIN_DIR="/home/philip/Apertium/smt/bin"

# CLEAN CORPUS
perl "$MOSESDECODER/clean-corpus-n.perl" $CORPUS.$PAIR $SL $TL "$CORPUS.clean" 1 40;

#TAG CORPUS
cat "$CORPUS.clean.$SL" | tail -n $TRAINING_LINES | apertium -d "$DATA" $SL-$TL-tagger \
| apertium-pretransfer > $CORPUS.tagged.$SL;

cat "$CORPUS.clean.$TL" | tail -n $TRAINING_LINES | apertium -d "$DATA" $TL-$SL-tagger \
| apertium-pretransfer > $CORPUS.tagged.$TL;

cat "$CORPUS.tagged.$SL" | lt-proc -b "$DATA/$SL-$TL.autobil.bin" > $CORPUS.biltrans.$PAIR

N=`wc -l $CORPUS.clean.$SL | cut -d ' ' -f 1`

# REMOVE LINES WITH NO ANALYSES
seq 1 $N > $CORPUS.lines
paste $CORPUS.lines $CORPUS.tagged.$SL $CORPUS.tagged.$TL | grep '<' | cut -f1 > $CORPUS.lines.new
paste $CORPUS.lines $CORPUS.tagged.$SL $CORPUS.tagged.$TL | grep '<' | cut -f2 > $CORPUS.tagged.$SL.new
paste $CORPUS.lines $CORPUS.tagged.$SL $CORPUS.tagged.$TL | grep '<' | cut -f3 > $CORPUS.tagged.$TL.new

mv $CORPUS.lines.new $CORPUS.lines
mv $CORPUS.tagged.$SL.new $CORPUS.tagged.$SL
mv $CORPUS.tagged.$TL.new $CORPUS.tagged.$TL

# TRIM TAGS
cat $CORPUS.tagged.$SL | $LEX_TOOLS/process-tagger-output $DATA/$SL-$TL.autobil.bin -p -t > $CORPUS.tag-tok.$SL
cat $CORPUS.tagged.$TL | $LEX_TOOLS/process-tagger-output $DATA/$TL-$SL.autobil.bin -p -t > $CORPUS.tag-tok.$TL
cat $CORPUS.tagged.$SL | $LEX_TOOLS/process-tagger-output $DATA/$SL-$TL.autobil.bin -b -t > $CORPUS.biltrans-tok.$PAIR

# ALIGN
perl $MOSESDECODER/train-model.perl -external-bin-dir "$BIN_DIR" -corpus $CORPUS.tag-tok \
-f $TL -e $SL -alignment grow-diag-final-and -reordering msd-bidirectional-fe \
-lm 0:5:/home/philip/Apertium/gsoc2013/giza/europarl.lm:0 2>&1

# EXTRACT
zcat giza.$SL-$TL/$SL-$TL.A3.final.gz | $SCRIPTS/giza-to-moses.awk > $CORPUS.phrasetable.$SL-$TL

# SENTENCES
python3 $SCRIPTS/extract-sentences.py $CORPUS.phrasetable.$SL-$TL $CORPUS.biltrans-tok.$PAIR \
> $CORPUS.candidates.$SL-$TL 2>/dev/null

# FREQUENCY LEXICON
python $SCRIPTS/extract-freq-lexicon.py $CORPUS.candidates.$SL-$TL > $CORPUS.lex.$SL-$TL 2>/dev/null

# BILTRANS CANDIDATES
python3 $SCRIPTS/extract-biltrans-candidates.py $CORPUS.phrasetable.$SL-$TL $CORPUS.biltrans-tok.$PAIR \
> $CORPUS.biltrans-entries.$SL-$TL 2>$CORPUS.biltrans-pairs.$SL-$TL

# NGRAM PATTERNS
$crisphold=1.5
python $SCRIPTS/ngram-count-patterns.py $CORPUS.lex.$SL-$TL $CORPUS.candidates.$SL-$TL $crisphold 2>/dev/null > $CORPUS.ngrams.$SL-$TL

# FILTER PATTERNS

# NGRAMS TO RULES
$cripshold=1.5
python3 $SCRIPTS/ngrams-to-rules.py europarl.ngrams.en-es $crisphold > $Europarl3.ngrams.$SL-TL.lrx

</pre>

[[Category:Lexical selection]]
[[Category:Documentation in English]]

Generating lexical-selection rules from a parallel corpus

2013-09-21T14:40:04Z

Fpetkovski: /* Prepare corpus */

{{TOCD}}
If you have a parallel corpus, one of the things you can do is generate some lexical selection rules from it, to improve translation of words with more than one possible translation.

== You will need ==

Here is a list of software that you will need installed:

* Giza++ (or some other word aligner)
* Moses (for making Giza++ less human hostile)
* All the Moses scripts
* [[lttoolbox]]
* Apertium
* [[apertium-lex-tools]]

Furthermore you'll need:

* an Apertium language pair
* a parallel corpus (see [[Corpora]])

===Installing prerequisites===
See [[Minimal installation from SVN]] for apertium/lttoolbox.

See [[Constraint-based lexical selection module]] for apertium-lex-tools.

For Giza++ and moses-decoder, etc. you can do
<pre>
$ mkdir ~/smt
$ cd ~/smt
$ mkdir local # our "install prefix"
$ wget https://giza-pp.googlecode.com/files/giza-pp-v1.0.7.tar.gz
$ tar xzvf giza-pp-v1.0.7.tar.gz
$ cd giza-pp
$ make
$ mkdir ../local/bin
$ cp GIZA++-v2/snt2cooc.out ../local/bin/
$ cp GIZA++-v2/snt2plain.out ../local/bin/
$ cp GIZA++-v2/GIZA++ ../local/bin/
$ cp mkcls-v2/mkcls ../local/bin/
$ git clone https://github.com/moses-smt/mosesdecoder
$ cd mosesdecoder/
$ ./bjam
</pre>

Now e.g. the clean-corpus and train-model scripts referred to below will be in ~/smt/mosesdecoder/scripts/training/clean-corpus-n.perl
See http://www.statmt.org/moses/?n=Development.GetStarted if you want to install the binaries to some other directory.

== Getting started ==

We're going to do the example with [[Corpora|EuroParl]] and the English to Spanish pair in Apertium.

Given that you've got all the stuff installed, the work will be as follows:

=== Prepare corpus ===

To generate the rules, we need three files,

* The tagged and tokenised source corpus
* The tagged and tokenised target corpus
* The output of the lexical transfer module in the source→target direction, tokenised

These three files should be sentence aligned.

The first thing that we need to do is tag both sides of the corpus:

<pre>
$ nohup cat europarl.clean.en | apertium-destxt |\
apertium -f none -d /home/fran/source/apertium-en-es en-es-pretransfer > europarl.tagged.en &
$ nohup cat europarl.clean.es | apertium-destxt |\
apertium -f none -d /home/fran/source/apertium-en-es es-en-pretransfer > europarl.tagged.es &
</pre>

Then we need to remove the lines with no analyses on:

<pre>
$ paste europarl.lines europarl.tagged.en europarl.tagged.es | grep '<' | cut -f2 > europarl.tagged.new.es
$ paste europarl.lines europarl.tagged.en europarl.tagged.es | grep '<' | cut -f3 > europarl.tagged.new.en
</pre>

Next, we need to clean the corpus and remove long sentences.
(Make sure you are in the same directory as the one where you have your europarl corpus)

<pre>
$ perl (path to your mosesdecoder)/scripts/training/clean-corpus-n.perl europarl.tagged.new es en europarl.tag-clean 1 40
clean-corpus.perl: processing europarl-v6.es-en.es & .en to europarl.clean, cutoff 1-40
..........(100000)...

Input sentences: 1786594 Output sentences: 1467708
</pre>

We're going to cut off the bottom 67,658 for testing (also because Giza++ segfaults somewhere around there).
<pre>
$ mkdir testing
$ tail -67658 europarl.tag-clean.en > testing/europarl.tag-clean.67658.en
$ tail -67658 europarl.tagged.es > testing/europarl.tag-clean.67658.es
</pre>

<pre>
$ head -1400000 europarl.tag-clean.en > europarl.tag-clean.en.new
$ head -1400000 europarl.tag-clean.es > europarl.tag-clean.es.new
$ mv europarl.tag-clean.en.new europarl.tag-clean.en
$ mv europarl.tag-clean.es.new europarl.tag-clean.es
</pre>

These files are:

* <code>europarl.tag-clean.en</code>: The tagged source language side of the corpus
* <code>europarl.tagged.es</code>: The tagged target language side of the corpus

Check that they have the same length:

<pre>
$ wc -l europarl.*
1400000 europarl.tag-clean.en
1400000 europarl.tag-clean.es
5600000 total
</pre>

=== Align corpus ===

Now we've got the corpus files ready, we can align the corpus using the Moses scripts:

<pre>
nohup perl (path to your mosesdecoder)/scripts/training/train-model.perl -external-bin-dir \
~/smt/local/bin -corpus europarl.tag-tok \
-f en -e es -alignment grow-diag-final-and -reordering msd-bidirectional-fe \
-lm 0:5:/home/fran/corpora/europarl/europarl.lm:0 >log 2>&1 &
</pre>

Note: Remember to change all the paths in the above command!

You'll need an LM file, but you can copy it from a previous Moses installation. If you don't have one, make an empty file and put a few words in it. We won't be using the LM anyway.

This takes a while, from a few hours to a day. So leave it running and go and make a soufflé, or chop some wood or something.

=== Extract sentences ===

The first thing we need to do after Moses has finished training is convert the Giza++ alignments to a less human- (and machine-) hostile format:

<pre>
$ zcat giza.en-es/en-es.A3.final.gz | ~/source/apertium-lex-tools/scripts/giza-to-moses.awk > europarl.phrasetable.en-es
</pre>

Then we want to make sure again that our file has the right number of lines:

<pre>
$ wc -l europarl.phrasetable.en-es
1400000 europarl.phrasetable.en-es
</pre>

Then we want to extract the sentences where the target language word aligned to a source language word is a possible translation in the bilingual dictionary:

<pre>
$ ~/source/apertium-lex-tools/scripts/extract-sentences.py europarl.phrasetable.en-es europarl.biltrans-tok.en-es \
> europarl.candidates.en-es
</pre>

These are basically sentences that we can hope that Apertium might be able to generate.

=== Extract bilingual dictionary candidates ===

Using the phrasetable and the bilingual file we can extract candidates for the bilingual dictionary.

<pre>
python3 ~/Apertium/apertium-lex-tools/scripts/extract-biltrans-candidates.py europarl.phrasetable.en-es europarl.biltrans-tok.en-es > europarl.biltrans-candidates.en-es 2> europarl.biltrans-pairs.en-es
</pre>

where europarl.biltrans-candidates.en-es contains the generated entries for the bilingual dictionary.

===Extract frequency lexicon===

The next step is to extract the frequency lexicon.

<pre>
$ python ~/source/apertium-lex-tools/scripts/extract-freq-lexicon.py europarl.candidates.en-es > europarl.lex.en-es
</pre>

This file should look like:

<pre>
$ cat europarl.lex.en-es | head
31381 union<n> unión<n> @
101 union<n> sindicato<n>
1 union<n> situación<n>
1 union<n> monetario<adj>
4 slope<n> pendiente<n> @
1 slope<n> ladera<n>
</pre>

Where the highest frequency translation is marked with an <code>@</code>.

Note: This frequency lexicon can be used as a substitute for "choosing the most general translation" in your bilingual dictionary.

===Generate patterns===

Now we generate the ngrams that we are going to generate the rules from.

<pre>
$ crisphold=1.5 # ratio of how many times you see the alternative translation compared to the default
$ python ~/source/apertium-lex-tools/scripts/ngram-count-patterns.py europarl.lex.en-es europarl.candidates.en-es $crisphold 2>/dev/null > europarl.ngrams.en-es
</pre>

This script outputs lines in the following format:

<pre>
-language<n> and<cnjcoo> language<n> ,<cm> lengua<n> 2
+language<n> plain<adj> language<n> ,<cm> lenguaje<n> 3
-language<n> language<n> knowledge<n> lengua<n> 4
-language<n> language<n> of<pr> communication<n> lengua<n> 3
-language<n> Community<adj> language<n> .<sent> lengua<n> 5
-language<n> language<n> in~addition~to<pr> their<det><pos> lengua<n> 2
-language<n> every<det><ind> language<n> lengua<n> 2
+language<n> and<cnjcoo> *understandable language<n> lenguaje<n> 2
-language<n> two<num> language<n> lengua<n> 8
-language<n> only<adj> official<adj> language<n> lengua<n> 2
</pre>

The <code>+</code> and <code>-</code> indicate if this line chooses the most frequent transation (<code>-</code>) or a translation which is not the most frequent (<code>+</code>). The pattern selecting the translation is then shown, followed by the translation and then the frequency.

===Filter rules===

Now you can filter the rules, for example by removing rules with conjunctions, or removing rules with unknown words.

=== Generate rules ===

The final stage is to generate the rules,

<pre>
python3 ~/source/apertium-lex-tools/scripts/ngrams-to-rules.py europarl.ngrams.en-es $crisphold > europarl.ngrams.en-es.lrx
</pre>

=== Process script ===
For the whole process you can run the following script:

<pre>
CORPUS_DIR="/home/philip/Apertium/corpora/raw/europarl-fr-es"
CORPUS="Europarl3"
PAIR="es-fr"
SL="fr"
TL="es"
LEX_TOOLS="/home/philip/Apertium/apertium-lex-tools"
SCRIPTS="$LEX_TOOLS/scripts"
MOSESDECODER="/home/philip/Apertium/mosesdecoder/scripts/training"
TRAINING_LINES=50000
DATA="/home/philip/Apertium/apertium-fr-es"
BIN_DIR="/home/philip/Apertium/smt/bin"

# CLEAN CORPUS
perl "$MOSESDECODER/clean-corpus-n.perl" $CORPUS.$PAIR $SL $TL "$CORPUS.clean" 1 40;

#TAG CORPUS
cat "$CORPUS.clean.$SL" | tail -n $TRAINING_LINES | apertium -d "$DATA" $SL-$TL-tagger \
| apertium-pretransfer > $CORPUS.tagged.$SL;

cat "$CORPUS.clean.$TL" | tail -n $TRAINING_LINES | apertium -d "$DATA" $TL-$SL-tagger \
| apertium-pretransfer > $CORPUS.tagged.$TL;

cat "$CORPUS.tagged.$SL" | lt-proc -b "$DATA/$SL-$TL.autobil.bin" > $CORPUS.biltrans.$PAIR

N=`wc -l $CORPUS.clean.$SL | cut -d ' ' -f 1`

# REMOVE LINES WITH NO ANALYSES
seq 1 $N > $CORPUS.lines
paste $CORPUS.lines $CORPUS.tagged.$SL $CORPUS.tagged.$TL | grep '<' | cut -f1 > $CORPUS.lines.new
paste $CORPUS.lines $CORPUS.tagged.$SL $CORPUS.tagged.$TL | grep '<' | cut -f2 > $CORPUS.tagged.$SL.new
paste $CORPUS.lines $CORPUS.tagged.$SL $CORPUS.tagged.$TL | grep '<' | cut -f3 > $CORPUS.tagged.$TL.new

mv $CORPUS.lines.new $CORPUS.lines
mv $CORPUS.tagged.$SL.new $CORPUS.tagged.$SL
mv $CORPUS.tagged.$TL.new $CORPUS.tagged.$TL

# TRIM TAGS
cat $CORPUS.tagged.$SL | $LEX_TOOLS/process-tagger-output $DATA/$SL-$TL.autobil.bin -p -t > $CORPUS.tag-tok.$SL
cat $CORPUS.tagged.$TL | $LEX_TOOLS/process-tagger-output $DATA/$TL-$SL.autobil.bin -p -t > $CORPUS.tag-tok.$TL
cat $CORPUS.tagged.$SL | $LEX_TOOLS/process-tagger-output $DATA/$SL-$TL.autobil.bin -b -t > $CORPUS.biltrans-tok.$PAIR

# ALIGN
perl $MOSESDECODER/train-model.perl -external-bin-dir "$BIN_DIR" -corpus $CORPUS.tag-tok \
-f $TL -e $SL -alignment grow-diag-final-and -reordering msd-bidirectional-fe \
-lm 0:5:/home/philip/Apertium/gsoc2013/giza/europarl.lm:0 2>&1

# EXTRACT
zcat giza.$SL-$TL/$SL-$TL.A3.final.gz | $SCRIPTS/giza-to-moses.awk > $CORPUS.phrasetable.$SL-$TL

# SENTENCES
python3 $SCRIPTS/extract-sentences.py $CORPUS.phrasetable.$SL-$TL $CORPUS.biltrans-tok.$PAIR \
> $CORPUS.candidates.$SL-$TL 2>/dev/null

# FREQUENCY LEXICON
python $SCRIPTS/extract-freq-lexicon.py $CORPUS.candidates.$SL-$TL > $CORPUS.lex.$SL-$TL 2>/dev/null

# BILTRANS CANDIDATES
python3 $SCRIPTS/extract-biltrans-candidates.py $CORPUS.phrasetable.$SL-$TL $CORPUS.biltrans-tok.$PAIR \
> $CORPUS.biltrans-entries.$SL-$TL 2>$CORPUS.biltrans-pairs.$SL-$TL

# NGRAM PATTERNS
$crisphold=1.5
python $SCRIPTS/ngram-count-patterns.py $CORPUS.lex.$SL-$TL $CORPUS.candidates.$SL-$TL $crisphold 2>/dev/null > $CORPUS.ngrams.$SL-$TL

# FILTER PATTERNS

# NGRAMS TO RULES
$cripshold=1.5
python3 $SCRIPTS/ngrams-to-rules.py europarl.ngrams.en-es $crisphold > $Europarl3.ngrams.$SL-TL.lrx

</pre>

[[Category:Lexical selection]]
[[Category:Documentation in English]]

Generating lexical-selection rules from a parallel corpus

2013-09-19T14:35:57Z

Fpetkovski: /* Prepare corpus */

{{TOCD}}
If you have a parallel corpus, one of the things you can do is generate some lexical selection rules from it, to improve translation of words with more than one possible translation.

== You will need ==

Here is a list of software that you will need installed:

* Giza++ (or some other word aligner)
* Moses (for making Giza++ less human hostile)
* All the Moses scripts
* [[lttoolbox]]
* Apertium
* [[apertium-lex-tools]]

Furthermore you'll need:

* an Apertium language pair
* a parallel corpus (see [[Corpora]])

===Installing prerequisites===
See [[Minimal installation from SVN]] for apertium/lttoolbox.

See [[Constraint-based lexical selection module]] for apertium-lex-tools.

For Giza++ and moses-decoder, etc. you can do
<pre>
$ mkdir ~/smt
$ cd ~/smt
$ mkdir local # our "install prefix"
$ wget https://giza-pp.googlecode.com/files/giza-pp-v1.0.7.tar.gz
$ tar xzvf giza-pp-v1.0.7.tar.gz
$ cd giza-pp
$ make
$ mkdir ../local/bin
$ cp GIZA++-v2/snt2cooc.out ../local/bin/
$ cp GIZA++-v2/snt2plain.out ../local/bin/
$ cp GIZA++-v2/GIZA++ ../local/bin/
$ cp mkcls-v2/mkcls ../local/bin/
$ git clone https://github.com/moses-smt/mosesdecoder
$ cd mosesdecoder/
$ ./bjam
</pre>

Now e.g. the clean-corpus and train-model scripts referred to below will be in ~/smt/mosesdecoder/scripts/training/clean-corpus-n.perl
See http://www.statmt.org/moses/?n=Development.GetStarted if you want to install the binaries to some other directory.

== Getting started ==

We're going to do the example with [[Corpora|EuroParl]] and the English to Spanish pair in Apertium.

Given that you've got all the stuff installed, the work will be as follows:

=== Prepare corpus ===

To generate the rules, we need three files,

* The tagged and tokenised source corpus
* The tagged and tokenised target corpus
* The output of the lexical transfer module in the source→target direction, tokenised

These three files should be sentence aligned.

The first thing that we need to do is tag both sides of the corpus:

<pre>
$ nohup cat europarl.clean.en | apertium-destxt |\
apertium -f none -d /home/fran/source/apertium-en-es en-es-pretransfer > europarl.tagged.en &
$ nohup cat europarl.clean.es | apertium-destxt |\
apertium -f none -d /home/fran/source/apertium-en-es es-en-pretransfer > europarl.tagged.es &
</pre>

Then we need to remove the lines with no analyses on:

<pre>
$ paste europarl.lines europarl.tagged.en europarl.tagged.es | grep '<' | cut -f2 > europarl.tagged.en.new
$ paste europarl.lines europarl.tagged.en europarl.tagged.es | grep '<' | cut -f3 > europarl.tagged.es.new
</pre>

Next, we need to clean the corpus and remove long sentences.
(Make sure you are in the same directory as the one where you have your europarl corpus)

<pre>
$ perl (path to your mosesdecoder)/scripts/training/clean-corpus-n.perl europarl-v7.es-en es en europarl.clean 1 40
clean-corpus.perl: processing europarl-v6.es-en.es & .en to europarl.clean, cutoff 1-40
..........(100000)...

Input sentences: 1786594 Output sentences: 1467708
</pre>

Then run the English side through the lexical transfer:

<pre>
$ nohup cat europarl.tagged.en | lt-proc -b ~/source/apertium-en-es/en-es.autobil.bin > europarl.biltrans.en-es &
</pre>

We're going to cut off the bottom 67,658 for testing (also because Giza++ segfaults somewhere around there).

<pre>
$ mkdir testing
$ tail -67658 europarl.lines > testing/europarl.67658.lines
$ tail -67658 europarl.tagged.en > testing/europarl.tagged.67658.en
$ tail -67658 europarl.tagged.es > testing/europarl.tagged.67658.es
</pre>

<pre>
$ head -1400000 europarl.lines > europarl.lines.new
$ head -1400000 europarl.tagged.en > europarl.tagged.en.new
$ head -1400000 europarl.tagged.es > europarl.tagged.es.new
$ head -1400000 europarl.biltrans.en-es > europarl.biltrans.en-es.new
$ mv europarl.lines.new europarl.lines
$ mv europarl.tagged.en.new europarl.tagged.en
$ mv europarl.tagged.es.new europarl.tagged.es
$ mv europarl.biltrans.en-es.new europarl.biltrans.en-es
</pre>

These files are:

* <code>europarl.lines</code>: The list of lines included in the corpus from the original cleaned corpus.
* <code>europarl.tagged.en</code>: The tagged source language side of the corpus
* <code>europarl.tagged.es</code>: The tagged target language side of the corpus
* <code>europarl.biltrans.en-es</code>: The output of the lexical transfer SL→TL

Check that they have the same length:

<pre>
$ wc -l europarl.*
1400000 europarl.biltrans.en-es
1400000 europarl.lines
1400000 europarl.tagged.en
1400000 europarl.tagged.es
5600000 total
</pre>

The next step is to tokenise these into a format appropriate for Moses. We also do some tag trimming here so
that we could use the correct tags when generating lexical rules and bidix entries. For this we can use process-tagger-output from the apertium-lex-tools directory.

<pre>
$ nohup cat europarl.tagged.es | ~/source/apertium-lex-tools/process-tagger-output ~/source/apertium-en-es/es-en.autobil.bin -p -t > europarl.tag-tok.en&
$ nohup cat europarl.tagged.en | ~/source/apertium-lex-tools/process-tagger-output ~/source/apertium-en-es/en-es.autobil.bin -p -t > europarl.tag-tok.es&
$ nohup cat europarl.biltrans.en-es | ~/source/apertium-lex-tools/process-tagger-output ~/source/apertium-en-es/es-en.autobil.bin -b -t > europarl.biltrans-tok.en-es &
</pre>

=== Align corpus ===

Now we've got the corpus files ready, we can align the corpus using the Moses scripts:

<pre>
nohup perl (path to your mosesdecoder)/scripts/training/train-model.perl -external-bin-dir \
~/smt/local/bin -corpus europarl.tag-tok \
-f en -e es -alignment grow-diag-final-and -reordering msd-bidirectional-fe \
-lm 0:5:/home/fran/corpora/europarl/europarl.lm:0 >log 2>&1 &
</pre>

Note: Remember to change all the paths in the above command!

You'll need an LM file, but you can copy it from a previous Moses installation. If you don't have one, make an empty file and put a few words in it. We won't be using the LM anyway.

This takes a while, from a few hours to a day. So leave it running and go and make a soufflé, or chop some wood or something.

=== Extract sentences ===

The first thing we need to do after Moses has finished training is convert the Giza++ alignments to a less human- (and machine-) hostile format:

<pre>
$ zcat giza.en-es/en-es.A3.final.gz | ~/source/apertium-lex-tools/scripts/giza-to-moses.awk > europarl.phrasetable.en-es
</pre>

Then we want to make sure again that our file has the right number of lines:

<pre>
$ wc -l europarl.phrasetable.en-es
1400000 europarl.phrasetable.en-es
</pre>

Then we want to extract the sentences where the target language word aligned to a source language word is a possible translation in the bilingual dictionary:

<pre>
$ ~/source/apertium-lex-tools/scripts/extract-sentences.py europarl.phrasetable.en-es europarl.biltrans-tok.en-es \
> europarl.candidates.en-es
</pre>

These are basically sentences that we can hope that Apertium might be able to generate.

=== Extract bilingual dictionary candidates ===

Using the phrasetable and the bilingual file we can extract candidates for the bilingual dictionary.

<pre>
python3 ~/Apertium/apertium-lex-tools/scripts/extract-biltrans-candidates.py europarl.phrasetable.en-es europarl.biltrans-tok.en-es > europarl.biltrans-candidates.en-es 2> europarl.biltrans-pairs.en-es
</pre>

where europarl.biltrans-candidates.en-es contains the generated entries for the bilingual dictionary.

===Extract frequency lexicon===

The next step is to extract the frequency lexicon.

<pre>
$ python ~/source/apertium-lex-tools/scripts/extract-freq-lexicon.py europarl.candidates.en-es > europarl.lex.en-es
</pre>

This file should look like:

<pre>
$ cat europarl.lex.en-es | head
31381 union<n> unión<n> @
101 union<n> sindicato<n>
1 union<n> situación<n>
1 union<n> monetario<adj>
4 slope<n> pendiente<n> @
1 slope<n> ladera<n>
</pre>

Where the highest frequency translation is marked with an <code>@</code>.

Note: This frequency lexicon can be used as a substitute for "choosing the most general translation" in your bilingual dictionary.

===Generate patterns===

Now we generate the ngrams that we are going to generate the rules from.

<pre>
$ crisphold=1.5 # ratio of how many times you see the alternative translation compared to the default
$ python ~/source/apertium-lex-tools/scripts/ngram-count-patterns.py europarl.lex.en-es europarl.candidates.en-es $crisphold 2>/dev/null > europarl.ngrams.en-es
</pre>

This script outputs lines in the following format:

<pre>
-language<n> and<cnjcoo> language<n> ,<cm> lengua<n> 2
+language<n> plain<adj> language<n> ,<cm> lenguaje<n> 3
-language<n> language<n> knowledge<n> lengua<n> 4
-language<n> language<n> of<pr> communication<n> lengua<n> 3
-language<n> Community<adj> language<n> .<sent> lengua<n> 5
-language<n> language<n> in~addition~to<pr> their<det><pos> lengua<n> 2
-language<n> every<det><ind> language<n> lengua<n> 2
+language<n> and<cnjcoo> *understandable language<n> lenguaje<n> 2
-language<n> two<num> language<n> lengua<n> 8
-language<n> only<adj> official<adj> language<n> lengua<n> 2
</pre>

The <code>+</code> and <code>-</code> indicate if this line chooses the most frequent transation (<code>-</code>) or a translation which is not the most frequent (<code>+</code>). The pattern selecting the translation is then shown, followed by the translation and then the frequency.

===Filter rules===

Now you can filter the rules, for example by removing rules with conjunctions, or removing rules with unknown words.

=== Generate rules ===

The final stage is to generate the rules,

<pre>
python3 ~/source/apertium-lex-tools/scripts/ngrams-to-rules.py europarl.ngrams.en-es $crisphold > europarl.ngrams.en-es.lrx
</pre>

=== Process script ===
For the whole process you can run the following script:

<pre>
CORPUS_DIR="/home/philip/Apertium/corpora/raw/europarl-fr-es"
CORPUS="Europarl3"
PAIR="es-fr"
SL="fr"
TL="es"
LEX_TOOLS="/home/philip/Apertium/apertium-lex-tools"
SCRIPTS="$LEX_TOOLS/scripts"
MOSESDECODER="/home/philip/Apertium/mosesdecoder/scripts/training"
TRAINING_LINES=50000
DATA="/home/philip/Apertium/apertium-fr-es"
BIN_DIR="/home/philip/Apertium/smt/bin"

# CLEAN CORPUS
perl "$MOSESDECODER/clean-corpus-n.perl" $CORPUS.$PAIR $SL $TL "$CORPUS.clean" 1 40;

#TAG CORPUS
cat "$CORPUS.clean.$SL" | tail -n $TRAINING_LINES | apertium -d "$DATA" $SL-$TL-tagger \
| apertium-pretransfer > $CORPUS.tagged.$SL;

cat "$CORPUS.clean.$TL" | tail -n $TRAINING_LINES | apertium -d "$DATA" $TL-$SL-tagger \
| apertium-pretransfer > $CORPUS.tagged.$TL;

cat "$CORPUS.tagged.$SL" | lt-proc -b "$DATA/$SL-$TL.autobil.bin" > $CORPUS.biltrans.$PAIR

N=`wc -l $CORPUS.clean.$SL | cut -d ' ' -f 1`

# REMOVE LINES WITH NO ANALYSES
seq 1 $N > $CORPUS.lines
paste $CORPUS.lines $CORPUS.tagged.$SL $CORPUS.tagged.$TL | grep '<' | cut -f1 > $CORPUS.lines.new
paste $CORPUS.lines $CORPUS.tagged.$SL $CORPUS.tagged.$TL | grep '<' | cut -f2 > $CORPUS.tagged.$SL.new
paste $CORPUS.lines $CORPUS.tagged.$SL $CORPUS.tagged.$TL | grep '<' | cut -f3 > $CORPUS.tagged.$TL.new

mv $CORPUS.lines.new $CORPUS.lines
mv $CORPUS.tagged.$SL.new $CORPUS.tagged.$SL
mv $CORPUS.tagged.$TL.new $CORPUS.tagged.$TL

# TRIM TAGS
cat $CORPUS.tagged.$SL | $LEX_TOOLS/process-tagger-output $DATA/$SL-$TL.autobil.bin -p -t > $CORPUS.tag-tok.$SL
cat $CORPUS.tagged.$TL | $LEX_TOOLS/process-tagger-output $DATA/$TL-$SL.autobil.bin -p -t > $CORPUS.tag-tok.$TL
cat $CORPUS.tagged.$SL | $LEX_TOOLS/process-tagger-output $DATA/$SL-$TL.autobil.bin -b -t > $CORPUS.biltrans-tok.$PAIR

# ALIGN
perl $MOSESDECODER/train-model.perl -external-bin-dir "$BIN_DIR" -corpus $CORPUS.tag-tok \
-f $TL -e $SL -alignment grow-diag-final-and -reordering msd-bidirectional-fe \
-lm 0:5:/home/philip/Apertium/gsoc2013/giza/europarl.lm:0 2>&1

# EXTRACT
zcat giza.$SL-$TL/$SL-$TL.A3.final.gz | $SCRIPTS/giza-to-moses.awk > $CORPUS.phrasetable.$SL-$TL

# SENTENCES
python3 $SCRIPTS/extract-sentences.py $CORPUS.phrasetable.$SL-$TL $CORPUS.biltrans-tok.$PAIR \
> $CORPUS.candidates.$SL-$TL 2>/dev/null

# FREQUENCY LEXICON
python $SCRIPTS/extract-freq-lexicon.py $CORPUS.candidates.$SL-$TL > $CORPUS.lex.$SL-$TL 2>/dev/null

# BILTRANS CANDIDATES
python3 $SCRIPTS/extract-biltrans-candidates.py $CORPUS.phrasetable.$SL-$TL $CORPUS.biltrans-tok.$PAIR \
> $CORPUS.biltrans-entries.$SL-$TL 2>$CORPUS.biltrans-pairs.$SL-$TL

# NGRAM PATTERNS
$crisphold=1.5
python $SCRIPTS/ngram-count-patterns.py $CORPUS.lex.$SL-$TL $CORPUS.candidates.$SL-$TL $crisphold 2>/dev/null > $CORPUS.ngrams.$SL-$TL

# FILTER PATTERNS

# NGRAMS TO RULES
$cripshold=1.5
python3 $SCRIPTS/ngrams-to-rules.py europarl.ngrams.en-es $crisphold > $Europarl3.ngrams.$SL-TL.lrx

</pre>

[[Category:Lexical selection]]
[[Category:Documentation in English]]

User:Fpetkovski/GSOC 2013 Application - Improving the lexical selection module

2013-09-08T10:09:29Z

Fpetkovski: /* Work to do */

The lexical selection module in Apertium is currently a prototype. There are many optimisations that could be made to make it faster and more efficient. There are a number of scripts which can be used for learning lexical-selection rules, but the scripts are not particularly well written. Part of the task will be to rewrite the scripts taking into account all possible corner cases.

The project idea is located here [[Ideas_for_Google_Summer_of_Code/Improvements_in_lexical-selection_module|here]].

== Personal Info ==

First name: Filip 
Last name: Petkovski 
email: filip.petkovsky@gmail.com 
fpetkovski on IRC: #apertium 

== Why are you interested in machine translation ? ==

Machine translation can be thought of as one of the greatest challenges in natural language processing. It is the single most useful application of NLP and building a good MT system requires a blend of numerous different techniques from both computer science and linguistics.

== Why is it that you are interested in the Apertium project? ==

Apertium is a great project. It is obvious that a ton of work has been put into both developing the platform and creating the language resources. Apertium is a nice blend
between rule based and corpus based machine translation. It also allows me to easily work with my native language, as well as with other closely related languages from the Slavic language group.

== Why should Google and Apertium sponsor it? ==

Lexical selection is the task of deciding which word to use in a given context.
A good lexical selection module can significantly increase translation quality,
and give machine translation a more human-like feel.

== Which of the published tasks are you interested in? What do you plan to do? ==
I'm interested in [[Ideas_for_Google_Summer_of_Code/Improvements_in_lexical-selection_module|Improving the lexical selection module]].

I intend to improve the existing scripts and programs, and merge them into a release-ready package. I also intend to extend the existing functionality of the module.

== Work already done ==

* Generate lexical selection rules from a parallel corpus for the sh-mk language pair (submitted on svn)
* Generate additional bidix entries from a parallel corpus for the sh-mk language pair (submitted on svn)
* Last GSoC's participant (Corpus based feature transfer)

== Work to do ==

'''Community bonding period:'''
* <s>go through the training process for monolingual rule extraction</s>
* <s>go through the tranining process for MaxEnt rule extraction (monolingual/parallel)</s>
* <s>document the results</s>

'''Week 1:''' 
* <s>Update the instructions on the wiki</s>
* <s>Remove unused and redundant scripts.</s> (prefixed with unused. )
* <s>Do proper processing of tags in all scripts.</s> (fixed with FSTProcessor::biltransWithoutQueue)
* <s>Fix tokenization</s>. (fixed in scripts/common.py with tokenize_biltrans_line)
* <s>Make sure that capitalisation, any tag and any character work as expected (fixed in tokenization).</s>
* <s>Ensure that all scripts process escaped characters correctly, e.g. ^ \ / $ < ></s> (fixed with tokenization)
'''Week 2:''' 
* <s>Script/program for finding possibly missing bidix entries from an aligned parallel corpus. </s>
* <s>Make sure that <match lemma="*" tags="*"/> works the same as <match/> </s>
* <s> <match/> doesn't match an LU when the lemma is , </s>
* Fix bug10 in the testing dir.
'''Week 3:''' 
* <s> Merge the four different implementations of irstlm_ranker into a single implementation </s>
* <s> add option to the ranker which marks translations which fall outside of xx% of the probability mass for a given sentence <code>|@| |+| |-|</code> </s>
* <s> Move lex-learner to lex-tools </s>
* <s>Run through and document new training process with a language pair (mk-en, br-fr, or en-es) </s>
* <s> Demonstrate bidix extraction script with a language pair (e.g. es-pt) </s>
'''Week 4-6:''' 
* Rewrite the LRXProcessor::processME and LRXProcessor::process methods so that they share more code and are more modularised. Having a 650 line method is not the right thing.
* Work on a way to trim non-significant features from the maximum-entropy models.
** probability mass: discard features which fall outside of xx% of the probability mass, e.g. 80%, should be configurable
** outcome pruning: discard features that select a translation which can never win: e.g. the sum of the weights of all the contexts where it appears never adds up to more than the sum of the weights of all the other translations
* <s> Implement poor-man's alignment: instead of using giza++, use tagged corpora and look up to see if the equivalent word appears. </s>
'''Weeks 7-9:'''
* ...
'''Week 9-10:'''
* Apply the model to different language pairs and generate lexical selection rules and bidix entries.
** eu-es, es-fr, es-pt, mk-en, br-fr, en-es
'''Week 11-12:'''
* Wrap up / writing paper

== Skills, qualifications and field of study ==

I am a Graduate student of Computer Science, holding a Bachelor's degree in Computing. I have an excellent knowledge of Java and C#, and I'm fairly comfortable with C/C++ and scripting languages.

Machine learning is one of my strongest skills. I have worked on quite a few ML projects involving named entity relation extraction, news articles classification, image based gender classification and real time vehicle detection. I have experience with building and optimising a model, feature selection and feature extraction for classification.
I did my bachelor thesis in the field of computer vision, and my master thesis is in the field of natural language processing.

I have also taken part in last year's GSoC, and have in addition worked on the sh-mk and sh-en language pairs.

== Non-GSoC activities ==

My Master's thesis is due 28.6 but I intend to focus on it intensively before the coding period starts (27.5).

I also might be moving to the United States for a few months as a part of a work and travel programme, so I might be offline for a couple of days around 10.6.

[[Category:GSoC 2013 Student proposals|Fpetkovski]]

Learning rules from parallel and non-parallel corpora

2013-08-19T13:47:09Z

Fpetkovski: /* Estimating rules using parallel corpora */

== Estimating rules using parallel corpora ==
It is always recommended to use a parallel corpus for any type of machine translation training
when such a resource is available. This section describes several (3) methods for estimating
lexical selection rules using a parallel corpus. We start by describing the part of the training process that is shared by all three methods, and then continue to describe each of the
individual methods separately.
=== Prequisites ===
The training methods use several software packages that need to be installed.
First you will need to download and install:
* [[lttoolbox]]
* Apertium
* [[apertium-lex-tools]]
* Moses and its training scripts which you can install using the following script

Furthermore you will also need:

* an Apertium language pair
* a parallel corpus (see [[Corpora]])

===Installing prerequisites===
See [[Minimal installation from SVN]] for apertium/lttoolbox.

See [[Constraint-based lexical selection module]] for apertium-lex-tools.

For moses-decoder you can do
<pre>
git clone https://github.com/moses-smt/mosesdecoder
cd mosesdecoder/
./bjam
</pre>

Now e.g. the clean-corpus and train-model scripts referred to below will be in ~/smt/mosesdecoder/scripts/training/clean-corpus-n.perl
See http://www.statmt.org/moses/?n=Development.GetStarted if you want to install the binaries to some other directory.

=== Preparing the training files ===

The parallel corpus is processed in such a way that
the training files (the source and target side corpus) are first analysed and tagged.

Next, lines with no analysis are removed and blank within tokens are replaced with a new character
since Giza tokenizes a sentence by splitting on white space.

Finally, both files are cleaned using a moses training script so that Giza will not
crash during the training process.

All of this can be achieved using the following script:

<pre>
CORPUS="Europarl3"
PAIR="es-pt"
SL="pt"
TL="es"
DATA="/home/philip/Apertium/apertium-es-pt"

LEX_TOOLS="/home/philip/Apertium/apertium-lex-tools"
SCRIPTS="$LEX_TOOLS/scripts"
MOSESDECODER="/home/philip/Apertium/mosesdecoder/scripts/training"
TRAINING_LINES=100000

if [ ! -d data-$SL-$TL ]; then
mkdir data-$SL-$TL;
fi

# TAG CORPUS
cat "$CORPUS.$PAIR.$SL" | head -n $TRAINING_LINES | apertium -d "$DATA" $SL-$TL-tagger-dcase \
| apertium-pretransfer > data-$SL-$TL/$CORPUS.tagged.$SL;

cat "$CORPUS.$PAIR.$TL" | head -n $TRAINING_LINES | apertium -d "$DATA" $TL-$SL-tagger-dcase \
| apertium-pretransfer > data-$SL-$TL/$CORPUS.tagged.$TL;

N=`wc -l $CORPUS.$PAIR.$SL | cut -d ' ' -f 1`

# REMOVE LINES WITH NO ANALYSES
seq 1 $TRAINING_LINES > data-$SL-$TL/$CORPUS.lines
paste data-$SL-$TL/$CORPUS.lines data-$SL-$TL/$CORPUS.tagged.$SL data-$SL-$TL/$CORPUS.tagged.$TL | grep '<' \
| cut -f1 > data-$SL-$TL/$CORPUS.lines.new
paste data-$SL-$TL/$CORPUS.lines data-$SL-$TL/$CORPUS.tagged.$SL data-$SL-$TL/$CORPUS.tagged.$TL | grep '<' \
| cut -f2 > data-$SL-$TL/$CORPUS.tagged.$SL.new
paste data-$SL-$TL/$CORPUS.lines data-$SL-$TL/$CORPUS.tagged.$SL data-$SL-$TL/$CORPUS.tagged.$TL | grep '<' \
| cut -f3 > data-$SL-$TL/$CORPUS.tagged.$TL.new
mv data-$SL-$TL/$CORPUS.lines.new data-$SL-$TL/$CORPUS.lines
cat data-$SL-$TL/$CORPUS.tagged.$SL.new \
| sed 's/ /~/g' | sed 's/\$[^\^]*/$ /g' > data-$SL-$TL/$CORPUS.tagged.$SL
cat data-$SL-$TL/$CORPUS.tagged.$TL.new \
| sed 's/ /~/g' | sed 's/\$[^\^]*/$ /g' > data-$SL-$TL/$CORPUS.tagged.$TL
rm data-$SL-$TL/*.new

# CLEAN CORPUS
perl "$MOSESDECODER/clean-corpus-n.perl" data-$SL-$TL/$CORPUS.tagged $SL $TL "data-$SL-$TL/$CORPUS.tag-clean" 1 40;

</pre>

Make sure you set the variables as follows: 
* CORPUS="Europarl3": name of the corpus that you're using
* PAIR: the direction of the corpus
* SL: the source language
* TL: the target language
* DATA: path to the language resources for the language pair
* LEX_TOOLS: path to apertium-lex-tools
* MOSESDECODER: path to moses-decoder
* TRAINING_LINES: amount of training lines

=== Learning rules with Giza ===

==== Installing Giza ====

You can download and install Giza in the following way:

<pre>

$ mkdir ~/smt
$ cd ~/smt
$ mkdir local # our "install prefix"
$ wget https://giza-pp.googlecode.com/files/giza-pp-v1.0.7.tar.gz
$ tar xzvf giza-pp-v1.0.7.tar.gz
$ cd giza-pp
$ make
$ mkdir ../local/bin
$ cp GIZA++-v2/snt2cooc.out ../local/bin/
$ cp GIZA++-v2/snt2plain.out ../local/bin/
$ cp GIZA++-v2/GIZA++ ../local/bin/
$ cp mkcls-v2/mkcls ../local/bin/
</pre>

Notice that all of your binary files will be placed in a single folder which is needed by the moses training script.

==== Alignment ====
Giza++ is used for obtaining an alignment between the tokens of the parallel sentences in the two corpora. The alignment process can sometimes be slow depending on the size of the corpora.

After alignment has been done, the tokens' tags can be trimmed down so that they match the the set of tags found in the bilingual dictionary of the language pair.

Next, a bilingual transfer output is obtained from the source language side so that
ambiguous sentences and missing bidix candidates can be extracted.

Finally, a frequency lexicon is created, marking the most common translation for each ambiguous token.

<pre>
BIN_DIR="/home/philip/Apertium/smt/bin"
# ALIGN
perl $MOSESDECODER/train-model.perl -external-bin-dir "$BIN_DIR" -corpus data-$SL-$TL/$CORPUS.tag-clean \
-f $TL -e $SL -alignment grow-diag-final-and -reordering msd-bidirectional-fe \
-lm 0:5:/home/philip/Apertium/gsoc2013/giza/europarl.lm:0 2>&1

# EXTRACT
zcat giza.$SL-$TL/$SL-$TL.A3.final.gz | $SCRIPTS/giza-to-moses.awk > data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL

# TRIM TAGS
cat data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL | sed 's/ ||| /\t/g' | cut -f 1 \
| sed 's/~/ /g' | $LEX_TOOLS/process-tagger-output $DATA/$TL-$SL.autobil.bin -p -t > tmp1

cat data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL | sed 's/ ||| /\t/g' | cut -f 2 \
| sed 's/~/ /g' | $LEX_TOOLS/process-tagger-output $DATA/$SL-$TL.autobil.bin -p -t > tmp2

cat data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL | sed 's/ ||| /\t/g' | cut -f 3 > tmp3

cat data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL | sed 's/ ||| /\t/g' | cut -f 2 \
| sed 's/~/ /g' | $LEX_TOOLS/process-tagger-output $DATA/$SL-$TL.autobil.bin -b -t > data-$SL-$TL/$CORPUS.clean-biltrans.$PAIR

paste tmp1 tmp2 tmp3 | sed 's/\t/ ||| /g' > data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL
rm tmp1 tmp2 tmp3

# SENTENCES
python3 $SCRIPTS/extract-sentences.py data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL data-$SL-$TL/$CORPUS.clean-biltrans.$PAIR \
> data-$SL-$TL/$CORPUS.candidates.$SL-$TL 2>/dev/null

# FREQUENCY LEXICON
python $SCRIPTS/extract-freq-lexicon.py data-$SL-$TL/$CORPUS.candidates.$SL-$TL > data-$SL-$TL/$CORPUS.lex.$SL-$TL 2>/dev/null

</pre>

Make sure you set the BIN_DIR variable so that it contains the path to the binary folder
generated by the Giza installation process.

==== Maximum likelihood rule extraction ====
The ML method counts how many each translation occurs in a given context, and compares that number
with the default translation from the frequency lexicon.
It then decides whether to create a rule with the given translation, or to leave the default
translation.

The rule generation process is done with the following script:

<pre>
crisphold=1.5
# NGRAM PATTERNS
python $SCRIPTS/ngram-count-patterns.py data-$SL-$TL/$CORPUS.lex.$SL-$TL data-$SL-$TL/$CORPUS.candidates.$SL-$TL $crisphold 2>/dev/null > data-$SL-$TL/$CORPUS.ngrams.$SL-$TL

# NGRAMS TO RULES
python3 $SCRIPTS/ngrams-to-rules.py data-$SL-$TL/$CORPUS.ngrams.$SL-$TL $crisphold > data-$SL-$TL/$CORPUS.ngrams.$SL-$TL.lrx 2>/dev/null
</pre>

Where ''''crisphold'''' is a variable which determines how many more times should an individual translation occur over the default translation for a rule to be created.

==== Maximum entropy rule extraction ====
The ME method learns a discriminative model which assigns each individual ngram
a weight with which it contributes to a certain translation.

The rule extraction process is done in the following way:

<pre>
$MIN=1
YASMET=$LEX_TOOLS/yasmet
python $SCRIPTS/ngram-count-patterns-maxent2.py data-$SL-$TL/$CORPUS.lex.$SL-$TL data-$SL-$TL/$CORPUS.candidates.$SL-$TL 2>ngrams > events

echo -n "" > all-lambdas
cat events | grep -v -e '\$ 0\.0 #' -e '\$ 0 #' > events.trimmed
for i in `cat events.trimmed | cut -f1 | sort -u | sed 's/$[\*\^\$]$/\\\\\1/g'`; do

num=`cat events.trimmed | grep "^$i" | cut -f2 | head -1`
echo $num > tmp.yasmet.$i;
cat events.trimmed | grep "^$i" | cut -f3 >> tmp.yasmet.$i;
echo "$i"
cat tmp.yasmet.$i | $YASMET -red $MIN > tmp.yasmet.$i.$MIN;
cat tmp.yasmet.$i.$MIN | $YASMET > tmp.lambdas.$i
cat tmp.lambdas.$i | sed "s/^/$i /g" >> all-lambdas;
done

rm tmp.*

python3 $SCRIPTS/merge-ngrams-lambdas.py ngrams all-lambdas > rules-all.txt

python3 $SCRIPTS/lambdas-to-rules.py data-$SL-$TL/$CORPUS.lex.$SL-$TL rules-all.txt > ngrams-all.txt

python3 $SCRIPTS/ngrams-to-rules-me.py ngrams-all.txt > $PAIR.ngrams-lm-$MIN.xml 2>/tmp/$PAIR.$MIN.lög

</pre>

The MIN variable denotes how many times a certain context should occur for it to be taken into account.

=== Poorman's alignment ===

When using a large corpus, aligning tokens with Giza can be very slow.
For that reason, we can estimate pairwise and ngram counts directly by relaxing the coocurence criteria used by Giza.

For each possible translation of an ambiguous word, we add one if the translation occurs anywhere in the target sentence of the parallel corpus.

A script for learning rules with maximum likelihood is given below:

<pre>

</pre>

== Estimating rules using non-parallel corpora ==
Prerequisites:
* Install [[apertium-lex-tools]]
* Install IRSTLM (http://sourceforge.net/projects/irstlm/)
* Estimate a '''binary''' target side language model (http://sourceforge.net/apps/mediawiki/irstlm/index.php?title=User_Manual).
* The language pair must support the pretransfer and multi modes. See apertium-sh-mk/modes.xml
as a reference on how to add these modes if they do not exist.
Place the following Makefile in the folder where you want to run your training process:

<pre>
CORPUS=setimes
DIR=sh-mk
DATA=/home/philip/Apertium/apertium-sh-mk/
AUTOBIL=sh-mk.autobil.bin
SCRIPTS=/home/philip/Apertium/apertium-lex-tools/scripts
MODEL=/home/philip/Apertium/corpora/language-models/mk/setimes.mk.5.blm
LEX_TOOLS=/home/philip/Apertium/apertium-lex-tools
THR=0

#all: data/$(CORPUS).$(DIR).lrx data/$(CORPUS).$(DIR).freq.lrx
all: data/$(CORPUS).$(DIR).freq.lrx.bin data/$(CORPUS).$(DIR).patterns.lrx

data/$(CORPUS).$(DIR).tagger: $(CORPUS).$(DIR).txt
if [ ! -d data ]; then mkdir data; fi
cat $(CORPUS).$(DIR).txt | sed 's/[^\.]$$/./g' | apertium-destxt | apertium -f none -d $(DATA) $(DIR)-tagger | apertium-pretransfer > $@

data/$(CORPUS).$(DIR).ambig: data/$(CORPUS).$(DIR).tagger
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/multitrans $(DATA)$(DIR).autobil.bin -b -t > $@

data/$(CORPUS).$(DIR).multi-trimmed: data/$(CORPUS).$(DIR).tagger
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/multitrans $(DATA)$(DIR).autobil.bin -m -t > $@

data/$(CORPUS).$(DIR).ranked: data/$(CORPUS).$(DIR).tagger
cat $< | $(LEX_TOOLS)/multitrans $(DATA)$(DIR).autobil.bin -m | apertium -f none -d $(DATA) $(DIR)-multi | irstlm-ranker-frac $(MODEL) > $@

data/$(CORPUS).$(DIR).annotated: data/$(CORPUS).$(DIR).multi-trimmed data/$(CORPUS).$(DIR).ranked
paste data/$(CORPUS).$(DIR).multi-trimmed data/$(CORPUS).$(DIR).ranked | cut -f1-4 > $@

data/$(CORPUS).$(DIR).freq: data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated
python3 $(SCRIPTS)/biltrans-extract-frac-freq.py data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated > $@

data/$(CORPUS).$(DIR).freq.lrx: data/$(CORPUS).$(DIR).freq
python3 $(SCRIPTS)/extract-alig-lrx.py $< > $@

data/$(CORPUS).$(DIR).freq.lrx.bin: data/$(CORPUS).$(DIR).freq.lrx
lrx-comp $< $@

data/$(CORPUS).$(DIR).ngrams: data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated
python3 $(SCRIPTS)/biltrans-count-patterns-ngrams.py data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated > $@

data/$(CORPUS).$(DIR).patterns: data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ngrams
python3 $(SCRIPTS)/ngram-pruning-frac.py data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ngrams > $@

data/$(CORPUS).$(DIR).patterns.lrx: data/$(CORPUS).$(DIR).patterns
python3 $(SCRIPTS)/ngrams-to-rules.py $< $(THR) > $@
</pre>

In the same folder also place your source side corpus file. The corpus file needs to be named as "basename"."language-pair".txt. 
As an illustration, in the Makefile example, the corpus file is named setimes.sh-mk.txt.

Set the Makefile variables as follows: 
* CORPUS denotes the base name of your corpus file
* DIR stands for the language pair
* DATA is the path to the language resources for the language pair
* AUTOBIL is the path to binary bilingual dictionary for the language pair
* SCRIPTS denotes the path to the lex-tools scripts
* MODEL is the path to the target side (binary) language model used for scoring the possible translations of ambiguous words

Finally, executing the Makefile will generate lexical selection rules for the specified language pair.

Learning rules from parallel and non-parallel corpora

2013-08-19T07:58:09Z

Fpetkovski: /* Maximum entropy rule extraction */

== Estimating rules using parallel corpora ==
It is always recommended to use a parallel corpus for any type of machine translation training
when such a resource is available. This section describes several (3) methods for estimating
lexical selection rules using a parallel corpus. We start by describing the part of the training process that is shared by all three methods, and then continue to describe each of the
individual methods separately.
=== Prequisites ===
The training methods use several software packages that need to be installed.
First you will need to download and install:
* [[lttoolbox]]
* Apertium
* [[apertium-lex-tools]]
* Moses and its training scripts which you can install using the following script

Furthermore you will also need:

* an Apertium language pair
* a parallel corpus (see [[Corpora]])

===Installing prerequisites===
See [[Minimal installation from SVN]] for apertium/lttoolbox.

See [[Constraint-based lexical selection module]] for apertium-lex-tools.

For moses-decoder you can do
<pre>
git clone https://github.com/moses-smt/mosesdecoder
cd mosesdecoder/
./bjam
</pre>

Now e.g. the clean-corpus and train-model scripts referred to below will be in ~/smt/mosesdecoder/scripts/training/clean-corpus-n.perl
See http://www.statmt.org/moses/?n=Development.GetStarted if you want to install the binaries to some other directory.

=== Preparing the training files ===

The parallel corpus is processed in such a way that
the training files (the source and target side corpus) are first analysed and tagged.

Next, lines with no analysis are removed and blank within tokens are replaced with a new character
since Giza tokenizes a sentence by splitting on white space.

Finally, both files are cleaned using a moses training script so that Giza will not
crash during the training process.

All of this can be achieved using the following script:

<pre>
CORPUS="Europarl3"
PAIR="es-pt"
SL="pt"
TL="es"
DATA="/home/philip/Apertium/apertium-es-pt"

LEX_TOOLS="/home/philip/Apertium/apertium-lex-tools"
SCRIPTS="$LEX_TOOLS/scripts"
MOSESDECODER="/home/philip/Apertium/mosesdecoder/scripts/training"
TRAINING_LINES=100000

if [ ! -d data-$SL-$TL ]; then
mkdir data-$SL-$TL;
fi

# TAG CORPUS
cat "$CORPUS.$PAIR.$SL" | head -n $TRAINING_LINES | apertium -d "$DATA" $SL-$TL-tagger-dcase \
| apertium-pretransfer > data-$SL-$TL/$CORPUS.tagged.$SL;

cat "$CORPUS.$PAIR.$TL" | head -n $TRAINING_LINES | apertium -d "$DATA" $TL-$SL-tagger-dcase \
| apertium-pretransfer > data-$SL-$TL/$CORPUS.tagged.$TL;

N=`wc -l $CORPUS.$PAIR.$SL | cut -d ' ' -f 1`

# REMOVE LINES WITH NO ANALYSES
seq 1 $TRAINING_LINES > data-$SL-$TL/$CORPUS.lines
paste data-$SL-$TL/$CORPUS.lines data-$SL-$TL/$CORPUS.tagged.$SL data-$SL-$TL/$CORPUS.tagged.$TL | grep '<' \
| cut -f1 > data-$SL-$TL/$CORPUS.lines.new
paste data-$SL-$TL/$CORPUS.lines data-$SL-$TL/$CORPUS.tagged.$SL data-$SL-$TL/$CORPUS.tagged.$TL | grep '<' \
| cut -f2 > data-$SL-$TL/$CORPUS.tagged.$SL.new
paste data-$SL-$TL/$CORPUS.lines data-$SL-$TL/$CORPUS.tagged.$SL data-$SL-$TL/$CORPUS.tagged.$TL | grep '<' \
| cut -f3 > data-$SL-$TL/$CORPUS.tagged.$TL.new
mv data-$SL-$TL/$CORPUS.lines.new data-$SL-$TL/$CORPUS.lines
cat data-$SL-$TL/$CORPUS.tagged.$SL.new \
| sed 's/ /~/g' | sed 's/\$[^\^]*/$ /g' > data-$SL-$TL/$CORPUS.tagged.$SL
cat data-$SL-$TL/$CORPUS.tagged.$TL.new \
| sed 's/ /~/g' | sed 's/\$[^\^]*/$ /g' > data-$SL-$TL/$CORPUS.tagged.$TL
rm data-$SL-$TL/*.new

# CLEAN CORPUS
perl "$MOSESDECODER/clean-corpus-n.perl" data-$SL-$TL/$CORPUS.tagged $SL $TL "data-$SL-$TL/$CORPUS.tag-clean" 1 40;

</pre>

Make sure you set the variables as follows: 
* CORPUS="Europarl3": name of the corpus that you're using
* PAIR: the direction of the corpus
* SL: the source language
* TL: the target language
* DATA: path to the language resources for the language pair
* LEX_TOOLS: path to apertium-lex-tools
* MOSESDECODER: path to moses-decoder
* TRAINING_LINES: amount of training lines

=== Learning rules with Giza ===

==== Installing Giza ====

You can download and install Giza in the following way:

<pre>

$ mkdir ~/smt
$ cd ~/smt
$ mkdir local # our "install prefix"
$ wget https://giza-pp.googlecode.com/files/giza-pp-v1.0.7.tar.gz
$ tar xzvf giza-pp-v1.0.7.tar.gz
$ cd giza-pp
$ make
$ mkdir ../local/bin
$ cp GIZA++-v2/snt2cooc.out ../local/bin/
$ cp GIZA++-v2/snt2plain.out ../local/bin/
$ cp GIZA++-v2/GIZA++ ../local/bin/
$ cp mkcls-v2/mkcls ../local/bin/
</pre>

Notice that all of your binary files will be placed in a single folder which is needed by the moses training script.

==== Alignment ====
Giza++ is used for obtaining an alignment between the tokens of the parallel sentences in the two corpora. The alignment process can sometimes be slow depending on the size of the corpora.

After alignment has been done, the tokens' tags can be trimmed down so that they match the the set of tags found in the bilingual dictionary of the language pair.

Next, a bilingual transfer output is obtained from the source language side so that
ambiguous sentences and missing bidix candidates can be extracted.

Finally, a frequency lexicon is created, marking the most common translation for each ambiguous token.

<pre>
BIN_DIR="/home/philip/Apertium/smt/bin"
# ALIGN
perl $MOSESDECODER/train-model.perl -external-bin-dir "$BIN_DIR" -corpus data-$SL-$TL/$CORPUS.tag-clean \
-f $TL -e $SL -alignment grow-diag-final-and -reordering msd-bidirectional-fe \
-lm 0:5:/home/philip/Apertium/gsoc2013/giza/europarl.lm:0 2>&1

# EXTRACT
zcat giza.$SL-$TL/$SL-$TL.A3.final.gz | $SCRIPTS/giza-to-moses.awk > data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL

# TRIM TAGS
cat data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL | sed 's/ ||| /\t/g' | cut -f 1 \
| sed 's/~/ /g' | $LEX_TOOLS/process-tagger-output $DATA/$TL-$SL.autobil.bin -p -t > tmp1

cat data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL | sed 's/ ||| /\t/g' | cut -f 2 \
| sed 's/~/ /g' | $LEX_TOOLS/process-tagger-output $DATA/$SL-$TL.autobil.bin -p -t > tmp2

cat data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL | sed 's/ ||| /\t/g' | cut -f 3 > tmp3

cat data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL | sed 's/ ||| /\t/g' | cut -f 2 \
| sed 's/~/ /g' | $LEX_TOOLS/process-tagger-output $DATA/$SL-$TL.autobil.bin -b -t > data-$SL-$TL/$CORPUS.clean-biltrans.$PAIR

paste tmp1 tmp2 tmp3 | sed 's/\t/ ||| /g' > data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL
rm tmp1 tmp2 tmp3

# SENTENCES
python3 $SCRIPTS/extract-sentences.py data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL data-$SL-$TL/$CORPUS.clean-biltrans.$PAIR \
> data-$SL-$TL/$CORPUS.candidates.$SL-$TL 2>/dev/null

# FREQUENCY LEXICON
python $SCRIPTS/extract-freq-lexicon.py data-$SL-$TL/$CORPUS.candidates.$SL-$TL > data-$SL-$TL/$CORPUS.lex.$SL-$TL 2>/dev/null

</pre>

Make sure you set the BIN_DIR variable so that it contains the path to the binary folder
generated by the Giza installation process.

==== Maximum likelihood rule extraction ====
The ML method counts how many each translation occurs in a given context, and compares that number
with the default translation from the frequency lexicon.
It then decides whether to create a rule with the given translation, or to leave the default
translation.

The rule generation process is done with the following script:

<pre>
crisphold=1.5
# NGRAM PATTERNS
python $SCRIPTS/ngram-count-patterns.py data-$SL-$TL/$CORPUS.lex.$SL-$TL data-$SL-$TL/$CORPUS.candidates.$SL-$TL $crisphold 2>/dev/null > data-$SL-$TL/$CORPUS.ngrams.$SL-$TL

# NGRAMS TO RULES
python3 $SCRIPTS/ngrams-to-rules.py data-$SL-$TL/$CORPUS.ngrams.$SL-$TL $crisphold > data-$SL-$TL/$CORPUS.ngrams.$SL-$TL.lrx 2>/dev/null
</pre>

Where ''''crisphold'''' is a variable which determines how many more times should an individual translation occur over the default translation for a rule to be created.

==== Maximum entropy rule extraction ====
The ME method learns a discriminative model which assigns each individual ngram
a weight with which it contributes to a certain translation.

The rule extraction process is done in the following way:

<pre>
$MIN=1
YASMET=$LEX_TOOLS/yasmet
python $SCRIPTS/ngram-count-patterns-maxent2.py data-$SL-$TL/$CORPUS.lex.$SL-$TL data-$SL-$TL/$CORPUS.candidates.$SL-$TL 2>ngrams > events

echo -n "" > all-lambdas
cat events | grep -v -e '\$ 0\.0 #' -e '\$ 0 #' > events.trimmed
for i in `cat events.trimmed | cut -f1 | sort -u | sed 's/$[\*\^\$]$/\\\\\1/g'`; do

num=`cat events.trimmed | grep "^$i" | cut -f2 | head -1`
echo $num > tmp.yasmet.$i;
cat events.trimmed | grep "^$i" | cut -f3 >> tmp.yasmet.$i;
echo "$i"
cat tmp.yasmet.$i | $YASMET -red $MIN > tmp.yasmet.$i.$MIN;
cat tmp.yasmet.$i.$MIN | $YASMET > tmp.lambdas.$i
cat tmp.lambdas.$i | sed "s/^/$i /g" >> all-lambdas;
done

rm tmp.*

python3 $SCRIPTS/merge-ngrams-lambdas.py ngrams all-lambdas > rules-all.txt

python3 $SCRIPTS/lambdas-to-rules.py data-$SL-$TL/$CORPUS.lex.$SL-$TL rules-all.txt > ngrams-all.txt

python3 $SCRIPTS/ngrams-to-rules-me.py ngrams-all.txt > $PAIR.ngrams-lm-$MIN.xml 2>/tmp/$PAIR.$MIN.lög

</pre>

The MIN variable denotes how many times a certain context should occur for it to be taken into account.

== Estimating rules using non-parallel corpora ==
Prerequisites:
* Install [[apertium-lex-tools]]
* Install IRSTLM (http://sourceforge.net/projects/irstlm/)
* Estimate a '''binary''' target side language model (http://sourceforge.net/apps/mediawiki/irstlm/index.php?title=User_Manual).
* The language pair must support the pretransfer and multi modes. See apertium-sh-mk/modes.xml
as a reference on how to add these modes if they do not exist.
Place the following Makefile in the folder where you want to run your training process:

<pre>
CORPUS=setimes
DIR=sh-mk
DATA=/home/philip/Apertium/apertium-sh-mk/
AUTOBIL=sh-mk.autobil.bin
SCRIPTS=/home/philip/Apertium/apertium-lex-tools/scripts
MODEL=/home/philip/Apertium/corpora/language-models/mk/setimes.mk.5.blm
LEX_TOOLS=/home/philip/Apertium/apertium-lex-tools
THR=0

#all: data/$(CORPUS).$(DIR).lrx data/$(CORPUS).$(DIR).freq.lrx
all: data/$(CORPUS).$(DIR).freq.lrx.bin data/$(CORPUS).$(DIR).patterns.lrx

data/$(CORPUS).$(DIR).tagger: $(CORPUS).$(DIR).txt
if [ ! -d data ]; then mkdir data; fi
cat $(CORPUS).$(DIR).txt | sed 's/[^\.]$$/./g' | apertium-destxt | apertium -f none -d $(DATA) $(DIR)-tagger | apertium-pretransfer > $@

data/$(CORPUS).$(DIR).ambig: data/$(CORPUS).$(DIR).tagger
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/multitrans $(DATA)$(DIR).autobil.bin -b -t > $@

data/$(CORPUS).$(DIR).multi-trimmed: data/$(CORPUS).$(DIR).tagger
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/multitrans $(DATA)$(DIR).autobil.bin -m -t > $@

data/$(CORPUS).$(DIR).ranked: data/$(CORPUS).$(DIR).tagger
cat $< | $(LEX_TOOLS)/multitrans $(DATA)$(DIR).autobil.bin -m | apertium -f none -d $(DATA) $(DIR)-multi | irstlm-ranker-frac $(MODEL) > $@

data/$(CORPUS).$(DIR).annotated: data/$(CORPUS).$(DIR).multi-trimmed data/$(CORPUS).$(DIR).ranked
paste data/$(CORPUS).$(DIR).multi-trimmed data/$(CORPUS).$(DIR).ranked | cut -f1-4 > $@

data/$(CORPUS).$(DIR).freq: data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated
python3 $(SCRIPTS)/biltrans-extract-frac-freq.py data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated > $@

data/$(CORPUS).$(DIR).freq.lrx: data/$(CORPUS).$(DIR).freq
python3 $(SCRIPTS)/extract-alig-lrx.py $< > $@

data/$(CORPUS).$(DIR).freq.lrx.bin: data/$(CORPUS).$(DIR).freq.lrx
lrx-comp $< $@

data/$(CORPUS).$(DIR).ngrams: data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated
python3 $(SCRIPTS)/biltrans-count-patterns-ngrams.py data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated > $@

data/$(CORPUS).$(DIR).patterns: data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ngrams
python3 $(SCRIPTS)/ngram-pruning-frac.py data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ngrams > $@

data/$(CORPUS).$(DIR).patterns.lrx: data/$(CORPUS).$(DIR).patterns
python3 $(SCRIPTS)/ngrams-to-rules.py $< $(THR) > $@
</pre>

In the same folder also place your source side corpus file. The corpus file needs to be named as "basename"."language-pair".txt. 
As an illustration, in the Makefile example, the corpus file is named setimes.sh-mk.txt.

Set the Makefile variables as follows: 
* CORPUS denotes the base name of your corpus file
* DIR stands for the language pair
* DATA is the path to the language resources for the language pair
* AUTOBIL is the path to binary bilingual dictionary for the language pair
* SCRIPTS denotes the path to the lex-tools scripts
* MODEL is the path to the target side (binary) language model used for scoring the possible translations of ambiguous words

Finally, executing the Makefile will generate lexical selection rules for the specified language pair.

Learning rules from parallel and non-parallel corpora

2013-08-19T07:57:18Z

Fpetkovski: /* Maximum entropy rule extraction */

== Estimating rules using parallel corpora ==
It is always recommended to use a parallel corpus for any type of machine translation training
when such a resource is available. This section describes several (3) methods for estimating
lexical selection rules using a parallel corpus. We start by describing the part of the training process that is shared by all three methods, and then continue to describe each of the
individual methods separately.
=== Prequisites ===
The training methods use several software packages that need to be installed.
First you will need to download and install:
* [[lttoolbox]]
* Apertium
* [[apertium-lex-tools]]
* Moses and its training scripts which you can install using the following script

Furthermore you will also need:

* an Apertium language pair
* a parallel corpus (see [[Corpora]])

===Installing prerequisites===
See [[Minimal installation from SVN]] for apertium/lttoolbox.

See [[Constraint-based lexical selection module]] for apertium-lex-tools.

For moses-decoder you can do
<pre>
git clone https://github.com/moses-smt/mosesdecoder
cd mosesdecoder/
./bjam
</pre>

Now e.g. the clean-corpus and train-model scripts referred to below will be in ~/smt/mosesdecoder/scripts/training/clean-corpus-n.perl
See http://www.statmt.org/moses/?n=Development.GetStarted if you want to install the binaries to some other directory.

=== Preparing the training files ===

The parallel corpus is processed in such a way that
the training files (the source and target side corpus) are first analysed and tagged.

Next, lines with no analysis are removed and blank within tokens are replaced with a new character
since Giza tokenizes a sentence by splitting on white space.

Finally, both files are cleaned using a moses training script so that Giza will not
crash during the training process.

All of this can be achieved using the following script:

<pre>
CORPUS="Europarl3"
PAIR="es-pt"
SL="pt"
TL="es"
DATA="/home/philip/Apertium/apertium-es-pt"

LEX_TOOLS="/home/philip/Apertium/apertium-lex-tools"
SCRIPTS="$LEX_TOOLS/scripts"
MOSESDECODER="/home/philip/Apertium/mosesdecoder/scripts/training"
TRAINING_LINES=100000

if [ ! -d data-$SL-$TL ]; then
mkdir data-$SL-$TL;
fi

# TAG CORPUS
cat "$CORPUS.$PAIR.$SL" | head -n $TRAINING_LINES | apertium -d "$DATA" $SL-$TL-tagger-dcase \
| apertium-pretransfer > data-$SL-$TL/$CORPUS.tagged.$SL;

cat "$CORPUS.$PAIR.$TL" | head -n $TRAINING_LINES | apertium -d "$DATA" $TL-$SL-tagger-dcase \
| apertium-pretransfer > data-$SL-$TL/$CORPUS.tagged.$TL;

N=`wc -l $CORPUS.$PAIR.$SL | cut -d ' ' -f 1`

# REMOVE LINES WITH NO ANALYSES
seq 1 $TRAINING_LINES > data-$SL-$TL/$CORPUS.lines
paste data-$SL-$TL/$CORPUS.lines data-$SL-$TL/$CORPUS.tagged.$SL data-$SL-$TL/$CORPUS.tagged.$TL | grep '<' \
| cut -f1 > data-$SL-$TL/$CORPUS.lines.new
paste data-$SL-$TL/$CORPUS.lines data-$SL-$TL/$CORPUS.tagged.$SL data-$SL-$TL/$CORPUS.tagged.$TL | grep '<' \
| cut -f2 > data-$SL-$TL/$CORPUS.tagged.$SL.new
paste data-$SL-$TL/$CORPUS.lines data-$SL-$TL/$CORPUS.tagged.$SL data-$SL-$TL/$CORPUS.tagged.$TL | grep '<' \
| cut -f3 > data-$SL-$TL/$CORPUS.tagged.$TL.new
mv data-$SL-$TL/$CORPUS.lines.new data-$SL-$TL/$CORPUS.lines
cat data-$SL-$TL/$CORPUS.tagged.$SL.new \
| sed 's/ /~/g' | sed 's/\$[^\^]*/$ /g' > data-$SL-$TL/$CORPUS.tagged.$SL
cat data-$SL-$TL/$CORPUS.tagged.$TL.new \
| sed 's/ /~/g' | sed 's/\$[^\^]*/$ /g' > data-$SL-$TL/$CORPUS.tagged.$TL
rm data-$SL-$TL/*.new

# CLEAN CORPUS
perl "$MOSESDECODER/clean-corpus-n.perl" data-$SL-$TL/$CORPUS.tagged $SL $TL "data-$SL-$TL/$CORPUS.tag-clean" 1 40;

</pre>

Make sure you set the variables as follows: 
* CORPUS="Europarl3": name of the corpus that you're using
* PAIR: the direction of the corpus
* SL: the source language
* TL: the target language
* DATA: path to the language resources for the language pair
* LEX_TOOLS: path to apertium-lex-tools
* MOSESDECODER: path to moses-decoder
* TRAINING_LINES: amount of training lines

=== Learning rules with Giza ===

==== Installing Giza ====

You can download and install Giza in the following way:

<pre>

$ mkdir ~/smt
$ cd ~/smt
$ mkdir local # our "install prefix"
$ wget https://giza-pp.googlecode.com/files/giza-pp-v1.0.7.tar.gz
$ tar xzvf giza-pp-v1.0.7.tar.gz
$ cd giza-pp
$ make
$ mkdir ../local/bin
$ cp GIZA++-v2/snt2cooc.out ../local/bin/
$ cp GIZA++-v2/snt2plain.out ../local/bin/
$ cp GIZA++-v2/GIZA++ ../local/bin/
$ cp mkcls-v2/mkcls ../local/bin/
</pre>

Notice that all of your binary files will be placed in a single folder which is needed by the moses training script.

==== Alignment ====
Giza++ is used for obtaining an alignment between the tokens of the parallel sentences in the two corpora. The alignment process can sometimes be slow depending on the size of the corpora.

After alignment has been done, the tokens' tags can be trimmed down so that they match the the set of tags found in the bilingual dictionary of the language pair.

Next, a bilingual transfer output is obtained from the source language side so that
ambiguous sentences and missing bidix candidates can be extracted.

Finally, a frequency lexicon is created, marking the most common translation for each ambiguous token.

<pre>
BIN_DIR="/home/philip/Apertium/smt/bin"
# ALIGN
perl $MOSESDECODER/train-model.perl -external-bin-dir "$BIN_DIR" -corpus data-$SL-$TL/$CORPUS.tag-clean \
-f $TL -e $SL -alignment grow-diag-final-and -reordering msd-bidirectional-fe \
-lm 0:5:/home/philip/Apertium/gsoc2013/giza/europarl.lm:0 2>&1

# EXTRACT
zcat giza.$SL-$TL/$SL-$TL.A3.final.gz | $SCRIPTS/giza-to-moses.awk > data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL

# TRIM TAGS
cat data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL | sed 's/ ||| /\t/g' | cut -f 1 \
| sed 's/~/ /g' | $LEX_TOOLS/process-tagger-output $DATA/$TL-$SL.autobil.bin -p -t > tmp1

cat data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL | sed 's/ ||| /\t/g' | cut -f 2 \
| sed 's/~/ /g' | $LEX_TOOLS/process-tagger-output $DATA/$SL-$TL.autobil.bin -p -t > tmp2

cat data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL | sed 's/ ||| /\t/g' | cut -f 3 > tmp3

cat data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL | sed 's/ ||| /\t/g' | cut -f 2 \
| sed 's/~/ /g' | $LEX_TOOLS/process-tagger-output $DATA/$SL-$TL.autobil.bin -b -t > data-$SL-$TL/$CORPUS.clean-biltrans.$PAIR

paste tmp1 tmp2 tmp3 | sed 's/\t/ ||| /g' > data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL
rm tmp1 tmp2 tmp3

# SENTENCES
python3 $SCRIPTS/extract-sentences.py data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL data-$SL-$TL/$CORPUS.clean-biltrans.$PAIR \
> data-$SL-$TL/$CORPUS.candidates.$SL-$TL 2>/dev/null

# FREQUENCY LEXICON
python $SCRIPTS/extract-freq-lexicon.py data-$SL-$TL/$CORPUS.candidates.$SL-$TL > data-$SL-$TL/$CORPUS.lex.$SL-$TL 2>/dev/null

</pre>

Make sure you set the BIN_DIR variable so that it contains the path to the binary folder
generated by the Giza installation process.

==== Maximum likelihood rule extraction ====
The ML method counts how many each translation occurs in a given context, and compares that number
with the default translation from the frequency lexicon.
It then decides whether to create a rule with the given translation, or to leave the default
translation.

The rule generation process is done with the following script:

<pre>
crisphold=1.5
# NGRAM PATTERNS
python $SCRIPTS/ngram-count-patterns.py data-$SL-$TL/$CORPUS.lex.$SL-$TL data-$SL-$TL/$CORPUS.candidates.$SL-$TL $crisphold 2>/dev/null > data-$SL-$TL/$CORPUS.ngrams.$SL-$TL

# NGRAMS TO RULES
python3 $SCRIPTS/ngrams-to-rules.py data-$SL-$TL/$CORPUS.ngrams.$SL-$TL $crisphold > data-$SL-$TL/$CORPUS.ngrams.$SL-$TL.lrx 2>/dev/null
</pre>

Where ''''crisphold'''' is a variable which determines how many more times should an individual translation occur over the default translation for a rule to be created.

==== Maximum entropy rule extraction ====
The ME method learns a discriminative model which assigns each individual ngram
a weight with which it contributes to a certain translation.

The rule extraction process is done in the following way:

<pre>
$MIN=1
YASMET=$LEX_TOOLS/yasmet
python $SCRIPTS/ngram-count-patterns-maxent2.py data-$SL-$TL/$CORPUS.lex.$SL-$TL data-$SL-$TL/$CORPUS.candidates.$SL-$TL 2>ngrams > events

echo -n "" > all-lambdas
cat events | grep -v -e '\$ 0\.0 #' -e '\$ 0 #' > events.trimmed
for i in `cat events.trimmed | cut -f1 | sort -u | sed 's/$[\*\^\$]$/\\\\\1/g'`; do

num=`cat events.trimmed | grep "^$i" | cut -f2 | head -1`
echo $num > tmp.yasmet.$i;
cat events.trimmed | grep "^$i" | cut -f3 >> tmp.yasmet.$i;
echo "$i"
cat tmp.yasmet.$i | $YASMET -red $MIN > tmp.yasmet.$i.$MIN;
cat tmp.yasmet.$i.$MIN | $YASMET > tmp.lambdas.$i
cat tmp.lambdas.$i | sed "s/^/$i /g" >> all-lambdas;
done

rm tmp.*

python3 $SCRIPTS/merge-ngrams-lambdas.py ngrams all-lambdas > rules-all.txt

python3 $SCRIPTS/lambdas-to-rules.py $TRAIN/$CORPUS.lex.$SL-$TL rules-all.txt > ngrams-all.txt

python3 $SCRIPTS/ngrams-to-rules-me.py ngrams-all.txt > $PAIR.ngrams-lm-$MIN.xml 2>/tmp/$PAIR.$MIN.lög

</pre>

The MIN variable denotes how many times a certain context should occur for it to be taken into account.

== Estimating rules using non-parallel corpora ==
Prerequisites:
* Install [[apertium-lex-tools]]
* Install IRSTLM (http://sourceforge.net/projects/irstlm/)
* Estimate a '''binary''' target side language model (http://sourceforge.net/apps/mediawiki/irstlm/index.php?title=User_Manual).
* The language pair must support the pretransfer and multi modes. See apertium-sh-mk/modes.xml
as a reference on how to add these modes if they do not exist.
Place the following Makefile in the folder where you want to run your training process:

<pre>
CORPUS=setimes
DIR=sh-mk
DATA=/home/philip/Apertium/apertium-sh-mk/
AUTOBIL=sh-mk.autobil.bin
SCRIPTS=/home/philip/Apertium/apertium-lex-tools/scripts
MODEL=/home/philip/Apertium/corpora/language-models/mk/setimes.mk.5.blm
LEX_TOOLS=/home/philip/Apertium/apertium-lex-tools
THR=0

#all: data/$(CORPUS).$(DIR).lrx data/$(CORPUS).$(DIR).freq.lrx
all: data/$(CORPUS).$(DIR).freq.lrx.bin data/$(CORPUS).$(DIR).patterns.lrx

data/$(CORPUS).$(DIR).tagger: $(CORPUS).$(DIR).txt
if [ ! -d data ]; then mkdir data; fi
cat $(CORPUS).$(DIR).txt | sed 's/[^\.]$$/./g' | apertium-destxt | apertium -f none -d $(DATA) $(DIR)-tagger | apertium-pretransfer > $@

data/$(CORPUS).$(DIR).ambig: data/$(CORPUS).$(DIR).tagger
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/multitrans $(DATA)$(DIR).autobil.bin -b -t > $@

data/$(CORPUS).$(DIR).multi-trimmed: data/$(CORPUS).$(DIR).tagger
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/multitrans $(DATA)$(DIR).autobil.bin -m -t > $@

data/$(CORPUS).$(DIR).ranked: data/$(CORPUS).$(DIR).tagger
cat $< | $(LEX_TOOLS)/multitrans $(DATA)$(DIR).autobil.bin -m | apertium -f none -d $(DATA) $(DIR)-multi | irstlm-ranker-frac $(MODEL) > $@

data/$(CORPUS).$(DIR).annotated: data/$(CORPUS).$(DIR).multi-trimmed data/$(CORPUS).$(DIR).ranked
paste data/$(CORPUS).$(DIR).multi-trimmed data/$(CORPUS).$(DIR).ranked | cut -f1-4 > $@

data/$(CORPUS).$(DIR).freq: data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated
python3 $(SCRIPTS)/biltrans-extract-frac-freq.py data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated > $@

data/$(CORPUS).$(DIR).freq.lrx: data/$(CORPUS).$(DIR).freq
python3 $(SCRIPTS)/extract-alig-lrx.py $< > $@

data/$(CORPUS).$(DIR).freq.lrx.bin: data/$(CORPUS).$(DIR).freq.lrx
lrx-comp $< $@

data/$(CORPUS).$(DIR).ngrams: data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated
python3 $(SCRIPTS)/biltrans-count-patterns-ngrams.py data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated > $@

data/$(CORPUS).$(DIR).patterns: data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ngrams
python3 $(SCRIPTS)/ngram-pruning-frac.py data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ngrams > $@

data/$(CORPUS).$(DIR).patterns.lrx: data/$(CORPUS).$(DIR).patterns
python3 $(SCRIPTS)/ngrams-to-rules.py $< $(THR) > $@
</pre>

In the same folder also place your source side corpus file. The corpus file needs to be named as "basename"."language-pair".txt. 
As an illustration, in the Makefile example, the corpus file is named setimes.sh-mk.txt.

Set the Makefile variables as follows: 
* CORPUS denotes the base name of your corpus file
* DIR stands for the language pair
* DATA is the path to the language resources for the language pair
* AUTOBIL is the path to binary bilingual dictionary for the language pair
* SCRIPTS denotes the path to the lex-tools scripts
* MODEL is the path to the target side (binary) language model used for scoring the possible translations of ambiguous words

Finally, executing the Makefile will generate lexical selection rules for the specified language pair.

Learning rules from parallel and non-parallel corpora

2013-08-19T07:56:56Z

Fpetkovski: /* Maximum entropy rule extraction */

== Estimating rules using parallel corpora ==
It is always recommended to use a parallel corpus for any type of machine translation training
when such a resource is available. This section describes several (3) methods for estimating
lexical selection rules using a parallel corpus. We start by describing the part of the training process that is shared by all three methods, and then continue to describe each of the
individual methods separately.
=== Prequisites ===
The training methods use several software packages that need to be installed.
First you will need to download and install:
* [[lttoolbox]]
* Apertium
* [[apertium-lex-tools]]
* Moses and its training scripts which you can install using the following script

Furthermore you will also need:

* an Apertium language pair
* a parallel corpus (see [[Corpora]])

===Installing prerequisites===
See [[Minimal installation from SVN]] for apertium/lttoolbox.

See [[Constraint-based lexical selection module]] for apertium-lex-tools.

For moses-decoder you can do
<pre>
git clone https://github.com/moses-smt/mosesdecoder
cd mosesdecoder/
./bjam
</pre>

Now e.g. the clean-corpus and train-model scripts referred to below will be in ~/smt/mosesdecoder/scripts/training/clean-corpus-n.perl
See http://www.statmt.org/moses/?n=Development.GetStarted if you want to install the binaries to some other directory.

=== Preparing the training files ===

The parallel corpus is processed in such a way that
the training files (the source and target side corpus) are first analysed and tagged.

Next, lines with no analysis are removed and blank within tokens are replaced with a new character
since Giza tokenizes a sentence by splitting on white space.

Finally, both files are cleaned using a moses training script so that Giza will not
crash during the training process.

All of this can be achieved using the following script:

<pre>
CORPUS="Europarl3"
PAIR="es-pt"
SL="pt"
TL="es"
DATA="/home/philip/Apertium/apertium-es-pt"

LEX_TOOLS="/home/philip/Apertium/apertium-lex-tools"
SCRIPTS="$LEX_TOOLS/scripts"
MOSESDECODER="/home/philip/Apertium/mosesdecoder/scripts/training"
TRAINING_LINES=100000

if [ ! -d data-$SL-$TL ]; then
mkdir data-$SL-$TL;
fi

# TAG CORPUS
cat "$CORPUS.$PAIR.$SL" | head -n $TRAINING_LINES | apertium -d "$DATA" $SL-$TL-tagger-dcase \
| apertium-pretransfer > data-$SL-$TL/$CORPUS.tagged.$SL;

cat "$CORPUS.$PAIR.$TL" | head -n $TRAINING_LINES | apertium -d "$DATA" $TL-$SL-tagger-dcase \
| apertium-pretransfer > data-$SL-$TL/$CORPUS.tagged.$TL;

N=`wc -l $CORPUS.$PAIR.$SL | cut -d ' ' -f 1`

# REMOVE LINES WITH NO ANALYSES
seq 1 $TRAINING_LINES > data-$SL-$TL/$CORPUS.lines
paste data-$SL-$TL/$CORPUS.lines data-$SL-$TL/$CORPUS.tagged.$SL data-$SL-$TL/$CORPUS.tagged.$TL | grep '<' \
| cut -f1 > data-$SL-$TL/$CORPUS.lines.new
paste data-$SL-$TL/$CORPUS.lines data-$SL-$TL/$CORPUS.tagged.$SL data-$SL-$TL/$CORPUS.tagged.$TL | grep '<' \
| cut -f2 > data-$SL-$TL/$CORPUS.tagged.$SL.new
paste data-$SL-$TL/$CORPUS.lines data-$SL-$TL/$CORPUS.tagged.$SL data-$SL-$TL/$CORPUS.tagged.$TL | grep '<' \
| cut -f3 > data-$SL-$TL/$CORPUS.tagged.$TL.new
mv data-$SL-$TL/$CORPUS.lines.new data-$SL-$TL/$CORPUS.lines
cat data-$SL-$TL/$CORPUS.tagged.$SL.new \
| sed 's/ /~/g' | sed 's/\$[^\^]*/$ /g' > data-$SL-$TL/$CORPUS.tagged.$SL
cat data-$SL-$TL/$CORPUS.tagged.$TL.new \
| sed 's/ /~/g' | sed 's/\$[^\^]*/$ /g' > data-$SL-$TL/$CORPUS.tagged.$TL
rm data-$SL-$TL/*.new

# CLEAN CORPUS
perl "$MOSESDECODER/clean-corpus-n.perl" data-$SL-$TL/$CORPUS.tagged $SL $TL "data-$SL-$TL/$CORPUS.tag-clean" 1 40;

</pre>

Make sure you set the variables as follows: 
* CORPUS="Europarl3": name of the corpus that you're using
* PAIR: the direction of the corpus
* SL: the source language
* TL: the target language
* DATA: path to the language resources for the language pair
* LEX_TOOLS: path to apertium-lex-tools
* MOSESDECODER: path to moses-decoder
* TRAINING_LINES: amount of training lines

=== Learning rules with Giza ===

==== Installing Giza ====

You can download and install Giza in the following way:

<pre>

$ mkdir ~/smt
$ cd ~/smt
$ mkdir local # our "install prefix"
$ wget https://giza-pp.googlecode.com/files/giza-pp-v1.0.7.tar.gz
$ tar xzvf giza-pp-v1.0.7.tar.gz
$ cd giza-pp
$ make
$ mkdir ../local/bin
$ cp GIZA++-v2/snt2cooc.out ../local/bin/
$ cp GIZA++-v2/snt2plain.out ../local/bin/
$ cp GIZA++-v2/GIZA++ ../local/bin/
$ cp mkcls-v2/mkcls ../local/bin/
</pre>

Notice that all of your binary files will be placed in a single folder which is needed by the moses training script.

==== Alignment ====
Giza++ is used for obtaining an alignment between the tokens of the parallel sentences in the two corpora. The alignment process can sometimes be slow depending on the size of the corpora.

After alignment has been done, the tokens' tags can be trimmed down so that they match the the set of tags found in the bilingual dictionary of the language pair.

Next, a bilingual transfer output is obtained from the source language side so that
ambiguous sentences and missing bidix candidates can be extracted.

Finally, a frequency lexicon is created, marking the most common translation for each ambiguous token.

<pre>
BIN_DIR="/home/philip/Apertium/smt/bin"
# ALIGN
perl $MOSESDECODER/train-model.perl -external-bin-dir "$BIN_DIR" -corpus data-$SL-$TL/$CORPUS.tag-clean \
-f $TL -e $SL -alignment grow-diag-final-and -reordering msd-bidirectional-fe \
-lm 0:5:/home/philip/Apertium/gsoc2013/giza/europarl.lm:0 2>&1

# EXTRACT
zcat giza.$SL-$TL/$SL-$TL.A3.final.gz | $SCRIPTS/giza-to-moses.awk > data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL

# TRIM TAGS
cat data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL | sed 's/ ||| /\t/g' | cut -f 1 \
| sed 's/~/ /g' | $LEX_TOOLS/process-tagger-output $DATA/$TL-$SL.autobil.bin -p -t > tmp1

cat data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL | sed 's/ ||| /\t/g' | cut -f 2 \
| sed 's/~/ /g' | $LEX_TOOLS/process-tagger-output $DATA/$SL-$TL.autobil.bin -p -t > tmp2

cat data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL | sed 's/ ||| /\t/g' | cut -f 3 > tmp3

cat data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL | sed 's/ ||| /\t/g' | cut -f 2 \
| sed 's/~/ /g' | $LEX_TOOLS/process-tagger-output $DATA/$SL-$TL.autobil.bin -b -t > data-$SL-$TL/$CORPUS.clean-biltrans.$PAIR

paste tmp1 tmp2 tmp3 | sed 's/\t/ ||| /g' > data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL
rm tmp1 tmp2 tmp3

# SENTENCES
python3 $SCRIPTS/extract-sentences.py data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL data-$SL-$TL/$CORPUS.clean-biltrans.$PAIR \
> data-$SL-$TL/$CORPUS.candidates.$SL-$TL 2>/dev/null

# FREQUENCY LEXICON
python $SCRIPTS/extract-freq-lexicon.py data-$SL-$TL/$CORPUS.candidates.$SL-$TL > data-$SL-$TL/$CORPUS.lex.$SL-$TL 2>/dev/null

</pre>

Make sure you set the BIN_DIR variable so that it contains the path to the binary folder
generated by the Giza installation process.

==== Maximum likelihood rule extraction ====
The ML method counts how many each translation occurs in a given context, and compares that number
with the default translation from the frequency lexicon.
It then decides whether to create a rule with the given translation, or to leave the default
translation.

The rule generation process is done with the following script:

<pre>
crisphold=1.5
# NGRAM PATTERNS
python $SCRIPTS/ngram-count-patterns.py data-$SL-$TL/$CORPUS.lex.$SL-$TL data-$SL-$TL/$CORPUS.candidates.$SL-$TL $crisphold 2>/dev/null > data-$SL-$TL/$CORPUS.ngrams.$SL-$TL

# NGRAMS TO RULES
python3 $SCRIPTS/ngrams-to-rules.py data-$SL-$TL/$CORPUS.ngrams.$SL-$TL $crisphold > data-$SL-$TL/$CORPUS.ngrams.$SL-$TL.lrx 2>/dev/null
</pre>

Where ''''crisphold'''' is a variable which determines how many more times should an individual translation occur over the default translation for a rule to be created.

==== Maximum entropy rule extraction ====
The ME method learns a discriminative model which assigns each individual context (ngram)
a weight with which it contributes to a certain translation.

The rule extraction process is done in the following way:

<pre>
$MIN=1
YASMET=$LEX_TOOLS/yasmet
python $SCRIPTS/ngram-count-patterns-maxent2.py data-$SL-$TL/$CORPUS.lex.$SL-$TL data-$SL-$TL/$CORPUS.candidates.$SL-$TL 2>ngrams > events

echo -n "" > all-lambdas
cat events | grep -v -e '\$ 0\.0 #' -e '\$ 0 #' > events.trimmed
for i in `cat events.trimmed | cut -f1 | sort -u | sed 's/$[\*\^\$]$/\\\\\1/g'`; do

num=`cat events.trimmed | grep "^$i" | cut -f2 | head -1`
echo $num > tmp.yasmet.$i;
cat events.trimmed | grep "^$i" | cut -f3 >> tmp.yasmet.$i;
echo "$i"
cat tmp.yasmet.$i | $YASMET -red $MIN > tmp.yasmet.$i.$MIN;
cat tmp.yasmet.$i.$MIN | $YASMET > tmp.lambdas.$i
cat tmp.lambdas.$i | sed "s/^/$i /g" >> all-lambdas;
done

rm tmp.*

python3 $SCRIPTS/merge-ngrams-lambdas.py ngrams all-lambdas > rules-all.txt

python3 $SCRIPTS/lambdas-to-rules.py $TRAIN/$CORPUS.lex.$SL-$TL rules-all.txt > ngrams-all.txt

python3 $SCRIPTS/ngrams-to-rules-me.py ngrams-all.txt > $PAIR.ngrams-lm-$MIN.xml 2>/tmp/$PAIR.$MIN.lög

</pre>

The MIN variable denotes how many times a certain context should occur for it to be taken into account.

== Estimating rules using non-parallel corpora ==
Prerequisites:
* Install [[apertium-lex-tools]]
* Install IRSTLM (http://sourceforge.net/projects/irstlm/)
* Estimate a '''binary''' target side language model (http://sourceforge.net/apps/mediawiki/irstlm/index.php?title=User_Manual).
* The language pair must support the pretransfer and multi modes. See apertium-sh-mk/modes.xml
as a reference on how to add these modes if they do not exist.
Place the following Makefile in the folder where you want to run your training process:

<pre>
CORPUS=setimes
DIR=sh-mk
DATA=/home/philip/Apertium/apertium-sh-mk/
AUTOBIL=sh-mk.autobil.bin
SCRIPTS=/home/philip/Apertium/apertium-lex-tools/scripts
MODEL=/home/philip/Apertium/corpora/language-models/mk/setimes.mk.5.blm
LEX_TOOLS=/home/philip/Apertium/apertium-lex-tools
THR=0

#all: data/$(CORPUS).$(DIR).lrx data/$(CORPUS).$(DIR).freq.lrx
all: data/$(CORPUS).$(DIR).freq.lrx.bin data/$(CORPUS).$(DIR).patterns.lrx

data/$(CORPUS).$(DIR).tagger: $(CORPUS).$(DIR).txt
if [ ! -d data ]; then mkdir data; fi
cat $(CORPUS).$(DIR).txt | sed 's/[^\.]$$/./g' | apertium-destxt | apertium -f none -d $(DATA) $(DIR)-tagger | apertium-pretransfer > $@

data/$(CORPUS).$(DIR).ambig: data/$(CORPUS).$(DIR).tagger
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/multitrans $(DATA)$(DIR).autobil.bin -b -t > $@

data/$(CORPUS).$(DIR).multi-trimmed: data/$(CORPUS).$(DIR).tagger
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/multitrans $(DATA)$(DIR).autobil.bin -m -t > $@

data/$(CORPUS).$(DIR).ranked: data/$(CORPUS).$(DIR).tagger
cat $< | $(LEX_TOOLS)/multitrans $(DATA)$(DIR).autobil.bin -m | apertium -f none -d $(DATA) $(DIR)-multi | irstlm-ranker-frac $(MODEL) > $@

data/$(CORPUS).$(DIR).annotated: data/$(CORPUS).$(DIR).multi-trimmed data/$(CORPUS).$(DIR).ranked
paste data/$(CORPUS).$(DIR).multi-trimmed data/$(CORPUS).$(DIR).ranked | cut -f1-4 > $@

data/$(CORPUS).$(DIR).freq: data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated
python3 $(SCRIPTS)/biltrans-extract-frac-freq.py data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated > $@

data/$(CORPUS).$(DIR).freq.lrx: data/$(CORPUS).$(DIR).freq
python3 $(SCRIPTS)/extract-alig-lrx.py $< > $@

data/$(CORPUS).$(DIR).freq.lrx.bin: data/$(CORPUS).$(DIR).freq.lrx
lrx-comp $< $@

data/$(CORPUS).$(DIR).ngrams: data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated
python3 $(SCRIPTS)/biltrans-count-patterns-ngrams.py data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated > $@

data/$(CORPUS).$(DIR).patterns: data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ngrams
python3 $(SCRIPTS)/ngram-pruning-frac.py data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ngrams > $@

data/$(CORPUS).$(DIR).patterns.lrx: data/$(CORPUS).$(DIR).patterns
python3 $(SCRIPTS)/ngrams-to-rules.py $< $(THR) > $@
</pre>

In the same folder also place your source side corpus file. The corpus file needs to be named as "basename"."language-pair".txt. 
As an illustration, in the Makefile example, the corpus file is named setimes.sh-mk.txt.

Set the Makefile variables as follows: 
* CORPUS denotes the base name of your corpus file
* DIR stands for the language pair
* DATA is the path to the language resources for the language pair
* AUTOBIL is the path to binary bilingual dictionary for the language pair
* SCRIPTS denotes the path to the lex-tools scripts
* MODEL is the path to the target side (binary) language model used for scoring the possible translations of ambiguous words

Finally, executing the Makefile will generate lexical selection rules for the specified language pair.

Learning rules from parallel and non-parallel corpora

2013-08-19T07:55:02Z

Fpetkovski: /* Maximum entropy rule extraction */

== Estimating rules using parallel corpora ==
It is always recommended to use a parallel corpus for any type of machine translation training
when such a resource is available. This section describes several (3) methods for estimating
lexical selection rules using a parallel corpus. We start by describing the part of the training process that is shared by all three methods, and then continue to describe each of the
individual methods separately.
=== Prequisites ===
The training methods use several software packages that need to be installed.
First you will need to download and install:
* [[lttoolbox]]
* Apertium
* [[apertium-lex-tools]]
* Moses and its training scripts which you can install using the following script

Furthermore you will also need:

* an Apertium language pair
* a parallel corpus (see [[Corpora]])

===Installing prerequisites===
See [[Minimal installation from SVN]] for apertium/lttoolbox.

See [[Constraint-based lexical selection module]] for apertium-lex-tools.

For moses-decoder you can do
<pre>
git clone https://github.com/moses-smt/mosesdecoder
cd mosesdecoder/
./bjam
</pre>

Now e.g. the clean-corpus and train-model scripts referred to below will be in ~/smt/mosesdecoder/scripts/training/clean-corpus-n.perl
See http://www.statmt.org/moses/?n=Development.GetStarted if you want to install the binaries to some other directory.

=== Preparing the training files ===

The parallel corpus is processed in such a way that
the training files (the source and target side corpus) are first analysed and tagged.

Next, lines with no analysis are removed and blank within tokens are replaced with a new character
since Giza tokenizes a sentence by splitting on white space.

Finally, both files are cleaned using a moses training script so that Giza will not
crash during the training process.

All of this can be achieved using the following script:

<pre>
CORPUS="Europarl3"
PAIR="es-pt"
SL="pt"
TL="es"
DATA="/home/philip/Apertium/apertium-es-pt"

LEX_TOOLS="/home/philip/Apertium/apertium-lex-tools"
SCRIPTS="$LEX_TOOLS/scripts"
MOSESDECODER="/home/philip/Apertium/mosesdecoder/scripts/training"
TRAINING_LINES=100000

if [ ! -d data-$SL-$TL ]; then
mkdir data-$SL-$TL;
fi

# TAG CORPUS
cat "$CORPUS.$PAIR.$SL" | head -n $TRAINING_LINES | apertium -d "$DATA" $SL-$TL-tagger-dcase \
| apertium-pretransfer > data-$SL-$TL/$CORPUS.tagged.$SL;

cat "$CORPUS.$PAIR.$TL" | head -n $TRAINING_LINES | apertium -d "$DATA" $TL-$SL-tagger-dcase \
| apertium-pretransfer > data-$SL-$TL/$CORPUS.tagged.$TL;

N=`wc -l $CORPUS.$PAIR.$SL | cut -d ' ' -f 1`

# REMOVE LINES WITH NO ANALYSES
seq 1 $TRAINING_LINES > data-$SL-$TL/$CORPUS.lines
paste data-$SL-$TL/$CORPUS.lines data-$SL-$TL/$CORPUS.tagged.$SL data-$SL-$TL/$CORPUS.tagged.$TL | grep '<' \
| cut -f1 > data-$SL-$TL/$CORPUS.lines.new
paste data-$SL-$TL/$CORPUS.lines data-$SL-$TL/$CORPUS.tagged.$SL data-$SL-$TL/$CORPUS.tagged.$TL | grep '<' \
| cut -f2 > data-$SL-$TL/$CORPUS.tagged.$SL.new
paste data-$SL-$TL/$CORPUS.lines data-$SL-$TL/$CORPUS.tagged.$SL data-$SL-$TL/$CORPUS.tagged.$TL | grep '<' \
| cut -f3 > data-$SL-$TL/$CORPUS.tagged.$TL.new
mv data-$SL-$TL/$CORPUS.lines.new data-$SL-$TL/$CORPUS.lines
cat data-$SL-$TL/$CORPUS.tagged.$SL.new \
| sed 's/ /~/g' | sed 's/\$[^\^]*/$ /g' > data-$SL-$TL/$CORPUS.tagged.$SL
cat data-$SL-$TL/$CORPUS.tagged.$TL.new \
| sed 's/ /~/g' | sed 's/\$[^\^]*/$ /g' > data-$SL-$TL/$CORPUS.tagged.$TL
rm data-$SL-$TL/*.new

# CLEAN CORPUS
perl "$MOSESDECODER/clean-corpus-n.perl" data-$SL-$TL/$CORPUS.tagged $SL $TL "data-$SL-$TL/$CORPUS.tag-clean" 1 40;

</pre>

Make sure you set the variables as follows: 
* CORPUS="Europarl3": name of the corpus that you're using
* PAIR: the direction of the corpus
* SL: the source language
* TL: the target language
* DATA: path to the language resources for the language pair
* LEX_TOOLS: path to apertium-lex-tools
* MOSESDECODER: path to moses-decoder
* TRAINING_LINES: amount of training lines

=== Learning rules with Giza ===

==== Installing Giza ====

You can download and install Giza in the following way:

<pre>

$ mkdir ~/smt
$ cd ~/smt
$ mkdir local # our "install prefix"
$ wget https://giza-pp.googlecode.com/files/giza-pp-v1.0.7.tar.gz
$ tar xzvf giza-pp-v1.0.7.tar.gz
$ cd giza-pp
$ make
$ mkdir ../local/bin
$ cp GIZA++-v2/snt2cooc.out ../local/bin/
$ cp GIZA++-v2/snt2plain.out ../local/bin/
$ cp GIZA++-v2/GIZA++ ../local/bin/
$ cp mkcls-v2/mkcls ../local/bin/
</pre>

Notice that all of your binary files will be placed in a single folder which is needed by the moses training script.

==== Alignment ====
Giza++ is used for obtaining an alignment between the tokens of the parallel sentences in the two corpora. The alignment process can sometimes be slow depending on the size of the corpora.

After alignment has been done, the tokens' tags can be trimmed down so that they match the the set of tags found in the bilingual dictionary of the language pair.

Next, a bilingual transfer output is obtained from the source language side so that
ambiguous sentences and missing bidix candidates can be extracted.

Finally, a frequency lexicon is created, marking the most common translation for each ambiguous token.

<pre>
BIN_DIR="/home/philip/Apertium/smt/bin"
# ALIGN
perl $MOSESDECODER/train-model.perl -external-bin-dir "$BIN_DIR" -corpus data-$SL-$TL/$CORPUS.tag-clean \
-f $TL -e $SL -alignment grow-diag-final-and -reordering msd-bidirectional-fe \
-lm 0:5:/home/philip/Apertium/gsoc2013/giza/europarl.lm:0 2>&1

# EXTRACT
zcat giza.$SL-$TL/$SL-$TL.A3.final.gz | $SCRIPTS/giza-to-moses.awk > data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL

# TRIM TAGS
cat data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL | sed 's/ ||| /\t/g' | cut -f 1 \
| sed 's/~/ /g' | $LEX_TOOLS/process-tagger-output $DATA/$TL-$SL.autobil.bin -p -t > tmp1

cat data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL | sed 's/ ||| /\t/g' | cut -f 2 \
| sed 's/~/ /g' | $LEX_TOOLS/process-tagger-output $DATA/$SL-$TL.autobil.bin -p -t > tmp2

cat data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL | sed 's/ ||| /\t/g' | cut -f 3 > tmp3

cat data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL | sed 's/ ||| /\t/g' | cut -f 2 \
| sed 's/~/ /g' | $LEX_TOOLS/process-tagger-output $DATA/$SL-$TL.autobil.bin -b -t > data-$SL-$TL/$CORPUS.clean-biltrans.$PAIR

paste tmp1 tmp2 tmp3 | sed 's/\t/ ||| /g' > data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL
rm tmp1 tmp2 tmp3

# SENTENCES
python3 $SCRIPTS/extract-sentences.py data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL data-$SL-$TL/$CORPUS.clean-biltrans.$PAIR \
> data-$SL-$TL/$CORPUS.candidates.$SL-$TL 2>/dev/null

# FREQUENCY LEXICON
python $SCRIPTS/extract-freq-lexicon.py data-$SL-$TL/$CORPUS.candidates.$SL-$TL > data-$SL-$TL/$CORPUS.lex.$SL-$TL 2>/dev/null

</pre>

Make sure you set the BIN_DIR variable so that it contains the path to the binary folder
generated by the Giza installation process.

==== Maximum likelihood rule extraction ====
The ML method counts how many each translation occurs in a given context, and compares that number
with the default translation from the frequency lexicon.
It then decides whether to create a rule with the given translation, or to leave the default
translation.

The rule generation process is done with the following script:

<pre>
crisphold=1.5
# NGRAM PATTERNS
python $SCRIPTS/ngram-count-patterns.py data-$SL-$TL/$CORPUS.lex.$SL-$TL data-$SL-$TL/$CORPUS.candidates.$SL-$TL $crisphold 2>/dev/null > data-$SL-$TL/$CORPUS.ngrams.$SL-$TL

# NGRAMS TO RULES
python3 $SCRIPTS/ngrams-to-rules.py data-$SL-$TL/$CORPUS.ngrams.$SL-$TL $crisphold > data-$SL-$TL/$CORPUS.ngrams.$SL-$TL.lrx 2>/dev/null
</pre>

Where ''''crisphold'''' is a variable which determines how many more times should an individual translation occur over the default translation for a rule to be created.

==== Maximum entropy rule extraction ====
The ME method learns a discriminative model which assigns each individual context (ngram)
a weight with which it contributes to a certain translation.

The rule extraction process is done in the following way:

<pre>
YASMET=$LEX_TOOLS/yasmet
python $SCRIPTS/ngram-count-patterns-maxent2.py data-$SL-$TL/$CORPUS.lex.$SL-$TL data-$SL-$TL/$CORPUS.candidates.$SL-$TL 2>ngrams > events

echo -n "" > all-lambdas
cat events | grep -v -e '\$ 0\.0 #' -e '\$ 0 #' > events.trimmed
for i in `cat events.trimmed | cut -f1 | sort -u | sed 's/$[\*\^\$]$/\\\\\1/g'`; do

num=`cat events.trimmed | grep "^$i" | cut -f2 | head -1`
echo $num > tmp.yasmet.$i;
cat events.trimmed | grep "^$i" | cut -f3 >> tmp.yasmet.$i;
echo "$i"
cat tmp.yasmet.$i | $YASMET -red $MIN > tmp.yasmet.$i.$MIN;
cat tmp.yasmet.$i.$MIN | $YASMET > tmp.lambdas.$i
cat tmp.lambdas.$i | sed "s/^/$i /g" >> all-lambdas;
done

rm tmp.*

python3 $SCRIPTS/merge-ngrams-lambdas.py ngrams all-lambdas > rules-all.txt

python3 $SCRIPTS/lambdas-to-rules.py $TRAIN/$CORPUS.lex.$SL-$TL rules-all.txt > ngrams-all.txt

python3 $SCRIPTS/ngrams-to-rules-me.py ngrams-all.txt > $PAIR.ngrams-lm-$MIN.xml 2>/tmp/$PAIR.$MIN.lög

</pre>

== Estimating rules using non-parallel corpora ==
Prerequisites:
* Install [[apertium-lex-tools]]
* Install IRSTLM (http://sourceforge.net/projects/irstlm/)
* Estimate a '''binary''' target side language model (http://sourceforge.net/apps/mediawiki/irstlm/index.php?title=User_Manual).
* The language pair must support the pretransfer and multi modes. See apertium-sh-mk/modes.xml
as a reference on how to add these modes if they do not exist.
Place the following Makefile in the folder where you want to run your training process:

<pre>
CORPUS=setimes
DIR=sh-mk
DATA=/home/philip/Apertium/apertium-sh-mk/
AUTOBIL=sh-mk.autobil.bin
SCRIPTS=/home/philip/Apertium/apertium-lex-tools/scripts
MODEL=/home/philip/Apertium/corpora/language-models/mk/setimes.mk.5.blm
LEX_TOOLS=/home/philip/Apertium/apertium-lex-tools
THR=0

#all: data/$(CORPUS).$(DIR).lrx data/$(CORPUS).$(DIR).freq.lrx
all: data/$(CORPUS).$(DIR).freq.lrx.bin data/$(CORPUS).$(DIR).patterns.lrx

data/$(CORPUS).$(DIR).tagger: $(CORPUS).$(DIR).txt
if [ ! -d data ]; then mkdir data; fi
cat $(CORPUS).$(DIR).txt | sed 's/[^\.]$$/./g' | apertium-destxt | apertium -f none -d $(DATA) $(DIR)-tagger | apertium-pretransfer > $@

data/$(CORPUS).$(DIR).ambig: data/$(CORPUS).$(DIR).tagger
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/multitrans $(DATA)$(DIR).autobil.bin -b -t > $@

data/$(CORPUS).$(DIR).multi-trimmed: data/$(CORPUS).$(DIR).tagger
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/multitrans $(DATA)$(DIR).autobil.bin -m -t > $@

data/$(CORPUS).$(DIR).ranked: data/$(CORPUS).$(DIR).tagger
cat $< | $(LEX_TOOLS)/multitrans $(DATA)$(DIR).autobil.bin -m | apertium -f none -d $(DATA) $(DIR)-multi | irstlm-ranker-frac $(MODEL) > $@

data/$(CORPUS).$(DIR).annotated: data/$(CORPUS).$(DIR).multi-trimmed data/$(CORPUS).$(DIR).ranked
paste data/$(CORPUS).$(DIR).multi-trimmed data/$(CORPUS).$(DIR).ranked | cut -f1-4 > $@

data/$(CORPUS).$(DIR).freq: data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated
python3 $(SCRIPTS)/biltrans-extract-frac-freq.py data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated > $@

data/$(CORPUS).$(DIR).freq.lrx: data/$(CORPUS).$(DIR).freq
python3 $(SCRIPTS)/extract-alig-lrx.py $< > $@

data/$(CORPUS).$(DIR).freq.lrx.bin: data/$(CORPUS).$(DIR).freq.lrx
lrx-comp $< $@

data/$(CORPUS).$(DIR).ngrams: data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated
python3 $(SCRIPTS)/biltrans-count-patterns-ngrams.py data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated > $@

data/$(CORPUS).$(DIR).patterns: data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ngrams
python3 $(SCRIPTS)/ngram-pruning-frac.py data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ngrams > $@

data/$(CORPUS).$(DIR).patterns.lrx: data/$(CORPUS).$(DIR).patterns
python3 $(SCRIPTS)/ngrams-to-rules.py $< $(THR) > $@
</pre>

In the same folder also place your source side corpus file. The corpus file needs to be named as "basename"."language-pair".txt. 
As an illustration, in the Makefile example, the corpus file is named setimes.sh-mk.txt.

Set the Makefile variables as follows: 
* CORPUS denotes the base name of your corpus file
* DIR stands for the language pair
* DATA is the path to the language resources for the language pair
* AUTOBIL is the path to binary bilingual dictionary for the language pair
* SCRIPTS denotes the path to the lex-tools scripts
* MODEL is the path to the target side (binary) language model used for scoring the possible translations of ambiguous words

Finally, executing the Makefile will generate lexical selection rules for the specified language pair.

Learning rules from parallel and non-parallel corpora

2013-08-19T07:54:12Z

Fpetkovski: /* Estimating rules using parallel corpora */

== Estimating rules using parallel corpora ==
It is always recommended to use a parallel corpus for any type of machine translation training
when such a resource is available. This section describes several (3) methods for estimating
lexical selection rules using a parallel corpus. We start by describing the part of the training process that is shared by all three methods, and then continue to describe each of the
individual methods separately.
=== Prequisites ===
The training methods use several software packages that need to be installed.
First you will need to download and install:
* [[lttoolbox]]
* Apertium
* [[apertium-lex-tools]]
* Moses and its training scripts which you can install using the following script

Furthermore you will also need:

* an Apertium language pair
* a parallel corpus (see [[Corpora]])

===Installing prerequisites===
See [[Minimal installation from SVN]] for apertium/lttoolbox.

See [[Constraint-based lexical selection module]] for apertium-lex-tools.

For moses-decoder you can do
<pre>
git clone https://github.com/moses-smt/mosesdecoder
cd mosesdecoder/
./bjam
</pre>

Now e.g. the clean-corpus and train-model scripts referred to below will be in ~/smt/mosesdecoder/scripts/training/clean-corpus-n.perl
See http://www.statmt.org/moses/?n=Development.GetStarted if you want to install the binaries to some other directory.

=== Preparing the training files ===

The parallel corpus is processed in such a way that
the training files (the source and target side corpus) are first analysed and tagged.

Next, lines with no analysis are removed and blank within tokens are replaced with a new character
since Giza tokenizes a sentence by splitting on white space.

Finally, both files are cleaned using a moses training script so that Giza will not
crash during the training process.

All of this can be achieved using the following script:

<pre>
CORPUS="Europarl3"
PAIR="es-pt"
SL="pt"
TL="es"
DATA="/home/philip/Apertium/apertium-es-pt"

LEX_TOOLS="/home/philip/Apertium/apertium-lex-tools"
SCRIPTS="$LEX_TOOLS/scripts"
MOSESDECODER="/home/philip/Apertium/mosesdecoder/scripts/training"
TRAINING_LINES=100000

if [ ! -d data-$SL-$TL ]; then
mkdir data-$SL-$TL;
fi

# TAG CORPUS
cat "$CORPUS.$PAIR.$SL" | head -n $TRAINING_LINES | apertium -d "$DATA" $SL-$TL-tagger-dcase \
| apertium-pretransfer > data-$SL-$TL/$CORPUS.tagged.$SL;

cat "$CORPUS.$PAIR.$TL" | head -n $TRAINING_LINES | apertium -d "$DATA" $TL-$SL-tagger-dcase \
| apertium-pretransfer > data-$SL-$TL/$CORPUS.tagged.$TL;

N=`wc -l $CORPUS.$PAIR.$SL | cut -d ' ' -f 1`

# REMOVE LINES WITH NO ANALYSES
seq 1 $TRAINING_LINES > data-$SL-$TL/$CORPUS.lines
paste data-$SL-$TL/$CORPUS.lines data-$SL-$TL/$CORPUS.tagged.$SL data-$SL-$TL/$CORPUS.tagged.$TL | grep '<' \
| cut -f1 > data-$SL-$TL/$CORPUS.lines.new
paste data-$SL-$TL/$CORPUS.lines data-$SL-$TL/$CORPUS.tagged.$SL data-$SL-$TL/$CORPUS.tagged.$TL | grep '<' \
| cut -f2 > data-$SL-$TL/$CORPUS.tagged.$SL.new
paste data-$SL-$TL/$CORPUS.lines data-$SL-$TL/$CORPUS.tagged.$SL data-$SL-$TL/$CORPUS.tagged.$TL | grep '<' \
| cut -f3 > data-$SL-$TL/$CORPUS.tagged.$TL.new
mv data-$SL-$TL/$CORPUS.lines.new data-$SL-$TL/$CORPUS.lines
cat data-$SL-$TL/$CORPUS.tagged.$SL.new \
| sed 's/ /~/g' | sed 's/\$[^\^]*/$ /g' > data-$SL-$TL/$CORPUS.tagged.$SL
cat data-$SL-$TL/$CORPUS.tagged.$TL.new \
| sed 's/ /~/g' | sed 's/\$[^\^]*/$ /g' > data-$SL-$TL/$CORPUS.tagged.$TL
rm data-$SL-$TL/*.new

# CLEAN CORPUS
perl "$MOSESDECODER/clean-corpus-n.perl" data-$SL-$TL/$CORPUS.tagged $SL $TL "data-$SL-$TL/$CORPUS.tag-clean" 1 40;

</pre>

Make sure you set the variables as follows: 
* CORPUS="Europarl3": name of the corpus that you're using
* PAIR: the direction of the corpus
* SL: the source language
* TL: the target language
* DATA: path to the language resources for the language pair
* LEX_TOOLS: path to apertium-lex-tools
* MOSESDECODER: path to moses-decoder
* TRAINING_LINES: amount of training lines

=== Learning rules with Giza ===

==== Installing Giza ====

You can download and install Giza in the following way:

<pre>

$ mkdir ~/smt
$ cd ~/smt
$ mkdir local # our "install prefix"
$ wget https://giza-pp.googlecode.com/files/giza-pp-v1.0.7.tar.gz
$ tar xzvf giza-pp-v1.0.7.tar.gz
$ cd giza-pp
$ make
$ mkdir ../local/bin
$ cp GIZA++-v2/snt2cooc.out ../local/bin/
$ cp GIZA++-v2/snt2plain.out ../local/bin/
$ cp GIZA++-v2/GIZA++ ../local/bin/
$ cp mkcls-v2/mkcls ../local/bin/
</pre>

Notice that all of your binary files will be placed in a single folder which is needed by the moses training script.

==== Alignment ====
Giza++ is used for obtaining an alignment between the tokens of the parallel sentences in the two corpora. The alignment process can sometimes be slow depending on the size of the corpora.

After alignment has been done, the tokens' tags can be trimmed down so that they match the the set of tags found in the bilingual dictionary of the language pair.

Next, a bilingual transfer output is obtained from the source language side so that
ambiguous sentences and missing bidix candidates can be extracted.

Finally, a frequency lexicon is created, marking the most common translation for each ambiguous token.

<pre>
BIN_DIR="/home/philip/Apertium/smt/bin"
# ALIGN
perl $MOSESDECODER/train-model.perl -external-bin-dir "$BIN_DIR" -corpus data-$SL-$TL/$CORPUS.tag-clean \
-f $TL -e $SL -alignment grow-diag-final-and -reordering msd-bidirectional-fe \
-lm 0:5:/home/philip/Apertium/gsoc2013/giza/europarl.lm:0 2>&1

# EXTRACT
zcat giza.$SL-$TL/$SL-$TL.A3.final.gz | $SCRIPTS/giza-to-moses.awk > data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL

# TRIM TAGS
cat data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL | sed 's/ ||| /\t/g' | cut -f 1 \
| sed 's/~/ /g' | $LEX_TOOLS/process-tagger-output $DATA/$TL-$SL.autobil.bin -p -t > tmp1

cat data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL | sed 's/ ||| /\t/g' | cut -f 2 \
| sed 's/~/ /g' | $LEX_TOOLS/process-tagger-output $DATA/$SL-$TL.autobil.bin -p -t > tmp2

cat data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL | sed 's/ ||| /\t/g' | cut -f 3 > tmp3

cat data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL | sed 's/ ||| /\t/g' | cut -f 2 \
| sed 's/~/ /g' | $LEX_TOOLS/process-tagger-output $DATA/$SL-$TL.autobil.bin -b -t > data-$SL-$TL/$CORPUS.clean-biltrans.$PAIR

paste tmp1 tmp2 tmp3 | sed 's/\t/ ||| /g' > data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL
rm tmp1 tmp2 tmp3

# SENTENCES
python3 $SCRIPTS/extract-sentences.py data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL data-$SL-$TL/$CORPUS.clean-biltrans.$PAIR \
> data-$SL-$TL/$CORPUS.candidates.$SL-$TL 2>/dev/null

# FREQUENCY LEXICON
python $SCRIPTS/extract-freq-lexicon.py data-$SL-$TL/$CORPUS.candidates.$SL-$TL > data-$SL-$TL/$CORPUS.lex.$SL-$TL 2>/dev/null

</pre>

Make sure you set the BIN_DIR variable so that it contains the path to the binary folder
generated by the Giza installation process.

==== Maximum likelihood rule extraction ====
The ML method counts how many each translation occurs in a given context, and compares that number
with the default translation from the frequency lexicon.
It then decides whether to create a rule with the given translation, or to leave the default
translation.

The rule generation process is done with the following script:

<pre>
crisphold=1.5
# NGRAM PATTERNS
python $SCRIPTS/ngram-count-patterns.py data-$SL-$TL/$CORPUS.lex.$SL-$TL data-$SL-$TL/$CORPUS.candidates.$SL-$TL $crisphold 2>/dev/null > data-$SL-$TL/$CORPUS.ngrams.$SL-$TL

# NGRAMS TO RULES
python3 $SCRIPTS/ngrams-to-rules.py data-$SL-$TL/$CORPUS.ngrams.$SL-$TL $crisphold > data-$SL-$TL/$CORPUS.ngrams.$SL-$TL.lrx 2>/dev/null
</pre>

Where ''''crisphold'''' is a variable which determines how many more times should an individual translation occur over the default translation for a rule to be created.

==== Maximum entropy rule extraction ====
The ME method learns a discriminative model which assigns each individual context (ngram)
a weight with which it contributes to a certain translation.

The rule extraction process is done in the following way:

<pre>
YASMET=$LEX_TOOLS/yasmet
python $SCRIPTS/ngram-count-patterns-maxent2.py data/$CORPUS.lex.$SL-$TL data/$CORPUS.candidates.$SL-$TL 2>ngrams > events

echo -n "" > all-lambdas
cat events | grep -v -e '\$ 0\.0 #' -e '\$ 0 #' > events.trimmed
for i in `cat events.trimmed | cut -f1 | sort -u | sed 's/$[\*\^\$]$/\\\\\1/g'`; do

num=`cat events.trimmed | grep "^$i" | cut -f2 | head -1`
echo $num > tmp.yasmet.$i;
cat events.trimmed | grep "^$i" | cut -f3 >> tmp.yasmet.$i;
echo "$i"
cat tmp.yasmet.$i | $YASMET -red $MIN > tmp.yasmet.$i.$MIN;
cat tmp.yasmet.$i.$MIN | $YASMET > tmp.lambdas.$i
cat tmp.lambdas.$i | sed "s/^/$i /g" >> all-lambdas;
done

rm tmp.*

python3 $SCRIPTS/merge-ngrams-lambdas.py ngrams all-lambdas > rules-all.txt

python3 $SCRIPTS/lambdas-to-rules.py $TRAIN/$CORPUS.lex.$SL-$TL rules-all.txt > ngrams-all.txt

python3 $SCRIPTS/ngrams-to-rules-me.py ngrams-all.txt > $PAIR.ngrams-lm-$MIN.xml 2>/tmp/$PAIR.$MIN.lög

</pre>

== Estimating rules using non-parallel corpora ==
Prerequisites:
* Install [[apertium-lex-tools]]
* Install IRSTLM (http://sourceforge.net/projects/irstlm/)
* Estimate a '''binary''' target side language model (http://sourceforge.net/apps/mediawiki/irstlm/index.php?title=User_Manual).
* The language pair must support the pretransfer and multi modes. See apertium-sh-mk/modes.xml
as a reference on how to add these modes if they do not exist.
Place the following Makefile in the folder where you want to run your training process:

<pre>
CORPUS=setimes
DIR=sh-mk
DATA=/home/philip/Apertium/apertium-sh-mk/
AUTOBIL=sh-mk.autobil.bin
SCRIPTS=/home/philip/Apertium/apertium-lex-tools/scripts
MODEL=/home/philip/Apertium/corpora/language-models/mk/setimes.mk.5.blm
LEX_TOOLS=/home/philip/Apertium/apertium-lex-tools
THR=0

#all: data/$(CORPUS).$(DIR).lrx data/$(CORPUS).$(DIR).freq.lrx
all: data/$(CORPUS).$(DIR).freq.lrx.bin data/$(CORPUS).$(DIR).patterns.lrx

data/$(CORPUS).$(DIR).tagger: $(CORPUS).$(DIR).txt
if [ ! -d data ]; then mkdir data; fi
cat $(CORPUS).$(DIR).txt | sed 's/[^\.]$$/./g' | apertium-destxt | apertium -f none -d $(DATA) $(DIR)-tagger | apertium-pretransfer > $@

data/$(CORPUS).$(DIR).ambig: data/$(CORPUS).$(DIR).tagger
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/multitrans $(DATA)$(DIR).autobil.bin -b -t > $@

data/$(CORPUS).$(DIR).multi-trimmed: data/$(CORPUS).$(DIR).tagger
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/multitrans $(DATA)$(DIR).autobil.bin -m -t > $@

data/$(CORPUS).$(DIR).ranked: data/$(CORPUS).$(DIR).tagger
cat $< | $(LEX_TOOLS)/multitrans $(DATA)$(DIR).autobil.bin -m | apertium -f none -d $(DATA) $(DIR)-multi | irstlm-ranker-frac $(MODEL) > $@

data/$(CORPUS).$(DIR).annotated: data/$(CORPUS).$(DIR).multi-trimmed data/$(CORPUS).$(DIR).ranked
paste data/$(CORPUS).$(DIR).multi-trimmed data/$(CORPUS).$(DIR).ranked | cut -f1-4 > $@

data/$(CORPUS).$(DIR).freq: data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated
python3 $(SCRIPTS)/biltrans-extract-frac-freq.py data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated > $@

data/$(CORPUS).$(DIR).freq.lrx: data/$(CORPUS).$(DIR).freq
python3 $(SCRIPTS)/extract-alig-lrx.py $< > $@

data/$(CORPUS).$(DIR).freq.lrx.bin: data/$(CORPUS).$(DIR).freq.lrx
lrx-comp $< $@

data/$(CORPUS).$(DIR).ngrams: data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated
python3 $(SCRIPTS)/biltrans-count-patterns-ngrams.py data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated > $@

data/$(CORPUS).$(DIR).patterns: data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ngrams
python3 $(SCRIPTS)/ngram-pruning-frac.py data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ngrams > $@

data/$(CORPUS).$(DIR).patterns.lrx: data/$(CORPUS).$(DIR).patterns
python3 $(SCRIPTS)/ngrams-to-rules.py $< $(THR) > $@
</pre>

In the same folder also place your source side corpus file. The corpus file needs to be named as "basename"."language-pair".txt. 
As an illustration, in the Makefile example, the corpus file is named setimes.sh-mk.txt.

Set the Makefile variables as follows: 
* CORPUS denotes the base name of your corpus file
* DIR stands for the language pair
* DATA is the path to the language resources for the language pair
* AUTOBIL is the path to binary bilingual dictionary for the language pair
* SCRIPTS denotes the path to the lex-tools scripts
* MODEL is the path to the target side (binary) language model used for scoring the possible translations of ambiguous words

Finally, executing the Makefile will generate lexical selection rules for the specified language pair.

Learning rules from parallel and non-parallel corpora

2013-08-19T07:41:11Z

Fpetkovski: /* Training */

== Estimating rules using parallel corpora ==
It is always recommended to use a parallel corpus for any type of machine translation training
when such a resource is available. This section describes several (3) methods for estimating
lexical selection rules using a parallel corpus. We start by describing the part of the training process that is shared by all three methods, and then continue to describe each of the
individual methods separately.
=== Prequisites ===
The training methods use several software packages that need to be installed.
First you will need to download and install:
* [[lttoolbox]]
* Apertium
* [[apertium-lex-tools]]
* Moses and its training scripts which you can install using the following script

Furthermore you will also need:

* an Apertium language pair
* a parallel corpus (see [[Corpora]])

===Installing prerequisites===
See [[Minimal installation from SVN]] for apertium/lttoolbox.

See [[Constraint-based lexical selection module]] for apertium-lex-tools.

For moses-decoder you can do
<pre>
git clone https://github.com/moses-smt/mosesdecoder
cd mosesdecoder/
./bjam
</pre>

Now e.g. the clean-corpus and train-model scripts referred to below will be in ~/smt/mosesdecoder/scripts/training/clean-corpus-n.perl
See http://www.statmt.org/moses/?n=Development.GetStarted if you want to install the binaries to some other directory.

=== Preparing the training files ===

The parallel corpus is processed in such a way that
the training files (the source and target side corpus) are first analysed and tagged.

Next, lines with no analysis are removed and blank within tokens are replaced with a new character
since Giza tokenizes a sentence by splitting on white space.

Finally, both files are cleaned using a moses training script so that Giza will not
crash during the training process.

All of this can be achieved using the following script:

<pre>
CORPUS="Europarl3"
PAIR="es-pt"
SL="pt"
TL="es"
DATA="/home/philip/Apertium/apertium-es-pt"

LEX_TOOLS="/home/philip/Apertium/apertium-lex-tools"
SCRIPTS="$LEX_TOOLS/scripts"
MOSESDECODER="/home/philip/Apertium/mosesdecoder/scripts/training"
TRAINING_LINES=100000

if [ ! -d data-$SL-$TL ]; then
mkdir data-$SL-$TL;
fi

# TAG CORPUS
cat "$CORPUS.$PAIR.$SL" | head -n $TRAINING_LINES | apertium -d "$DATA" $SL-$TL-tagger-dcase \
| apertium-pretransfer > data-$SL-$TL/$CORPUS.tagged.$SL;

cat "$CORPUS.$PAIR.$TL" | head -n $TRAINING_LINES | apertium -d "$DATA" $TL-$SL-tagger-dcase \
| apertium-pretransfer > data-$SL-$TL/$CORPUS.tagged.$TL;

N=`wc -l $CORPUS.$PAIR.$SL | cut -d ' ' -f 1`

# REMOVE LINES WITH NO ANALYSES
seq 1 $TRAINING_LINES > data-$SL-$TL/$CORPUS.lines
paste data-$SL-$TL/$CORPUS.lines data-$SL-$TL/$CORPUS.tagged.$SL data-$SL-$TL/$CORPUS.tagged.$TL | grep '<' \
| cut -f1 > data-$SL-$TL/$CORPUS.lines.new
paste data-$SL-$TL/$CORPUS.lines data-$SL-$TL/$CORPUS.tagged.$SL data-$SL-$TL/$CORPUS.tagged.$TL | grep '<' \
| cut -f2 > data-$SL-$TL/$CORPUS.tagged.$SL.new
paste data-$SL-$TL/$CORPUS.lines data-$SL-$TL/$CORPUS.tagged.$SL data-$SL-$TL/$CORPUS.tagged.$TL | grep '<' \
| cut -f3 > data-$SL-$TL/$CORPUS.tagged.$TL.new
mv data-$SL-$TL/$CORPUS.lines.new data-$SL-$TL/$CORPUS.lines
cat data-$SL-$TL/$CORPUS.tagged.$SL.new \
| sed 's/ /~/g' | sed 's/\$[^\^]*/$ /g' > data-$SL-$TL/$CORPUS.tagged.$SL
cat data-$SL-$TL/$CORPUS.tagged.$TL.new \
| sed 's/ /~/g' | sed 's/\$[^\^]*/$ /g' > data-$SL-$TL/$CORPUS.tagged.$TL
rm data-$SL-$TL/*.new

# CLEAN CORPUS
perl "$MOSESDECODER/clean-corpus-n.perl" data-$SL-$TL/$CORPUS.tagged $SL $TL "data-$SL-$TL/$CORPUS.tag-clean" 1 40;

</pre>

Make sure you set the variables as follows: 
* CORPUS="Europarl3": name of the corpus that you're using
* PAIR: the direction of the corpus
* SL: the source language
* TL: the target language
* DATA: path to the language resources for the language pair
* LEX_TOOLS: path to apertium-lex-tools
* MOSESDECODER: path to moses-decoder
* TRAINING_LINES: amount of training lines

=== Learning rules with Giza ===

==== Installing Giza ====

You can download and install Giza in the following way:

<pre>

$ mkdir ~/smt
$ cd ~/smt
$ mkdir local # our "install prefix"
$ wget https://giza-pp.googlecode.com/files/giza-pp-v1.0.7.tar.gz
$ tar xzvf giza-pp-v1.0.7.tar.gz
$ cd giza-pp
$ make
$ mkdir ../local/bin
$ cp GIZA++-v2/snt2cooc.out ../local/bin/
$ cp GIZA++-v2/snt2plain.out ../local/bin/
$ cp GIZA++-v2/GIZA++ ../local/bin/
$ cp mkcls-v2/mkcls ../local/bin/
</pre>

Notice that all of your binary files will be placed in a single folder which is needed by the moses training script.

==== Alignment ====
Giza++ is used for obtaining an alignment between the tokens of the parallel sentences in the two corpora. The alignment process can sometimes be slow depending on the size of the corpora.

After alignment has been done, the tokens' tags can be trimmed down so that they match the the set of tags found in the bilingual dictionary of the language pair.

Finally, a bilingual transfer output is obtained from the source language side so that
ambiguous sentences and missing bidix candidates can be extracted.

<pre>
BIN_DIR="/home/philip/Apertium/smt/bin"
# ALIGN
perl $MOSESDECODER/train-model.perl -external-bin-dir "$BIN_DIR" -corpus data-$SL-$TL/$CORPUS.tag-clean \
-f $TL -e $SL -alignment grow-diag-final-and -reordering msd-bidirectional-fe \
-lm 0:5:/home/philip/Apertium/gsoc2013/giza/europarl.lm:0 2>&1

# EXTRACT
zcat giza.$SL-$TL/$SL-$TL.A3.final.gz | $SCRIPTS/giza-to-moses.awk > data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL

# TRIM TAGS
cat data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL | sed 's/ ||| /\t/g' | cut -f 1 \
| sed 's/~/ /g' | $LEX_TOOLS/process-tagger-output $DATA/$TL-$SL.autobil.bin -p -t > tmp1

cat data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL | sed 's/ ||| /\t/g' | cut -f 2 \
| sed 's/~/ /g' | $LEX_TOOLS/process-tagger-output $DATA/$SL-$TL.autobil.bin -p -t > tmp2

cat data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL | sed 's/ ||| /\t/g' | cut -f 3 > tmp3

cat data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL | sed 's/ ||| /\t/g' | cut -f 2 \
| sed 's/~/ /g' | $LEX_TOOLS/process-tagger-output $DATA/$SL-$TL.autobil.bin -b -t > data-$SL-$TL/$CORPUS.clean-biltrans.$PAIR

paste tmp1 tmp2 tmp3 | sed 's/\t/ ||| /g' > data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL
rm tmp1 tmp2 tmp3
</pre>

== Estimating rules using non-parallel corpora ==
Prerequisites:
* Install [[apertium-lex-tools]]
* Install IRSTLM (http://sourceforge.net/projects/irstlm/)
* Estimate a '''binary''' target side language model (http://sourceforge.net/apps/mediawiki/irstlm/index.php?title=User_Manual).
* The language pair must support the pretransfer and multi modes. See apertium-sh-mk/modes.xml
as a reference on how to add these modes if they do not exist.
Place the following Makefile in the folder where you want to run your training process:

<pre>
CORPUS=setimes
DIR=sh-mk
DATA=/home/philip/Apertium/apertium-sh-mk/
AUTOBIL=sh-mk.autobil.bin
SCRIPTS=/home/philip/Apertium/apertium-lex-tools/scripts
MODEL=/home/philip/Apertium/corpora/language-models/mk/setimes.mk.5.blm
LEX_TOOLS=/home/philip/Apertium/apertium-lex-tools
THR=0

#all: data/$(CORPUS).$(DIR).lrx data/$(CORPUS).$(DIR).freq.lrx
all: data/$(CORPUS).$(DIR).freq.lrx.bin data/$(CORPUS).$(DIR).patterns.lrx

data/$(CORPUS).$(DIR).tagger: $(CORPUS).$(DIR).txt
if [ ! -d data ]; then mkdir data; fi
cat $(CORPUS).$(DIR).txt | sed 's/[^\.]$$/./g' | apertium-destxt | apertium -f none -d $(DATA) $(DIR)-tagger | apertium-pretransfer > $@

data/$(CORPUS).$(DIR).ambig: data/$(CORPUS).$(DIR).tagger
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/multitrans $(DATA)$(DIR).autobil.bin -b -t > $@

data/$(CORPUS).$(DIR).multi-trimmed: data/$(CORPUS).$(DIR).tagger
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/multitrans $(DATA)$(DIR).autobil.bin -m -t > $@

data/$(CORPUS).$(DIR).ranked: data/$(CORPUS).$(DIR).tagger
cat $< | $(LEX_TOOLS)/multitrans $(DATA)$(DIR).autobil.bin -m | apertium -f none -d $(DATA) $(DIR)-multi | irstlm-ranker-frac $(MODEL) > $@

data/$(CORPUS).$(DIR).annotated: data/$(CORPUS).$(DIR).multi-trimmed data/$(CORPUS).$(DIR).ranked
paste data/$(CORPUS).$(DIR).multi-trimmed data/$(CORPUS).$(DIR).ranked | cut -f1-4 > $@

data/$(CORPUS).$(DIR).freq: data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated
python3 $(SCRIPTS)/biltrans-extract-frac-freq.py data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated > $@

data/$(CORPUS).$(DIR).freq.lrx: data/$(CORPUS).$(DIR).freq
python3 $(SCRIPTS)/extract-alig-lrx.py $< > $@

data/$(CORPUS).$(DIR).freq.lrx.bin: data/$(CORPUS).$(DIR).freq.lrx
lrx-comp $< $@

data/$(CORPUS).$(DIR).ngrams: data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated
python3 $(SCRIPTS)/biltrans-count-patterns-ngrams.py data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated > $@

data/$(CORPUS).$(DIR).patterns: data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ngrams
python3 $(SCRIPTS)/ngram-pruning-frac.py data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ngrams > $@

data/$(CORPUS).$(DIR).patterns.lrx: data/$(CORPUS).$(DIR).patterns
python3 $(SCRIPTS)/ngrams-to-rules.py $< $(THR) > $@
</pre>

In the same folder also place your source side corpus file. The corpus file needs to be named as "basename"."language-pair".txt. 
As an illustration, in the Makefile example, the corpus file is named setimes.sh-mk.txt.

Set the Makefile variables as follows: 
* CORPUS denotes the base name of your corpus file
* DIR stands for the language pair
* DATA is the path to the language resources for the language pair
* AUTOBIL is the path to binary bilingual dictionary for the language pair
* SCRIPTS denotes the path to the lex-tools scripts
* MODEL is the path to the target side (binary) language model used for scoring the possible translations of ambiguous words

Finally, executing the Makefile will generate lexical selection rules for the specified language pair.

Learning rules from parallel and non-parallel corpora

2013-08-19T07:39:22Z

Fpetkovski: /* Training */

== Estimating rules using parallel corpora ==
It is always recommended to use a parallel corpus for any type of machine translation training
when such a resource is available. This section describes several (3) methods for estimating
lexical selection rules using a parallel corpus. We start by describing the part of the training process that is shared by all three methods, and then continue to describe each of the
individual methods separately.
=== Prequisites ===
The training methods use several software packages that need to be installed.
First you will need to download and install:
* [[lttoolbox]]
* Apertium
* [[apertium-lex-tools]]
* Moses and its training scripts which you can install using the following script

Furthermore you will also need:

* an Apertium language pair
* a parallel corpus (see [[Corpora]])

===Installing prerequisites===
See [[Minimal installation from SVN]] for apertium/lttoolbox.

See [[Constraint-based lexical selection module]] for apertium-lex-tools.

For moses-decoder you can do
<pre>
git clone https://github.com/moses-smt/mosesdecoder
cd mosesdecoder/
./bjam
</pre>

Now e.g. the clean-corpus and train-model scripts referred to below will be in ~/smt/mosesdecoder/scripts/training/clean-corpus-n.perl
See http://www.statmt.org/moses/?n=Development.GetStarted if you want to install the binaries to some other directory.

=== Preparing the training files ===

The parallel corpus is processed in such a way that
the training files (the source and target side corpus) are first analysed and tagged.

Next, lines with no analysis are removed and blank within tokens are replaced with a new character
since Giza tokenizes a sentence by splitting on white space.

Finally, both files are cleaned using a moses training script so that Giza will not
crash during the training process.

All of this can be achieved using the following script:

<pre>
CORPUS="Europarl3"
PAIR="es-pt"
SL="pt"
TL="es"
DATA="/home/philip/Apertium/apertium-es-pt"

LEX_TOOLS="/home/philip/Apertium/apertium-lex-tools"
SCRIPTS="$LEX_TOOLS/scripts"
MOSESDECODER="/home/philip/Apertium/mosesdecoder/scripts/training"
TRAINING_LINES=100000

if [ ! -d data-$SL-$TL ]; then
mkdir data-$SL-$TL;
fi

# TAG CORPUS
cat "$CORPUS.$PAIR.$SL" | head -n $TRAINING_LINES | apertium -d "$DATA" $SL-$TL-tagger-dcase \
| apertium-pretransfer > data-$SL-$TL/$CORPUS.tagged.$SL;

cat "$CORPUS.$PAIR.$TL" | head -n $TRAINING_LINES | apertium -d "$DATA" $TL-$SL-tagger-dcase \
| apertium-pretransfer > data-$SL-$TL/$CORPUS.tagged.$TL;

N=`wc -l $CORPUS.$PAIR.$SL | cut -d ' ' -f 1`

# REMOVE LINES WITH NO ANALYSES
seq 1 $TRAINING_LINES > data-$SL-$TL/$CORPUS.lines
paste data-$SL-$TL/$CORPUS.lines data-$SL-$TL/$CORPUS.tagged.$SL data-$SL-$TL/$CORPUS.tagged.$TL | grep '<' \
| cut -f1 > data-$SL-$TL/$CORPUS.lines.new
paste data-$SL-$TL/$CORPUS.lines data-$SL-$TL/$CORPUS.tagged.$SL data-$SL-$TL/$CORPUS.tagged.$TL | grep '<' \
| cut -f2 > data-$SL-$TL/$CORPUS.tagged.$SL.new
paste data-$SL-$TL/$CORPUS.lines data-$SL-$TL/$CORPUS.tagged.$SL data-$SL-$TL/$CORPUS.tagged.$TL | grep '<' \
| cut -f3 > data-$SL-$TL/$CORPUS.tagged.$TL.new
mv data-$SL-$TL/$CORPUS.lines.new data-$SL-$TL/$CORPUS.lines
cat data-$SL-$TL/$CORPUS.tagged.$SL.new \
| sed 's/ /~/g' | sed 's/\$[^\^]*/$ /g' > data-$SL-$TL/$CORPUS.tagged.$SL
cat data-$SL-$TL/$CORPUS.tagged.$TL.new \
| sed 's/ /~/g' | sed 's/\$[^\^]*/$ /g' > data-$SL-$TL/$CORPUS.tagged.$TL
rm data-$SL-$TL/*.new

# CLEAN CORPUS
perl "$MOSESDECODER/clean-corpus-n.perl" data-$SL-$TL/$CORPUS.tagged $SL $TL "data-$SL-$TL/$CORPUS.tag-clean" 1 40;

</pre>

Make sure you set the variables as follows: 
* CORPUS="Europarl3": name of the corpus that you're using
* PAIR: the direction of the corpus
* SL: the source language
* TL: the target language
* DATA: path to the language resources for the language pair
* LEX_TOOLS: path to apertium-lex-tools
* MOSESDECODER: path to moses-decoder
* TRAINING_LINES: amount of training lines

=== Learning rules with Giza ===

==== Installing Giza ====

You can download and install Giza in the following way:

<pre>

$ mkdir ~/smt
$ cd ~/smt
$ mkdir local # our "install prefix"
$ wget https://giza-pp.googlecode.com/files/giza-pp-v1.0.7.tar.gz
$ tar xzvf giza-pp-v1.0.7.tar.gz
$ cd giza-pp
$ make
$ mkdir ../local/bin
$ cp GIZA++-v2/snt2cooc.out ../local/bin/
$ cp GIZA++-v2/snt2plain.out ../local/bin/
$ cp GIZA++-v2/GIZA++ ../local/bin/
$ cp mkcls-v2/mkcls ../local/bin/
</pre>

Notice that all of your binary files will be placed in a single folder which is needed by the moses training script.

==== Training ====
Giza++ is used for obtaining an alignment between the tokens of the parallel sentences in the two corpora. The alignment process can sometimes be slow depending on the size of the corpora.

After alignment has been done, the tokens' tags can be trimmed down so that they match the the set of tags found in the bilingual dictionary of the language pair.

Finally, a bilingual transfer output is obtained from the source language side so that
ambiguous sentences and missing bidix candidates can be extracted.

<pre>
BIN_DIR="/home/philip/Apertium/smt/bin"
# ALIGN
perl $MOSESDECODER/train-model.perl -external-bin-dir "$BIN_DIR" -corpus data-$SL-$TL/$CORPUS.tag-clean \
-f $TL -e $SL -alignment grow-diag-final-and -reordering msd-bidirectional-fe \
-lm 0:5:/home/philip/Apertium/gsoc2013/giza/europarl.lm:0 2>&1

# EXTRACT
zcat giza.$SL-$TL/$SL-$TL.A3.final.gz | $SCRIPTS/giza-to-moses.awk > data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL

# TRIM TAGS
cat data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL | sed 's/ ||| /\t/g' | cut -f 1 \
| sed 's/~/ /g' | $LEX_TOOLS/process-tagger-output $DATA/$TL-$SL.autobil.bin -p -t > tmp1

cat data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL | sed 's/ ||| /\t/g' | cut -f 2 \
| sed 's/~/ /g' | $LEX_TOOLS/process-tagger-output $DATA/$SL-$TL.autobil.bin -p -t > tmp2

cat data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL | sed 's/ ||| /\t/g' | cut -f 3 > tmp3

cat data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL | sed 's/ ||| /\t/g' | cut -f 2 \
| sed 's/~/ /g' | $LEX_TOOLS/process-tagger-output $DATA/$SL-$TL.autobil.bin -b -t > data-$SL-$TL/$CORPUS.clean-biltrans.$PAIR

paste tmp1 tmp2 tmp3 | sed 's/\t/ ||| /g' > data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL
rm tmp1 tmp2 tmp3
</pre>

== Estimating rules using non-parallel corpora ==
Prerequisites:
* Install [[apertium-lex-tools]]
* Install IRSTLM (http://sourceforge.net/projects/irstlm/)
* Estimate a '''binary''' target side language model (http://sourceforge.net/apps/mediawiki/irstlm/index.php?title=User_Manual).
* The language pair must support the pretransfer and multi modes. See apertium-sh-mk/modes.xml
as a reference on how to add these modes if they do not exist.
Place the following Makefile in the folder where you want to run your training process:

<pre>
CORPUS=setimes
DIR=sh-mk
DATA=/home/philip/Apertium/apertium-sh-mk/
AUTOBIL=sh-mk.autobil.bin
SCRIPTS=/home/philip/Apertium/apertium-lex-tools/scripts
MODEL=/home/philip/Apertium/corpora/language-models/mk/setimes.mk.5.blm
LEX_TOOLS=/home/philip/Apertium/apertium-lex-tools
THR=0

#all: data/$(CORPUS).$(DIR).lrx data/$(CORPUS).$(DIR).freq.lrx
all: data/$(CORPUS).$(DIR).freq.lrx.bin data/$(CORPUS).$(DIR).patterns.lrx

data/$(CORPUS).$(DIR).tagger: $(CORPUS).$(DIR).txt
if [ ! -d data ]; then mkdir data; fi
cat $(CORPUS).$(DIR).txt | sed 's/[^\.]$$/./g' | apertium-destxt | apertium -f none -d $(DATA) $(DIR)-tagger | apertium-pretransfer > $@

data/$(CORPUS).$(DIR).ambig: data/$(CORPUS).$(DIR).tagger
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/multitrans $(DATA)$(DIR).autobil.bin -b -t > $@

data/$(CORPUS).$(DIR).multi-trimmed: data/$(CORPUS).$(DIR).tagger
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/multitrans $(DATA)$(DIR).autobil.bin -m -t > $@

data/$(CORPUS).$(DIR).ranked: data/$(CORPUS).$(DIR).tagger
cat $< | $(LEX_TOOLS)/multitrans $(DATA)$(DIR).autobil.bin -m | apertium -f none -d $(DATA) $(DIR)-multi | irstlm-ranker-frac $(MODEL) > $@

data/$(CORPUS).$(DIR).annotated: data/$(CORPUS).$(DIR).multi-trimmed data/$(CORPUS).$(DIR).ranked
paste data/$(CORPUS).$(DIR).multi-trimmed data/$(CORPUS).$(DIR).ranked | cut -f1-4 > $@

data/$(CORPUS).$(DIR).freq: data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated
python3 $(SCRIPTS)/biltrans-extract-frac-freq.py data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated > $@

data/$(CORPUS).$(DIR).freq.lrx: data/$(CORPUS).$(DIR).freq
python3 $(SCRIPTS)/extract-alig-lrx.py $< > $@

data/$(CORPUS).$(DIR).freq.lrx.bin: data/$(CORPUS).$(DIR).freq.lrx
lrx-comp $< $@

data/$(CORPUS).$(DIR).ngrams: data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated
python3 $(SCRIPTS)/biltrans-count-patterns-ngrams.py data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated > $@

data/$(CORPUS).$(DIR).patterns: data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ngrams
python3 $(SCRIPTS)/ngram-pruning-frac.py data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ngrams > $@

data/$(CORPUS).$(DIR).patterns.lrx: data/$(CORPUS).$(DIR).patterns
python3 $(SCRIPTS)/ngrams-to-rules.py $< $(THR) > $@
</pre>

In the same folder also place your source side corpus file. The corpus file needs to be named as "basename"."language-pair".txt. 
As an illustration, in the Makefile example, the corpus file is named setimes.sh-mk.txt.

Set the Makefile variables as follows: 
* CORPUS denotes the base name of your corpus file
* DIR stands for the language pair
* DATA is the path to the language resources for the language pair
* AUTOBIL is the path to binary bilingual dictionary for the language pair
* SCRIPTS denotes the path to the lex-tools scripts
* MODEL is the path to the target side (binary) language model used for scoring the possible translations of ambiguous words

Finally, executing the Makefile will generate lexical selection rules for the specified language pair.

Learning rules from parallel and non-parallel corpora

2013-08-19T07:38:00Z

Fpetkovski: /* Estimating rules using parallel corpora */

== Estimating rules using parallel corpora ==
It is always recommended to use a parallel corpus for any type of machine translation training
when such a resource is available. This section describes several (3) methods for estimating
lexical selection rules using a parallel corpus. We start by describing the part of the training process that is shared by all three methods, and then continue to describe each of the
individual methods separately.
=== Prequisites ===
The training methods use several software packages that need to be installed.
First you will need to download and install:
* [[lttoolbox]]
* Apertium
* [[apertium-lex-tools]]
* Moses and its training scripts which you can install using the following script

Furthermore you will also need:

* an Apertium language pair
* a parallel corpus (see [[Corpora]])

===Installing prerequisites===
See [[Minimal installation from SVN]] for apertium/lttoolbox.

See [[Constraint-based lexical selection module]] for apertium-lex-tools.

For moses-decoder you can do
<pre>
git clone https://github.com/moses-smt/mosesdecoder
cd mosesdecoder/
./bjam
</pre>

Now e.g. the clean-corpus and train-model scripts referred to below will be in ~/smt/mosesdecoder/scripts/training/clean-corpus-n.perl
See http://www.statmt.org/moses/?n=Development.GetStarted if you want to install the binaries to some other directory.

=== Preparing the training files ===

The parallel corpus is processed in such a way that
the training files (the source and target side corpus) are first analysed and tagged.

Next, lines with no analysis are removed and blank within tokens are replaced with a new character
since Giza tokenizes a sentence by splitting on white space.

Finally, both files are cleaned using a moses training script so that Giza will not
crash during the training process.

All of this can be achieved using the following script:

<pre>
CORPUS="Europarl3"
PAIR="es-pt"
SL="pt"
TL="es"
DATA="/home/philip/Apertium/apertium-es-pt"

LEX_TOOLS="/home/philip/Apertium/apertium-lex-tools"
SCRIPTS="$LEX_TOOLS/scripts"
MOSESDECODER="/home/philip/Apertium/mosesdecoder/scripts/training"
TRAINING_LINES=100000

if [ ! -d data-$SL-$TL ]; then
mkdir data-$SL-$TL;
fi

# TAG CORPUS
cat "$CORPUS.$PAIR.$SL" | head -n $TRAINING_LINES | apertium -d "$DATA" $SL-$TL-tagger-dcase \
| apertium-pretransfer > data-$SL-$TL/$CORPUS.tagged.$SL;

cat "$CORPUS.$PAIR.$TL" | head -n $TRAINING_LINES | apertium -d "$DATA" $TL-$SL-tagger-dcase \
| apertium-pretransfer > data-$SL-$TL/$CORPUS.tagged.$TL;

N=`wc -l $CORPUS.$PAIR.$SL | cut -d ' ' -f 1`

# REMOVE LINES WITH NO ANALYSES
seq 1 $TRAINING_LINES > data-$SL-$TL/$CORPUS.lines
paste data-$SL-$TL/$CORPUS.lines data-$SL-$TL/$CORPUS.tagged.$SL data-$SL-$TL/$CORPUS.tagged.$TL | grep '<' \
| cut -f1 > data-$SL-$TL/$CORPUS.lines.new
paste data-$SL-$TL/$CORPUS.lines data-$SL-$TL/$CORPUS.tagged.$SL data-$SL-$TL/$CORPUS.tagged.$TL | grep '<' \
| cut -f2 > data-$SL-$TL/$CORPUS.tagged.$SL.new
paste data-$SL-$TL/$CORPUS.lines data-$SL-$TL/$CORPUS.tagged.$SL data-$SL-$TL/$CORPUS.tagged.$TL | grep '<' \
| cut -f3 > data-$SL-$TL/$CORPUS.tagged.$TL.new
mv data-$SL-$TL/$CORPUS.lines.new data-$SL-$TL/$CORPUS.lines
cat data-$SL-$TL/$CORPUS.tagged.$SL.new \
| sed 's/ /~/g' | sed 's/\$[^\^]*/$ /g' > data-$SL-$TL/$CORPUS.tagged.$SL
cat data-$SL-$TL/$CORPUS.tagged.$TL.new \
| sed 's/ /~/g' | sed 's/\$[^\^]*/$ /g' > data-$SL-$TL/$CORPUS.tagged.$TL
rm data-$SL-$TL/*.new

# CLEAN CORPUS
perl "$MOSESDECODER/clean-corpus-n.perl" data-$SL-$TL/$CORPUS.tagged $SL $TL "data-$SL-$TL/$CORPUS.tag-clean" 1 40;

</pre>

Make sure you set the variables as follows: 
* CORPUS="Europarl3": name of the corpus that you're using
* PAIR: the direction of the corpus
* SL: the source language
* TL: the target language
* DATA: path to the language resources for the language pair
* LEX_TOOLS: path to apertium-lex-tools
* MOSESDECODER: path to moses-decoder
* TRAINING_LINES: amount of training lines

=== Learning rules with Giza ===

==== Installing Giza ====

You can download and install Giza in the following way:

<pre>

$ mkdir ~/smt
$ cd ~/smt
$ mkdir local # our "install prefix"
$ wget https://giza-pp.googlecode.com/files/giza-pp-v1.0.7.tar.gz
$ tar xzvf giza-pp-v1.0.7.tar.gz
$ cd giza-pp
$ make
$ mkdir ../local/bin
$ cp GIZA++-v2/snt2cooc.out ../local/bin/
$ cp GIZA++-v2/snt2plain.out ../local/bin/
$ cp GIZA++-v2/GIZA++ ../local/bin/
$ cp mkcls-v2/mkcls ../local/bin/
</pre>

Notice that all of your binary files will be placed in a single folder which is needed by the moses training script.

==== Training ====
Giza is used for obtaining an alignment between the tokens of the parallel sentences in the two corpora. The alignment process can sometimes be slow depending on the size of the corpora.

After alignment has been done, the tokens' tags can be trimmed down so that they match the the set of tags found in the bilingual dictionary of the language pair.

Finally, a bilingual transfer output is obtained from the source language side so that
ambiguous sentences and missing bidix candidates can be extracted.

<pre>
BIN_DIR="/home/philip/Apertium/smt/bin"
# ALIGN
perl $MOSESDECODER/train-model.perl -external-bin-dir "$BIN_DIR" -corpus data-$SL-$TL/$CORPUS.tag-clean \
-f $TL -e $SL -alignment grow-diag-final-and -reordering msd-bidirectional-fe \
-lm 0:5:/home/philip/Apertium/gsoc2013/giza/europarl.lm:0 2>&1

# EXTRACT
zcat giza.$SL-$TL/$SL-$TL.A3.final.gz | $SCRIPTS/giza-to-moses.awk > data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL

# TRIM TAGS
cat data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL | sed 's/ ||| /\t/g' | cut -f 1 \
| sed 's/~/ /g' | $LEX_TOOLS/process-tagger-output $DATA/$TL-$SL.autobil.bin -p -t > tmp1

cat data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL | sed 's/ ||| /\t/g' | cut -f 2 \
| sed 's/~/ /g' | $LEX_TOOLS/process-tagger-output $DATA/$SL-$TL.autobil.bin -p -t > tmp2

cat data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL | sed 's/ ||| /\t/g' | cut -f 3 > tmp3

cat data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL | sed 's/ ||| /\t/g' | cut -f 2 \
| sed 's/~/ /g' | $LEX_TOOLS/process-tagger-output $DATA/$SL-$TL.autobil.bin -b -t > data-$SL-$TL/$CORPUS.clean-biltrans.$PAIR

paste tmp1 tmp2 tmp3 | sed 's/\t/ ||| /g' > data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL
rm tmp1 tmp2 tmp3
</pre>

== Estimating rules using non-parallel corpora ==
Prerequisites:
* Install [[apertium-lex-tools]]
* Install IRSTLM (http://sourceforge.net/projects/irstlm/)
* Estimate a '''binary''' target side language model (http://sourceforge.net/apps/mediawiki/irstlm/index.php?title=User_Manual).
* The language pair must support the pretransfer and multi modes. See apertium-sh-mk/modes.xml
as a reference on how to add these modes if they do not exist.
Place the following Makefile in the folder where you want to run your training process:

<pre>
CORPUS=setimes
DIR=sh-mk
DATA=/home/philip/Apertium/apertium-sh-mk/
AUTOBIL=sh-mk.autobil.bin
SCRIPTS=/home/philip/Apertium/apertium-lex-tools/scripts
MODEL=/home/philip/Apertium/corpora/language-models/mk/setimes.mk.5.blm
LEX_TOOLS=/home/philip/Apertium/apertium-lex-tools
THR=0

#all: data/$(CORPUS).$(DIR).lrx data/$(CORPUS).$(DIR).freq.lrx
all: data/$(CORPUS).$(DIR).freq.lrx.bin data/$(CORPUS).$(DIR).patterns.lrx

data/$(CORPUS).$(DIR).tagger: $(CORPUS).$(DIR).txt
if [ ! -d data ]; then mkdir data; fi
cat $(CORPUS).$(DIR).txt | sed 's/[^\.]$$/./g' | apertium-destxt | apertium -f none -d $(DATA) $(DIR)-tagger | apertium-pretransfer > $@

data/$(CORPUS).$(DIR).ambig: data/$(CORPUS).$(DIR).tagger
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/multitrans $(DATA)$(DIR).autobil.bin -b -t > $@

data/$(CORPUS).$(DIR).multi-trimmed: data/$(CORPUS).$(DIR).tagger
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/multitrans $(DATA)$(DIR).autobil.bin -m -t > $@

data/$(CORPUS).$(DIR).ranked: data/$(CORPUS).$(DIR).tagger
cat $< | $(LEX_TOOLS)/multitrans $(DATA)$(DIR).autobil.bin -m | apertium -f none -d $(DATA) $(DIR)-multi | irstlm-ranker-frac $(MODEL) > $@

data/$(CORPUS).$(DIR).annotated: data/$(CORPUS).$(DIR).multi-trimmed data/$(CORPUS).$(DIR).ranked
paste data/$(CORPUS).$(DIR).multi-trimmed data/$(CORPUS).$(DIR).ranked | cut -f1-4 > $@

data/$(CORPUS).$(DIR).freq: data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated
python3 $(SCRIPTS)/biltrans-extract-frac-freq.py data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated > $@

data/$(CORPUS).$(DIR).freq.lrx: data/$(CORPUS).$(DIR).freq
python3 $(SCRIPTS)/extract-alig-lrx.py $< > $@

data/$(CORPUS).$(DIR).freq.lrx.bin: data/$(CORPUS).$(DIR).freq.lrx
lrx-comp $< $@

data/$(CORPUS).$(DIR).ngrams: data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated
python3 $(SCRIPTS)/biltrans-count-patterns-ngrams.py data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated > $@

data/$(CORPUS).$(DIR).patterns: data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ngrams
python3 $(SCRIPTS)/ngram-pruning-frac.py data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ngrams > $@

data/$(CORPUS).$(DIR).patterns.lrx: data/$(CORPUS).$(DIR).patterns
python3 $(SCRIPTS)/ngrams-to-rules.py $< $(THR) > $@
</pre>

In the same folder also place your source side corpus file. The corpus file needs to be named as "basename"."language-pair".txt. 
As an illustration, in the Makefile example, the corpus file is named setimes.sh-mk.txt.

Set the Makefile variables as follows: 
* CORPUS denotes the base name of your corpus file
* DIR stands for the language pair
* DATA is the path to the language resources for the language pair
* AUTOBIL is the path to binary bilingual dictionary for the language pair
* SCRIPTS denotes the path to the lex-tools scripts
* MODEL is the path to the target side (binary) language model used for scoring the possible translations of ambiguous words

Finally, executing the Makefile will generate lexical selection rules for the specified language pair.

Learning rules from parallel and non-parallel corpora

2013-08-19T07:27:38Z

Fpetkovski: /* Learning rules with Giza */

Learning rules from parallel and non-parallel corpora

2013-08-19T07:27:03Z

Fpetkovski: /* Preparing the training files */

Learning rules from parallel and non-parallel corpora

2013-08-19T07:22:41Z

Fpetkovski: /* Estimating rules using parallel corpora */

Learning rules from parallel and non-parallel corpora

2013-08-19T07:09:31Z

Fpetkovski: /* Preparing the training files */