https://wiki.apertium.org/w/api.php?action=feedcontributions&user=Fpetkovski&feedformat=atom
Apertium - User contributions [en]
2024-03-19T08:26:59Z
User contributions
MediaWiki 1.34.1
https://wiki.apertium.org/w/index.php?title=Generating_lexical-selection_rules_from_monolingual_corpora&diff=43868
Generating lexical-selection rules from monolingual corpora
2013-09-23T06:21:12Z
<p>Fpetkovski: /* Annotation */</p>
<hr />
<div>{{TOCD}}<br />
This page describes how to generate lexical selection rules without relying on a parallel corpus.<br />
<br />
==Prerequisites==<br />
<br />
* [[apertium-lex-tools]]<br />
* [[IRSTLM]]<br />
* A language pair (e.g. apertium-br-fr)<br />
** The language pair should have the following two modes:<br />
*** <code>-multi</code> which is all the modules after lexical transfer (see apertium-mk-en/modes.xml)<br />
*** <code>-pretransfer</code> which is all the modules up to lexical transfer (see apertium-mk-en/modes.xml)<br />
<br />
==Annotation==<br />
<br />
'''Important:''' If you don't want through the whole process step by step, you can use the Makefile scripts provided in the [[#Makefiles|last section]] of this page.<br />
<br />
<br />
We're going to do the example with EuroParl and the English to Spanish pair in Apertium.<br />
<br />
Given that you've got all the stuff installed, the work will be as follows:<br />
<br />
Make sure that you have trained a language model for the target language. See [http://sourceforge.net/apps/mediawiki/irstlm/index.php?title=User_Manual#Training_your_first_LM]<br />
<br />
Take your corpus and make a tagged version of it:<br />
<br />
<pre><br />
cat europarl.es-en.es | apertium-destxt | apertium -f none -d ~/source/apertium-en-es en-es-pretransfer > europarl.en-es.es.tagged<br />
</pre><br />
<br />
Make an ambiguous version of your corpus and trim redundant tags:<br />
<br />
<pre><br />
cat europarl.en-es.es.tagged | python ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil -b -f -t -n > europarl.en-es.es.ambig<br />
</pre><br />
<br />
Next, generate all the possible disambiguation paths while trimming redundant tags:<br />
<br />
<pre><br />
cat europarl.en-es.es.tagged | ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil -m -f -t -n > europarl.en-es.es.multi-trimmed<br />
</pre><br />
<br />
Translate and score all possible disambiguation paths:<br />
<br />
<pre><br />
cat europarl.en-es.es.tagged | python ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil -m -f -n |<br />
apertium -f none -d ~/source/apertium-en-es en-es-multi | ~/source/apertium-lex-tools/irstlm-ranker <br />
~/source/corpora/lm/en.blm europarl.en-es.es.multi-trimmed -f > europarl.en-es.es.annotated<br />
</pre><br />
<br />
Now we have a pseudo-parallel corpus where each possible translation is scored.<br />
We start by extracting a frequency lexicon:<br />
<br />
<pre><br />
python3 ~/source/apertium-lex-tools-scripts/biltrans-extract-frac-freq.py europarl.en-es.es.ambig europarl.en-es.es.annotated > europarl.en-es.freq<br />
</pre><br />
<br />
<pre><br />
python3 ~/source/apertium-lex-tools-scripts/extract-alig-lrx.py europarl.en-es.freq > europarl.en-es.freq.lrx<br />
</pre><br />
<br />
<pre><br />
lrx-comp europarl.en-es.freq.lrx europarl.en-es.freq.lrx.bin<br />
</pre><br />
<br />
From here on, we have two paths we can choose. We can extract rules using a maximum entropy classifier, or we can extract<br />
rules based only on the scores provided by irstlm-ranker.<br />
<br />
== Direct rule extraction ==<br />
When using this method, we directly continue with extracting ngrams from the pseudo parallel corpus:<br />
<br />
<pre><br />
python3 ~/source/apertium-lex-tools/scripts/biltrans-count-patterns-ngrams.py europarl.en-es.freq europarl.en-es.es.ambig europarl.en-es.es.annotated > ngrams<br />
</pre><br />
<br />
Next, we prune the generated ngrams:<br />
<br />
<pre><br />
python3 ~/source/apertium-lex-tools/scripts/ngram-pruning-frac.py europarl.en-es.freq ngrams > patterns<br />
</pre><br />
<br />
Finally, we generate and compile lexical selection rules while thresholding their irstlm-score<br />
<br />
<pre><br />
crisphold=1;<br />
python3 ~/source/apertium-lex-tools/scripts//ngrams-to-rules.py patterns $crisphold > patterns.lrx<br />
</pre><br />
<br />
<pre><br />
lrx-comp patterns.lrx patterns.lrx.bin<br />
</pre><br />
<br />
== Maximum entropy rule extraction ==<br />
When extracting rules using a maximum entropy criterion, we first extract features which we are going to feed to a classifier:<br />
<br />
<pre><br />
python3 ~/source/apertium-lex-tools/scripts/biltrans-count-patterns-frac-maxent.py europarl-en-es.freq <br />
europarl.en-es.ambig europarl.en-es.annotated > events 2>ngrams<br />
</pre><br />
<br />
We then train classifiers which as a side effect score how much each ngram contributes to a certain translation:<br />
<pre><br />
cat events | grep -v -e '\$$ 0\.0 #' -e '\$$ 0 #' > events.trimmed<br />
cat events.trimmed | python ~/source/apertium-lex-tools/scripts/merge-all-lambdas.py $(YASMET) > all-lambdas<br />
</pre><br />
<br />
<pre><br />
python3 ~/source/apertium-lex-tools/scripts/merge-ngrams-lambdas.py ngrams all-lambdas > rules-all<br />
</pre><br />
<br />
Finally, we extract ngrams:<br />
<br />
<pre><br />
python3 ~/source/apertium-lex-tools/scripts/lambdas-to-rules.py europarl-en-es.freq rules-all > ngrams-all<br />
</pre><br />
<br />
we trim them:<br />
<br />
<pre><br />
python3 ~/source/apertium-lex-tools/scripts/ngram-pareto-trim.py ngrams-all > ngrams-trimmed<br />
</pre><br />
<br />
and generate lexical selection rules:<br />
<br />
<pre><br />
python3 ~/source/apertium-lex-tools/scripts/ngrams-to-rules-me.py ngrams-trimmed > europarl.en-es.lrx.bin<br />
</pre><br />
<br />
<br />
== Makefiles ==<br />
<br />
=== Direct rule extraction ===<br />
You can use this makefile to generate rules using direct rule extraction. <br />
Your corpus needs to be placed in the same folder with your makefile.<br />
<br />
<pre><br />
CORPUS=setimes<br />
PAIR=mk-en<br />
DATA=/home/philip/Apertium/apertium-mk-en<br />
SL=mk<br />
TL=en<br />
TRAINING_LINES=10000<br />
SCRIPTS=/home/philip/Apertium/apertium-lex-tools/scripts<br />
MODEL=/home/philip/Apertium/corpora/language-models/en/setimes.en.5.blm<br />
LEX_TOOLS=/home/philip/Apertium/apertium-lex-tools<br />
<br />
<br />
<br />
AUTOBIL=$(SL)-$(TL).autobil.bin<br />
DIR=$(SL)-$(TL)<br />
<br />
all: data/$(CORPUS).$(DIR).freq.lrx.bin data/$(CORPUS).$(DIR).patterns.lrx<br />
<br />
data/$(CORPUS).$(DIR).tagger: $(CORPUS).$(PAIR).$(SL)<br />
if [ ! -d data ]; then mkdir data; fi<br />
cat $(CORPUS).$(PAIR).$(SL) | head -n $(TRAINING_LINES)| sed 's/[^\.]$$/./g' | apertium-destxt | apertium -f none -d $(DATA) $(DIR)-tagger | apertium-pretransfer > $@<br />
<br />
data/$(CORPUS).$(DIR).ambig: data/$(CORPUS).$(DIR).tagger<br />
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/process-tagger-output $(DATA)/$(AUTOBIL) -b -t -f -n > $@<br />
<br />
data/$(CORPUS).$(DIR).multi-trimmed: data/$(CORPUS).$(DIR).tagger<br />
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/process-tagger-output $(DATA)/$(AUTOBIL) -m -t -f > $@<br />
<br />
data/$(CORPUS).$(DIR).annotated: data/$(CORPUS).$(DIR).tagger data/$(CORPUS).$(DIR).multi-trimmed<br />
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/process-tagger-output $(DATA)/$(AUTOBIL) -m -f | apertium -f none -d $(DATA) $(DIR)-multi | $(LEX_TOOLS)/irstlm-ranker $(MODEL) data/$(CORPUS).$(DIR).multi-trimmed -f 2>/dev/null | grep "|@|" | cut -f 1-3 > $@ <br />
<br />
data/$(CORPUS).$(DIR).freq: data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated<br />
python3 $(SCRIPTS)/biltrans-extract-frac-freq.py data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated > $@<br />
<br />
data/$(CORPUS).$(DIR).freq.lrx: data/$(CORPUS).$(DIR).freq<br />
python3 $(SCRIPTS)/extract-alig-lrx.py $< > $@<br />
<br />
data/$(CORPUS).$(DIR).freq.lrx.bin: data/$(CORPUS).$(DIR).freq.lrx<br />
lrx-comp $< $@<br />
<br />
data/$(CORPUS).$(DIR).ngrams: data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated<br />
python3 $(SCRIPTS)/biltrans-count-patterns-ngrams.py data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated > $@<br />
<br />
data/$(CORPUS).$(DIR).patterns: data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ngrams<br />
python3 $(SCRIPTS)/ngram-pruning-frac.py data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ngrams > $@ <br />
<br />
data/$(CORPUS).$(DIR).patterns.lrx: data/$(CORPUS).$(DIR).patterns<br />
python3 $(SCRIPTS)/ngrams-to-rules.py $< $(THR) > $@<br />
<br />
<br />
</pre><br />
<br />
The corpus file needs to be named as "basename"."language-pair"."source side". <br/><br />
As an illustration, in the Makefile example, the corpus file is named europarl.en-es.es <br/><br />
<br />
Set the Makefile variables as follows: <br/><br />
CORPUS denotes the base name of your corpus file <br/><br />
PAIR stands for the language pair <br/><br />
SL and TL stand for source language and target language <br/><br />
DATA is the path to the language resources for the language pair <br/><br />
SCRIPTS denotes the path to the lex-tools scripts <br/><br />
MODEL is the path to the target side (binary) language model used for scoring the possible translations of ambiguous words <br/><br />
<br />
Finally, executing the Makefile will generate lexical selection rules for the specified language pair.<br />
<br />
=== Maximum entropy ===<br />
You can use this makefile to generate rules using maximum entropy classifiers. <br />
Your corpus needs to be placed in the same folder with your makefile.<br />
<br />
<pre><br />
CORPUS=europarl<br />
PAIR=en-es<br />
DATA=/home/philip/source/apertium-en-es<br />
SL=es<br />
TL=en<br />
MODEL=/home/philip/lm/en.blm<br />
SCRIPTS=/home/philip/source/apertium-lex-tools/scripts<br />
LEX_TOOLS=/home/philip/source/apertium-lex-tools<br />
THR=1<br />
TRAINING_LINES=10000<br />
<br />
<br />
AUTOBIL=$(SL)-$(TL).autobil.bin<br />
DIR=$(SL)-$(TL)<br />
YASMET=$(LEX_TOOLS)/yasmet<br />
all: data/$(CORPUS).$(DIR).freq.lrx.bin data/$(CORPUS).$(DIR).lm.xml<br />
<br />
data/$(CORPUS).$(DIR).tagger: $(CORPUS).$(PAIR).$(SL)<br />
if [ ! -d data ]; then mkdir data; fi<br />
cat $(CORPUS).$(PAIR).$(SL) | head -n $(TRAINING_LINES) | apertium-destxt | apertium -f none -d $(DATA) $(DIR)-tagger | apertium-pretransfer > $@<br />
<br />
data/$(CORPUS).$(DIR).ambig: data/$(CORPUS).$(DIR).tagger<br />
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/multitrans $(DATA)/$(AUTOBIL) -b -t -f -n > $@<br />
<br />
data/$(CORPUS).$(DIR).multi-trimmed: data/$(CORPUS).$(DIR).tagger<br />
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/multitrans $(DATA)/$(AUTOBIL) -m -t -f > $@<br />
<br />
data/$(CORPUS).$(DIR).annotated: data/$(CORPUS).$(DIR).tagger data/$(CORPUS).$(DIR).multi-trimmed<br />
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/multitrans $(DATA)/$(AUTOBIL) -m -f | apertium -f none -d $(DATA) $(DIR)-multi | $(LEX_TOOLS)/irstlm-ranker $(MODEL) data/$(CORPUS).$(DIR).multi-trimmed -f 2>/dev/null > $@ <br />
<br />
data/$(CORPUS).$(DIR).freq: data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated<br />
python3 $(SCRIPTS)/biltrans-extract-frac-freq.py data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated > $@<br />
<br />
data/$(CORPUS).$(DIR).freq.lrx: data/$(CORPUS).$(DIR).freq<br />
python3 $(SCRIPTS)/extract-alig-lrx.py $< > $@<br />
<br />
data/$(CORPUS).$(DIR).freq.lrx.bin: data/$(CORPUS).$(DIR).freq.lrx<br />
lrx-comp $< $@<br />
<br />
data/$(CORPUS).$(DIR).events data/$(CORPUS).$(DIR).ngrams: data/$(CORPUS).$(DIR).annotated data/$(CORPUS).$(DIR).freq<br />
python3 $(SCRIPTS)/biltrans-count-patterns-frac-maxent.py data/$(CORPUS).$(SL)-$(TL).freq data/$(CORPUS).$(SL)-$(TL).ambig data/$(CORPUS).$(SL)-$(TL).annotated > data/$(CORPUS).$(DIR).events 2>data/$(CORPUS).$(DIR).ngrams<br />
<br />
data/$(CORPUS).$(DIR).all-lambdas: data/$(CORPUS).$(DIR).events<br />
cat data/$(CORPUS).$(DIR).events | grep -v -e '\$$ 0\.0 #' -e '\$$ 0 #' > data/$(CORPUS).$(DIR).events.trimmed<br />
cat data/$(CORPUS).$(DIR).events.trimmed | python $(SCRIPTS)/merge-all-lambdas.py $(YASMET)> $@<br />
<br />
data/$(CORPUS).$(DIR).rules-all: data/$(CORPUS).$(DIR).ngrams data/$(CORPUS).$(DIR).all-lambdas<br />
python3 $(SCRIPTS)/merge-ngrams-lambdas.py data/$(CORPUS).$(DIR).ngrams data/$(CORPUS).$(DIR).all-lambdas > $@<br />
<br />
data/$(CORPUS).$(DIR).ngrams-all: data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).rules-all<br />
python3 $(SCRIPTS)/lambdas-to-rules.py data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).rules-all > $@<br />
<br />
data/$(CORPUS).$(DIR).ngrams-trimmed: data/$(CORPUS).$(DIR).ngrams-all<br />
cat $< | python3 $(SCRIPTS)/ngram-pareto-trim.py > $@<br />
<br />
data/$(CORPUS).$(DIR).lm.xml: data/$(CORPUS).$(DIR).ngrams-trimmed<br />
python3 $(SCRIPTS)/ngrams-to-rules-me.py data/$(CORPUS).$(DIR).ngrams-trimmed > $@<br />
<br />
data/$(CORPUS).$(DIR).lm.bin: data/$(CORPUS).$(DIR).lm.xml<br />
lrx-comp $< $@<br />
<br />
<br />
</pre><br />
The corpus file needs to be named as "basename"."language-pair"."source side". <br/><br />
As an illustration, in the Makefile example, the corpus file is named europarl.en-es.es <br/><br />
<br />
Set the Makefile variables as follows: <br/><br />
CORPUS denotes the base name of your corpus file <br/><br />
PAIR stands for the language pair <br/><br />
SL and TL stand for source language and target language <br/><br />
DATA is the path to the language resources for the language pair <br/><br />
SCRIPTS denotes the path to the lex-tools scripts <br/><br />
MODEL is the path to the target side (binary) language model used for scoring the possible translations of ambiguous words <br/><br />
<br />
Finally, executing the Makefile will generate lexical selection rules for the specified language pair.<br />
<br />
<br />
<br />
[[Category:Lexical selection]]</div>
Fpetkovski
https://wiki.apertium.org/w/index.php?title=Generating_lexical-selection_rules_from_monolingual_corpora&diff=43867
Generating lexical-selection rules from monolingual corpora
2013-09-23T06:19:38Z
<p>Fpetkovski: /* Annotation */</p>
<hr />
<div>{{TOCD}}<br />
This page describes how to generate lexical selection rules without relying on a parallel corpus.<br />
<br />
==Prerequisites==<br />
<br />
* [[apertium-lex-tools]]<br />
* [[IRSTLM]]<br />
* A language pair (e.g. apertium-br-fr)<br />
** The language pair should have the following two modes:<br />
*** <code>-multi</code> which is all the modules after lexical transfer (see apertium-mk-en/modes.xml)<br />
*** <code>-pretransfer</code> which is all the modules up to lexical transfer (see apertium-mk-en/modes.xml)<br />
<br />
==Annotation==<br />
<br />
'''Important: If you don't want through the whole process step by step, you can use the Makefile script provided in the last section of this page.'''<br />
<br />
<br />
We're going to do the example with EuroParl and the English to Spanish pair in Apertium.<br />
<br />
Given that you've got all the stuff installed, the work will be as follows:<br />
<br />
Make sure that you have trained a language model for the target language. See [http://sourceforge.net/apps/mediawiki/irstlm/index.php?title=User_Manual#Training_your_first_LM]<br />
<br />
Take your corpus and make a tagged version of it:<br />
<br />
<pre><br />
cat europarl.es-en.es | apertium-destxt | apertium -f none -d ~/source/apertium-en-es en-es-pretransfer > europarl.en-es.es.tagged<br />
</pre><br />
<br />
Make an ambiguous version of your corpus and trim redundant tags:<br />
<br />
<pre><br />
cat europarl.en-es.es.tagged | python ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil -b -f -t -n > europarl.en-es.es.ambig<br />
</pre><br />
<br />
Next, generate all the possible disambiguation paths while trimming redundant tags:<br />
<br />
<pre><br />
cat europarl.en-es.es.tagged | ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil -m -f -t -n > europarl.en-es.es.multi-trimmed<br />
</pre><br />
<br />
Translate and score all possible disambiguation paths:<br />
<br />
<pre><br />
cat europarl.en-es.es.tagged | python ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil -m -f -n |<br />
apertium -f none -d ~/source/apertium-en-es en-es-multi | ~/source/apertium-lex-tools/irstlm-ranker <br />
~/source/corpora/lm/en.blm europarl.en-es.es.multi-trimmed -f > europarl.en-es.es.annotated<br />
</pre><br />
<br />
Now we have a pseudo-parallel corpus where each possible translation is scored.<br />
We start by extracting a frequency lexicon:<br />
<br />
<pre><br />
python3 ~/source/apertium-lex-tools-scripts/biltrans-extract-frac-freq.py europarl.en-es.es.ambig europarl.en-es.es.annotated > europarl.en-es.freq<br />
</pre><br />
<br />
<pre><br />
python3 ~/source/apertium-lex-tools-scripts/extract-alig-lrx.py europarl.en-es.freq > europarl.en-es.freq.lrx<br />
</pre><br />
<br />
<pre><br />
lrx-comp europarl.en-es.freq.lrx europarl.en-es.freq.lrx.bin<br />
</pre><br />
<br />
From here on, we have two paths we can choose. We can extract rules using a maximum entropy classifier, or we can extract<br />
rules based only on the scores provided by irstlm-ranker.<br />
<br />
== Direct rule extraction ==<br />
When using this method, we directly continue with extracting ngrams from the pseudo parallel corpus:<br />
<br />
<pre><br />
python3 ~/source/apertium-lex-tools/scripts/biltrans-count-patterns-ngrams.py europarl.en-es.freq europarl.en-es.es.ambig europarl.en-es.es.annotated > ngrams<br />
</pre><br />
<br />
Next, we prune the generated ngrams:<br />
<br />
<pre><br />
python3 ~/source/apertium-lex-tools/scripts/ngram-pruning-frac.py europarl.en-es.freq ngrams > patterns<br />
</pre><br />
<br />
Finally, we generate and compile lexical selection rules while thresholding their irstlm-score<br />
<br />
<pre><br />
crisphold=1;<br />
python3 ~/source/apertium-lex-tools/scripts//ngrams-to-rules.py patterns $crisphold > patterns.lrx<br />
</pre><br />
<br />
<pre><br />
lrx-comp patterns.lrx patterns.lrx.bin<br />
</pre><br />
<br />
== Maximum entropy rule extraction ==<br />
When extracting rules using a maximum entropy criterion, we first extract features which we are going to feed to a classifier:<br />
<br />
<pre><br />
python3 ~/source/apertium-lex-tools/scripts/biltrans-count-patterns-frac-maxent.py europarl-en-es.freq <br />
europarl.en-es.ambig europarl.en-es.annotated > events 2>ngrams<br />
</pre><br />
<br />
We then train classifiers which as a side effect score how much each ngram contributes to a certain translation:<br />
<pre><br />
cat events | grep -v -e '\$$ 0\.0 #' -e '\$$ 0 #' > events.trimmed<br />
cat events.trimmed | python ~/source/apertium-lex-tools/scripts/merge-all-lambdas.py $(YASMET) > all-lambdas<br />
</pre><br />
<br />
<pre><br />
python3 ~/source/apertium-lex-tools/scripts/merge-ngrams-lambdas.py ngrams all-lambdas > rules-all<br />
</pre><br />
<br />
Finally, we extract ngrams:<br />
<br />
<pre><br />
python3 ~/source/apertium-lex-tools/scripts/lambdas-to-rules.py europarl-en-es.freq rules-all > ngrams-all<br />
</pre><br />
<br />
we trim them:<br />
<br />
<pre><br />
python3 ~/source/apertium-lex-tools/scripts/ngram-pareto-trim.py ngrams-all > ngrams-trimmed<br />
</pre><br />
<br />
and generate lexical selection rules:<br />
<br />
<pre><br />
python3 ~/source/apertium-lex-tools/scripts/ngrams-to-rules-me.py ngrams-trimmed > europarl.en-es.lrx.bin<br />
</pre><br />
<br />
<br />
== Makefiles ==<br />
<br />
=== Direct rule extraction ===<br />
You can use this makefile to generate rules using direct rule extraction. <br />
Your corpus needs to be placed in the same folder with your makefile.<br />
<br />
<pre><br />
CORPUS=setimes<br />
PAIR=mk-en<br />
DATA=/home/philip/Apertium/apertium-mk-en<br />
SL=mk<br />
TL=en<br />
TRAINING_LINES=10000<br />
SCRIPTS=/home/philip/Apertium/apertium-lex-tools/scripts<br />
MODEL=/home/philip/Apertium/corpora/language-models/en/setimes.en.5.blm<br />
LEX_TOOLS=/home/philip/Apertium/apertium-lex-tools<br />
<br />
<br />
<br />
AUTOBIL=$(SL)-$(TL).autobil.bin<br />
DIR=$(SL)-$(TL)<br />
<br />
all: data/$(CORPUS).$(DIR).freq.lrx.bin data/$(CORPUS).$(DIR).patterns.lrx<br />
<br />
data/$(CORPUS).$(DIR).tagger: $(CORPUS).$(PAIR).$(SL)<br />
if [ ! -d data ]; then mkdir data; fi<br />
cat $(CORPUS).$(PAIR).$(SL) | head -n $(TRAINING_LINES)| sed 's/[^\.]$$/./g' | apertium-destxt | apertium -f none -d $(DATA) $(DIR)-tagger | apertium-pretransfer > $@<br />
<br />
data/$(CORPUS).$(DIR).ambig: data/$(CORPUS).$(DIR).tagger<br />
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/process-tagger-output $(DATA)/$(AUTOBIL) -b -t -f -n > $@<br />
<br />
data/$(CORPUS).$(DIR).multi-trimmed: data/$(CORPUS).$(DIR).tagger<br />
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/process-tagger-output $(DATA)/$(AUTOBIL) -m -t -f > $@<br />
<br />
data/$(CORPUS).$(DIR).annotated: data/$(CORPUS).$(DIR).tagger data/$(CORPUS).$(DIR).multi-trimmed<br />
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/process-tagger-output $(DATA)/$(AUTOBIL) -m -f | apertium -f none -d $(DATA) $(DIR)-multi | $(LEX_TOOLS)/irstlm-ranker $(MODEL) data/$(CORPUS).$(DIR).multi-trimmed -f 2>/dev/null | grep "|@|" | cut -f 1-3 > $@ <br />
<br />
data/$(CORPUS).$(DIR).freq: data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated<br />
python3 $(SCRIPTS)/biltrans-extract-frac-freq.py data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated > $@<br />
<br />
data/$(CORPUS).$(DIR).freq.lrx: data/$(CORPUS).$(DIR).freq<br />
python3 $(SCRIPTS)/extract-alig-lrx.py $< > $@<br />
<br />
data/$(CORPUS).$(DIR).freq.lrx.bin: data/$(CORPUS).$(DIR).freq.lrx<br />
lrx-comp $< $@<br />
<br />
data/$(CORPUS).$(DIR).ngrams: data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated<br />
python3 $(SCRIPTS)/biltrans-count-patterns-ngrams.py data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated > $@<br />
<br />
data/$(CORPUS).$(DIR).patterns: data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ngrams<br />
python3 $(SCRIPTS)/ngram-pruning-frac.py data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ngrams > $@ <br />
<br />
data/$(CORPUS).$(DIR).patterns.lrx: data/$(CORPUS).$(DIR).patterns<br />
python3 $(SCRIPTS)/ngrams-to-rules.py $< $(THR) > $@<br />
<br />
<br />
</pre><br />
<br />
The corpus file needs to be named as "basename"."language-pair"."source side". <br/><br />
As an illustration, in the Makefile example, the corpus file is named europarl.en-es.es <br/><br />
<br />
Set the Makefile variables as follows: <br/><br />
CORPUS denotes the base name of your corpus file <br/><br />
PAIR stands for the language pair <br/><br />
SL and TL stand for source language and target language <br/><br />
DATA is the path to the language resources for the language pair <br/><br />
SCRIPTS denotes the path to the lex-tools scripts <br/><br />
MODEL is the path to the target side (binary) language model used for scoring the possible translations of ambiguous words <br/><br />
<br />
Finally, executing the Makefile will generate lexical selection rules for the specified language pair.<br />
<br />
=== Maximum entropy ===<br />
You can use this makefile to generate rules using maximum entropy classifiers. <br />
Your corpus needs to be placed in the same folder with your makefile.<br />
<br />
<pre><br />
CORPUS=europarl<br />
PAIR=en-es<br />
DATA=/home/philip/source/apertium-en-es<br />
SL=es<br />
TL=en<br />
MODEL=/home/philip/lm/en.blm<br />
SCRIPTS=/home/philip/source/apertium-lex-tools/scripts<br />
LEX_TOOLS=/home/philip/source/apertium-lex-tools<br />
THR=1<br />
TRAINING_LINES=10000<br />
<br />
<br />
AUTOBIL=$(SL)-$(TL).autobil.bin<br />
DIR=$(SL)-$(TL)<br />
YASMET=$(LEX_TOOLS)/yasmet<br />
all: data/$(CORPUS).$(DIR).freq.lrx.bin data/$(CORPUS).$(DIR).lm.xml<br />
<br />
data/$(CORPUS).$(DIR).tagger: $(CORPUS).$(PAIR).$(SL)<br />
if [ ! -d data ]; then mkdir data; fi<br />
cat $(CORPUS).$(PAIR).$(SL) | head -n $(TRAINING_LINES) | apertium-destxt | apertium -f none -d $(DATA) $(DIR)-tagger | apertium-pretransfer > $@<br />
<br />
data/$(CORPUS).$(DIR).ambig: data/$(CORPUS).$(DIR).tagger<br />
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/multitrans $(DATA)/$(AUTOBIL) -b -t -f -n > $@<br />
<br />
data/$(CORPUS).$(DIR).multi-trimmed: data/$(CORPUS).$(DIR).tagger<br />
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/multitrans $(DATA)/$(AUTOBIL) -m -t -f > $@<br />
<br />
data/$(CORPUS).$(DIR).annotated: data/$(CORPUS).$(DIR).tagger data/$(CORPUS).$(DIR).multi-trimmed<br />
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/multitrans $(DATA)/$(AUTOBIL) -m -f | apertium -f none -d $(DATA) $(DIR)-multi | $(LEX_TOOLS)/irstlm-ranker $(MODEL) data/$(CORPUS).$(DIR).multi-trimmed -f 2>/dev/null > $@ <br />
<br />
data/$(CORPUS).$(DIR).freq: data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated<br />
python3 $(SCRIPTS)/biltrans-extract-frac-freq.py data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated > $@<br />
<br />
data/$(CORPUS).$(DIR).freq.lrx: data/$(CORPUS).$(DIR).freq<br />
python3 $(SCRIPTS)/extract-alig-lrx.py $< > $@<br />
<br />
data/$(CORPUS).$(DIR).freq.lrx.bin: data/$(CORPUS).$(DIR).freq.lrx<br />
lrx-comp $< $@<br />
<br />
data/$(CORPUS).$(DIR).events data/$(CORPUS).$(DIR).ngrams: data/$(CORPUS).$(DIR).annotated data/$(CORPUS).$(DIR).freq<br />
python3 $(SCRIPTS)/biltrans-count-patterns-frac-maxent.py data/$(CORPUS).$(SL)-$(TL).freq data/$(CORPUS).$(SL)-$(TL).ambig data/$(CORPUS).$(SL)-$(TL).annotated > data/$(CORPUS).$(DIR).events 2>data/$(CORPUS).$(DIR).ngrams<br />
<br />
data/$(CORPUS).$(DIR).all-lambdas: data/$(CORPUS).$(DIR).events<br />
cat data/$(CORPUS).$(DIR).events | grep -v -e '\$$ 0\.0 #' -e '\$$ 0 #' > data/$(CORPUS).$(DIR).events.trimmed<br />
cat data/$(CORPUS).$(DIR).events.trimmed | python $(SCRIPTS)/merge-all-lambdas.py $(YASMET)> $@<br />
<br />
data/$(CORPUS).$(DIR).rules-all: data/$(CORPUS).$(DIR).ngrams data/$(CORPUS).$(DIR).all-lambdas<br />
python3 $(SCRIPTS)/merge-ngrams-lambdas.py data/$(CORPUS).$(DIR).ngrams data/$(CORPUS).$(DIR).all-lambdas > $@<br />
<br />
data/$(CORPUS).$(DIR).ngrams-all: data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).rules-all<br />
python3 $(SCRIPTS)/lambdas-to-rules.py data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).rules-all > $@<br />
<br />
data/$(CORPUS).$(DIR).ngrams-trimmed: data/$(CORPUS).$(DIR).ngrams-all<br />
cat $< | python3 $(SCRIPTS)/ngram-pareto-trim.py > $@<br />
<br />
data/$(CORPUS).$(DIR).lm.xml: data/$(CORPUS).$(DIR).ngrams-trimmed<br />
python3 $(SCRIPTS)/ngrams-to-rules-me.py data/$(CORPUS).$(DIR).ngrams-trimmed > $@<br />
<br />
data/$(CORPUS).$(DIR).lm.bin: data/$(CORPUS).$(DIR).lm.xml<br />
lrx-comp $< $@<br />
<br />
<br />
</pre><br />
The corpus file needs to be named as "basename"."language-pair"."source side". <br/><br />
As an illustration, in the Makefile example, the corpus file is named europarl.en-es.es <br/><br />
<br />
Set the Makefile variables as follows: <br/><br />
CORPUS denotes the base name of your corpus file <br/><br />
PAIR stands for the language pair <br/><br />
SL and TL stand for source language and target language <br/><br />
DATA is the path to the language resources for the language pair <br/><br />
SCRIPTS denotes the path to the lex-tools scripts <br/><br />
MODEL is the path to the target side (binary) language model used for scoring the possible translations of ambiguous words <br/><br />
<br />
Finally, executing the Makefile will generate lexical selection rules for the specified language pair.<br />
<br />
<br />
<br />
[[Category:Lexical selection]]</div>
Fpetkovski
https://wiki.apertium.org/w/index.php?title=Generating_lexical-selection_rules_from_monolingual_corpora&diff=43866
Generating lexical-selection rules from monolingual corpora
2013-09-23T06:19:20Z
<p>Fpetkovski: </p>
<hr />
<div>{{TOCD}}<br />
This page describes how to generate lexical selection rules without relying on a parallel corpus.<br />
<br />
==Prerequisites==<br />
<br />
* [[apertium-lex-tools]]<br />
* [[IRSTLM]]<br />
* A language pair (e.g. apertium-br-fr)<br />
** The language pair should have the following two modes:<br />
*** <code>-multi</code> which is all the modules after lexical transfer (see apertium-mk-en/modes.xml)<br />
*** <code>-pretransfer</code> which is all the modules up to lexical transfer (see apertium-mk-en/modes.xml)<br />
<br />
==Annotation==<br />
<br />
'''Important: If you don't want through the whole process step by step, you can use the Makefile script provided in the last section of this page.<br />
'''<br />
We're going to do the example with EuroParl and the English to Spanish pair in Apertium.<br />
<br />
Given that you've got all the stuff installed, the work will be as follows:<br />
<br />
Make sure that you have trained a language model for the target language. See [http://sourceforge.net/apps/mediawiki/irstlm/index.php?title=User_Manual#Training_your_first_LM]<br />
<br />
Take your corpus and make a tagged version of it:<br />
<br />
<pre><br />
cat europarl.es-en.es | apertium-destxt | apertium -f none -d ~/source/apertium-en-es en-es-pretransfer > europarl.en-es.es.tagged<br />
</pre><br />
<br />
Make an ambiguous version of your corpus and trim redundant tags:<br />
<br />
<pre><br />
cat europarl.en-es.es.tagged | python ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil -b -f -t -n > europarl.en-es.es.ambig<br />
</pre><br />
<br />
Next, generate all the possible disambiguation paths while trimming redundant tags:<br />
<br />
<pre><br />
cat europarl.en-es.es.tagged | ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil -m -f -t -n > europarl.en-es.es.multi-trimmed<br />
</pre><br />
<br />
Translate and score all possible disambiguation paths:<br />
<br />
<pre><br />
cat europarl.en-es.es.tagged | python ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil -m -f -n |<br />
apertium -f none -d ~/source/apertium-en-es en-es-multi | ~/source/apertium-lex-tools/irstlm-ranker <br />
~/source/corpora/lm/en.blm europarl.en-es.es.multi-trimmed -f > europarl.en-es.es.annotated<br />
</pre><br />
<br />
Now we have a pseudo-parallel corpus where each possible translation is scored.<br />
We start by extracting a frequency lexicon:<br />
<br />
<pre><br />
python3 ~/source/apertium-lex-tools-scripts/biltrans-extract-frac-freq.py europarl.en-es.es.ambig europarl.en-es.es.annotated > europarl.en-es.freq<br />
</pre><br />
<br />
<pre><br />
python3 ~/source/apertium-lex-tools-scripts/extract-alig-lrx.py europarl.en-es.freq > europarl.en-es.freq.lrx<br />
</pre><br />
<br />
<pre><br />
lrx-comp europarl.en-es.freq.lrx europarl.en-es.freq.lrx.bin<br />
</pre><br />
<br />
From here on, we have two paths we can choose. We can extract rules using a maximum entropy classifier, or we can extract<br />
rules based only on the scores provided by irstlm-ranker.<br />
<br />
== Direct rule extraction ==<br />
When using this method, we directly continue with extracting ngrams from the pseudo parallel corpus:<br />
<br />
<pre><br />
python3 ~/source/apertium-lex-tools/scripts/biltrans-count-patterns-ngrams.py europarl.en-es.freq europarl.en-es.es.ambig europarl.en-es.es.annotated > ngrams<br />
</pre><br />
<br />
Next, we prune the generated ngrams:<br />
<br />
<pre><br />
python3 ~/source/apertium-lex-tools/scripts/ngram-pruning-frac.py europarl.en-es.freq ngrams > patterns<br />
</pre><br />
<br />
Finally, we generate and compile lexical selection rules while thresholding their irstlm-score<br />
<br />
<pre><br />
crisphold=1;<br />
python3 ~/source/apertium-lex-tools/scripts//ngrams-to-rules.py patterns $crisphold > patterns.lrx<br />
</pre><br />
<br />
<pre><br />
lrx-comp patterns.lrx patterns.lrx.bin<br />
</pre><br />
<br />
== Maximum entropy rule extraction ==<br />
When extracting rules using a maximum entropy criterion, we first extract features which we are going to feed to a classifier:<br />
<br />
<pre><br />
python3 ~/source/apertium-lex-tools/scripts/biltrans-count-patterns-frac-maxent.py europarl-en-es.freq <br />
europarl.en-es.ambig europarl.en-es.annotated > events 2>ngrams<br />
</pre><br />
<br />
We then train classifiers which as a side effect score how much each ngram contributes to a certain translation:<br />
<pre><br />
cat events | grep -v -e '\$$ 0\.0 #' -e '\$$ 0 #' > events.trimmed<br />
cat events.trimmed | python ~/source/apertium-lex-tools/scripts/merge-all-lambdas.py $(YASMET) > all-lambdas<br />
</pre><br />
<br />
<pre><br />
python3 ~/source/apertium-lex-tools/scripts/merge-ngrams-lambdas.py ngrams all-lambdas > rules-all<br />
</pre><br />
<br />
Finally, we extract ngrams:<br />
<br />
<pre><br />
python3 ~/source/apertium-lex-tools/scripts/lambdas-to-rules.py europarl-en-es.freq rules-all > ngrams-all<br />
</pre><br />
<br />
we trim them:<br />
<br />
<pre><br />
python3 ~/source/apertium-lex-tools/scripts/ngram-pareto-trim.py ngrams-all > ngrams-trimmed<br />
</pre><br />
<br />
and generate lexical selection rules:<br />
<br />
<pre><br />
python3 ~/source/apertium-lex-tools/scripts/ngrams-to-rules-me.py ngrams-trimmed > europarl.en-es.lrx.bin<br />
</pre><br />
<br />
<br />
== Makefiles ==<br />
<br />
=== Direct rule extraction ===<br />
You can use this makefile to generate rules using direct rule extraction. <br />
Your corpus needs to be placed in the same folder with your makefile.<br />
<br />
<pre><br />
CORPUS=setimes<br />
PAIR=mk-en<br />
DATA=/home/philip/Apertium/apertium-mk-en<br />
SL=mk<br />
TL=en<br />
TRAINING_LINES=10000<br />
SCRIPTS=/home/philip/Apertium/apertium-lex-tools/scripts<br />
MODEL=/home/philip/Apertium/corpora/language-models/en/setimes.en.5.blm<br />
LEX_TOOLS=/home/philip/Apertium/apertium-lex-tools<br />
<br />
<br />
<br />
AUTOBIL=$(SL)-$(TL).autobil.bin<br />
DIR=$(SL)-$(TL)<br />
<br />
all: data/$(CORPUS).$(DIR).freq.lrx.bin data/$(CORPUS).$(DIR).patterns.lrx<br />
<br />
data/$(CORPUS).$(DIR).tagger: $(CORPUS).$(PAIR).$(SL)<br />
if [ ! -d data ]; then mkdir data; fi<br />
cat $(CORPUS).$(PAIR).$(SL) | head -n $(TRAINING_LINES)| sed 's/[^\.]$$/./g' | apertium-destxt | apertium -f none -d $(DATA) $(DIR)-tagger | apertium-pretransfer > $@<br />
<br />
data/$(CORPUS).$(DIR).ambig: data/$(CORPUS).$(DIR).tagger<br />
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/process-tagger-output $(DATA)/$(AUTOBIL) -b -t -f -n > $@<br />
<br />
data/$(CORPUS).$(DIR).multi-trimmed: data/$(CORPUS).$(DIR).tagger<br />
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/process-tagger-output $(DATA)/$(AUTOBIL) -m -t -f > $@<br />
<br />
data/$(CORPUS).$(DIR).annotated: data/$(CORPUS).$(DIR).tagger data/$(CORPUS).$(DIR).multi-trimmed<br />
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/process-tagger-output $(DATA)/$(AUTOBIL) -m -f | apertium -f none -d $(DATA) $(DIR)-multi | $(LEX_TOOLS)/irstlm-ranker $(MODEL) data/$(CORPUS).$(DIR).multi-trimmed -f 2>/dev/null | grep "|@|" | cut -f 1-3 > $@ <br />
<br />
data/$(CORPUS).$(DIR).freq: data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated<br />
python3 $(SCRIPTS)/biltrans-extract-frac-freq.py data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated > $@<br />
<br />
data/$(CORPUS).$(DIR).freq.lrx: data/$(CORPUS).$(DIR).freq<br />
python3 $(SCRIPTS)/extract-alig-lrx.py $< > $@<br />
<br />
data/$(CORPUS).$(DIR).freq.lrx.bin: data/$(CORPUS).$(DIR).freq.lrx<br />
lrx-comp $< $@<br />
<br />
data/$(CORPUS).$(DIR).ngrams: data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated<br />
python3 $(SCRIPTS)/biltrans-count-patterns-ngrams.py data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated > $@<br />
<br />
data/$(CORPUS).$(DIR).patterns: data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ngrams<br />
python3 $(SCRIPTS)/ngram-pruning-frac.py data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ngrams > $@ <br />
<br />
data/$(CORPUS).$(DIR).patterns.lrx: data/$(CORPUS).$(DIR).patterns<br />
python3 $(SCRIPTS)/ngrams-to-rules.py $< $(THR) > $@<br />
<br />
<br />
</pre><br />
<br />
The corpus file needs to be named as "basename"."language-pair"."source side". <br/><br />
As an illustration, in the Makefile example, the corpus file is named europarl.en-es.es <br/><br />
<br />
Set the Makefile variables as follows: <br/><br />
CORPUS denotes the base name of your corpus file <br/><br />
PAIR stands for the language pair <br/><br />
SL and TL stand for source language and target language <br/><br />
DATA is the path to the language resources for the language pair <br/><br />
SCRIPTS denotes the path to the lex-tools scripts <br/><br />
MODEL is the path to the target side (binary) language model used for scoring the possible translations of ambiguous words <br/><br />
<br />
Finally, executing the Makefile will generate lexical selection rules for the specified language pair.<br />
<br />
=== Maximum entropy ===<br />
You can use this makefile to generate rules using maximum entropy classifiers. <br />
Your corpus needs to be placed in the same folder with your makefile.<br />
<br />
<pre><br />
CORPUS=europarl<br />
PAIR=en-es<br />
DATA=/home/philip/source/apertium-en-es<br />
SL=es<br />
TL=en<br />
MODEL=/home/philip/lm/en.blm<br />
SCRIPTS=/home/philip/source/apertium-lex-tools/scripts<br />
LEX_TOOLS=/home/philip/source/apertium-lex-tools<br />
THR=1<br />
TRAINING_LINES=10000<br />
<br />
<br />
AUTOBIL=$(SL)-$(TL).autobil.bin<br />
DIR=$(SL)-$(TL)<br />
YASMET=$(LEX_TOOLS)/yasmet<br />
all: data/$(CORPUS).$(DIR).freq.lrx.bin data/$(CORPUS).$(DIR).lm.xml<br />
<br />
data/$(CORPUS).$(DIR).tagger: $(CORPUS).$(PAIR).$(SL)<br />
if [ ! -d data ]; then mkdir data; fi<br />
cat $(CORPUS).$(PAIR).$(SL) | head -n $(TRAINING_LINES) | apertium-destxt | apertium -f none -d $(DATA) $(DIR)-tagger | apertium-pretransfer > $@<br />
<br />
data/$(CORPUS).$(DIR).ambig: data/$(CORPUS).$(DIR).tagger<br />
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/multitrans $(DATA)/$(AUTOBIL) -b -t -f -n > $@<br />
<br />
data/$(CORPUS).$(DIR).multi-trimmed: data/$(CORPUS).$(DIR).tagger<br />
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/multitrans $(DATA)/$(AUTOBIL) -m -t -f > $@<br />
<br />
data/$(CORPUS).$(DIR).annotated: data/$(CORPUS).$(DIR).tagger data/$(CORPUS).$(DIR).multi-trimmed<br />
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/multitrans $(DATA)/$(AUTOBIL) -m -f | apertium -f none -d $(DATA) $(DIR)-multi | $(LEX_TOOLS)/irstlm-ranker $(MODEL) data/$(CORPUS).$(DIR).multi-trimmed -f 2>/dev/null > $@ <br />
<br />
data/$(CORPUS).$(DIR).freq: data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated<br />
python3 $(SCRIPTS)/biltrans-extract-frac-freq.py data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated > $@<br />
<br />
data/$(CORPUS).$(DIR).freq.lrx: data/$(CORPUS).$(DIR).freq<br />
python3 $(SCRIPTS)/extract-alig-lrx.py $< > $@<br />
<br />
data/$(CORPUS).$(DIR).freq.lrx.bin: data/$(CORPUS).$(DIR).freq.lrx<br />
lrx-comp $< $@<br />
<br />
data/$(CORPUS).$(DIR).events data/$(CORPUS).$(DIR).ngrams: data/$(CORPUS).$(DIR).annotated data/$(CORPUS).$(DIR).freq<br />
python3 $(SCRIPTS)/biltrans-count-patterns-frac-maxent.py data/$(CORPUS).$(SL)-$(TL).freq data/$(CORPUS).$(SL)-$(TL).ambig data/$(CORPUS).$(SL)-$(TL).annotated > data/$(CORPUS).$(DIR).events 2>data/$(CORPUS).$(DIR).ngrams<br />
<br />
data/$(CORPUS).$(DIR).all-lambdas: data/$(CORPUS).$(DIR).events<br />
cat data/$(CORPUS).$(DIR).events | grep -v -e '\$$ 0\.0 #' -e '\$$ 0 #' > data/$(CORPUS).$(DIR).events.trimmed<br />
cat data/$(CORPUS).$(DIR).events.trimmed | python $(SCRIPTS)/merge-all-lambdas.py $(YASMET)> $@<br />
<br />
data/$(CORPUS).$(DIR).rules-all: data/$(CORPUS).$(DIR).ngrams data/$(CORPUS).$(DIR).all-lambdas<br />
python3 $(SCRIPTS)/merge-ngrams-lambdas.py data/$(CORPUS).$(DIR).ngrams data/$(CORPUS).$(DIR).all-lambdas > $@<br />
<br />
data/$(CORPUS).$(DIR).ngrams-all: data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).rules-all<br />
python3 $(SCRIPTS)/lambdas-to-rules.py data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).rules-all > $@<br />
<br />
data/$(CORPUS).$(DIR).ngrams-trimmed: data/$(CORPUS).$(DIR).ngrams-all<br />
cat $< | python3 $(SCRIPTS)/ngram-pareto-trim.py > $@<br />
<br />
data/$(CORPUS).$(DIR).lm.xml: data/$(CORPUS).$(DIR).ngrams-trimmed<br />
python3 $(SCRIPTS)/ngrams-to-rules-me.py data/$(CORPUS).$(DIR).ngrams-trimmed > $@<br />
<br />
data/$(CORPUS).$(DIR).lm.bin: data/$(CORPUS).$(DIR).lm.xml<br />
lrx-comp $< $@<br />
<br />
<br />
</pre><br />
The corpus file needs to be named as "basename"."language-pair"."source side". <br/><br />
As an illustration, in the Makefile example, the corpus file is named europarl.en-es.es <br/><br />
<br />
Set the Makefile variables as follows: <br/><br />
CORPUS denotes the base name of your corpus file <br/><br />
PAIR stands for the language pair <br/><br />
SL and TL stand for source language and target language <br/><br />
DATA is the path to the language resources for the language pair <br/><br />
SCRIPTS denotes the path to the lex-tools scripts <br/><br />
MODEL is the path to the target side (binary) language model used for scoring the possible translations of ambiguous words <br/><br />
<br />
Finally, executing the Makefile will generate lexical selection rules for the specified language pair.<br />
<br />
<br />
<br />
[[Category:Lexical selection]]</div>
Fpetkovski
https://wiki.apertium.org/w/index.php?title=Generating_lexical-selection_rules_from_monolingual_corpora&diff=43865
Generating lexical-selection rules from monolingual corpora
2013-09-23T06:18:45Z
<p>Fpetkovski: /* Annotation */</p>
<hr />
<div>{{TOCD}}<br />
This page describes how to generate lexical selection rules without relying on a parallel corpus.<br />
<br />
==Prerequisites==<br />
<br />
* [[apertium-lex-tools]]<br />
* [[IRSTLM]]<br />
* A language pair (e.g. apertium-br-fr)<br />
** The language pair should have the following two modes:<br />
*** <code>-multi</code> which is all the modules after lexical transfer (see apertium-mk-en/modes.xml)<br />
*** <code>-pretransfer</code> which is all the modules up to lexical transfer (see apertium-mk-en/modes.xml)<br />
<br />
==Annotation==<br />
<br />
'''Important: If you don't want through the whole process step by step, you can use the Makefile script provided in the last section of this page.<br />
'''<br />
We're going to do the example with EuroParl and the English to Spanish pair in Apertium.<br />
<br />
Given that you've got all the stuff installed, the work will be as follows:<br />
<br />
Make sure that you have trained a language model for the target language. See [http://sourceforge.net/apps/mediawiki/irstlm/index.php?title=User_Manual#Training_your_first_LM]<br />
<br />
Take your corpus and make a tagged version of it:<br />
<br />
<pre><br />
cat europarl.es-en.es | apertium-destxt | apertium -f none -d ~/source/apertium-en-es en-es-pretransfer > europarl.en-es.es.tagged<br />
</pre><br />
<br />
Make an ambiguous version of your corpus and trim redundant tags:<br />
<br />
<pre><br />
cat europarl.en-es.es.tagged | python ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil -b -f -t -n > europarl.en-es.es.ambig<br />
</pre><br />
<br />
Next, generate all the possible disambiguation paths while trimming redundant tags:<br />
<br />
<pre><br />
cat europarl.en-es.es.tagged | ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil -m -f -t -n > europarl.en-es.es.multi-trimmed<br />
</pre><br />
<br />
Translate and score all possible disambiguation paths:<br />
<br />
<pre><br />
cat europarl.en-es.es.tagged | python ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil -m -f -n |<br />
apertium -f none -d ~/source/apertium-en-es en-es-multi | ~/source/apertium-lex-tools/irstlm-ranker <br />
~/source/corpora/lm/en.blm europarl.en-es.es.multi-trimmed -f > europarl.en-es.es.annotated<br />
</pre><br />
<br />
Now we have a pseudo-parallel corpus where each possible translation is scored.<br />
We start by extracting a frequency lexicon:<br />
<br />
<pre><br />
python3 ~/source/apertium-lex-tools-scripts/biltrans-extract-frac-freq.py europarl.en-es.es.ambig europarl.en-es.es.annotated > europarl.en-es.freq<br />
</pre><br />
<br />
<pre><br />
python3 ~/source/apertium-lex-tools-scripts/extract-alig-lrx.py europarl.en-es.freq > europarl.en-es.freq.lrx<br />
</pre><br />
<br />
<pre><br />
lrx-comp europarl.en-es.freq.lrx europarl.en-es.freq.lrx.bin<br />
</pre><br />
<br />
From here on, we have two paths we can choose. We can extract rules using a maximum entropy classifier, or we can extract<br />
rules based only on the scores provided by irstlm-ranker.<br />
<br />
== Direct rule extraction ==<br />
When using this method, we directly continue with extracting ngrams from the pseudo parallel corpus:<br />
<br />
<pre><br />
python3 ~/source/apertium/apertium-lex-tools/scripts/biltrans-count-patterns-ngrams.py europarl.en-es.freq europarl.en-es.es.ambig europarl.en-es.es.annotated > ngrams<br />
</pre><br />
<br />
Next, we prune the generated ngrams:<br />
<br />
<pre><br />
python3 ~/source/apertium/apertium-lex-tools/scripts/ngram-pruning-frac.py europarl.en-es.freq ngrams > patterns<br />
</pre><br />
<br />
Finally, we generate and compile lexical selection rules while thresholding their irstlm-score<br />
<br />
<pre><br />
crisphold=1;<br />
python3 ~/source/apertium/apertium-lex-tools/scripts//ngrams-to-rules.py patterns $crisphold > patterns.lrx<br />
</pre><br />
<br />
<pre><br />
lrx-comp patterns.lrx patterns.lrx.bin<br />
</pre><br />
<br />
== Maximum entropy rule extraction ==<br />
When extracting rules using a maximum entropy criterion, we first extract features which we are going to feed to a classifier:<br />
<br />
<pre><br />
python3 ~/source/apertium/apertium-lex-tools/scripts/biltrans-count-patterns-frac-maxent.py europarl-en-es.freq <br />
europarl.en-es.ambig europarl.en-es.annotated > events 2>ngrams<br />
</pre><br />
<br />
We then train classifiers which as a side effect score how much each ngram contributes to a certain translation:<br />
<pre><br />
cat events | grep -v -e '\$$ 0\.0 #' -e '\$$ 0 #' > events.trimmed<br />
cat events.trimmed | python ~/source/apertium/apertium-lex-tools/scripts/merge-all-lambdas.py $(YASMET) > all-lambdas<br />
</pre><br />
<br />
<pre><br />
python3 ~/source/apertium-lex-tools/scripts/merge-ngrams-lambdas.py ngrams all-lambdas > rules-all<br />
</pre><br />
<br />
Finally, we extract ngrams:<br />
<br />
<pre><br />
python3 ~/source/apertium-lex-tools/scripts/lambdas-to-rules.py europarl-en-es.freq rules-all > ngrams-all<br />
</pre><br />
<br />
we trim them:<br />
<br />
<pre><br />
python3 ~/source/apertium-lex-tools/scripts/ngram-pareto-trim.py ngrams-all > ngrams-trimmed<br />
</pre><br />
<br />
and generate lexical selection rules:<br />
<br />
<pre><br />
python3 ~/source/apertium-lex-tools/scripts/ngrams-to-rules-me.py ngrams-trimmed > europarl.en-es.lrx.bin<br />
</pre><br />
<br />
<br />
== Makefiles ==<br />
<br />
=== Direct rule extraction ===<br />
You can use this makefile to generate rules using direct rule extraction. <br />
Your corpus needs to be placed in the same folder with your makefile.<br />
<br />
<pre><br />
CORPUS=setimes<br />
PAIR=mk-en<br />
DATA=/home/philip/Apertium/apertium-mk-en<br />
SL=mk<br />
TL=en<br />
TRAINING_LINES=10000<br />
SCRIPTS=/home/philip/Apertium/apertium-lex-tools/scripts<br />
MODEL=/home/philip/Apertium/corpora/language-models/en/setimes.en.5.blm<br />
LEX_TOOLS=/home/philip/Apertium/apertium-lex-tools<br />
<br />
<br />
<br />
AUTOBIL=$(SL)-$(TL).autobil.bin<br />
DIR=$(SL)-$(TL)<br />
<br />
all: data/$(CORPUS).$(DIR).freq.lrx.bin data/$(CORPUS).$(DIR).patterns.lrx<br />
<br />
data/$(CORPUS).$(DIR).tagger: $(CORPUS).$(PAIR).$(SL)<br />
if [ ! -d data ]; then mkdir data; fi<br />
cat $(CORPUS).$(PAIR).$(SL) | head -n $(TRAINING_LINES)| sed 's/[^\.]$$/./g' | apertium-destxt | apertium -f none -d $(DATA) $(DIR)-tagger | apertium-pretransfer > $@<br />
<br />
data/$(CORPUS).$(DIR).ambig: data/$(CORPUS).$(DIR).tagger<br />
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/process-tagger-output $(DATA)/$(AUTOBIL) -b -t -f -n > $@<br />
<br />
data/$(CORPUS).$(DIR).multi-trimmed: data/$(CORPUS).$(DIR).tagger<br />
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/process-tagger-output $(DATA)/$(AUTOBIL) -m -t -f > $@<br />
<br />
data/$(CORPUS).$(DIR).annotated: data/$(CORPUS).$(DIR).tagger data/$(CORPUS).$(DIR).multi-trimmed<br />
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/process-tagger-output $(DATA)/$(AUTOBIL) -m -f | apertium -f none -d $(DATA) $(DIR)-multi | $(LEX_TOOLS)/irstlm-ranker $(MODEL) data/$(CORPUS).$(DIR).multi-trimmed -f 2>/dev/null | grep "|@|" | cut -f 1-3 > $@ <br />
<br />
data/$(CORPUS).$(DIR).freq: data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated<br />
python3 $(SCRIPTS)/biltrans-extract-frac-freq.py data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated > $@<br />
<br />
data/$(CORPUS).$(DIR).freq.lrx: data/$(CORPUS).$(DIR).freq<br />
python3 $(SCRIPTS)/extract-alig-lrx.py $< > $@<br />
<br />
data/$(CORPUS).$(DIR).freq.lrx.bin: data/$(CORPUS).$(DIR).freq.lrx<br />
lrx-comp $< $@<br />
<br />
data/$(CORPUS).$(DIR).ngrams: data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated<br />
python3 $(SCRIPTS)/biltrans-count-patterns-ngrams.py data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated > $@<br />
<br />
data/$(CORPUS).$(DIR).patterns: data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ngrams<br />
python3 $(SCRIPTS)/ngram-pruning-frac.py data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ngrams > $@ <br />
<br />
data/$(CORPUS).$(DIR).patterns.lrx: data/$(CORPUS).$(DIR).patterns<br />
python3 $(SCRIPTS)/ngrams-to-rules.py $< $(THR) > $@<br />
<br />
<br />
</pre><br />
<br />
The corpus file needs to be named as "basename"."language-pair"."source side". <br/><br />
As an illustration, in the Makefile example, the corpus file is named europarl.en-es.es <br/><br />
<br />
Set the Makefile variables as follows: <br/><br />
CORPUS denotes the base name of your corpus file <br/><br />
PAIR stands for the language pair <br/><br />
SL and TL stand for source language and target language <br/><br />
DATA is the path to the language resources for the language pair <br/><br />
SCRIPTS denotes the path to the lex-tools scripts <br/><br />
MODEL is the path to the target side (binary) language model used for scoring the possible translations of ambiguous words <br/><br />
<br />
Finally, executing the Makefile will generate lexical selection rules for the specified language pair.<br />
<br />
=== Maximum entropy ===<br />
You can use this makefile to generate rules using maximum entropy classifiers. <br />
Your corpus needs to be placed in the same folder with your makefile.<br />
<br />
<pre><br />
CORPUS=europarl<br />
PAIR=en-es<br />
DATA=/home/philip/source/apertium-en-es<br />
SL=es<br />
TL=en<br />
MODEL=/home/philip/lm/en.blm<br />
SCRIPTS=/home/philip/source/apertium-lex-tools/scripts<br />
LEX_TOOLS=/home/philip/source/apertium-lex-tools<br />
THR=1<br />
TRAINING_LINES=10000<br />
<br />
<br />
AUTOBIL=$(SL)-$(TL).autobil.bin<br />
DIR=$(SL)-$(TL)<br />
YASMET=$(LEX_TOOLS)/yasmet<br />
all: data/$(CORPUS).$(DIR).freq.lrx.bin data/$(CORPUS).$(DIR).lm.xml<br />
<br />
data/$(CORPUS).$(DIR).tagger: $(CORPUS).$(PAIR).$(SL)<br />
if [ ! -d data ]; then mkdir data; fi<br />
cat $(CORPUS).$(PAIR).$(SL) | head -n $(TRAINING_LINES) | apertium-destxt | apertium -f none -d $(DATA) $(DIR)-tagger | apertium-pretransfer > $@<br />
<br />
data/$(CORPUS).$(DIR).ambig: data/$(CORPUS).$(DIR).tagger<br />
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/multitrans $(DATA)/$(AUTOBIL) -b -t -f -n > $@<br />
<br />
data/$(CORPUS).$(DIR).multi-trimmed: data/$(CORPUS).$(DIR).tagger<br />
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/multitrans $(DATA)/$(AUTOBIL) -m -t -f > $@<br />
<br />
data/$(CORPUS).$(DIR).annotated: data/$(CORPUS).$(DIR).tagger data/$(CORPUS).$(DIR).multi-trimmed<br />
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/multitrans $(DATA)/$(AUTOBIL) -m -f | apertium -f none -d $(DATA) $(DIR)-multi | $(LEX_TOOLS)/irstlm-ranker $(MODEL) data/$(CORPUS).$(DIR).multi-trimmed -f 2>/dev/null > $@ <br />
<br />
data/$(CORPUS).$(DIR).freq: data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated<br />
python3 $(SCRIPTS)/biltrans-extract-frac-freq.py data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated > $@<br />
<br />
data/$(CORPUS).$(DIR).freq.lrx: data/$(CORPUS).$(DIR).freq<br />
python3 $(SCRIPTS)/extract-alig-lrx.py $< > $@<br />
<br />
data/$(CORPUS).$(DIR).freq.lrx.bin: data/$(CORPUS).$(DIR).freq.lrx<br />
lrx-comp $< $@<br />
<br />
data/$(CORPUS).$(DIR).events data/$(CORPUS).$(DIR).ngrams: data/$(CORPUS).$(DIR).annotated data/$(CORPUS).$(DIR).freq<br />
python3 $(SCRIPTS)/biltrans-count-patterns-frac-maxent.py data/$(CORPUS).$(SL)-$(TL).freq data/$(CORPUS).$(SL)-$(TL).ambig data/$(CORPUS).$(SL)-$(TL).annotated > data/$(CORPUS).$(DIR).events 2>data/$(CORPUS).$(DIR).ngrams<br />
<br />
data/$(CORPUS).$(DIR).all-lambdas: data/$(CORPUS).$(DIR).events<br />
cat data/$(CORPUS).$(DIR).events | grep -v -e '\$$ 0\.0 #' -e '\$$ 0 #' > data/$(CORPUS).$(DIR).events.trimmed<br />
cat data/$(CORPUS).$(DIR).events.trimmed | python $(SCRIPTS)/merge-all-lambdas.py $(YASMET)> $@<br />
<br />
data/$(CORPUS).$(DIR).rules-all: data/$(CORPUS).$(DIR).ngrams data/$(CORPUS).$(DIR).all-lambdas<br />
python3 $(SCRIPTS)/merge-ngrams-lambdas.py data/$(CORPUS).$(DIR).ngrams data/$(CORPUS).$(DIR).all-lambdas > $@<br />
<br />
data/$(CORPUS).$(DIR).ngrams-all: data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).rules-all<br />
python3 $(SCRIPTS)/lambdas-to-rules.py data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).rules-all > $@<br />
<br />
data/$(CORPUS).$(DIR).ngrams-trimmed: data/$(CORPUS).$(DIR).ngrams-all<br />
cat $< | python3 $(SCRIPTS)/ngram-pareto-trim.py > $@<br />
<br />
data/$(CORPUS).$(DIR).lm.xml: data/$(CORPUS).$(DIR).ngrams-trimmed<br />
python3 $(SCRIPTS)/ngrams-to-rules-me.py data/$(CORPUS).$(DIR).ngrams-trimmed > $@<br />
<br />
data/$(CORPUS).$(DIR).lm.bin: data/$(CORPUS).$(DIR).lm.xml<br />
lrx-comp $< $@<br />
<br />
<br />
</pre><br />
The corpus file needs to be named as "basename"."language-pair"."source side". <br/><br />
As an illustration, in the Makefile example, the corpus file is named europarl.en-es.es <br/><br />
<br />
Set the Makefile variables as follows: <br/><br />
CORPUS denotes the base name of your corpus file <br/><br />
PAIR stands for the language pair <br/><br />
SL and TL stand for source language and target language <br/><br />
DATA is the path to the language resources for the language pair <br/><br />
SCRIPTS denotes the path to the lex-tools scripts <br/><br />
MODEL is the path to the target side (binary) language model used for scoring the possible translations of ambiguous words <br/><br />
<br />
Finally, executing the Makefile will generate lexical selection rules for the specified language pair.<br />
<br />
<br />
<br />
[[Category:Lexical selection]]</div>
Fpetkovski
https://wiki.apertium.org/w/index.php?title=Generating_lexical-selection_rules_from_monolingual_corpora&diff=43864
Generating lexical-selection rules from monolingual corpora
2013-09-23T06:14:55Z
<p>Fpetkovski: /* Direct rule extraction */</p>
<hr />
<div>{{TOCD}}<br />
This page describes how to generate lexical selection rules without relying on a parallel corpus.<br />
<br />
==Prerequisites==<br />
<br />
* [[apertium-lex-tools]]<br />
* [[IRSTLM]]<br />
* A language pair (e.g. apertium-br-fr)<br />
** The language pair should have the following two modes:<br />
*** <code>-multi</code> which is all the modules after lexical transfer (see apertium-mk-en/modes.xml)<br />
*** <code>-pretransfer</code> which is all the modules up to lexical transfer (see apertium-mk-en/modes.xml)<br />
<br />
==Annotation==<br />
<br />
<br />
Important: If you don't want through the whole process step by step, you can use the Makefile script provided in the last section of this page.<br />
<br />
We're going to do the example with EuroParl and the English to Spanish pair in Apertium.<br />
<br />
Given that you've got all the stuff installed, the work will be as follows:<br />
<br />
Take your corpus and make a tagged version of it:<br />
<br />
<pre><br />
cat europarl.es-en.es | apertium-destxt | apertium -f none -d ~/source/apertium/apertium-en-es en-es-pretransfer > europarl.en-es.es.tagged<br />
</pre><br />
<br />
Make an ambiguous version of your corpus and trim redundant tags:<br />
<br />
<pre><br />
cat europarl.en-es.es.tagged | python ~/source/apertium/apertium-lex-tools/multitrans ~/source/apertium/apertium-en-es/en-es.autobil -b -f -t -n > europarl.en-es.es.ambig<br />
</pre><br />
<br />
Next, generate all the possible disambiguation paths while trimming redundant tags:<br />
<br />
<pre><br />
cat europarl.en-es.es.tagged | ~/source/apertium/apertium-lex-tools/multitrans ~/source/apertium/apertium-en-es/en-es.autobil -m -f -t -n > europarl.en-es.es.multi-trimmed<br />
</pre><br />
<br />
Translate and score all possible disambiguation paths:<br />
<br />
<pre><br />
cat europarl.en-es.es.tagged | python ~/source/apertium/apertium-lex-tools/multitrans ~/source/apertium/apertium-en-es/en-es.autobil -m -f -n |<br />
apertium -f none -d ~/source/apertium/apertium-en-es en-es-multi | ~/source/apertium/apertium-lex-tools/irstlm-ranker <br />
~/source/corpora/lm/en.blm europarl.en-es.es.multi-trimmed -f > europarl.en-es.es.annotated<br />
</pre><br />
<br />
Now we have a pseudo-parallel corpus where each possible translation is scored.<br />
We start by extracting a frequency lexicon:<br />
<br />
<pre><br />
python3 ~/source/apertium/apertium-lex-tools-scripts/biltrans-extract-frac-freq.py europarl.en-es.es.ambig europarl.en-es.es.annotated > europarl.en-es.freq<br />
</pre><br />
<br />
<pre><br />
python3 ~/source/apertium/apertium-lex-tools-scripts/extract-alig-lrx.py europarl.en-es.freq > europarl.en-es.freq.lrx<br />
</pre><br />
<br />
<pre><br />
lrx-comp europarl.en-es.freq.lrx europarl.en-es.freq.lrx.bin<br />
</pre><br />
<br />
From here on, we have two paths we can choose. We can extract rules using a maximum entropy classifier, or we can extract<br />
rules based only on the scores provided by irstlm-ranker.<br />
<br />
== Direct rule extraction ==<br />
When using this method, we directly continue with extracting ngrams from the pseudo parallel corpus:<br />
<br />
<pre><br />
python3 ~/source/apertium/apertium-lex-tools/scripts/biltrans-count-patterns-ngrams.py europarl.en-es.freq europarl.en-es.es.ambig europarl.en-es.es.annotated > ngrams<br />
</pre><br />
<br />
Next, we prune the generated ngrams:<br />
<br />
<pre><br />
python3 ~/source/apertium/apertium-lex-tools/scripts/ngram-pruning-frac.py europarl.en-es.freq ngrams > patterns<br />
</pre><br />
<br />
Finally, we generate and compile lexical selection rules while thresholding their irstlm-score<br />
<br />
<pre><br />
crisphold=1;<br />
python3 ~/source/apertium/apertium-lex-tools/scripts//ngrams-to-rules.py patterns $crisphold > patterns.lrx<br />
</pre><br />
<br />
<pre><br />
lrx-comp patterns.lrx patterns.lrx.bin<br />
</pre><br />
<br />
== Maximum entropy rule extraction ==<br />
When extracting rules using a maximum entropy criterion, we first extract features which we are going to feed to a classifier:<br />
<br />
<pre><br />
python3 ~/source/apertium/apertium-lex-tools/scripts/biltrans-count-patterns-frac-maxent.py europarl-en-es.freq <br />
europarl.en-es.ambig europarl.en-es.annotated > events 2>ngrams<br />
</pre><br />
<br />
We then train classifiers which as a side effect score how much each ngram contributes to a certain translation:<br />
<pre><br />
cat events | grep -v -e '\$$ 0\.0 #' -e '\$$ 0 #' > events.trimmed<br />
cat events.trimmed | python ~/source/apertium/apertium-lex-tools/scripts/merge-all-lambdas.py $(YASMET) > all-lambdas<br />
</pre><br />
<br />
<pre><br />
python3 ~/source/apertium-lex-tools/scripts/merge-ngrams-lambdas.py ngrams all-lambdas > rules-all<br />
</pre><br />
<br />
Finally, we extract ngrams:<br />
<br />
<pre><br />
python3 ~/source/apertium-lex-tools/scripts/lambdas-to-rules.py europarl-en-es.freq rules-all > ngrams-all<br />
</pre><br />
<br />
we trim them:<br />
<br />
<pre><br />
python3 ~/source/apertium-lex-tools/scripts/ngram-pareto-trim.py ngrams-all > ngrams-trimmed<br />
</pre><br />
<br />
and generate lexical selection rules:<br />
<br />
<pre><br />
python3 ~/source/apertium-lex-tools/scripts/ngrams-to-rules-me.py ngrams-trimmed > europarl.en-es.lrx.bin<br />
</pre><br />
<br />
<br />
== Makefiles ==<br />
<br />
=== Direct rule extraction ===<br />
You can use this makefile to generate rules using direct rule extraction. <br />
Your corpus needs to be placed in the same folder with your makefile.<br />
<br />
<pre><br />
CORPUS=setimes<br />
PAIR=mk-en<br />
DATA=/home/philip/Apertium/apertium-mk-en<br />
SL=mk<br />
TL=en<br />
TRAINING_LINES=10000<br />
SCRIPTS=/home/philip/Apertium/apertium-lex-tools/scripts<br />
MODEL=/home/philip/Apertium/corpora/language-models/en/setimes.en.5.blm<br />
LEX_TOOLS=/home/philip/Apertium/apertium-lex-tools<br />
<br />
<br />
<br />
AUTOBIL=$(SL)-$(TL).autobil.bin<br />
DIR=$(SL)-$(TL)<br />
<br />
all: data/$(CORPUS).$(DIR).freq.lrx.bin data/$(CORPUS).$(DIR).patterns.lrx<br />
<br />
data/$(CORPUS).$(DIR).tagger: $(CORPUS).$(PAIR).$(SL)<br />
if [ ! -d data ]; then mkdir data; fi<br />
cat $(CORPUS).$(PAIR).$(SL) | head -n $(TRAINING_LINES)| sed 's/[^\.]$$/./g' | apertium-destxt | apertium -f none -d $(DATA) $(DIR)-tagger | apertium-pretransfer > $@<br />
<br />
data/$(CORPUS).$(DIR).ambig: data/$(CORPUS).$(DIR).tagger<br />
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/process-tagger-output $(DATA)/$(AUTOBIL) -b -t -f -n > $@<br />
<br />
data/$(CORPUS).$(DIR).multi-trimmed: data/$(CORPUS).$(DIR).tagger<br />
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/process-tagger-output $(DATA)/$(AUTOBIL) -m -t -f > $@<br />
<br />
data/$(CORPUS).$(DIR).annotated: data/$(CORPUS).$(DIR).tagger data/$(CORPUS).$(DIR).multi-trimmed<br />
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/process-tagger-output $(DATA)/$(AUTOBIL) -m -f | apertium -f none -d $(DATA) $(DIR)-multi | $(LEX_TOOLS)/irstlm-ranker $(MODEL) data/$(CORPUS).$(DIR).multi-trimmed -f 2>/dev/null | grep "|@|" | cut -f 1-3 > $@ <br />
<br />
data/$(CORPUS).$(DIR).freq: data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated<br />
python3 $(SCRIPTS)/biltrans-extract-frac-freq.py data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated > $@<br />
<br />
data/$(CORPUS).$(DIR).freq.lrx: data/$(CORPUS).$(DIR).freq<br />
python3 $(SCRIPTS)/extract-alig-lrx.py $< > $@<br />
<br />
data/$(CORPUS).$(DIR).freq.lrx.bin: data/$(CORPUS).$(DIR).freq.lrx<br />
lrx-comp $< $@<br />
<br />
data/$(CORPUS).$(DIR).ngrams: data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated<br />
python3 $(SCRIPTS)/biltrans-count-patterns-ngrams.py data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated > $@<br />
<br />
data/$(CORPUS).$(DIR).patterns: data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ngrams<br />
python3 $(SCRIPTS)/ngram-pruning-frac.py data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ngrams > $@ <br />
<br />
data/$(CORPUS).$(DIR).patterns.lrx: data/$(CORPUS).$(DIR).patterns<br />
python3 $(SCRIPTS)/ngrams-to-rules.py $< $(THR) > $@<br />
<br />
<br />
</pre><br />
<br />
The corpus file needs to be named as "basename"."language-pair"."source side". <br/><br />
As an illustration, in the Makefile example, the corpus file is named europarl.en-es.es <br/><br />
<br />
Set the Makefile variables as follows: <br/><br />
CORPUS denotes the base name of your corpus file <br/><br />
PAIR stands for the language pair <br/><br />
SL and TL stand for source language and target language <br/><br />
DATA is the path to the language resources for the language pair <br/><br />
SCRIPTS denotes the path to the lex-tools scripts <br/><br />
MODEL is the path to the target side (binary) language model used for scoring the possible translations of ambiguous words <br/><br />
<br />
Finally, executing the Makefile will generate lexical selection rules for the specified language pair.<br />
<br />
=== Maximum entropy ===<br />
You can use this makefile to generate rules using maximum entropy classifiers. <br />
Your corpus needs to be placed in the same folder with your makefile.<br />
<br />
<pre><br />
CORPUS=europarl<br />
PAIR=en-es<br />
DATA=/home/philip/source/apertium-en-es<br />
SL=es<br />
TL=en<br />
MODEL=/home/philip/lm/en.blm<br />
SCRIPTS=/home/philip/source/apertium-lex-tools/scripts<br />
LEX_TOOLS=/home/philip/source/apertium-lex-tools<br />
THR=1<br />
TRAINING_LINES=10000<br />
<br />
<br />
AUTOBIL=$(SL)-$(TL).autobil.bin<br />
DIR=$(SL)-$(TL)<br />
YASMET=$(LEX_TOOLS)/yasmet<br />
all: data/$(CORPUS).$(DIR).freq.lrx.bin data/$(CORPUS).$(DIR).lm.xml<br />
<br />
data/$(CORPUS).$(DIR).tagger: $(CORPUS).$(PAIR).$(SL)<br />
if [ ! -d data ]; then mkdir data; fi<br />
cat $(CORPUS).$(PAIR).$(SL) | head -n $(TRAINING_LINES) | apertium-destxt | apertium -f none -d $(DATA) $(DIR)-tagger | apertium-pretransfer > $@<br />
<br />
data/$(CORPUS).$(DIR).ambig: data/$(CORPUS).$(DIR).tagger<br />
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/multitrans $(DATA)/$(AUTOBIL) -b -t -f -n > $@<br />
<br />
data/$(CORPUS).$(DIR).multi-trimmed: data/$(CORPUS).$(DIR).tagger<br />
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/multitrans $(DATA)/$(AUTOBIL) -m -t -f > $@<br />
<br />
data/$(CORPUS).$(DIR).annotated: data/$(CORPUS).$(DIR).tagger data/$(CORPUS).$(DIR).multi-trimmed<br />
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/multitrans $(DATA)/$(AUTOBIL) -m -f | apertium -f none -d $(DATA) $(DIR)-multi | $(LEX_TOOLS)/irstlm-ranker $(MODEL) data/$(CORPUS).$(DIR).multi-trimmed -f 2>/dev/null > $@ <br />
<br />
data/$(CORPUS).$(DIR).freq: data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated<br />
python3 $(SCRIPTS)/biltrans-extract-frac-freq.py data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated > $@<br />
<br />
data/$(CORPUS).$(DIR).freq.lrx: data/$(CORPUS).$(DIR).freq<br />
python3 $(SCRIPTS)/extract-alig-lrx.py $< > $@<br />
<br />
data/$(CORPUS).$(DIR).freq.lrx.bin: data/$(CORPUS).$(DIR).freq.lrx<br />
lrx-comp $< $@<br />
<br />
data/$(CORPUS).$(DIR).events data/$(CORPUS).$(DIR).ngrams: data/$(CORPUS).$(DIR).annotated data/$(CORPUS).$(DIR).freq<br />
python3 $(SCRIPTS)/biltrans-count-patterns-frac-maxent.py data/$(CORPUS).$(SL)-$(TL).freq data/$(CORPUS).$(SL)-$(TL).ambig data/$(CORPUS).$(SL)-$(TL).annotated > data/$(CORPUS).$(DIR).events 2>data/$(CORPUS).$(DIR).ngrams<br />
<br />
data/$(CORPUS).$(DIR).all-lambdas: data/$(CORPUS).$(DIR).events<br />
cat data/$(CORPUS).$(DIR).events | grep -v -e '\$$ 0\.0 #' -e '\$$ 0 #' > data/$(CORPUS).$(DIR).events.trimmed<br />
cat data/$(CORPUS).$(DIR).events.trimmed | python $(SCRIPTS)/merge-all-lambdas.py $(YASMET)> $@<br />
<br />
data/$(CORPUS).$(DIR).rules-all: data/$(CORPUS).$(DIR).ngrams data/$(CORPUS).$(DIR).all-lambdas<br />
python3 $(SCRIPTS)/merge-ngrams-lambdas.py data/$(CORPUS).$(DIR).ngrams data/$(CORPUS).$(DIR).all-lambdas > $@<br />
<br />
data/$(CORPUS).$(DIR).ngrams-all: data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).rules-all<br />
python3 $(SCRIPTS)/lambdas-to-rules.py data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).rules-all > $@<br />
<br />
data/$(CORPUS).$(DIR).ngrams-trimmed: data/$(CORPUS).$(DIR).ngrams-all<br />
cat $< | python3 $(SCRIPTS)/ngram-pareto-trim.py > $@<br />
<br />
data/$(CORPUS).$(DIR).lm.xml: data/$(CORPUS).$(DIR).ngrams-trimmed<br />
python3 $(SCRIPTS)/ngrams-to-rules-me.py data/$(CORPUS).$(DIR).ngrams-trimmed > $@<br />
<br />
data/$(CORPUS).$(DIR).lm.bin: data/$(CORPUS).$(DIR).lm.xml<br />
lrx-comp $< $@<br />
<br />
<br />
</pre><br />
The corpus file needs to be named as "basename"."language-pair"."source side". <br/><br />
As an illustration, in the Makefile example, the corpus file is named europarl.en-es.es <br/><br />
<br />
Set the Makefile variables as follows: <br/><br />
CORPUS denotes the base name of your corpus file <br/><br />
PAIR stands for the language pair <br/><br />
SL and TL stand for source language and target language <br/><br />
DATA is the path to the language resources for the language pair <br/><br />
SCRIPTS denotes the path to the lex-tools scripts <br/><br />
MODEL is the path to the target side (binary) language model used for scoring the possible translations of ambiguous words <br/><br />
<br />
Finally, executing the Makefile will generate lexical selection rules for the specified language pair.<br />
<br />
<br />
<br />
[[Category:Lexical selection]]</div>
Fpetkovski
https://wiki.apertium.org/w/index.php?title=Generating_lexical-selection_rules_from_monolingual_corpora&diff=43863
Generating lexical-selection rules from monolingual corpora
2013-09-23T06:14:28Z
<p>Fpetkovski: /* Maximum entropy */</p>
<hr />
<div>{{TOCD}}<br />
This page describes how to generate lexical selection rules without relying on a parallel corpus.<br />
<br />
==Prerequisites==<br />
<br />
* [[apertium-lex-tools]]<br />
* [[IRSTLM]]<br />
* A language pair (e.g. apertium-br-fr)<br />
** The language pair should have the following two modes:<br />
*** <code>-multi</code> which is all the modules after lexical transfer (see apertium-mk-en/modes.xml)<br />
*** <code>-pretransfer</code> which is all the modules up to lexical transfer (see apertium-mk-en/modes.xml)<br />
<br />
==Annotation==<br />
<br />
<br />
Important: If you don't want through the whole process step by step, you can use the Makefile script provided in the last section of this page.<br />
<br />
We're going to do the example with EuroParl and the English to Spanish pair in Apertium.<br />
<br />
Given that you've got all the stuff installed, the work will be as follows:<br />
<br />
Take your corpus and make a tagged version of it:<br />
<br />
<pre><br />
cat europarl.es-en.es | apertium-destxt | apertium -f none -d ~/source/apertium/apertium-en-es en-es-pretransfer > europarl.en-es.es.tagged<br />
</pre><br />
<br />
Make an ambiguous version of your corpus and trim redundant tags:<br />
<br />
<pre><br />
cat europarl.en-es.es.tagged | python ~/source/apertium/apertium-lex-tools/multitrans ~/source/apertium/apertium-en-es/en-es.autobil -b -f -t -n > europarl.en-es.es.ambig<br />
</pre><br />
<br />
Next, generate all the possible disambiguation paths while trimming redundant tags:<br />
<br />
<pre><br />
cat europarl.en-es.es.tagged | ~/source/apertium/apertium-lex-tools/multitrans ~/source/apertium/apertium-en-es/en-es.autobil -m -f -t -n > europarl.en-es.es.multi-trimmed<br />
</pre><br />
<br />
Translate and score all possible disambiguation paths:<br />
<br />
<pre><br />
cat europarl.en-es.es.tagged | python ~/source/apertium/apertium-lex-tools/multitrans ~/source/apertium/apertium-en-es/en-es.autobil -m -f -n |<br />
apertium -f none -d ~/source/apertium/apertium-en-es en-es-multi | ~/source/apertium/apertium-lex-tools/irstlm-ranker <br />
~/source/corpora/lm/en.blm europarl.en-es.es.multi-trimmed -f > europarl.en-es.es.annotated<br />
</pre><br />
<br />
Now we have a pseudo-parallel corpus where each possible translation is scored.<br />
We start by extracting a frequency lexicon:<br />
<br />
<pre><br />
python3 ~/source/apertium/apertium-lex-tools-scripts/biltrans-extract-frac-freq.py europarl.en-es.es.ambig europarl.en-es.es.annotated > europarl.en-es.freq<br />
</pre><br />
<br />
<pre><br />
python3 ~/source/apertium/apertium-lex-tools-scripts/extract-alig-lrx.py europarl.en-es.freq > europarl.en-es.freq.lrx<br />
</pre><br />
<br />
<pre><br />
lrx-comp europarl.en-es.freq.lrx europarl.en-es.freq.lrx.bin<br />
</pre><br />
<br />
From here on, we have two paths we can choose. We can extract rules using a maximum entropy classifier, or we can extract<br />
rules based only on the scores provided by irstlm-ranker.<br />
<br />
== Direct rule extraction ==<br />
When using this method, we directly continue with extracting ngrams from the pseudo parallel corpus:<br />
<br />
<pre><br />
python3 ~/source/apertium/apertium-lex-tools/scripts/biltrans-count-patterns-ngrams.py europarl.en-es.freq europarl.en-es.es.ambig europarl.en-es.es.annotated > ngrams<br />
</pre><br />
<br />
Next, we prune the generated ngrams:<br />
<br />
<pre><br />
python3 ~/source/apertium/apertium-lex-tools/scripts/ngram-pruning-frac.py europarl.en-es.freq ngrams > patterns<br />
</pre><br />
<br />
Finally, we generate and compile lexical selection rules while thresholding their irstlm-score<br />
<br />
<pre><br />
crisphold=1;<br />
python3 ~/source/apertium/apertium-lex-tools/scripts//ngrams-to-rules.py patterns $crisphold > patterns.lrx<br />
</pre><br />
<br />
<pre><br />
lrx-comp patterns.lrx patterns.lrx.bin<br />
</pre><br />
<br />
== Maximum entropy rule extraction ==<br />
When extracting rules using a maximum entropy criterion, we first extract features which we are going to feed to a classifier:<br />
<br />
<pre><br />
python3 ~/source/apertium/apertium-lex-tools/scripts/biltrans-count-patterns-frac-maxent.py europarl-en-es.freq <br />
europarl.en-es.ambig europarl.en-es.annotated > events 2>ngrams<br />
</pre><br />
<br />
We then train classifiers which as a side effect score how much each ngram contributes to a certain translation:<br />
<pre><br />
cat events | grep -v -e '\$$ 0\.0 #' -e '\$$ 0 #' > events.trimmed<br />
cat events.trimmed | python ~/source/apertium/apertium-lex-tools/scripts/merge-all-lambdas.py $(YASMET) > all-lambdas<br />
</pre><br />
<br />
<pre><br />
python3 ~/source/apertium-lex-tools/scripts/merge-ngrams-lambdas.py ngrams all-lambdas > rules-all<br />
</pre><br />
<br />
Finally, we extract ngrams:<br />
<br />
<pre><br />
python3 ~/source/apertium-lex-tools/scripts/lambdas-to-rules.py europarl-en-es.freq rules-all > ngrams-all<br />
</pre><br />
<br />
we trim them:<br />
<br />
<pre><br />
python3 ~/source/apertium-lex-tools/scripts/ngram-pareto-trim.py ngrams-all > ngrams-trimmed<br />
</pre><br />
<br />
and generate lexical selection rules:<br />
<br />
<pre><br />
python3 ~/source/apertium-lex-tools/scripts/ngrams-to-rules-me.py ngrams-trimmed > europarl.en-es.lrx.bin<br />
</pre><br />
<br />
<br />
== Makefiles ==<br />
<br />
=== Direct rule extraction ===<br />
You can use this makefile to generate rules using direct rule extraction. <br />
Your corpus needs to be placed in the same folder with your makefile.<br />
<br />
<pre><br />
CORPUS=setimes<br />
PAIR=mk-en<br />
DATA=/home/philip/Apertium/apertium-mk-en<br />
SL=mk<br />
TL=en<br />
TRAINING_LINES=10000<br />
SCRIPTS=/home/philip/Apertium/apertium-lex-tools/scripts<br />
MODEL=/home/philip/Apertium/corpora/language-models/en/setimes.en.5.blm<br />
LEX_TOOLS=/home/philip/Apertium/apertium-lex-tools<br />
<br />
<br />
<br />
AUTOBIL=$(SL)-$(TL).autobil.bin<br />
DIR=$(SL)-$(TL)<br />
<br />
all: data/$(CORPUS).$(DIR).freq.lrx.bin data/$(CORPUS).$(DIR).patterns.lrx<br />
<br />
data/$(CORPUS).$(DIR).tagger: $(CORPUS).$(PAIR).$(SL)<br />
if [ ! -d data ]; then mkdir data; fi<br />
cat $(CORPUS).$(PAIR).$(SL) | head -n $(TRAINING_LINES)| sed 's/[^\.]$$/./g' | apertium-destxt | apertium -f none -d $(DATA) $(DIR)-tagger | apertium-pretransfer > $@<br />
<br />
data/$(CORPUS).$(DIR).ambig: data/$(CORPUS).$(DIR).tagger<br />
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/process-tagger-output $(DATA)/$(AUTOBIL) -b -t -f -n > $@<br />
<br />
data/$(CORPUS).$(DIR).multi-trimmed: data/$(CORPUS).$(DIR).tagger<br />
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/process-tagger-output $(DATA)/$(AUTOBIL) -m -t -f > $@<br />
<br />
data/$(CORPUS).$(DIR).annotated: data/$(CORPUS).$(DIR).tagger data/$(CORPUS).$(DIR).multi-trimmed<br />
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/process-tagger-output $(DATA)/$(AUTOBIL) -m -f | apertium -f none -d $(DATA) $(DIR)-multi | $(LEX_TOOLS)/irstlm-ranker $(MODEL) data/$(CORPUS).$(DIR).multi-trimmed -f 2>/dev/null | grep "|@|" | cut -f 1-3 > $@ <br />
<br />
data/$(CORPUS).$(DIR).freq: data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated<br />
python3 $(SCRIPTS)/biltrans-extract-frac-freq.py data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated > $@<br />
<br />
data/$(CORPUS).$(DIR).freq.lrx: data/$(CORPUS).$(DIR).freq<br />
python3 $(SCRIPTS)/extract-alig-lrx.py $< > $@<br />
<br />
data/$(CORPUS).$(DIR).freq.lrx.bin: data/$(CORPUS).$(DIR).freq.lrx<br />
lrx-comp $< $@<br />
<br />
data/$(CORPUS).$(DIR).ngrams: data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated<br />
python3 $(SCRIPTS)/biltrans-count-patterns-ngrams.py data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated > $@<br />
<br />
data/$(CORPUS).$(DIR).patterns: data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ngrams<br />
python3 $(SCRIPTS)/ngram-pruning-frac.py data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ngrams > $@ <br />
<br />
data/$(CORPUS).$(DIR).patterns.lrx: data/$(CORPUS).$(DIR).patterns<br />
python3 $(SCRIPTS)/ngrams-to-rules.py $< $(THR) > $@<br />
<br />
<br />
</pre><br />
<br />
The corpus file needs to be named as "basename"."language-pair"."source side". <br />
As an illustration, in the Makefile example, the corpus file is named europarl.en-es.es<br />
Set the Makefile variables as follows: <br />
CORPUS denotes the base name of your corpus file<br />
PAIR stands for the language pair<br />
SL and TL stand for source language and target language<br />
DATA is the path to the language resources for the language pair<br />
SCRIPTS denotes the path to the lex-tools scripts<br />
MODEL is the path to the target side (binary) language model used for scoring the possible translations of ambiguous words<br />
<br />
Finally, executing the Makefile will generate lexical selection rules for the specified language pair.<br />
<br />
=== Maximum entropy ===<br />
You can use this makefile to generate rules using maximum entropy classifiers. <br />
Your corpus needs to be placed in the same folder with your makefile.<br />
<br />
<pre><br />
CORPUS=europarl<br />
PAIR=en-es<br />
DATA=/home/philip/source/apertium-en-es<br />
SL=es<br />
TL=en<br />
MODEL=/home/philip/lm/en.blm<br />
SCRIPTS=/home/philip/source/apertium-lex-tools/scripts<br />
LEX_TOOLS=/home/philip/source/apertium-lex-tools<br />
THR=1<br />
TRAINING_LINES=10000<br />
<br />
<br />
AUTOBIL=$(SL)-$(TL).autobil.bin<br />
DIR=$(SL)-$(TL)<br />
YASMET=$(LEX_TOOLS)/yasmet<br />
all: data/$(CORPUS).$(DIR).freq.lrx.bin data/$(CORPUS).$(DIR).lm.xml<br />
<br />
data/$(CORPUS).$(DIR).tagger: $(CORPUS).$(PAIR).$(SL)<br />
if [ ! -d data ]; then mkdir data; fi<br />
cat $(CORPUS).$(PAIR).$(SL) | head -n $(TRAINING_LINES) | apertium-destxt | apertium -f none -d $(DATA) $(DIR)-tagger | apertium-pretransfer > $@<br />
<br />
data/$(CORPUS).$(DIR).ambig: data/$(CORPUS).$(DIR).tagger<br />
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/multitrans $(DATA)/$(AUTOBIL) -b -t -f -n > $@<br />
<br />
data/$(CORPUS).$(DIR).multi-trimmed: data/$(CORPUS).$(DIR).tagger<br />
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/multitrans $(DATA)/$(AUTOBIL) -m -t -f > $@<br />
<br />
data/$(CORPUS).$(DIR).annotated: data/$(CORPUS).$(DIR).tagger data/$(CORPUS).$(DIR).multi-trimmed<br />
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/multitrans $(DATA)/$(AUTOBIL) -m -f | apertium -f none -d $(DATA) $(DIR)-multi | $(LEX_TOOLS)/irstlm-ranker $(MODEL) data/$(CORPUS).$(DIR).multi-trimmed -f 2>/dev/null > $@ <br />
<br />
data/$(CORPUS).$(DIR).freq: data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated<br />
python3 $(SCRIPTS)/biltrans-extract-frac-freq.py data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated > $@<br />
<br />
data/$(CORPUS).$(DIR).freq.lrx: data/$(CORPUS).$(DIR).freq<br />
python3 $(SCRIPTS)/extract-alig-lrx.py $< > $@<br />
<br />
data/$(CORPUS).$(DIR).freq.lrx.bin: data/$(CORPUS).$(DIR).freq.lrx<br />
lrx-comp $< $@<br />
<br />
data/$(CORPUS).$(DIR).events data/$(CORPUS).$(DIR).ngrams: data/$(CORPUS).$(DIR).annotated data/$(CORPUS).$(DIR).freq<br />
python3 $(SCRIPTS)/biltrans-count-patterns-frac-maxent.py data/$(CORPUS).$(SL)-$(TL).freq data/$(CORPUS).$(SL)-$(TL).ambig data/$(CORPUS).$(SL)-$(TL).annotated > data/$(CORPUS).$(DIR).events 2>data/$(CORPUS).$(DIR).ngrams<br />
<br />
data/$(CORPUS).$(DIR).all-lambdas: data/$(CORPUS).$(DIR).events<br />
cat data/$(CORPUS).$(DIR).events | grep -v -e '\$$ 0\.0 #' -e '\$$ 0 #' > data/$(CORPUS).$(DIR).events.trimmed<br />
cat data/$(CORPUS).$(DIR).events.trimmed | python $(SCRIPTS)/merge-all-lambdas.py $(YASMET)> $@<br />
<br />
data/$(CORPUS).$(DIR).rules-all: data/$(CORPUS).$(DIR).ngrams data/$(CORPUS).$(DIR).all-lambdas<br />
python3 $(SCRIPTS)/merge-ngrams-lambdas.py data/$(CORPUS).$(DIR).ngrams data/$(CORPUS).$(DIR).all-lambdas > $@<br />
<br />
data/$(CORPUS).$(DIR).ngrams-all: data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).rules-all<br />
python3 $(SCRIPTS)/lambdas-to-rules.py data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).rules-all > $@<br />
<br />
data/$(CORPUS).$(DIR).ngrams-trimmed: data/$(CORPUS).$(DIR).ngrams-all<br />
cat $< | python3 $(SCRIPTS)/ngram-pareto-trim.py > $@<br />
<br />
data/$(CORPUS).$(DIR).lm.xml: data/$(CORPUS).$(DIR).ngrams-trimmed<br />
python3 $(SCRIPTS)/ngrams-to-rules-me.py data/$(CORPUS).$(DIR).ngrams-trimmed > $@<br />
<br />
data/$(CORPUS).$(DIR).lm.bin: data/$(CORPUS).$(DIR).lm.xml<br />
lrx-comp $< $@<br />
<br />
<br />
</pre><br />
The corpus file needs to be named as "basename"."language-pair"."source side". <br/><br />
As an illustration, in the Makefile example, the corpus file is named europarl.en-es.es <br/><br />
<br />
Set the Makefile variables as follows: <br/><br />
CORPUS denotes the base name of your corpus file <br/><br />
PAIR stands for the language pair <br/><br />
SL and TL stand for source language and target language <br/><br />
DATA is the path to the language resources for the language pair <br/><br />
SCRIPTS denotes the path to the lex-tools scripts <br/><br />
MODEL is the path to the target side (binary) language model used for scoring the possible translations of ambiguous words <br/><br />
<br />
Finally, executing the Makefile will generate lexical selection rules for the specified language pair.<br />
<br />
<br />
<br />
[[Category:Lexical selection]]</div>
Fpetkovski
https://wiki.apertium.org/w/index.php?title=Generating_lexical-selection_rules_from_monolingual_corpora&diff=43862
Generating lexical-selection rules from monolingual corpora
2013-09-23T06:13:51Z
<p>Fpetkovski: /* Maximum entropy */</p>
<hr />
<div>{{TOCD}}<br />
This page describes how to generate lexical selection rules without relying on a parallel corpus.<br />
<br />
==Prerequisites==<br />
<br />
* [[apertium-lex-tools]]<br />
* [[IRSTLM]]<br />
* A language pair (e.g. apertium-br-fr)<br />
** The language pair should have the following two modes:<br />
*** <code>-multi</code> which is all the modules after lexical transfer (see apertium-mk-en/modes.xml)<br />
*** <code>-pretransfer</code> which is all the modules up to lexical transfer (see apertium-mk-en/modes.xml)<br />
<br />
==Annotation==<br />
<br />
<br />
Important: If you don't want through the whole process step by step, you can use the Makefile script provided in the last section of this page.<br />
<br />
We're going to do the example with EuroParl and the English to Spanish pair in Apertium.<br />
<br />
Given that you've got all the stuff installed, the work will be as follows:<br />
<br />
Take your corpus and make a tagged version of it:<br />
<br />
<pre><br />
cat europarl.es-en.es | apertium-destxt | apertium -f none -d ~/source/apertium/apertium-en-es en-es-pretransfer > europarl.en-es.es.tagged<br />
</pre><br />
<br />
Make an ambiguous version of your corpus and trim redundant tags:<br />
<br />
<pre><br />
cat europarl.en-es.es.tagged | python ~/source/apertium/apertium-lex-tools/multitrans ~/source/apertium/apertium-en-es/en-es.autobil -b -f -t -n > europarl.en-es.es.ambig<br />
</pre><br />
<br />
Next, generate all the possible disambiguation paths while trimming redundant tags:<br />
<br />
<pre><br />
cat europarl.en-es.es.tagged | ~/source/apertium/apertium-lex-tools/multitrans ~/source/apertium/apertium-en-es/en-es.autobil -m -f -t -n > europarl.en-es.es.multi-trimmed<br />
</pre><br />
<br />
Translate and score all possible disambiguation paths:<br />
<br />
<pre><br />
cat europarl.en-es.es.tagged | python ~/source/apertium/apertium-lex-tools/multitrans ~/source/apertium/apertium-en-es/en-es.autobil -m -f -n |<br />
apertium -f none -d ~/source/apertium/apertium-en-es en-es-multi | ~/source/apertium/apertium-lex-tools/irstlm-ranker <br />
~/source/corpora/lm/en.blm europarl.en-es.es.multi-trimmed -f > europarl.en-es.es.annotated<br />
</pre><br />
<br />
Now we have a pseudo-parallel corpus where each possible translation is scored.<br />
We start by extracting a frequency lexicon:<br />
<br />
<pre><br />
python3 ~/source/apertium/apertium-lex-tools-scripts/biltrans-extract-frac-freq.py europarl.en-es.es.ambig europarl.en-es.es.annotated > europarl.en-es.freq<br />
</pre><br />
<br />
<pre><br />
python3 ~/source/apertium/apertium-lex-tools-scripts/extract-alig-lrx.py europarl.en-es.freq > europarl.en-es.freq.lrx<br />
</pre><br />
<br />
<pre><br />
lrx-comp europarl.en-es.freq.lrx europarl.en-es.freq.lrx.bin<br />
</pre><br />
<br />
From here on, we have two paths we can choose. We can extract rules using a maximum entropy classifier, or we can extract<br />
rules based only on the scores provided by irstlm-ranker.<br />
<br />
== Direct rule extraction ==<br />
When using this method, we directly continue with extracting ngrams from the pseudo parallel corpus:<br />
<br />
<pre><br />
python3 ~/source/apertium/apertium-lex-tools/scripts/biltrans-count-patterns-ngrams.py europarl.en-es.freq europarl.en-es.es.ambig europarl.en-es.es.annotated > ngrams<br />
</pre><br />
<br />
Next, we prune the generated ngrams:<br />
<br />
<pre><br />
python3 ~/source/apertium/apertium-lex-tools/scripts/ngram-pruning-frac.py europarl.en-es.freq ngrams > patterns<br />
</pre><br />
<br />
Finally, we generate and compile lexical selection rules while thresholding their irstlm-score<br />
<br />
<pre><br />
crisphold=1;<br />
python3 ~/source/apertium/apertium-lex-tools/scripts//ngrams-to-rules.py patterns $crisphold > patterns.lrx<br />
</pre><br />
<br />
<pre><br />
lrx-comp patterns.lrx patterns.lrx.bin<br />
</pre><br />
<br />
== Maximum entropy rule extraction ==<br />
When extracting rules using a maximum entropy criterion, we first extract features which we are going to feed to a classifier:<br />
<br />
<pre><br />
python3 ~/source/apertium/apertium-lex-tools/scripts/biltrans-count-patterns-frac-maxent.py europarl-en-es.freq <br />
europarl.en-es.ambig europarl.en-es.annotated > events 2>ngrams<br />
</pre><br />
<br />
We then train classifiers which as a side effect score how much each ngram contributes to a certain translation:<br />
<pre><br />
cat events | grep -v -e '\$$ 0\.0 #' -e '\$$ 0 #' > events.trimmed<br />
cat events.trimmed | python ~/source/apertium/apertium-lex-tools/scripts/merge-all-lambdas.py $(YASMET) > all-lambdas<br />
</pre><br />
<br />
<pre><br />
python3 ~/source/apertium-lex-tools/scripts/merge-ngrams-lambdas.py ngrams all-lambdas > rules-all<br />
</pre><br />
<br />
Finally, we extract ngrams:<br />
<br />
<pre><br />
python3 ~/source/apertium-lex-tools/scripts/lambdas-to-rules.py europarl-en-es.freq rules-all > ngrams-all<br />
</pre><br />
<br />
we trim them:<br />
<br />
<pre><br />
python3 ~/source/apertium-lex-tools/scripts/ngram-pareto-trim.py ngrams-all > ngrams-trimmed<br />
</pre><br />
<br />
and generate lexical selection rules:<br />
<br />
<pre><br />
python3 ~/source/apertium-lex-tools/scripts/ngrams-to-rules-me.py ngrams-trimmed > europarl.en-es.lrx.bin<br />
</pre><br />
<br />
<br />
== Makefiles ==<br />
<br />
=== Direct rule extraction ===<br />
You can use this makefile to generate rules using direct rule extraction. <br />
Your corpus needs to be placed in the same folder with your makefile.<br />
<br />
<pre><br />
CORPUS=setimes<br />
PAIR=mk-en<br />
DATA=/home/philip/Apertium/apertium-mk-en<br />
SL=mk<br />
TL=en<br />
TRAINING_LINES=10000<br />
SCRIPTS=/home/philip/Apertium/apertium-lex-tools/scripts<br />
MODEL=/home/philip/Apertium/corpora/language-models/en/setimes.en.5.blm<br />
LEX_TOOLS=/home/philip/Apertium/apertium-lex-tools<br />
<br />
<br />
<br />
AUTOBIL=$(SL)-$(TL).autobil.bin<br />
DIR=$(SL)-$(TL)<br />
<br />
all: data/$(CORPUS).$(DIR).freq.lrx.bin data/$(CORPUS).$(DIR).patterns.lrx<br />
<br />
data/$(CORPUS).$(DIR).tagger: $(CORPUS).$(PAIR).$(SL)<br />
if [ ! -d data ]; then mkdir data; fi<br />
cat $(CORPUS).$(PAIR).$(SL) | head -n $(TRAINING_LINES)| sed 's/[^\.]$$/./g' | apertium-destxt | apertium -f none -d $(DATA) $(DIR)-tagger | apertium-pretransfer > $@<br />
<br />
data/$(CORPUS).$(DIR).ambig: data/$(CORPUS).$(DIR).tagger<br />
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/process-tagger-output $(DATA)/$(AUTOBIL) -b -t -f -n > $@<br />
<br />
data/$(CORPUS).$(DIR).multi-trimmed: data/$(CORPUS).$(DIR).tagger<br />
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/process-tagger-output $(DATA)/$(AUTOBIL) -m -t -f > $@<br />
<br />
data/$(CORPUS).$(DIR).annotated: data/$(CORPUS).$(DIR).tagger data/$(CORPUS).$(DIR).multi-trimmed<br />
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/process-tagger-output $(DATA)/$(AUTOBIL) -m -f | apertium -f none -d $(DATA) $(DIR)-multi | $(LEX_TOOLS)/irstlm-ranker $(MODEL) data/$(CORPUS).$(DIR).multi-trimmed -f 2>/dev/null | grep "|@|" | cut -f 1-3 > $@ <br />
<br />
data/$(CORPUS).$(DIR).freq: data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated<br />
python3 $(SCRIPTS)/biltrans-extract-frac-freq.py data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated > $@<br />
<br />
data/$(CORPUS).$(DIR).freq.lrx: data/$(CORPUS).$(DIR).freq<br />
python3 $(SCRIPTS)/extract-alig-lrx.py $< > $@<br />
<br />
data/$(CORPUS).$(DIR).freq.lrx.bin: data/$(CORPUS).$(DIR).freq.lrx<br />
lrx-comp $< $@<br />
<br />
data/$(CORPUS).$(DIR).ngrams: data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated<br />
python3 $(SCRIPTS)/biltrans-count-patterns-ngrams.py data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated > $@<br />
<br />
data/$(CORPUS).$(DIR).patterns: data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ngrams<br />
python3 $(SCRIPTS)/ngram-pruning-frac.py data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ngrams > $@ <br />
<br />
data/$(CORPUS).$(DIR).patterns.lrx: data/$(CORPUS).$(DIR).patterns<br />
python3 $(SCRIPTS)/ngrams-to-rules.py $< $(THR) > $@<br />
<br />
<br />
</pre><br />
<br />
The corpus file needs to be named as "basename"."language-pair"."source side". <br />
As an illustration, in the Makefile example, the corpus file is named europarl.en-es.es<br />
Set the Makefile variables as follows: <br />
CORPUS denotes the base name of your corpus file<br />
PAIR stands for the language pair<br />
SL and TL stand for source language and target language<br />
DATA is the path to the language resources for the language pair<br />
SCRIPTS denotes the path to the lex-tools scripts<br />
MODEL is the path to the target side (binary) language model used for scoring the possible translations of ambiguous words<br />
<br />
Finally, executing the Makefile will generate lexical selection rules for the specified language pair.<br />
<br />
=== Maximum entropy ===<br />
You can use this makefile to generate rules using maximum entropy classifiers. <br />
Your corpus needs to be placed in the same folder with your makefile.<br />
<br />
<pre><br />
CORPUS=europarl<br />
PAIR=en-es<br />
DATA=/home/philip/source/apertium-en-es<br />
SL=es<br />
TL=en<br />
MODEL=/home/philip/lm/en.blm<br />
SCRIPTS=/home/philip/source/apertium-lex-tools/scripts<br />
LEX_TOOLS=/home/philip/source/apertium-lex-tools<br />
THR=1<br />
TRAINING_LINES=10000<br />
<br />
<br />
AUTOBIL=$(SL)-$(TL).autobil.bin<br />
DIR=$(SL)-$(TL)<br />
YASMET=$(LEX_TOOLS)/yasmet<br />
all: data/$(CORPUS).$(DIR).freq.lrx.bin data/$(CORPUS).$(DIR).lm.xml<br />
<br />
data/$(CORPUS).$(DIR).tagger: $(CORPUS).$(PAIR).$(SL)<br />
if [ ! -d data ]; then mkdir data; fi<br />
cat $(CORPUS).$(PAIR).$(SL) | head -n $(TRAINING_LINES) | apertium-destxt | apertium -f none -d $(DATA) $(DIR)-tagger | apertium-pretransfer > $@<br />
<br />
data/$(CORPUS).$(DIR).ambig: data/$(CORPUS).$(DIR).tagger<br />
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/multitrans $(DATA)/$(AUTOBIL) -b -t -f -n > $@<br />
<br />
data/$(CORPUS).$(DIR).multi-trimmed: data/$(CORPUS).$(DIR).tagger<br />
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/multitrans $(DATA)/$(AUTOBIL) -m -t -f > $@<br />
<br />
data/$(CORPUS).$(DIR).annotated: data/$(CORPUS).$(DIR).tagger data/$(CORPUS).$(DIR).multi-trimmed<br />
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/multitrans $(DATA)/$(AUTOBIL) -m -f | apertium -f none -d $(DATA) $(DIR)-multi | $(LEX_TOOLS)/irstlm-ranker $(MODEL) data/$(CORPUS).$(DIR).multi-trimmed -f 2>/dev/null > $@ <br />
<br />
data/$(CORPUS).$(DIR).freq: data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated<br />
python3 $(SCRIPTS)/biltrans-extract-frac-freq.py data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated > $@<br />
<br />
data/$(CORPUS).$(DIR).freq.lrx: data/$(CORPUS).$(DIR).freq<br />
python3 $(SCRIPTS)/extract-alig-lrx.py $< > $@<br />
<br />
data/$(CORPUS).$(DIR).freq.lrx.bin: data/$(CORPUS).$(DIR).freq.lrx<br />
lrx-comp $< $@<br />
<br />
data/$(CORPUS).$(DIR).events data/$(CORPUS).$(DIR).ngrams: data/$(CORPUS).$(DIR).annotated data/$(CORPUS).$(DIR).freq<br />
python3 $(SCRIPTS)/biltrans-count-patterns-frac-maxent.py data/$(CORPUS).$(SL)-$(TL).freq data/$(CORPUS).$(SL)-$(TL).ambig data/$(CORPUS).$(SL)-$(TL).annotated > data/$(CORPUS).$(DIR).events 2>data/$(CORPUS).$(DIR).ngrams<br />
<br />
data/$(CORPUS).$(DIR).all-lambdas: data/$(CORPUS).$(DIR).events<br />
cat data/$(CORPUS).$(DIR).events | grep -v -e '\$$ 0\.0 #' -e '\$$ 0 #' > data/$(CORPUS).$(DIR).events.trimmed<br />
cat data/$(CORPUS).$(DIR).events.trimmed | python $(SCRIPTS)/merge-all-lambdas.py $(YASMET)> $@<br />
<br />
data/$(CORPUS).$(DIR).rules-all: data/$(CORPUS).$(DIR).ngrams data/$(CORPUS).$(DIR).all-lambdas<br />
python3 $(SCRIPTS)/merge-ngrams-lambdas.py data/$(CORPUS).$(DIR).ngrams data/$(CORPUS).$(DIR).all-lambdas > $@<br />
<br />
data/$(CORPUS).$(DIR).ngrams-all: data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).rules-all<br />
python3 $(SCRIPTS)/lambdas-to-rules.py data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).rules-all > $@<br />
<br />
data/$(CORPUS).$(DIR).ngrams-trimmed: data/$(CORPUS).$(DIR).ngrams-all<br />
cat $< | python3 $(SCRIPTS)/ngram-pareto-trim.py > $@<br />
<br />
data/$(CORPUS).$(DIR).lm.xml: data/$(CORPUS).$(DIR).ngrams-trimmed<br />
python3 $(SCRIPTS)/ngrams-to-rules-me.py data/$(CORPUS).$(DIR).ngrams-trimmed > $@<br />
<br />
data/$(CORPUS).$(DIR).lm.bin: data/$(CORPUS).$(DIR).lm.xml<br />
lrx-comp $< $@<br />
<br />
<br />
</pre><br />
The corpus file needs to be named as "basename"."language-pair"."source side". <br />
As an illustration, in the Makefile example, the corpus file is named europarl.en-es.es<br />
Set the Makefile variables as follows: <br />
CORPUS denotes the base name of your corpus file<br />
PAIR stands for the language pair<br />
SL and TL stand for source language and target language<br />
DATA is the path to the language resources for the language pair<br />
SCRIPTS denotes the path to the lex-tools scripts<br />
MODEL is the path to the target side (binary) language model used for scoring the possible translations of ambiguous words<br />
<br />
Finally, executing the Makefile will generate lexical selection rules for the specified language pair.<br />
<br />
<br />
<br />
[[Category:Lexical selection]]</div>
Fpetkovski
https://wiki.apertium.org/w/index.php?title=Generating_lexical-selection_rules_from_monolingual_corpora&diff=43861
Generating lexical-selection rules from monolingual corpora
2013-09-23T06:12:35Z
<p>Fpetkovski: /* Direct rule extraction */</p>
<hr />
<div>{{TOCD}}<br />
This page describes how to generate lexical selection rules without relying on a parallel corpus.<br />
<br />
==Prerequisites==<br />
<br />
* [[apertium-lex-tools]]<br />
* [[IRSTLM]]<br />
* A language pair (e.g. apertium-br-fr)<br />
** The language pair should have the following two modes:<br />
*** <code>-multi</code> which is all the modules after lexical transfer (see apertium-mk-en/modes.xml)<br />
*** <code>-pretransfer</code> which is all the modules up to lexical transfer (see apertium-mk-en/modes.xml)<br />
<br />
==Annotation==<br />
<br />
<br />
Important: If you don't want through the whole process step by step, you can use the Makefile script provided in the last section of this page.<br />
<br />
We're going to do the example with EuroParl and the English to Spanish pair in Apertium.<br />
<br />
Given that you've got all the stuff installed, the work will be as follows:<br />
<br />
Take your corpus and make a tagged version of it:<br />
<br />
<pre><br />
cat europarl.es-en.es | apertium-destxt | apertium -f none -d ~/source/apertium/apertium-en-es en-es-pretransfer > europarl.en-es.es.tagged<br />
</pre><br />
<br />
Make an ambiguous version of your corpus and trim redundant tags:<br />
<br />
<pre><br />
cat europarl.en-es.es.tagged | python ~/source/apertium/apertium-lex-tools/multitrans ~/source/apertium/apertium-en-es/en-es.autobil -b -f -t -n > europarl.en-es.es.ambig<br />
</pre><br />
<br />
Next, generate all the possible disambiguation paths while trimming redundant tags:<br />
<br />
<pre><br />
cat europarl.en-es.es.tagged | ~/source/apertium/apertium-lex-tools/multitrans ~/source/apertium/apertium-en-es/en-es.autobil -m -f -t -n > europarl.en-es.es.multi-trimmed<br />
</pre><br />
<br />
Translate and score all possible disambiguation paths:<br />
<br />
<pre><br />
cat europarl.en-es.es.tagged | python ~/source/apertium/apertium-lex-tools/multitrans ~/source/apertium/apertium-en-es/en-es.autobil -m -f -n |<br />
apertium -f none -d ~/source/apertium/apertium-en-es en-es-multi | ~/source/apertium/apertium-lex-tools/irstlm-ranker <br />
~/source/corpora/lm/en.blm europarl.en-es.es.multi-trimmed -f > europarl.en-es.es.annotated<br />
</pre><br />
<br />
Now we have a pseudo-parallel corpus where each possible translation is scored.<br />
We start by extracting a frequency lexicon:<br />
<br />
<pre><br />
python3 ~/source/apertium/apertium-lex-tools-scripts/biltrans-extract-frac-freq.py europarl.en-es.es.ambig europarl.en-es.es.annotated > europarl.en-es.freq<br />
</pre><br />
<br />
<pre><br />
python3 ~/source/apertium/apertium-lex-tools-scripts/extract-alig-lrx.py europarl.en-es.freq > europarl.en-es.freq.lrx<br />
</pre><br />
<br />
<pre><br />
lrx-comp europarl.en-es.freq.lrx europarl.en-es.freq.lrx.bin<br />
</pre><br />
<br />
From here on, we have two paths we can choose. We can extract rules using a maximum entropy classifier, or we can extract<br />
rules based only on the scores provided by irstlm-ranker.<br />
<br />
== Direct rule extraction ==<br />
When using this method, we directly continue with extracting ngrams from the pseudo parallel corpus:<br />
<br />
<pre><br />
python3 ~/source/apertium/apertium-lex-tools/scripts/biltrans-count-patterns-ngrams.py europarl.en-es.freq europarl.en-es.es.ambig europarl.en-es.es.annotated > ngrams<br />
</pre><br />
<br />
Next, we prune the generated ngrams:<br />
<br />
<pre><br />
python3 ~/source/apertium/apertium-lex-tools/scripts/ngram-pruning-frac.py europarl.en-es.freq ngrams > patterns<br />
</pre><br />
<br />
Finally, we generate and compile lexical selection rules while thresholding their irstlm-score<br />
<br />
<pre><br />
crisphold=1;<br />
python3 ~/source/apertium/apertium-lex-tools/scripts//ngrams-to-rules.py patterns $crisphold > patterns.lrx<br />
</pre><br />
<br />
<pre><br />
lrx-comp patterns.lrx patterns.lrx.bin<br />
</pre><br />
<br />
== Maximum entropy rule extraction ==<br />
When extracting rules using a maximum entropy criterion, we first extract features which we are going to feed to a classifier:<br />
<br />
<pre><br />
python3 ~/source/apertium/apertium-lex-tools/scripts/biltrans-count-patterns-frac-maxent.py europarl-en-es.freq <br />
europarl.en-es.ambig europarl.en-es.annotated > events 2>ngrams<br />
</pre><br />
<br />
We then train classifiers which as a side effect score how much each ngram contributes to a certain translation:<br />
<pre><br />
cat events | grep -v -e '\$$ 0\.0 #' -e '\$$ 0 #' > events.trimmed<br />
cat events.trimmed | python ~/source/apertium/apertium-lex-tools/scripts/merge-all-lambdas.py $(YASMET) > all-lambdas<br />
</pre><br />
<br />
<pre><br />
python3 ~/source/apertium-lex-tools/scripts/merge-ngrams-lambdas.py ngrams all-lambdas > rules-all<br />
</pre><br />
<br />
Finally, we extract ngrams:<br />
<br />
<pre><br />
python3 ~/source/apertium-lex-tools/scripts/lambdas-to-rules.py europarl-en-es.freq rules-all > ngrams-all<br />
</pre><br />
<br />
we trim them:<br />
<br />
<pre><br />
python3 ~/source/apertium-lex-tools/scripts/ngram-pareto-trim.py ngrams-all > ngrams-trimmed<br />
</pre><br />
<br />
and generate lexical selection rules:<br />
<br />
<pre><br />
python3 ~/source/apertium-lex-tools/scripts/ngrams-to-rules-me.py ngrams-trimmed > europarl.en-es.lrx.bin<br />
</pre><br />
<br />
<br />
== Makefiles ==<br />
<br />
=== Direct rule extraction ===<br />
You can use this makefile to generate rules using direct rule extraction. <br />
Your corpus needs to be placed in the same folder with your makefile.<br />
<br />
<pre><br />
CORPUS=setimes<br />
PAIR=mk-en<br />
DATA=/home/philip/Apertium/apertium-mk-en<br />
SL=mk<br />
TL=en<br />
TRAINING_LINES=10000<br />
SCRIPTS=/home/philip/Apertium/apertium-lex-tools/scripts<br />
MODEL=/home/philip/Apertium/corpora/language-models/en/setimes.en.5.blm<br />
LEX_TOOLS=/home/philip/Apertium/apertium-lex-tools<br />
<br />
<br />
<br />
AUTOBIL=$(SL)-$(TL).autobil.bin<br />
DIR=$(SL)-$(TL)<br />
<br />
all: data/$(CORPUS).$(DIR).freq.lrx.bin data/$(CORPUS).$(DIR).patterns.lrx<br />
<br />
data/$(CORPUS).$(DIR).tagger: $(CORPUS).$(PAIR).$(SL)<br />
if [ ! -d data ]; then mkdir data; fi<br />
cat $(CORPUS).$(PAIR).$(SL) | head -n $(TRAINING_LINES)| sed 's/[^\.]$$/./g' | apertium-destxt | apertium -f none -d $(DATA) $(DIR)-tagger | apertium-pretransfer > $@<br />
<br />
data/$(CORPUS).$(DIR).ambig: data/$(CORPUS).$(DIR).tagger<br />
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/process-tagger-output $(DATA)/$(AUTOBIL) -b -t -f -n > $@<br />
<br />
data/$(CORPUS).$(DIR).multi-trimmed: data/$(CORPUS).$(DIR).tagger<br />
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/process-tagger-output $(DATA)/$(AUTOBIL) -m -t -f > $@<br />
<br />
data/$(CORPUS).$(DIR).annotated: data/$(CORPUS).$(DIR).tagger data/$(CORPUS).$(DIR).multi-trimmed<br />
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/process-tagger-output $(DATA)/$(AUTOBIL) -m -f | apertium -f none -d $(DATA) $(DIR)-multi | $(LEX_TOOLS)/irstlm-ranker $(MODEL) data/$(CORPUS).$(DIR).multi-trimmed -f 2>/dev/null | grep "|@|" | cut -f 1-3 > $@ <br />
<br />
data/$(CORPUS).$(DIR).freq: data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated<br />
python3 $(SCRIPTS)/biltrans-extract-frac-freq.py data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated > $@<br />
<br />
data/$(CORPUS).$(DIR).freq.lrx: data/$(CORPUS).$(DIR).freq<br />
python3 $(SCRIPTS)/extract-alig-lrx.py $< > $@<br />
<br />
data/$(CORPUS).$(DIR).freq.lrx.bin: data/$(CORPUS).$(DIR).freq.lrx<br />
lrx-comp $< $@<br />
<br />
data/$(CORPUS).$(DIR).ngrams: data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated<br />
python3 $(SCRIPTS)/biltrans-count-patterns-ngrams.py data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated > $@<br />
<br />
data/$(CORPUS).$(DIR).patterns: data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ngrams<br />
python3 $(SCRIPTS)/ngram-pruning-frac.py data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ngrams > $@ <br />
<br />
data/$(CORPUS).$(DIR).patterns.lrx: data/$(CORPUS).$(DIR).patterns<br />
python3 $(SCRIPTS)/ngrams-to-rules.py $< $(THR) > $@<br />
<br />
<br />
</pre><br />
<br />
The corpus file needs to be named as "basename"."language-pair"."source side". <br />
As an illustration, in the Makefile example, the corpus file is named europarl.en-es.es<br />
Set the Makefile variables as follows: <br />
CORPUS denotes the base name of your corpus file<br />
PAIR stands for the language pair<br />
SL and TL stand for source language and target language<br />
DATA is the path to the language resources for the language pair<br />
SCRIPTS denotes the path to the lex-tools scripts<br />
MODEL is the path to the target side (binary) language model used for scoring the possible translations of ambiguous words<br />
<br />
Finally, executing the Makefile will generate lexical selection rules for the specified language pair.<br />
<br />
=== Maximum entropy ===<br />
You can use this makefile to generate rules using maximum entropy classifiers. <br />
Your corpus needs to be placed in the same folder with your makefile.<br />
<br />
<pre><br />
CORPUS=europarl<br />
PAIR=en-es<br />
DATA=/home/philip/source/apertium-en-es<br />
SL=es<br />
TL=en<br />
MODEL=/home/philip/lm/en.blm<br />
SCRIPTS=/home/philip/source/apertium-lex-tools/scripts<br />
LEX_TOOLS=/home/philip/source/apertium-lex-tools<br />
THR=1<br />
TRAINING_LINES=10000<br />
<br />
<br />
AUTOBIL=$(SL)-$(TL).autobil.bin<br />
DIR=$(SL)-$(TL)<br />
YASMET=$(LEX_TOOLS)/yasmet<br />
all: data/$(CORPUS).$(DIR).freq.lrx.bin data/$(CORPUS).$(DIR).lm.xml<br />
<br />
data/$(CORPUS).$(DIR).tagger: $(CORPUS).$(PAIR).$(SL)<br />
if [ ! -d data ]; then mkdir data; fi<br />
cat $(CORPUS).$(PAIR).$(SL) | head -n $(TRAINING_LINES) | apertium-destxt | apertium -f none -d $(DATA) $(DIR)-tagger | apertium-pretransfer > $@<br />
<br />
data/$(CORPUS).$(DIR).ambig: data/$(CORPUS).$(DIR).tagger<br />
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/multitrans $(DATA)/$(AUTOBIL) -b -t -f -n > $@<br />
<br />
data/$(CORPUS).$(DIR).multi-trimmed: data/$(CORPUS).$(DIR).tagger<br />
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/multitrans $(DATA)/$(AUTOBIL) -m -t -f > $@<br />
<br />
data/$(CORPUS).$(DIR).annotated: data/$(CORPUS).$(DIR).tagger data/$(CORPUS).$(DIR).multi-trimmed<br />
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/multitrans $(DATA)/$(AUTOBIL) -m -f | apertium -f none -d $(DATA) $(DIR)-multi | $(LEX_TOOLS)/irstlm-ranker $(MODEL) data/$(CORPUS).$(DIR).multi-trimmed -f 2>/dev/null > $@ <br />
<br />
data/$(CORPUS).$(DIR).freq: data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated<br />
python3 $(SCRIPTS)/biltrans-extract-frac-freq.py data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated > $@<br />
<br />
data/$(CORPUS).$(DIR).freq.lrx: data/$(CORPUS).$(DIR).freq<br />
python3 $(SCRIPTS)/extract-alig-lrx.py $< > $@<br />
<br />
data/$(CORPUS).$(DIR).freq.lrx.bin: data/$(CORPUS).$(DIR).freq.lrx<br />
lrx-comp $< $@<br />
<br />
data/$(CORPUS).$(DIR).events data/$(CORPUS).$(DIR).ngrams: data/$(CORPUS).$(DIR).annotated data/$(CORPUS).$(DIR).freq<br />
python3 $(SCRIPTS)/biltrans-count-patterns-frac-maxent.py data/$(CORPUS).$(SL)-$(TL).freq data/$(CORPUS).$(SL)-$(TL).ambig data/$(CORPUS).$(SL)-$(TL).annotated > data/$(CORPUS).$(DIR).events 2>data/$(CORPUS).$(DIR).ngrams<br />
<br />
data/$(CORPUS).$(DIR).all-lambdas: data/$(CORPUS).$(DIR).events<br />
cat data/$(CORPUS).$(DIR).events | grep -v -e '\$$ 0\.0 #' -e '\$$ 0 #' > data/$(CORPUS).$(DIR).events.trimmed<br />
cat data/$(CORPUS).$(DIR).events.trimmed | python $(SCRIPTS)/merge-all-lambdas.py $(YASMET)> $@<br />
<br />
data/$(CORPUS).$(DIR).rules-all: data/$(CORPUS).$(DIR).ngrams data/$(CORPUS).$(DIR).all-lambdas<br />
python3 $(SCRIPTS)/merge-ngrams-lambdas.py data/$(CORPUS).$(DIR).ngrams data/$(CORPUS).$(DIR).all-lambdas > $@<br />
<br />
data/$(CORPUS).$(DIR).ngrams-all: data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).rules-all<br />
python3 $(SCRIPTS)/lambdas-to-rules.py data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).rules-all > $@<br />
<br />
data/$(CORPUS).$(DIR).ngrams-trimmed: data/$(CORPUS).$(DIR).ngrams-all<br />
cat $< | python3 $(SCRIPTS)/ngram-pareto-trim.py > $@<br />
<br />
data/$(CORPUS).$(DIR).lm.xml: data/$(CORPUS).$(DIR).ngrams-trimmed<br />
python3 $(SCRIPTS)/ngrams-to-rules-me.py data/$(CORPUS).$(DIR).ngrams-trimmed > $@<br />
<br />
</pre><br />
The corpus file needs to be named as "basename"."language-pair"."source side". <br />
As an illustration, in the Makefile example, the corpus file is named europarl.en-es.es<br />
Set the Makefile variables as follows: <br />
CORPUS denotes the base name of your corpus file<br />
PAIR stands for the language pair<br />
SL and TL stand for source language and target language<br />
DATA is the path to the language resources for the language pair<br />
SCRIPTS denotes the path to the lex-tools scripts<br />
MODEL is the path to the target side (binary) language model used for scoring the possible translations of ambiguous words<br />
<br />
Finally, executing the Makefile will generate lexical selection rules for the specified language pair.<br />
<br />
<br />
<br />
[[Category:Lexical selection]]</div>
Fpetkovski
https://wiki.apertium.org/w/index.php?title=Generating_lexical-selection_rules_from_monolingual_corpora&diff=43860
Generating lexical-selection rules from monolingual corpora
2013-09-23T06:12:10Z
<p>Fpetkovski: /* Maximum entropy rule extraction */</p>
<hr />
<div>{{TOCD}}<br />
This page describes how to generate lexical selection rules without relying on a parallel corpus.<br />
<br />
==Prerequisites==<br />
<br />
* [[apertium-lex-tools]]<br />
* [[IRSTLM]]<br />
* A language pair (e.g. apertium-br-fr)<br />
** The language pair should have the following two modes:<br />
*** <code>-multi</code> which is all the modules after lexical transfer (see apertium-mk-en/modes.xml)<br />
*** <code>-pretransfer</code> which is all the modules up to lexical transfer (see apertium-mk-en/modes.xml)<br />
<br />
==Annotation==<br />
<br />
<br />
Important: If you don't want through the whole process step by step, you can use the Makefile script provided in the last section of this page.<br />
<br />
We're going to do the example with EuroParl and the English to Spanish pair in Apertium.<br />
<br />
Given that you've got all the stuff installed, the work will be as follows:<br />
<br />
Take your corpus and make a tagged version of it:<br />
<br />
<pre><br />
cat europarl.es-en.es | apertium-destxt | apertium -f none -d ~/source/apertium/apertium-en-es en-es-pretransfer > europarl.en-es.es.tagged<br />
</pre><br />
<br />
Make an ambiguous version of your corpus and trim redundant tags:<br />
<br />
<pre><br />
cat europarl.en-es.es.tagged | python ~/source/apertium/apertium-lex-tools/multitrans ~/source/apertium/apertium-en-es/en-es.autobil -b -f -t -n > europarl.en-es.es.ambig<br />
</pre><br />
<br />
Next, generate all the possible disambiguation paths while trimming redundant tags:<br />
<br />
<pre><br />
cat europarl.en-es.es.tagged | ~/source/apertium/apertium-lex-tools/multitrans ~/source/apertium/apertium-en-es/en-es.autobil -m -f -t -n > europarl.en-es.es.multi-trimmed<br />
</pre><br />
<br />
Translate and score all possible disambiguation paths:<br />
<br />
<pre><br />
cat europarl.en-es.es.tagged | python ~/source/apertium/apertium-lex-tools/multitrans ~/source/apertium/apertium-en-es/en-es.autobil -m -f -n |<br />
apertium -f none -d ~/source/apertium/apertium-en-es en-es-multi | ~/source/apertium/apertium-lex-tools/irstlm-ranker <br />
~/source/corpora/lm/en.blm europarl.en-es.es.multi-trimmed -f > europarl.en-es.es.annotated<br />
</pre><br />
<br />
Now we have a pseudo-parallel corpus where each possible translation is scored.<br />
We start by extracting a frequency lexicon:<br />
<br />
<pre><br />
python3 ~/source/apertium/apertium-lex-tools-scripts/biltrans-extract-frac-freq.py europarl.en-es.es.ambig europarl.en-es.es.annotated > europarl.en-es.freq<br />
</pre><br />
<br />
<pre><br />
python3 ~/source/apertium/apertium-lex-tools-scripts/extract-alig-lrx.py europarl.en-es.freq > europarl.en-es.freq.lrx<br />
</pre><br />
<br />
<pre><br />
lrx-comp europarl.en-es.freq.lrx europarl.en-es.freq.lrx.bin<br />
</pre><br />
<br />
From here on, we have two paths we can choose. We can extract rules using a maximum entropy classifier, or we can extract<br />
rules based only on the scores provided by irstlm-ranker.<br />
<br />
== Direct rule extraction ==<br />
When using this method, we directly continue with extracting ngrams from the pseudo parallel corpus:<br />
<br />
<pre><br />
python3 ~/source/apertium/apertium-lex-tools/scripts/biltrans-count-patterns-ngrams.py europarl.en-es.freq europarl.en-es.es.ambig europarl.en-es.es.annotated > ngrams<br />
</pre><br />
<br />
Next, we prune the generated ngrams:<br />
<br />
<pre><br />
python3 ~/source/apertium/apertium-lex-tools/scripts/ngram-pruning-frac.py europarl.en-es.freq ngrams > patterns<br />
</pre><br />
<br />
Finally, we generate and compile lexical selection rules while thresholding their irstlm-score<br />
<br />
<pre><br />
crisphold=1;<br />
python3 ~/source/apertium/apertium-lex-tools/scripts//ngrams-to-rules.py patterns $crisphold > patterns.lrx<br />
</pre><br />
<br />
<pre><br />
lrx-comp patterns.lrx patterns.lrx.bin<br />
</pre><br />
<br />
== Maximum entropy rule extraction ==<br />
When extracting rules using a maximum entropy criterion, we first extract features which we are going to feed to a classifier:<br />
<br />
<pre><br />
python3 ~/source/apertium/apertium-lex-tools/scripts/biltrans-count-patterns-frac-maxent.py europarl-en-es.freq <br />
europarl.en-es.ambig europarl.en-es.annotated > events 2>ngrams<br />
</pre><br />
<br />
We then train classifiers which as a side effect score how much each ngram contributes to a certain translation:<br />
<pre><br />
cat events | grep -v -e '\$$ 0\.0 #' -e '\$$ 0 #' > events.trimmed<br />
cat events.trimmed | python ~/source/apertium/apertium-lex-tools/scripts/merge-all-lambdas.py $(YASMET) > all-lambdas<br />
</pre><br />
<br />
<pre><br />
python3 ~/source/apertium-lex-tools/scripts/merge-ngrams-lambdas.py ngrams all-lambdas > rules-all<br />
</pre><br />
<br />
Finally, we extract ngrams:<br />
<br />
<pre><br />
python3 ~/source/apertium-lex-tools/scripts/lambdas-to-rules.py europarl-en-es.freq rules-all > ngrams-all<br />
</pre><br />
<br />
we trim them:<br />
<br />
<pre><br />
python3 ~/source/apertium-lex-tools/scripts/ngram-pareto-trim.py ngrams-all > ngrams-trimmed<br />
</pre><br />
<br />
and generate lexical selection rules:<br />
<br />
<pre><br />
python3 ~/source/apertium-lex-tools/scripts/ngrams-to-rules-me.py ngrams-trimmed > europarl.en-es.lrx.bin<br />
</pre><br />
<br />
<br />
== Makefiles ==<br />
<br />
=== Direct rule extraction ===<br />
You can use this makefile to generate rules using maximum entropy classifiers. <br />
Your corpus needs to be placed in the same folder with your makefile.<br />
<br />
<pre><br />
CORPUS=setimes<br />
PAIR=mk-en<br />
DATA=/home/philip/Apertium/apertium-mk-en<br />
SL=mk<br />
TL=en<br />
TRAINING_LINES=10000<br />
SCRIPTS=/home/philip/Apertium/apertium-lex-tools/scripts<br />
MODEL=/home/philip/Apertium/corpora/language-models/en/setimes.en.5.blm<br />
LEX_TOOLS=/home/philip/Apertium/apertium-lex-tools<br />
<br />
<br />
<br />
AUTOBIL=$(SL)-$(TL).autobil.bin<br />
DIR=$(SL)-$(TL)<br />
<br />
all: data/$(CORPUS).$(DIR).freq.lrx.bin data/$(CORPUS).$(DIR).patterns.lrx<br />
<br />
data/$(CORPUS).$(DIR).tagger: $(CORPUS).$(PAIR).$(SL)<br />
if [ ! -d data ]; then mkdir data; fi<br />
cat $(CORPUS).$(PAIR).$(SL) | head -n $(TRAINING_LINES)| sed 's/[^\.]$$/./g' | apertium-destxt | apertium -f none -d $(DATA) $(DIR)-tagger | apertium-pretransfer > $@<br />
<br />
data/$(CORPUS).$(DIR).ambig: data/$(CORPUS).$(DIR).tagger<br />
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/process-tagger-output $(DATA)/$(AUTOBIL) -b -t -f -n > $@<br />
<br />
data/$(CORPUS).$(DIR).multi-trimmed: data/$(CORPUS).$(DIR).tagger<br />
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/process-tagger-output $(DATA)/$(AUTOBIL) -m -t -f > $@<br />
<br />
data/$(CORPUS).$(DIR).annotated: data/$(CORPUS).$(DIR).tagger data/$(CORPUS).$(DIR).multi-trimmed<br />
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/process-tagger-output $(DATA)/$(AUTOBIL) -m -f | apertium -f none -d $(DATA) $(DIR)-multi | $(LEX_TOOLS)/irstlm-ranker $(MODEL) data/$(CORPUS).$(DIR).multi-trimmed -f 2>/dev/null | grep "|@|" | cut -f 1-3 > $@ <br />
<br />
data/$(CORPUS).$(DIR).freq: data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated<br />
python3 $(SCRIPTS)/biltrans-extract-frac-freq.py data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated > $@<br />
<br />
data/$(CORPUS).$(DIR).freq.lrx: data/$(CORPUS).$(DIR).freq<br />
python3 $(SCRIPTS)/extract-alig-lrx.py $< > $@<br />
<br />
data/$(CORPUS).$(DIR).freq.lrx.bin: data/$(CORPUS).$(DIR).freq.lrx<br />
lrx-comp $< $@<br />
<br />
data/$(CORPUS).$(DIR).ngrams: data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated<br />
python3 $(SCRIPTS)/biltrans-count-patterns-ngrams.py data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated > $@<br />
<br />
data/$(CORPUS).$(DIR).patterns: data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ngrams<br />
python3 $(SCRIPTS)/ngram-pruning-frac.py data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ngrams > $@ <br />
<br />
data/$(CORPUS).$(DIR).patterns.lrx: data/$(CORPUS).$(DIR).patterns<br />
python3 $(SCRIPTS)/ngrams-to-rules.py $< $(THR) > $@<br />
<br />
<br />
</pre><br />
<br />
The corpus file needs to be named as "basename"."language-pair"."source side". <br />
As an illustration, in the Makefile example, the corpus file is named europarl.en-es.es<br />
Set the Makefile variables as follows: <br />
CORPUS denotes the base name of your corpus file<br />
PAIR stands for the language pair<br />
SL and TL stand for source language and target language<br />
DATA is the path to the language resources for the language pair<br />
SCRIPTS denotes the path to the lex-tools scripts<br />
MODEL is the path to the target side (binary) language model used for scoring the possible translations of ambiguous words<br />
<br />
Finally, executing the Makefile will generate lexical selection rules for the specified language pair.<br />
<br />
=== Maximum entropy ===<br />
You can use this makefile to generate rules using maximum entropy classifiers. <br />
Your corpus needs to be placed in the same folder with your makefile.<br />
<br />
<pre><br />
CORPUS=europarl<br />
PAIR=en-es<br />
DATA=/home/philip/source/apertium-en-es<br />
SL=es<br />
TL=en<br />
MODEL=/home/philip/lm/en.blm<br />
SCRIPTS=/home/philip/source/apertium-lex-tools/scripts<br />
LEX_TOOLS=/home/philip/source/apertium-lex-tools<br />
THR=1<br />
TRAINING_LINES=10000<br />
<br />
<br />
AUTOBIL=$(SL)-$(TL).autobil.bin<br />
DIR=$(SL)-$(TL)<br />
YASMET=$(LEX_TOOLS)/yasmet<br />
all: data/$(CORPUS).$(DIR).freq.lrx.bin data/$(CORPUS).$(DIR).lm.xml<br />
<br />
data/$(CORPUS).$(DIR).tagger: $(CORPUS).$(PAIR).$(SL)<br />
if [ ! -d data ]; then mkdir data; fi<br />
cat $(CORPUS).$(PAIR).$(SL) | head -n $(TRAINING_LINES) | apertium-destxt | apertium -f none -d $(DATA) $(DIR)-tagger | apertium-pretransfer > $@<br />
<br />
data/$(CORPUS).$(DIR).ambig: data/$(CORPUS).$(DIR).tagger<br />
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/multitrans $(DATA)/$(AUTOBIL) -b -t -f -n > $@<br />
<br />
data/$(CORPUS).$(DIR).multi-trimmed: data/$(CORPUS).$(DIR).tagger<br />
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/multitrans $(DATA)/$(AUTOBIL) -m -t -f > $@<br />
<br />
data/$(CORPUS).$(DIR).annotated: data/$(CORPUS).$(DIR).tagger data/$(CORPUS).$(DIR).multi-trimmed<br />
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/multitrans $(DATA)/$(AUTOBIL) -m -f | apertium -f none -d $(DATA) $(DIR)-multi | $(LEX_TOOLS)/irstlm-ranker $(MODEL) data/$(CORPUS).$(DIR).multi-trimmed -f 2>/dev/null > $@ <br />
<br />
data/$(CORPUS).$(DIR).freq: data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated<br />
python3 $(SCRIPTS)/biltrans-extract-frac-freq.py data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated > $@<br />
<br />
data/$(CORPUS).$(DIR).freq.lrx: data/$(CORPUS).$(DIR).freq<br />
python3 $(SCRIPTS)/extract-alig-lrx.py $< > $@<br />
<br />
data/$(CORPUS).$(DIR).freq.lrx.bin: data/$(CORPUS).$(DIR).freq.lrx<br />
lrx-comp $< $@<br />
<br />
data/$(CORPUS).$(DIR).events data/$(CORPUS).$(DIR).ngrams: data/$(CORPUS).$(DIR).annotated data/$(CORPUS).$(DIR).freq<br />
python3 $(SCRIPTS)/biltrans-count-patterns-frac-maxent.py data/$(CORPUS).$(SL)-$(TL).freq data/$(CORPUS).$(SL)-$(TL).ambig data/$(CORPUS).$(SL)-$(TL).annotated > data/$(CORPUS).$(DIR).events 2>data/$(CORPUS).$(DIR).ngrams<br />
<br />
data/$(CORPUS).$(DIR).all-lambdas: data/$(CORPUS).$(DIR).events<br />
cat data/$(CORPUS).$(DIR).events | grep -v -e '\$$ 0\.0 #' -e '\$$ 0 #' > data/$(CORPUS).$(DIR).events.trimmed<br />
cat data/$(CORPUS).$(DIR).events.trimmed | python $(SCRIPTS)/merge-all-lambdas.py $(YASMET)> $@<br />
<br />
data/$(CORPUS).$(DIR).rules-all: data/$(CORPUS).$(DIR).ngrams data/$(CORPUS).$(DIR).all-lambdas<br />
python3 $(SCRIPTS)/merge-ngrams-lambdas.py data/$(CORPUS).$(DIR).ngrams data/$(CORPUS).$(DIR).all-lambdas > $@<br />
<br />
data/$(CORPUS).$(DIR).ngrams-all: data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).rules-all<br />
python3 $(SCRIPTS)/lambdas-to-rules.py data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).rules-all > $@<br />
<br />
data/$(CORPUS).$(DIR).ngrams-trimmed: data/$(CORPUS).$(DIR).ngrams-all<br />
cat $< | python3 $(SCRIPTS)/ngram-pareto-trim.py > $@<br />
<br />
data/$(CORPUS).$(DIR).lm.xml: data/$(CORPUS).$(DIR).ngrams-trimmed<br />
python3 $(SCRIPTS)/ngrams-to-rules-me.py data/$(CORPUS).$(DIR).ngrams-trimmed > $@<br />
<br />
</pre><br />
The corpus file needs to be named as "basename"."language-pair"."source side". <br />
As an illustration, in the Makefile example, the corpus file is named europarl.en-es.es<br />
Set the Makefile variables as follows: <br />
CORPUS denotes the base name of your corpus file<br />
PAIR stands for the language pair<br />
SL and TL stand for source language and target language<br />
DATA is the path to the language resources for the language pair<br />
SCRIPTS denotes the path to the lex-tools scripts<br />
MODEL is the path to the target side (binary) language model used for scoring the possible translations of ambiguous words<br />
<br />
Finally, executing the Makefile will generate lexical selection rules for the specified language pair.<br />
<br />
<br />
<br />
[[Category:Lexical selection]]</div>
Fpetkovski
https://wiki.apertium.org/w/index.php?title=Generating_lexical-selection_rules_from_monolingual_corpora&diff=43859
Generating lexical-selection rules from monolingual corpora
2013-09-23T06:04:42Z
<p>Fpetkovski: /* Direct rule extraction */</p>
<hr />
<div>{{TOCD}}<br />
This page describes how to generate lexical selection rules without relying on a parallel corpus.<br />
<br />
==Prerequisites==<br />
<br />
* [[apertium-lex-tools]]<br />
* [[IRSTLM]]<br />
* A language pair (e.g. apertium-br-fr)<br />
** The language pair should have the following two modes:<br />
*** <code>-multi</code> which is all the modules after lexical transfer (see apertium-mk-en/modes.xml)<br />
*** <code>-pretransfer</code> which is all the modules up to lexical transfer (see apertium-mk-en/modes.xml)<br />
<br />
==Annotation==<br />
<br />
<br />
Important: If you don't want through the whole process step by step, you can use the Makefile script provided in the last section of this page.<br />
<br />
We're going to do the example with EuroParl and the English to Spanish pair in Apertium.<br />
<br />
Given that you've got all the stuff installed, the work will be as follows:<br />
<br />
Take your corpus and make a tagged version of it:<br />
<br />
<pre><br />
cat europarl.es-en.es | apertium-destxt | apertium -f none -d ~/source/apertium/apertium-en-es en-es-pretransfer > europarl.en-es.es.tagged<br />
</pre><br />
<br />
Make an ambiguous version of your corpus and trim redundant tags:<br />
<br />
<pre><br />
cat europarl.en-es.es.tagged | python ~/source/apertium/apertium-lex-tools/multitrans ~/source/apertium/apertium-en-es/en-es.autobil -b -f -t -n > europarl.en-es.es.ambig<br />
</pre><br />
<br />
Next, generate all the possible disambiguation paths while trimming redundant tags:<br />
<br />
<pre><br />
cat europarl.en-es.es.tagged | ~/source/apertium/apertium-lex-tools/multitrans ~/source/apertium/apertium-en-es/en-es.autobil -m -f -t -n > europarl.en-es.es.multi-trimmed<br />
</pre><br />
<br />
Translate and score all possible disambiguation paths:<br />
<br />
<pre><br />
cat europarl.en-es.es.tagged | python ~/source/apertium/apertium-lex-tools/multitrans ~/source/apertium/apertium-en-es/en-es.autobil -m -f -n |<br />
apertium -f none -d ~/source/apertium/apertium-en-es en-es-multi | ~/source/apertium/apertium-lex-tools/irstlm-ranker <br />
~/source/corpora/lm/en.blm europarl.en-es.es.multi-trimmed -f > europarl.en-es.es.annotated<br />
</pre><br />
<br />
Now we have a pseudo-parallel corpus where each possible translation is scored.<br />
We start by extracting a frequency lexicon:<br />
<br />
<pre><br />
python3 ~/source/apertium/apertium-lex-tools-scripts/biltrans-extract-frac-freq.py europarl.en-es.es.ambig europarl.en-es.es.annotated > europarl.en-es.freq<br />
</pre><br />
<br />
<pre><br />
python3 ~/source/apertium/apertium-lex-tools-scripts/extract-alig-lrx.py europarl.en-es.freq > europarl.en-es.freq.lrx<br />
</pre><br />
<br />
<pre><br />
lrx-comp europarl.en-es.freq.lrx europarl.en-es.freq.lrx.bin<br />
</pre><br />
<br />
From here on, we have two paths we can choose. We can extract rules using a maximum entropy classifier, or we can extract<br />
rules based only on the scores provided by irstlm-ranker.<br />
<br />
== Direct rule extraction ==<br />
When using this method, we directly continue with extracting ngrams from the pseudo parallel corpus:<br />
<br />
<pre><br />
python3 ~/source/apertium/apertium-lex-tools/scripts/biltrans-count-patterns-ngrams.py europarl.en-es.freq europarl.en-es.es.ambig europarl.en-es.es.annotated > ngrams<br />
</pre><br />
<br />
Next, we prune the generated ngrams:<br />
<br />
<pre><br />
python3 ~/source/apertium/apertium-lex-tools/scripts/ngram-pruning-frac.py europarl.en-es.freq ngrams > patterns<br />
</pre><br />
<br />
Finally, we generate and compile lexical selection rules while thresholding their irstlm-score<br />
<br />
<pre><br />
crisphold=1;<br />
python3 ~/source/apertium/apertium-lex-tools/scripts//ngrams-to-rules.py patterns $crisphold > patterns.lrx<br />
</pre><br />
<br />
<pre><br />
lrx-comp patterns.lrx patterns.lrx.bin<br />
</pre><br />
<br />
== Maximum entropy rule extraction ==<br />
When extracting rules using a maximum entropy criterion, we first extract features which we are going to feed to a classifier:<br />
<br />
<pre><br />
python3 ~/source/apertium/apertium-lex-tools/scripts/biltrans-count-patterns-frac-maxent.py europarl-en-es.freq <br />
europarl.en-es.ambig europarl.en-es.annotated > events 2>ngrams<br />
</pre><br />
<br />
We then train classifiers which as a side effect score how much each ngram contributes to a certain translation:<br />
<pre><br />
cat events | grep -v -e '\$$ 0\.0 #' -e '\$$ 0 #' > events.trimmed<br />
cat events.trimmed | python ~/source/apertium/apertium-lex-tools/scripts/merge-all-lambdas.py $(YASMET) > all-lambdas<br />
</pre><br />
<br />
<pre><br />
python3 ~/source/apertium-lex-tools/scripts/merge-ngrams-lambdas.py ngrams all-lambdas > rules-all<br />
</pre><br />
<br />
Finally, we extract ngrams:<br />
<br />
<pre><br />
python3 ~/source/apertium-lex-tools/scripts/lambdas-to-rules.py europarl-en-es.freq rules-all > ngrams-all<br />
</pre><br />
<br />
we trim them:<br />
<br />
<pre><br />
python3 ~/source/apertium-lex-tools/scripts/ngram-pareto-trim.py ngrams-all > ngrams-trimmed<br />
</pre><br />
<br />
and generate lexical selection rules:<br />
<br />
<pre><br />
python3 ~/source/apertium-lex-tools/scripts/ngrams-to-rules-me.py ngrams-trimmed > europarl.en-es.lrx.bin<br />
</pre><br />
<br />
<br />
[[Category:Lexical selection]]</div>
Fpetkovski
https://wiki.apertium.org/w/index.php?title=Generating_lexical-selection_rules_from_monolingual_corpora&diff=43858
Generating lexical-selection rules from monolingual corpora
2013-09-23T06:04:13Z
<p>Fpetkovski: /* Rule-extraction */</p>
<hr />
<div>{{TOCD}}<br />
This page describes how to generate lexical selection rules without relying on a parallel corpus.<br />
<br />
==Prerequisites==<br />
<br />
* [[apertium-lex-tools]]<br />
* [[IRSTLM]]<br />
* A language pair (e.g. apertium-br-fr)<br />
** The language pair should have the following two modes:<br />
*** <code>-multi</code> which is all the modules after lexical transfer (see apertium-mk-en/modes.xml)<br />
*** <code>-pretransfer</code> which is all the modules up to lexical transfer (see apertium-mk-en/modes.xml)<br />
<br />
==Annotation==<br />
<br />
<br />
Important: If you don't want through the whole process step by step, you can use the Makefile script provided in the last section of this page.<br />
<br />
We're going to do the example with EuroParl and the English to Spanish pair in Apertium.<br />
<br />
Given that you've got all the stuff installed, the work will be as follows:<br />
<br />
Take your corpus and make a tagged version of it:<br />
<br />
<pre><br />
cat europarl.es-en.es | apertium-destxt | apertium -f none -d ~/source/apertium/apertium-en-es en-es-pretransfer > europarl.en-es.es.tagged<br />
</pre><br />
<br />
Make an ambiguous version of your corpus and trim redundant tags:<br />
<br />
<pre><br />
cat europarl.en-es.es.tagged | python ~/source/apertium/apertium-lex-tools/multitrans ~/source/apertium/apertium-en-es/en-es.autobil -b -f -t -n > europarl.en-es.es.ambig<br />
</pre><br />
<br />
Next, generate all the possible disambiguation paths while trimming redundant tags:<br />
<br />
<pre><br />
cat europarl.en-es.es.tagged | ~/source/apertium/apertium-lex-tools/multitrans ~/source/apertium/apertium-en-es/en-es.autobil -m -f -t -n > europarl.en-es.es.multi-trimmed<br />
</pre><br />
<br />
Translate and score all possible disambiguation paths:<br />
<br />
<pre><br />
cat europarl.en-es.es.tagged | python ~/source/apertium/apertium-lex-tools/multitrans ~/source/apertium/apertium-en-es/en-es.autobil -m -f -n |<br />
apertium -f none -d ~/source/apertium/apertium-en-es en-es-multi | ~/source/apertium/apertium-lex-tools/irstlm-ranker <br />
~/source/corpora/lm/en.blm europarl.en-es.es.multi-trimmed -f > europarl.en-es.es.annotated<br />
</pre><br />
<br />
Now we have a pseudo-parallel corpus where each possible translation is scored.<br />
We start by extracting a frequency lexicon:<br />
<br />
<pre><br />
python3 ~/source/apertium/apertium-lex-tools-scripts/biltrans-extract-frac-freq.py europarl.en-es.es.ambig europarl.en-es.es.annotated > europarl.en-es.freq<br />
</pre><br />
<br />
<pre><br />
python3 ~/source/apertium/apertium-lex-tools-scripts/extract-alig-lrx.py europarl.en-es.freq > europarl.en-es.freq.lrx<br />
</pre><br />
<br />
<pre><br />
lrx-comp europarl.en-es.freq.lrx europarl.en-es.freq.lrx.bin<br />
</pre><br />
<br />
From here on, we have two paths we can choose. We can extract rules using a maximum entropy classifier, or we can extract<br />
rules based only on the scores provided by irstlm-ranker.<br />
<br />
== Direct rule extraction ==<br />
When using this method, we directly continue with extracting ngrams from the pseudo parallel corpus:<br />
<br />
<pre><br />
python3 ~/source/apertium/apertium-lex-tools/scripts/biltrans-count-patterns-ngrams.py europarl.en-es.freq europarl.en-es.es.ambig europarl.en-es.es.annotated > ngrams<br />
<pre><br />
<br />
Next, we prune the generated ngrams:<br />
<br />
<pre><br />
python3 ~/source/apertium/apertium-lex-tools/scripts/ngram-pruning-frac.py europarl.en-es.freq ngrams > patterns<br />
</pre><br />
<br />
Finally, we generate and compile lexical selection rules while thresholding their irstlm-score<br />
<br />
<pre><br />
crisphold=1;<br />
python3 ~/source/apertium/apertium-lex-tools/scripts//ngrams-to-rules.py patterns $crisphold > patterns.lrx<br />
</pre><br />
<br />
<pre><br />
lrx-comp patterns.lrx patterns.lrx.bin<br />
</pre><br />
<br />
== Maximum entropy rule extraction ==<br />
When extracting rules using a maximum entropy criterion, we first extract features which we are going to feed to a classifier:<br />
<br />
<pre><br />
python3 ~/source/apertium/apertium-lex-tools/scripts/biltrans-count-patterns-frac-maxent.py europarl-en-es.freq <br />
europarl.en-es.ambig europarl.en-es.annotated > events 2>ngrams<br />
</pre><br />
<br />
We then train classifiers which as a side effect score how much each ngram contributes to a certain translation:<br />
<pre><br />
cat events | grep -v -e '\$$ 0\.0 #' -e '\$$ 0 #' > events.trimmed<br />
cat events.trimmed | python ~/source/apertium/apertium-lex-tools/scripts/merge-all-lambdas.py $(YASMET) > all-lambdas<br />
</pre><br />
<br />
<pre><br />
python3 ~/source/apertium-lex-tools/scripts/merge-ngrams-lambdas.py ngrams all-lambdas > rules-all<br />
</pre><br />
<br />
Finally, we extract ngrams:<br />
<br />
<pre><br />
python3 ~/source/apertium-lex-tools/scripts/lambdas-to-rules.py europarl-en-es.freq rules-all > ngrams-all<br />
</pre><br />
<br />
we trim them:<br />
<br />
<pre><br />
python3 ~/source/apertium-lex-tools/scripts/ngram-pareto-trim.py ngrams-all > ngrams-trimmed<br />
</pre><br />
<br />
and generate lexical selection rules:<br />
<br />
<pre><br />
python3 ~/source/apertium-lex-tools/scripts/ngrams-to-rules-me.py ngrams-trimmed > europarl.en-es.lrx.bin<br />
</pre><br />
<br />
<br />
[[Category:Lexical selection]]</div>
Fpetkovski
https://wiki.apertium.org/w/index.php?title=Generating_lexical-selection_rules_from_monolingual_corpora&diff=43857
Generating lexical-selection rules from monolingual corpora
2013-09-23T05:47:23Z
<p>Fpetkovski: /* Annotation */</p>
<hr />
<div>{{TOCD}}<br />
This page describes how to generate lexical selection rules without relying on a parallel corpus.<br />
<br />
==Prerequisites==<br />
<br />
* [[apertium-lex-tools]]<br />
* [[IRSTLM]]<br />
* A language pair (e.g. apertium-br-fr)<br />
** The language pair should have the following two modes:<br />
*** <code>-multi</code> which is all the modules after lexical transfer (see apertium-mk-en/modes.xml)<br />
*** <code>-pretransfer</code> which is all the modules up to lexical transfer (see apertium-mk-en/modes.xml)<br />
<br />
==Annotation==<br />
<br />
<br />
Important: If you don't want through the whole process step by step, you can use the Makefile script provided in the last section of this page.<br />
<br />
We're going to do the example with EuroParl and the English to Spanish pair in Apertium.<br />
<br />
Given that you've got all the stuff installed, the work will be as follows:<br />
<br />
Take your corpus and make a tagged version of it:<br />
<br />
<pre><br />
cat europarl.es-en.es | apertium-destxt | apertium -f none -d ~/source/apertium/apertium-en-es en-es-pretransfer > europarl.en-es.es.tagged<br />
</pre><br />
<br />
Make an ambiguous version of your corpus and trim redundant tags:<br />
<br />
<pre><br />
cat europarl.en-es.es.tagged | python ~/source/apertium/apertium-lex-tools/multitrans ~/source/apertium/apertium-en-es/en-es.autobil -b -f -t -n > europarl.en-es.es.ambig<br />
</pre><br />
<br />
Next, generate all the possible disambiguation paths while trimming redundant tags:<br />
<br />
<pre><br />
cat europarl.en-es.es.tagged | ~/source/apertium/apertium-lex-tools/multitrans ~/source/apertium/apertium-en-es/en-es.autobil -m -f -t -n > europarl.en-es.es.multi-trimmed<br />
</pre><br />
<br />
Translate and score all possible disambiguation paths:<br />
<br />
<pre><br />
cat europarl.en-es.es.tagged | python ~/source/apertium/apertium-lex-tools/multitrans ~/source/apertium/apertium-en-es/en-es.autobil -m -f -n |<br />
apertium -f none -d ~/source/apertium/apertium-en-es en-es-multi | ~/source/apertium/apertium-lex-tools/irstlm-ranker <br />
~/source/corpora/lm/en.blm europarl.en-es.es.multi-trimmed -f > europarl.en-es.es.annotated<br />
</pre><br />
<br />
Now we have a pseudo-parallel corpus where each possible translation is scored.<br />
We start by extracting a frequency lexicon:<br />
<br />
<pre><br />
python3 ~/source/apertium/apertium-lex-tools-scripts/biltrans-extract-frac-freq.py europarl.en-es.es.ambig europarl.en-es.es.annotated > europarl.en-es.freq<br />
</pre><br />
<br />
<pre><br />
python3 ~/source/apertium/apertium-lex-tools-scripts/extract-alig-lrx.py europarl.en-es.freq > europarl.en-es.freq.lrx<br />
</pre><br />
<br />
<pre><br />
lrx-comp europarl.en-es.freq.lrx europarl.en-es.freq.lrx.bin<br />
</pre><br />
<br />
From here on, we have two paths we can choose. We can extract rules using a maximum entropy classifier, or we can extract<br />
rules based only on the scores provided by irstlm-ranker.<br />
<br />
==Rule-extraction==<br />
<br />
First extract the default translations:<br />
<br />
<br />
Then the ngram partial counts:<br />
<br />
<br />
===Finding the best threshold===<br />
<br />
<br />
<br />
<br />
[[Category:Lexical selection]]</div>
Fpetkovski
https://wiki.apertium.org/w/index.php?title=Generating_lexical-selection_rules_from_monolingual_corpora&diff=43856
Generating lexical-selection rules from monolingual corpora
2013-09-23T05:45:28Z
<p>Fpetkovski: /* Annotation */</p>
<hr />
<div>{{TOCD}}<br />
This page describes how to generate lexical selection rules without relying on a parallel corpus.<br />
<br />
==Prerequisites==<br />
<br />
* [[apertium-lex-tools]]<br />
* [[IRSTLM]]<br />
* A language pair (e.g. apertium-br-fr)<br />
** The language pair should have the following two modes:<br />
*** <code>-multi</code> which is all the modules after lexical transfer (see apertium-mk-en/modes.xml)<br />
*** <code>-pretransfer</code> which is all the modules up to lexical transfer (see apertium-mk-en/modes.xml)<br />
<br />
==Annotation==<br />
<br />
<br />
Important: If you don't want through the whole process step by step, you can use the Makefile script provided in the last section of this page.<br />
<br />
We're going to do the example with EuroParl and the English to Spanish pair in Apertium.<br />
<br />
Given that you've got all the stuff installed, the work will be as follows:<br />
<br />
Take your corpus and make a tagged version of it:<br />
<br />
<pre><br />
cat europarl.es-en.es | apertium-destxt | apertium -f none -d ~/source/apertium/apertium-en-es en-es-pretransfer > europarl.en-es.es.tagged<br />
</pre><br />
<br />
Make an ambiguous version of your corpus and trim redundant tags:<br />
<br />
<pre><br />
cat europarl.en-es.es.tagged | python ~/source/apertium/apertium-lex-tools/multitrans ~/source/apertium/apertium-en-es/en-es.autobil -b -f -t -n > europarl.en-es.es.ambig<br />
</pre><br />
<br />
Next, generate all the possible disambiguation paths while trimming redundant tags:<br />
<br />
<pre><br />
cat europarl.en-es.es.tagged | ~/source/apertium/apertium-lex-tools/multitrans ~/source/apertium/apertium-en-es/en-es.autobil -m -f -t -n > europarl.en-es.es.multi-trimmed<br />
</pre><br />
<br />
Translate and score all possible disambiguation paths:<br />
<br />
<pre><br />
cat europarl.en-es.es.tagged | python ~/source/apertium/apertium-lex-tools/multitrans ~/source/apertium/apertium-en-es/en-es.autobil -m -f -n |<br />
apertium -f none -d ~/source/apertium/apertium-en-es en-es-multi | ~/source/apertium/apertium-lex-tools/irstlm-ranker <br />
~/source/corpora/lm/en.blm europarl.en-es.es.multi-trimmed -f > europarl.en-es.es.annotated<br />
</pre><br />
<br />
Now we have a pseudo-parallel corpus where each possible translation is scored.<br />
We start by extracting a frequency lexicon:<br />
<br />
<pre><br />
python3 ~/source/apertium/apertium-lex-tools-scripts/biltrans-extract-frac-freq.py europarl.en-es.es.ambig europarl.en-es.es.annotated > europarl.en-es.freq<br />
</pre><br />
<br />
<pre><br />
python3 ~/source/apertium/apertium-lex-tools-scripts/extract-alig-lrx.py europarl.en-es.freq > europarl.en-es.freq.lrx<br />
</pre><br />
<br />
<br />
<pre><br />
lrx-comp europarl.en-es.freq.lrx europarl.en-es.freq.lrx.bin<br />
</pre><br />
<br />
==Rule-extraction==<br />
<br />
First extract the default translations:<br />
<br />
<br />
Then the ngram partial counts:<br />
<br />
<br />
===Finding the best threshold===<br />
<br />
<br />
<br />
<br />
[[Category:Lexical selection]]</div>
Fpetkovski
https://wiki.apertium.org/w/index.php?title=Generating_lexical-selection_rules_from_monolingual_corpora&diff=43855
Generating lexical-selection rules from monolingual corpora
2013-09-23T05:42:49Z
<p>Fpetkovski: /* Annotation */</p>
<hr />
<div>{{TOCD}}<br />
This page describes how to generate lexical selection rules without relying on a parallel corpus.<br />
<br />
==Prerequisites==<br />
<br />
* [[apertium-lex-tools]]<br />
* [[IRSTLM]]<br />
* A language pair (e.g. apertium-br-fr)<br />
** The language pair should have the following two modes:<br />
*** <code>-multi</code> which is all the modules after lexical transfer (see apertium-mk-en/modes.xml)<br />
*** <code>-pretransfer</code> which is all the modules up to lexical transfer (see apertium-mk-en/modes.xml)<br />
<br />
==Annotation==<br />
<br />
<br />
Important: If you don't want through the whole process step by step, you can use the Makefile script provided in the last section of this page.<br />
<br />
We're going to do the example with EuroParl and the English to Spanish pair in Apertium.<br />
<br />
Given that you've got all the stuff installed, the work will be as follows:<br />
<br />
Take your corpus and make a tagged version of it:<br />
<br />
<pre><br />
cat europarl.es-en.es | apertium-destxt | apertium -f none -d ~/source/apertium/apertium-en-es en-es-pretransfer > europarl.en-es.es.tagged<br />
</pre><br />
<br />
Make an ambiguous version of your corpus and trim redundant tags:<br />
<br />
<pre><br />
cat europarl.en-es.es.tagged | python ~/source/apertium/apertium-lex-tools/multitrans ~/source/apertium/apertium-en-es/en-es.autobil -b -f -t -n > europarl.en-es.es.ambig<br />
</pre><br />
<br />
Next, generate all the possible disambiguation paths while trimming redundant tags:<br />
<br />
<pre><br />
cat europarl.en-es.es.tagged | ~/source/apertium/apertium-lex-tools/multitrans ~/source/apertium/apertium-en-es/en-es.autobil -m -f -t -n > europarl.en-es.es.multi-trimmed<br />
</pre><br />
<br />
Translate and score all possible disambiguation paths:<br />
<br />
<pre><br />
cat europarl.en-es.es.tagged | python ~/source/apertium/apertium-lex-tools/multitrans ~/source/apertium/apertium-en-es/en-es.autobil -m -f -n |<br />
apertium -f none -d ~/source/apertium/apertium-en-es en-es-multi | ~/source/apertium/apertium-lex-tools/irstlm-ranker ~/source/corpora/lm/en.blm europarl.en-es.es.multi-trimmed -f > europarl.en-es.es.annotated<br />
</pre><br />
<br />
Now we have a pseudo-parallel corpus where each possible translation is scored.<br />
We start by extracting a frequency lexicon:<br />
<br />
<pre><br />
python3 ~/source/apertium/apertium-lex-tools-scripts/biltrans-extract-frac-freq.py europarl.en-es.es.ambig europarl.en-es.es.annotated > europarl.en-es.freq<br />
</pre><br />
<br />
==Rule-extraction==<br />
<br />
First extract the default translations:<br />
<br />
<br />
Then the ngram partial counts:<br />
<br />
<br />
===Finding the best threshold===<br />
<br />
<br />
<br />
<br />
[[Category:Lexical selection]]</div>
Fpetkovski
https://wiki.apertium.org/w/index.php?title=Generating_lexical-selection_rules_from_a_parallel_corpus&diff=43854
Generating lexical-selection rules from a parallel corpus
2013-09-23T05:29:12Z
<p>Fpetkovski: /* Getting started */</p>
<hr />
<div>{{TOCD}}<br />
If you have a parallel corpus, one of the things you can do is generate some lexical selection rules from it, to improve translation of words with more than one possible translation.<br />
<br />
== You will need ==<br />
<br />
Here is a list of software that you will need installed:<br />
<br />
* Giza++ (or some other word aligner)<br />
* Moses (for making Giza++ less human hostile)<br />
* All the Moses scripts<br />
* [[lttoolbox]]<br />
* Apertium<br />
* [[apertium-lex-tools]]<br />
<br />
Furthermore you'll need:<br />
<br />
* an Apertium language pair<br />
* a parallel corpus (see [[Corpora]])<br />
<br />
===Installing prerequisites===<br />
See [[Minimal installation from SVN]] for apertium/lttoolbox. <br />
<br />
See [[Constraint-based lexical selection module]] for apertium-lex-tools.<br />
<br />
For Giza++ and moses-decoder, etc. you can do<br />
<pre><br />
$ mkdir ~/smt<br />
$ cd ~/smt<br />
$ mkdir local # our "install prefix"<br />
$ wget https://giza-pp.googlecode.com/files/giza-pp-v1.0.7.tar.gz<br />
$ tar xzvf giza-pp-v1.0.7.tar.gz<br />
$ cd giza-pp<br />
$ make<br />
$ mkdir ../local/bin<br />
$ cp GIZA++-v2/snt2cooc.out ../local/bin/<br />
$ cp GIZA++-v2/snt2plain.out ../local/bin/<br />
$ cp GIZA++-v2/GIZA++ ../local/bin/<br />
$ cp mkcls-v2/mkcls ../local/bin/<br />
$ git clone https://github.com/moses-smt/mosesdecoder<br />
$ cd mosesdecoder/<br />
$ ./bjam <br />
</pre><br />
<br />
Now e.g. the clean-corpus and train-model scripts referred to below will be in ~/smt/mosesdecoder/scripts/training/clean-corpus-n.perl<br />
See http://www.statmt.org/moses/?n=Development.GetStarted if you want to install the binaries to some other directory.<br />
<br />
== Getting started ==<br />
<br />
'''Important:'''<br />
If you don't want through the whole process step by step, you can use the Makefile script provided in the [[#Makefile|last section]] of this page.<br />
<br />
We're going to do the example with [[Corpora|EuroParl]] and the English to Spanish pair in Apertium.<br />
<br />
Given that you've got all the stuff installed, the work will be as follows:<br />
<br />
=== Prepare corpus ===<br />
<br />
To generate the rules, we need three files,<br />
<br />
* The tagged and tokenised source corpus<br />
* The tagged and tokenised target corpus<br />
* The output of the lexical transfer module in the source→target direction, tokenised<br />
<br />
These three files should be sentence aligned. <br />
<br />
Make a folder called data-en-es. We are going to keep all the generated files there.<br />
<br />
The first thing that we need to do is tag both sides of the corpus:<br />
<br />
<pre><br />
$ nohup cat europarl.en-es.en | apertium-destxt |\<br />
apertium -f none -d /home/fran/source/apertium-en-es en-es-pretransfer > data-en-es/europarl-en-es.tagged.en &<br />
$ nohup cat europarl.clean.es | apertium-destxt |\<br />
apertium -f none -d /home/fran/source/apertium-en-es es-en-pretransfer > data-en-es.europarl-en-es.tagged.es &<br />
</pre><br />
<br />
Then we need to remove the lines with no analyses on and replace blanks within lemmas with a new character (we will use `~`):<br />
<br />
<pre><br />
$ paste data-en-es/europarl-en-es.tagged.en data-en-es/europarl-en-es.tagged.es | grep '<' | cut -f2 | sed 's/ /~/g' | sed 's/\$[^\^]*/\$ /g' > data-en-es/europarl-en-es.tagged.new.es<br />
$ paste data-en-es/europarl-en-es.tagged.en data-en-es/europarl-en-es.tagged.es | grep '<' | cut -f3 | sed 's/ /~/g' | sed 's/\$[^\^]*/\$ /g' > data-en-es/europarl-en-es.tagged.new.en<br />
</pre><br />
<br />
Next, we need to clean the corpus and remove long sentences.<br />
(Make sure you are in the same directory as the one where you have your europarl corpus) <br />
<br />
<pre><br />
$ perl (path to your mosesdecoder)/scripts/training/clean-corpus-n.perl data-en-es/europarl-en-es.tagged.new es en data-en-es/europarl-en-es.tag-clean 1 40<br />
clean-corpus.perl: processing europarl-v6.es-en.es & .en to data-en-es/europarl-en-es.clean, cutoff 1-40<br />
..........(100000)...<br />
<br />
Input sentences: 1786594 Output sentences: 1467708<br />
</pre><br />
<br />
<br />
We're going to cut off the bottom 67,658 for testing (also because Giza++ segfaults somewhere around there).<br />
<pre><br />
$ mkdir testing<br />
$ tail -67658 data-en-es/europarl-en-es.tag-clean.en > testing/europarl-en-es.tag-clean.67658.en<br />
$ tail -67658 data-en-es/europarl-en-es.tag-clean.es > testing/europarl-en-es.tag-clean.67658.es<br />
</pre><br />
<br />
<pre><br />
$ head -1400000 data-en-es/europarl-en-es.tag-clean.en > data-en-es/europarl-en-es.tag-clean.en.new<br />
$ head -1400000 data-en-es/europarl-en-es.tag-clean.es > data-en-es/europarl-en-es.tag-clean.es.new<br />
$ mv data-en-es/europarl-en-es.tag-clean.en.new data-en-es/europarl-en-es.tag-clean.en<br />
$ mv data-en-es/europarl-en-es.tag-clean.es.new data-en-es/europarl-en-es.tag-clean.es<br />
</pre><br />
<br />
<br />
These files are:<br />
<br />
* <code>data-en-es/europarl-en-es.tag-clean.en</code>: The tagged source language side of the corpus<br />
* <code>data-en-es/europarl-en-es.tag-clean.es</code>: The tagged target language side of the corpus<br />
<br />
Check that they have the same length:<br />
<br />
<pre><br />
$ wc -l europarl.*<br />
1400000 data-en-es/europarl-en-es.tag-clean.en<br />
1400000 data-en-es/europarl-en-es.tag-clean.es<br />
2800000 total<br />
</pre><br />
<br />
=== Align corpus ===<br />
<br />
Now we've got the corpus files ready, we can align the corpus using the Moses scripts:<br />
<br />
<pre><br />
nohup perl (path to your mosesdecoder)/scripts/training/train-model.perl -external-bin-dir \<br />
~/smt/local/bin -corpus data-en-es/europarl-en-es.tag-clean \<br />
-f en -e es -alignment grow-diag-final-and -reordering msd-bidirectional-fe \<br />
-lm 0:5:/home/fran/corpora/europarl/europarl.lm:0 >log 2>&1 &<br />
</pre><br />
<br />
Note: Remember to change all the paths in the above command!<br />
<br />
You'll need an LM file, but you can copy it from a previous Moses installation. If you don't have one, make an empty file and put a few words in it. We won't be using the LM anyway.<br />
<br />
This takes a while, from a few hours to a day. So leave it running and go and make a soufflé, or chop some wood or something.<br />
<br />
=== Extract sentences ===<br />
After the sentences are aligned, we need to trim unnecessary tags from the tokens, and generate a biltrans file.<br />
<br />
<pre><br />
zcat giza.en-es/en-es.A3.final.gz | ~/source/apertium-lex-tools/scripts/giza-to-moses.awk > data-en-es/europarl.phrasetable.en-es<br />
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 1 \<br />
| sed 's/~/ /g' | ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil.bin -p -t > tmp1<br />
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 2 \<br />
| sed 's/~/ /g' | ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil.bin -p -t > tmp2<br />
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 3 > tmp3<br />
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 2 \<br />
| sed 's/~/ /g' | ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil.bin -b -t > data-en-es/europarl.clean-biltrans.en-es<br />
paste tmp1 tmp2 tmp3 | sed 's/\t/ ||| /g' > data-en-es/europarl.phrasetable.en-es<br />
rm tmp1 tmp2 tmp3<br />
</pre><br />
<br />
Then we want to make sure again that our file has the right number of lines:<br />
<br />
<pre><br />
$ wc -l data-en-es/europarl.phrasetable.en-es<br />
1400000 data-en-es/europarl.phrasetable.en-es<br />
</pre><br />
<br />
Then we want to extract the sentences where the target language word aligned to a source language word is a possible translation in the bilingual dictionary:<br />
<br />
<pre><br />
$ ~/source/apertium-lex-tools/scripts/extract-sentences.py data-en-es/europarl-en-es.phrasetable.en-es data-en-es/europarl-en-es.biltrans-tok.en-es \<br />
> data-en-es/europarl-en-es.candidates.en-es<br />
</pre><br />
<br />
These are basically sentences that we can hope that Apertium might be able to generate.<br />
<br />
=== Extract bilingual dictionary candidates ===<br />
<br />
Using the phrasetable and the bilingual file we can extract candidates for the bilingual dictionary.<br />
<br />
<pre><br />
python3 ~/source/apertium-lex-tools/scripts/extract-biltrans-candidates.py data-en-es/europarl-en-es.phrasetable.en-es data-en-es/europarl-en-es.biltrans-tok.en-es > data-en-es/europarl-en-es.biltrans-candidates.en-es 2> data-en-es/europarl-en-es.biltrans-pairs.en-es<br />
</pre><br />
<br />
where data-en-es/europarl-en-es.biltrans-candidates.en-es contains the generated candidates for the bilingual dictionary.<br />
<br />
===Extract frequency lexicon===<br />
<br />
The next step is to extract the frequency lexicon. <br />
<br />
<pre><br />
$ python ~/source/apertium-lex-tools/scripts/extract-freq-lexicon.py data-en-es/europarl-en-es.candidates.en-es > data-en-es/europarl-en-es.lex.en-es<br />
</pre><br />
<br />
This file should look like:<br />
<br />
<pre><br />
$ cat europarl.lex.en-es | head <br />
31381 union<n> unión<n> @<br />
101 union<n> sindicato<n><br />
1 union<n> situación<n><br />
1 union<n> monetario<adj><br />
4 slope<n> pendiente<n> @<br />
1 slope<n> ladera<n><br />
</pre><br />
<br />
Where the highest frequency translation is marked with an <code>@</code>.<br />
<br />
Note: This frequency lexicon can be used as a substitute for "choosing the most general translation" in your bilingual dictionary.<br />
<br />
===Generate patterns===<br />
<br />
Now we generate the ngrams that we are going to generate the rules from.<br />
<br />
<pre><br />
$ crisphold=1.5 # ratio of how many times you see the alternative translation compared to the default<br />
$ python ~/source/apertium-lex-tools/scripts/ngram-count-patterns.py europarl.lex.en-es data-en-es/europarl-en-es.candidates.en-es $crisphold 2>/dev/null > data-en-es/europarl-en-es.ngrams.en-es<br />
</pre><br />
<br />
This script outputs lines in the following format: <br />
<br />
<pre><br />
-language<n> and<cnjcoo> language<n> ,<cm> lengua<n> 2<br />
+language<n> plain<adj> language<n> ,<cm> lenguaje<n> 3<br />
-language<n> language<n> knowledge<n> lengua<n> 4<br />
-language<n> language<n> of<pr> communication<n> lengua<n> 3<br />
-language<n> Community<adj> language<n> .<sent> lengua<n> 5<br />
-language<n> language<n> in~addition~to<pr> their<det><pos> lengua<n> 2<br />
-language<n> every<det><ind> language<n> lengua<n> 2<br />
+language<n> and<cnjcoo> *understandable language<n> lenguaje<n> 2<br />
-language<n> two<num> language<n> lengua<n> 8<br />
-language<n> only<adj> official<adj> language<n> lengua<n> 2<br />
</pre><br />
<br />
The <code>+</code> and <code>-</code> indicate if this line chooses the most frequent transation (<code>-</code>) or a translation which is not the most frequent (<code>+</code>). The pattern selecting the translation is then shown, followed by the translation and then the frequency.<br />
<br />
===Filter rules===<br />
<br />
Now you can filter the rules, for example by removing rules with conjunctions, or removing rules with unknown words.<br />
<br />
=== Generate rules ===<br />
<br />
The final stage is to generate the rules,<br />
<br />
<pre><br />
python3 ~/source/apertium-lex-tools/scripts/ngrams-to-rules.py data-en-es/europarl-en-es.ngrams.en-es $crisphold > data-en-es/europarl-en-es.ngrams.en-es.lrx<br />
</pre><br />
<br />
=== Makefile ===<br />
For the whole process you can run the following Makefile:<br />
<br />
<pre><br />
CORPUS=europarl<br />
PAIR=en-es<br />
SL=en<br />
TL=es<br />
DATA=/home/philip/Apertium/apertium-en-es<br />
<br />
LEX_TOOLS=/home/philip/Apertium/apertium-lex-tools<br />
SCRIPTS=$(LEX_TOOLS)/scripts<br />
MOSESDECODER=/home/philip/mosesdecoder/scripts/training<br />
TRAINING_LINES=200000<br />
BIN_DIR=/home/philip/giza-pp/bin<br />
LM=/home/philip/Apertium/gsoc2013/giza/dummy.lm<br />
<br />
crisphold=1<br />
<br />
all: data-$(SL)-$(TL)/$(CORPUS).ngrams.$(SL)-$(TL).lrx data-$(SL)-$(TL)/$(CORPUS).biltrans-entries.$(SL)-$(TL)<br />
<br />
# TAG CORPUS<br />
data-$(SL)-$(TL)/$(CORPUS).tagged.$(SL): $(CORPUS).$(PAIR).$(SL)<br />
if [ ! -d data-$(SL)-$(TL) ]; then mkdir data-$(SL)-$(TL); fi<br />
cat $(CORPUS).$(PAIR).$(SL) | head -n $(TRAINING_LINES) \<br />
| apertium-destxt \<br />
| apertium -f none -d $(DATA) $(SL)-$(TL)-tagger \<br />
| apertium-pretransfer > $@;<br />
<br />
data-$(SL)-$(TL)/$(CORPUS).tagged.$(TL): $(CORPUS).$(PAIR).$(TL)<br />
if [ ! -d data-$(SL)-$(TL) ]; then mkdir data-$(SL)-$(TL); fi<br />
cat $(CORPUS).$(PAIR).$(TL) | head -n $(TRAINING_LINES) \<br />
| apertium-destxt \<br />
| apertium -f none -d $(DATA) $(TL)-$(SL)-tagger \<br />
| apertium-pretransfer > $@;<br />
<br />
# REMOVE LINES WITH NO ANALYSES<br />
data-$(SL)-$(TL)/$(CORPUS).tagged.new.$(SL): data-$(SL)-$(TL)/$(CORPUS).tagged.$(SL) data-$(SL)-$(TL)/$(CORPUS).tagged.$(TL)<br />
paste data-$(SL)-$(TL)/$(CORPUS).tagged.$(SL) data-$(SL)-$(TL)/$(CORPUS).tagged.$(TL) \<br />
| grep '<' \<br />
| cut -f1 \<br />
| sed 's/ /~/g' | sed 's/$$[^\^]*/$$ /g' > $@<br />
<br />
data-$(SL)-$(TL)/$(CORPUS).tagged.new.$(TL): data-$(SL)-$(TL)/$(CORPUS).tagged.$(SL) data-$(SL)-$(TL)/$(CORPUS).tagged.$(TL)<br />
paste data-$(SL)-$(TL)/$(CORPUS).tagged.$(SL) data-$(SL)-$(TL)/$(CORPUS).tagged.$(TL) \<br />
| grep '<' \<br />
| cut -f2 \<br />
| sed 's/ /~/g' | sed 's/$$[^\^]*/$$ /g' > $@<br />
<br />
# CLEAN<br />
data-$(SL)-$(TL)/$(CORPUS).tag-clean.$(SL) data-$(SL)-$(TL)/$(CORPUS).tag-clean.$(TL): data-$(SL)-$(TL)/$(CORPUS).tagged.new.$(SL) data-$(SL)-$(TL)/$(CORPUS).tagged.new.$(TL)<br />
perl $(MOSESDECODER)/clean-corpus-n.perl data-$(SL)-$(TL)/$(CORPUS).tagged.new $(SL) $(TL) data-$(SL)-$(TL)/$(CORPUS).tag-clean 1 40;<br />
<br />
# ALIGN<br />
model: data-$(SL)-$(TL)/$(CORPUS).tag-clean.$(SL) data-$(SL)-$(TL)/$(CORPUS).tag-clean.$(TL)<br />
-perl $(MOSESDECODER)/train-model.perl -external-bin-dir $(BIN_DIR) -corpus data-$(SL)-$(TL)/$(CORPUS).tag-clean \<br />
-f $(TL) -e $(SL) -alignment grow-diag-final-and -reordering msd-bidirectional-fe \<br />
-lm 0:5:$(LM):0 2>&1<br />
<br />
# EXTRACT AND TRIM<br />
data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL): model<br />
zcat giza.$(SL)-$(TL)/$(SL)-$(TL).A3.final.gz | $(SCRIPTS)/giza-to-moses.awk > $@<br />
cat data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL) | sed 's/ ||| /\t/g' | cut -f 1 \<br />
| sed 's/~/ /g' | $(LEX_TOOLS)/multitrans $(DATA)/$(TL)-$(SL).autobil.bin -p -t > tmp1<br />
cat data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL) | sed 's/ ||| /\t/g' | cut -f 2 \<br />
| sed 's/~/ /g' | $(LEX_TOOLS)/multitrans $(DATA)/$(SL)-$(TL).autobil.bin -p -t > tmp2<br />
cat data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL) | sed 's/ ||| /\t/g' | cut -f 3 > tmp3<br />
cat data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL) | sed 's/ ||| /\t/g' | cut -f 2 \<br />
| sed 's/~/ /g' | $(LEX_TOOLS)/multitrans $(DATA)/$(SL)-$(TL).autobil.bin -b -t > data-$(SL)-$(TL)/$(CORPUS).clean-biltrans.$(PAIR)<br />
paste tmp1 tmp2 tmp3 | sed 's/\t/ ||| /g' > data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL)<br />
rm tmp1 tmp2 tmp3<br />
<br />
# SENTENCES<br />
data-$(SL)-$(TL)/$(CORPUS).candidates.$(SL)-$(TL): data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL) data-$(SL)-$(TL)/$(CORPUS).clean-biltrans.$(PAIR)<br />
python3 $(SCRIPTS)/extract-sentences.py data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL) \<br />
data-$(SL)-$(TL)/$(CORPUS).clean-biltrans.$(PAIR) > $@ 2>/dev/null<br />
<br />
# FREQUENCY LEXICON<br />
data-$(SL)-$(TL)/$(CORPUS).lex.$(SL)-$(TL): data-$(SL)-$(TL)/$(CORPUS).candidates.$(SL)-$(TL)<br />
python $(SCRIPTS)/extract-freq-lexicon.py data-$(SL)-$(TL)/$(CORPUS).candidates.$(SL)-$(TL) > $@ 2>/dev/null<br />
<br />
# BILTRANS CANDIDATES<br />
data-$(SL)-$(TL)/$(CORPUS).biltrans-entries.$(SL)-$(TL): data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL) data-$(SL)-$(TL)/$(CORPUS).clean-biltrans.$(PAIR)<br />
python3 $(SCRIPTS)/extract-biltrans-candidates.py data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL) data-$(SL)-$(TL)/$(CORPUS).clean-biltrans.$(PAIR) \<br />
> $@ 2>/dev/null<br />
<br />
# NGRAM PATTERNS<br />
data-$(SL)-$(TL)/$(CORPUS).ngrams.$(SL)-$(TL): data-$(SL)-$(TL)/$(CORPUS).lex.$(SL)-$(TL) data-$(SL)-$(TL)/$(CORPUS).candidates.$(SL)-$(TL)<br />
python $(SCRIPTS)/ngram-count-patterns.py data-$(SL)-$(TL)/$(CORPUS).lex.$(SL)-$(TL) data-$(SL)-$(TL)/$(CORPUS).candidates.$(SL)-$(TL) $(crisphold) 2>/dev/null > $@<br />
<br />
# NGRAMS TO RULES<br />
data-$(SL)-$(TL)/$(CORPUS).ngrams.$(SL)-$(TL).lrx: data-$(SL)-$(TL)/$(CORPUS).ngrams.$(SL)-$(TL)<br />
python3 $(SCRIPTS)/ngrams-to-rules.py data-$(SL)-$(TL)/$(CORPUS).ngrams.$(SL)-$(TL) $(crisphold) > $@ 2>/dev/null<br />
<br />
<br />
<br />
</pre><br />
<br />
[[Category:Lexical selection]]<br />
[[Category:Documentation in English]]</div>
Fpetkovski
https://wiki.apertium.org/w/index.php?title=Generating_lexical-selection_rules_from_a_parallel_corpus&diff=43853
Generating lexical-selection rules from a parallel corpus
2013-09-23T05:28:39Z
<p>Fpetkovski: /* Getting started */</p>
<hr />
<div>{{TOCD}}<br />
If you have a parallel corpus, one of the things you can do is generate some lexical selection rules from it, to improve translation of words with more than one possible translation.<br />
<br />
== You will need ==<br />
<br />
Here is a list of software that you will need installed:<br />
<br />
* Giza++ (or some other word aligner)<br />
* Moses (for making Giza++ less human hostile)<br />
* All the Moses scripts<br />
* [[lttoolbox]]<br />
* Apertium<br />
* [[apertium-lex-tools]]<br />
<br />
Furthermore you'll need:<br />
<br />
* an Apertium language pair<br />
* a parallel corpus (see [[Corpora]])<br />
<br />
===Installing prerequisites===<br />
See [[Minimal installation from SVN]] for apertium/lttoolbox. <br />
<br />
See [[Constraint-based lexical selection module]] for apertium-lex-tools.<br />
<br />
For Giza++ and moses-decoder, etc. you can do<br />
<pre><br />
$ mkdir ~/smt<br />
$ cd ~/smt<br />
$ mkdir local # our "install prefix"<br />
$ wget https://giza-pp.googlecode.com/files/giza-pp-v1.0.7.tar.gz<br />
$ tar xzvf giza-pp-v1.0.7.tar.gz<br />
$ cd giza-pp<br />
$ make<br />
$ mkdir ../local/bin<br />
$ cp GIZA++-v2/snt2cooc.out ../local/bin/<br />
$ cp GIZA++-v2/snt2plain.out ../local/bin/<br />
$ cp GIZA++-v2/GIZA++ ../local/bin/<br />
$ cp mkcls-v2/mkcls ../local/bin/<br />
$ git clone https://github.com/moses-smt/mosesdecoder<br />
$ cd mosesdecoder/<br />
$ ./bjam <br />
</pre><br />
<br />
Now e.g. the clean-corpus and train-model scripts referred to below will be in ~/smt/mosesdecoder/scripts/training/clean-corpus-n.perl<br />
See http://www.statmt.org/moses/?n=Development.GetStarted if you want to install the binaries to some other directory.<br />
<br />
== Getting started ==<br />
<br />
'''Important:'''<br />
'''If you don't want through the whole process step by step, you can use the Makefile script provided in the [[#Makefile|last section]] of this page.'''<br />
<br />
<br />
We're going to do the example with [[Corpora|EuroParl]] and the English to Spanish pair in Apertium.<br />
<br />
Given that you've got all the stuff installed, the work will be as follows:<br />
<br />
=== Prepare corpus ===<br />
<br />
To generate the rules, we need three files,<br />
<br />
* The tagged and tokenised source corpus<br />
* The tagged and tokenised target corpus<br />
* The output of the lexical transfer module in the source→target direction, tokenised<br />
<br />
These three files should be sentence aligned. <br />
<br />
Make a folder called data-en-es. We are going to keep all the generated files there.<br />
<br />
The first thing that we need to do is tag both sides of the corpus:<br />
<br />
<pre><br />
$ nohup cat europarl.en-es.en | apertium-destxt |\<br />
apertium -f none -d /home/fran/source/apertium-en-es en-es-pretransfer > data-en-es/europarl-en-es.tagged.en &<br />
$ nohup cat europarl.clean.es | apertium-destxt |\<br />
apertium -f none -d /home/fran/source/apertium-en-es es-en-pretransfer > data-en-es.europarl-en-es.tagged.es &<br />
</pre><br />
<br />
Then we need to remove the lines with no analyses on and replace blanks within lemmas with a new character (we will use `~`):<br />
<br />
<pre><br />
$ paste data-en-es/europarl-en-es.tagged.en data-en-es/europarl-en-es.tagged.es | grep '<' | cut -f2 | sed 's/ /~/g' | sed 's/\$[^\^]*/\$ /g' > data-en-es/europarl-en-es.tagged.new.es<br />
$ paste data-en-es/europarl-en-es.tagged.en data-en-es/europarl-en-es.tagged.es | grep '<' | cut -f3 | sed 's/ /~/g' | sed 's/\$[^\^]*/\$ /g' > data-en-es/europarl-en-es.tagged.new.en<br />
</pre><br />
<br />
Next, we need to clean the corpus and remove long sentences.<br />
(Make sure you are in the same directory as the one where you have your europarl corpus) <br />
<br />
<pre><br />
$ perl (path to your mosesdecoder)/scripts/training/clean-corpus-n.perl data-en-es/europarl-en-es.tagged.new es en data-en-es/europarl-en-es.tag-clean 1 40<br />
clean-corpus.perl: processing europarl-v6.es-en.es & .en to data-en-es/europarl-en-es.clean, cutoff 1-40<br />
..........(100000)...<br />
<br />
Input sentences: 1786594 Output sentences: 1467708<br />
</pre><br />
<br />
<br />
We're going to cut off the bottom 67,658 for testing (also because Giza++ segfaults somewhere around there).<br />
<pre><br />
$ mkdir testing<br />
$ tail -67658 data-en-es/europarl-en-es.tag-clean.en > testing/europarl-en-es.tag-clean.67658.en<br />
$ tail -67658 data-en-es/europarl-en-es.tag-clean.es > testing/europarl-en-es.tag-clean.67658.es<br />
</pre><br />
<br />
<pre><br />
$ head -1400000 data-en-es/europarl-en-es.tag-clean.en > data-en-es/europarl-en-es.tag-clean.en.new<br />
$ head -1400000 data-en-es/europarl-en-es.tag-clean.es > data-en-es/europarl-en-es.tag-clean.es.new<br />
$ mv data-en-es/europarl-en-es.tag-clean.en.new data-en-es/europarl-en-es.tag-clean.en<br />
$ mv data-en-es/europarl-en-es.tag-clean.es.new data-en-es/europarl-en-es.tag-clean.es<br />
</pre><br />
<br />
<br />
These files are:<br />
<br />
* <code>data-en-es/europarl-en-es.tag-clean.en</code>: The tagged source language side of the corpus<br />
* <code>data-en-es/europarl-en-es.tag-clean.es</code>: The tagged target language side of the corpus<br />
<br />
Check that they have the same length:<br />
<br />
<pre><br />
$ wc -l europarl.*<br />
1400000 data-en-es/europarl-en-es.tag-clean.en<br />
1400000 data-en-es/europarl-en-es.tag-clean.es<br />
2800000 total<br />
</pre><br />
<br />
=== Align corpus ===<br />
<br />
Now we've got the corpus files ready, we can align the corpus using the Moses scripts:<br />
<br />
<pre><br />
nohup perl (path to your mosesdecoder)/scripts/training/train-model.perl -external-bin-dir \<br />
~/smt/local/bin -corpus data-en-es/europarl-en-es.tag-clean \<br />
-f en -e es -alignment grow-diag-final-and -reordering msd-bidirectional-fe \<br />
-lm 0:5:/home/fran/corpora/europarl/europarl.lm:0 >log 2>&1 &<br />
</pre><br />
<br />
Note: Remember to change all the paths in the above command!<br />
<br />
You'll need an LM file, but you can copy it from a previous Moses installation. If you don't have one, make an empty file and put a few words in it. We won't be using the LM anyway.<br />
<br />
This takes a while, from a few hours to a day. So leave it running and go and make a soufflé, or chop some wood or something.<br />
<br />
=== Extract sentences ===<br />
After the sentences are aligned, we need to trim unnecessary tags from the tokens, and generate a biltrans file.<br />
<br />
<pre><br />
zcat giza.en-es/en-es.A3.final.gz | ~/source/apertium-lex-tools/scripts/giza-to-moses.awk > data-en-es/europarl.phrasetable.en-es<br />
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 1 \<br />
| sed 's/~/ /g' | ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil.bin -p -t > tmp1<br />
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 2 \<br />
| sed 's/~/ /g' | ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil.bin -p -t > tmp2<br />
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 3 > tmp3<br />
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 2 \<br />
| sed 's/~/ /g' | ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil.bin -b -t > data-en-es/europarl.clean-biltrans.en-es<br />
paste tmp1 tmp2 tmp3 | sed 's/\t/ ||| /g' > data-en-es/europarl.phrasetable.en-es<br />
rm tmp1 tmp2 tmp3<br />
</pre><br />
<br />
Then we want to make sure again that our file has the right number of lines:<br />
<br />
<pre><br />
$ wc -l data-en-es/europarl.phrasetable.en-es<br />
1400000 data-en-es/europarl.phrasetable.en-es<br />
</pre><br />
<br />
Then we want to extract the sentences where the target language word aligned to a source language word is a possible translation in the bilingual dictionary:<br />
<br />
<pre><br />
$ ~/source/apertium-lex-tools/scripts/extract-sentences.py data-en-es/europarl-en-es.phrasetable.en-es data-en-es/europarl-en-es.biltrans-tok.en-es \<br />
> data-en-es/europarl-en-es.candidates.en-es<br />
</pre><br />
<br />
These are basically sentences that we can hope that Apertium might be able to generate.<br />
<br />
=== Extract bilingual dictionary candidates ===<br />
<br />
Using the phrasetable and the bilingual file we can extract candidates for the bilingual dictionary.<br />
<br />
<pre><br />
python3 ~/source/apertium-lex-tools/scripts/extract-biltrans-candidates.py data-en-es/europarl-en-es.phrasetable.en-es data-en-es/europarl-en-es.biltrans-tok.en-es > data-en-es/europarl-en-es.biltrans-candidates.en-es 2> data-en-es/europarl-en-es.biltrans-pairs.en-es<br />
</pre><br />
<br />
where data-en-es/europarl-en-es.biltrans-candidates.en-es contains the generated candidates for the bilingual dictionary.<br />
<br />
===Extract frequency lexicon===<br />
<br />
The next step is to extract the frequency lexicon. <br />
<br />
<pre><br />
$ python ~/source/apertium-lex-tools/scripts/extract-freq-lexicon.py data-en-es/europarl-en-es.candidates.en-es > data-en-es/europarl-en-es.lex.en-es<br />
</pre><br />
<br />
This file should look like:<br />
<br />
<pre><br />
$ cat europarl.lex.en-es | head <br />
31381 union<n> unión<n> @<br />
101 union<n> sindicato<n><br />
1 union<n> situación<n><br />
1 union<n> monetario<adj><br />
4 slope<n> pendiente<n> @<br />
1 slope<n> ladera<n><br />
</pre><br />
<br />
Where the highest frequency translation is marked with an <code>@</code>.<br />
<br />
Note: This frequency lexicon can be used as a substitute for "choosing the most general translation" in your bilingual dictionary.<br />
<br />
===Generate patterns===<br />
<br />
Now we generate the ngrams that we are going to generate the rules from.<br />
<br />
<pre><br />
$ crisphold=1.5 # ratio of how many times you see the alternative translation compared to the default<br />
$ python ~/source/apertium-lex-tools/scripts/ngram-count-patterns.py europarl.lex.en-es data-en-es/europarl-en-es.candidates.en-es $crisphold 2>/dev/null > data-en-es/europarl-en-es.ngrams.en-es<br />
</pre><br />
<br />
This script outputs lines in the following format: <br />
<br />
<pre><br />
-language<n> and<cnjcoo> language<n> ,<cm> lengua<n> 2<br />
+language<n> plain<adj> language<n> ,<cm> lenguaje<n> 3<br />
-language<n> language<n> knowledge<n> lengua<n> 4<br />
-language<n> language<n> of<pr> communication<n> lengua<n> 3<br />
-language<n> Community<adj> language<n> .<sent> lengua<n> 5<br />
-language<n> language<n> in~addition~to<pr> their<det><pos> lengua<n> 2<br />
-language<n> every<det><ind> language<n> lengua<n> 2<br />
+language<n> and<cnjcoo> *understandable language<n> lenguaje<n> 2<br />
-language<n> two<num> language<n> lengua<n> 8<br />
-language<n> only<adj> official<adj> language<n> lengua<n> 2<br />
</pre><br />
<br />
The <code>+</code> and <code>-</code> indicate if this line chooses the most frequent transation (<code>-</code>) or a translation which is not the most frequent (<code>+</code>). The pattern selecting the translation is then shown, followed by the translation and then the frequency.<br />
<br />
===Filter rules===<br />
<br />
Now you can filter the rules, for example by removing rules with conjunctions, or removing rules with unknown words.<br />
<br />
=== Generate rules ===<br />
<br />
The final stage is to generate the rules,<br />
<br />
<pre><br />
python3 ~/source/apertium-lex-tools/scripts/ngrams-to-rules.py data-en-es/europarl-en-es.ngrams.en-es $crisphold > data-en-es/europarl-en-es.ngrams.en-es.lrx<br />
</pre><br />
<br />
=== Makefile ===<br />
For the whole process you can run the following Makefile:<br />
<br />
<pre><br />
CORPUS=europarl<br />
PAIR=en-es<br />
SL=en<br />
TL=es<br />
DATA=/home/philip/Apertium/apertium-en-es<br />
<br />
LEX_TOOLS=/home/philip/Apertium/apertium-lex-tools<br />
SCRIPTS=$(LEX_TOOLS)/scripts<br />
MOSESDECODER=/home/philip/mosesdecoder/scripts/training<br />
TRAINING_LINES=200000<br />
BIN_DIR=/home/philip/giza-pp/bin<br />
LM=/home/philip/Apertium/gsoc2013/giza/dummy.lm<br />
<br />
crisphold=1<br />
<br />
all: data-$(SL)-$(TL)/$(CORPUS).ngrams.$(SL)-$(TL).lrx data-$(SL)-$(TL)/$(CORPUS).biltrans-entries.$(SL)-$(TL)<br />
<br />
# TAG CORPUS<br />
data-$(SL)-$(TL)/$(CORPUS).tagged.$(SL): $(CORPUS).$(PAIR).$(SL)<br />
if [ ! -d data-$(SL)-$(TL) ]; then mkdir data-$(SL)-$(TL); fi<br />
cat $(CORPUS).$(PAIR).$(SL) | head -n $(TRAINING_LINES) \<br />
| apertium-destxt \<br />
| apertium -f none -d $(DATA) $(SL)-$(TL)-tagger \<br />
| apertium-pretransfer > $@;<br />
<br />
data-$(SL)-$(TL)/$(CORPUS).tagged.$(TL): $(CORPUS).$(PAIR).$(TL)<br />
if [ ! -d data-$(SL)-$(TL) ]; then mkdir data-$(SL)-$(TL); fi<br />
cat $(CORPUS).$(PAIR).$(TL) | head -n $(TRAINING_LINES) \<br />
| apertium-destxt \<br />
| apertium -f none -d $(DATA) $(TL)-$(SL)-tagger \<br />
| apertium-pretransfer > $@;<br />
<br />
# REMOVE LINES WITH NO ANALYSES<br />
data-$(SL)-$(TL)/$(CORPUS).tagged.new.$(SL): data-$(SL)-$(TL)/$(CORPUS).tagged.$(SL) data-$(SL)-$(TL)/$(CORPUS).tagged.$(TL)<br />
paste data-$(SL)-$(TL)/$(CORPUS).tagged.$(SL) data-$(SL)-$(TL)/$(CORPUS).tagged.$(TL) \<br />
| grep '<' \<br />
| cut -f1 \<br />
| sed 's/ /~/g' | sed 's/$$[^\^]*/$$ /g' > $@<br />
<br />
data-$(SL)-$(TL)/$(CORPUS).tagged.new.$(TL): data-$(SL)-$(TL)/$(CORPUS).tagged.$(SL) data-$(SL)-$(TL)/$(CORPUS).tagged.$(TL)<br />
paste data-$(SL)-$(TL)/$(CORPUS).tagged.$(SL) data-$(SL)-$(TL)/$(CORPUS).tagged.$(TL) \<br />
| grep '<' \<br />
| cut -f2 \<br />
| sed 's/ /~/g' | sed 's/$$[^\^]*/$$ /g' > $@<br />
<br />
# CLEAN<br />
data-$(SL)-$(TL)/$(CORPUS).tag-clean.$(SL) data-$(SL)-$(TL)/$(CORPUS).tag-clean.$(TL): data-$(SL)-$(TL)/$(CORPUS).tagged.new.$(SL) data-$(SL)-$(TL)/$(CORPUS).tagged.new.$(TL)<br />
perl $(MOSESDECODER)/clean-corpus-n.perl data-$(SL)-$(TL)/$(CORPUS).tagged.new $(SL) $(TL) data-$(SL)-$(TL)/$(CORPUS).tag-clean 1 40;<br />
<br />
# ALIGN<br />
model: data-$(SL)-$(TL)/$(CORPUS).tag-clean.$(SL) data-$(SL)-$(TL)/$(CORPUS).tag-clean.$(TL)<br />
-perl $(MOSESDECODER)/train-model.perl -external-bin-dir $(BIN_DIR) -corpus data-$(SL)-$(TL)/$(CORPUS).tag-clean \<br />
-f $(TL) -e $(SL) -alignment grow-diag-final-and -reordering msd-bidirectional-fe \<br />
-lm 0:5:$(LM):0 2>&1<br />
<br />
# EXTRACT AND TRIM<br />
data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL): model<br />
zcat giza.$(SL)-$(TL)/$(SL)-$(TL).A3.final.gz | $(SCRIPTS)/giza-to-moses.awk > $@<br />
cat data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL) | sed 's/ ||| /\t/g' | cut -f 1 \<br />
| sed 's/~/ /g' | $(LEX_TOOLS)/multitrans $(DATA)/$(TL)-$(SL).autobil.bin -p -t > tmp1<br />
cat data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL) | sed 's/ ||| /\t/g' | cut -f 2 \<br />
| sed 's/~/ /g' | $(LEX_TOOLS)/multitrans $(DATA)/$(SL)-$(TL).autobil.bin -p -t > tmp2<br />
cat data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL) | sed 's/ ||| /\t/g' | cut -f 3 > tmp3<br />
cat data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL) | sed 's/ ||| /\t/g' | cut -f 2 \<br />
| sed 's/~/ /g' | $(LEX_TOOLS)/multitrans $(DATA)/$(SL)-$(TL).autobil.bin -b -t > data-$(SL)-$(TL)/$(CORPUS).clean-biltrans.$(PAIR)<br />
paste tmp1 tmp2 tmp3 | sed 's/\t/ ||| /g' > data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL)<br />
rm tmp1 tmp2 tmp3<br />
<br />
# SENTENCES<br />
data-$(SL)-$(TL)/$(CORPUS).candidates.$(SL)-$(TL): data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL) data-$(SL)-$(TL)/$(CORPUS).clean-biltrans.$(PAIR)<br />
python3 $(SCRIPTS)/extract-sentences.py data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL) \<br />
data-$(SL)-$(TL)/$(CORPUS).clean-biltrans.$(PAIR) > $@ 2>/dev/null<br />
<br />
# FREQUENCY LEXICON<br />
data-$(SL)-$(TL)/$(CORPUS).lex.$(SL)-$(TL): data-$(SL)-$(TL)/$(CORPUS).candidates.$(SL)-$(TL)<br />
python $(SCRIPTS)/extract-freq-lexicon.py data-$(SL)-$(TL)/$(CORPUS).candidates.$(SL)-$(TL) > $@ 2>/dev/null<br />
<br />
# BILTRANS CANDIDATES<br />
data-$(SL)-$(TL)/$(CORPUS).biltrans-entries.$(SL)-$(TL): data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL) data-$(SL)-$(TL)/$(CORPUS).clean-biltrans.$(PAIR)<br />
python3 $(SCRIPTS)/extract-biltrans-candidates.py data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL) data-$(SL)-$(TL)/$(CORPUS).clean-biltrans.$(PAIR) \<br />
> $@ 2>/dev/null<br />
<br />
# NGRAM PATTERNS<br />
data-$(SL)-$(TL)/$(CORPUS).ngrams.$(SL)-$(TL): data-$(SL)-$(TL)/$(CORPUS).lex.$(SL)-$(TL) data-$(SL)-$(TL)/$(CORPUS).candidates.$(SL)-$(TL)<br />
python $(SCRIPTS)/ngram-count-patterns.py data-$(SL)-$(TL)/$(CORPUS).lex.$(SL)-$(TL) data-$(SL)-$(TL)/$(CORPUS).candidates.$(SL)-$(TL) $(crisphold) 2>/dev/null > $@<br />
<br />
# NGRAMS TO RULES<br />
data-$(SL)-$(TL)/$(CORPUS).ngrams.$(SL)-$(TL).lrx: data-$(SL)-$(TL)/$(CORPUS).ngrams.$(SL)-$(TL)<br />
python3 $(SCRIPTS)/ngrams-to-rules.py data-$(SL)-$(TL)/$(CORPUS).ngrams.$(SL)-$(TL) $(crisphold) > $@ 2>/dev/null<br />
<br />
<br />
<br />
</pre><br />
<br />
[[Category:Lexical selection]]<br />
[[Category:Documentation in English]]</div>
Fpetkovski
https://wiki.apertium.org/w/index.php?title=Generating_lexical-selection_rules_from_a_parallel_corpus&diff=43852
Generating lexical-selection rules from a parallel corpus
2013-09-23T05:28:24Z
<p>Fpetkovski: /* Getting started */</p>
<hr />
<div>{{TOCD}}<br />
If you have a parallel corpus, one of the things you can do is generate some lexical selection rules from it, to improve translation of words with more than one possible translation.<br />
<br />
== You will need ==<br />
<br />
Here is a list of software that you will need installed:<br />
<br />
* Giza++ (or some other word aligner)<br />
* Moses (for making Giza++ less human hostile)<br />
* All the Moses scripts<br />
* [[lttoolbox]]<br />
* Apertium<br />
* [[apertium-lex-tools]]<br />
<br />
Furthermore you'll need:<br />
<br />
* an Apertium language pair<br />
* a parallel corpus (see [[Corpora]])<br />
<br />
===Installing prerequisites===<br />
See [[Minimal installation from SVN]] for apertium/lttoolbox. <br />
<br />
See [[Constraint-based lexical selection module]] for apertium-lex-tools.<br />
<br />
For Giza++ and moses-decoder, etc. you can do<br />
<pre><br />
$ mkdir ~/smt<br />
$ cd ~/smt<br />
$ mkdir local # our "install prefix"<br />
$ wget https://giza-pp.googlecode.com/files/giza-pp-v1.0.7.tar.gz<br />
$ tar xzvf giza-pp-v1.0.7.tar.gz<br />
$ cd giza-pp<br />
$ make<br />
$ mkdir ../local/bin<br />
$ cp GIZA++-v2/snt2cooc.out ../local/bin/<br />
$ cp GIZA++-v2/snt2plain.out ../local/bin/<br />
$ cp GIZA++-v2/GIZA++ ../local/bin/<br />
$ cp mkcls-v2/mkcls ../local/bin/<br />
$ git clone https://github.com/moses-smt/mosesdecoder<br />
$ cd mosesdecoder/<br />
$ ./bjam <br />
</pre><br />
<br />
Now e.g. the clean-corpus and train-model scripts referred to below will be in ~/smt/mosesdecoder/scripts/training/clean-corpus-n.perl<br />
See http://www.statmt.org/moses/?n=Development.GetStarted if you want to install the binaries to some other directory.<br />
<br />
== Getting started ==<br />
<br />
Important:<br />
'''If you don't want through the whole process step by step, you can use the Makefile script provided in the [[#Makefile|last section]] of this page.'''<br />
<br />
<br />
We're going to do the example with [[Corpora|EuroParl]] and the English to Spanish pair in Apertium.<br />
<br />
Given that you've got all the stuff installed, the work will be as follows:<br />
<br />
=== Prepare corpus ===<br />
<br />
To generate the rules, we need three files,<br />
<br />
* The tagged and tokenised source corpus<br />
* The tagged and tokenised target corpus<br />
* The output of the lexical transfer module in the source→target direction, tokenised<br />
<br />
These three files should be sentence aligned. <br />
<br />
Make a folder called data-en-es. We are going to keep all the generated files there.<br />
<br />
The first thing that we need to do is tag both sides of the corpus:<br />
<br />
<pre><br />
$ nohup cat europarl.en-es.en | apertium-destxt |\<br />
apertium -f none -d /home/fran/source/apertium-en-es en-es-pretransfer > data-en-es/europarl-en-es.tagged.en &<br />
$ nohup cat europarl.clean.es | apertium-destxt |\<br />
apertium -f none -d /home/fran/source/apertium-en-es es-en-pretransfer > data-en-es.europarl-en-es.tagged.es &<br />
</pre><br />
<br />
Then we need to remove the lines with no analyses on and replace blanks within lemmas with a new character (we will use `~`):<br />
<br />
<pre><br />
$ paste data-en-es/europarl-en-es.tagged.en data-en-es/europarl-en-es.tagged.es | grep '<' | cut -f2 | sed 's/ /~/g' | sed 's/\$[^\^]*/\$ /g' > data-en-es/europarl-en-es.tagged.new.es<br />
$ paste data-en-es/europarl-en-es.tagged.en data-en-es/europarl-en-es.tagged.es | grep '<' | cut -f3 | sed 's/ /~/g' | sed 's/\$[^\^]*/\$ /g' > data-en-es/europarl-en-es.tagged.new.en<br />
</pre><br />
<br />
Next, we need to clean the corpus and remove long sentences.<br />
(Make sure you are in the same directory as the one where you have your europarl corpus) <br />
<br />
<pre><br />
$ perl (path to your mosesdecoder)/scripts/training/clean-corpus-n.perl data-en-es/europarl-en-es.tagged.new es en data-en-es/europarl-en-es.tag-clean 1 40<br />
clean-corpus.perl: processing europarl-v6.es-en.es & .en to data-en-es/europarl-en-es.clean, cutoff 1-40<br />
..........(100000)...<br />
<br />
Input sentences: 1786594 Output sentences: 1467708<br />
</pre><br />
<br />
<br />
We're going to cut off the bottom 67,658 for testing (also because Giza++ segfaults somewhere around there).<br />
<pre><br />
$ mkdir testing<br />
$ tail -67658 data-en-es/europarl-en-es.tag-clean.en > testing/europarl-en-es.tag-clean.67658.en<br />
$ tail -67658 data-en-es/europarl-en-es.tag-clean.es > testing/europarl-en-es.tag-clean.67658.es<br />
</pre><br />
<br />
<pre><br />
$ head -1400000 data-en-es/europarl-en-es.tag-clean.en > data-en-es/europarl-en-es.tag-clean.en.new<br />
$ head -1400000 data-en-es/europarl-en-es.tag-clean.es > data-en-es/europarl-en-es.tag-clean.es.new<br />
$ mv data-en-es/europarl-en-es.tag-clean.en.new data-en-es/europarl-en-es.tag-clean.en<br />
$ mv data-en-es/europarl-en-es.tag-clean.es.new data-en-es/europarl-en-es.tag-clean.es<br />
</pre><br />
<br />
<br />
These files are:<br />
<br />
* <code>data-en-es/europarl-en-es.tag-clean.en</code>: The tagged source language side of the corpus<br />
* <code>data-en-es/europarl-en-es.tag-clean.es</code>: The tagged target language side of the corpus<br />
<br />
Check that they have the same length:<br />
<br />
<pre><br />
$ wc -l europarl.*<br />
1400000 data-en-es/europarl-en-es.tag-clean.en<br />
1400000 data-en-es/europarl-en-es.tag-clean.es<br />
2800000 total<br />
</pre><br />
<br />
=== Align corpus ===<br />
<br />
Now we've got the corpus files ready, we can align the corpus using the Moses scripts:<br />
<br />
<pre><br />
nohup perl (path to your mosesdecoder)/scripts/training/train-model.perl -external-bin-dir \<br />
~/smt/local/bin -corpus data-en-es/europarl-en-es.tag-clean \<br />
-f en -e es -alignment grow-diag-final-and -reordering msd-bidirectional-fe \<br />
-lm 0:5:/home/fran/corpora/europarl/europarl.lm:0 >log 2>&1 &<br />
</pre><br />
<br />
Note: Remember to change all the paths in the above command!<br />
<br />
You'll need an LM file, but you can copy it from a previous Moses installation. If you don't have one, make an empty file and put a few words in it. We won't be using the LM anyway.<br />
<br />
This takes a while, from a few hours to a day. So leave it running and go and make a soufflé, or chop some wood or something.<br />
<br />
=== Extract sentences ===<br />
After the sentences are aligned, we need to trim unnecessary tags from the tokens, and generate a biltrans file.<br />
<br />
<pre><br />
zcat giza.en-es/en-es.A3.final.gz | ~/source/apertium-lex-tools/scripts/giza-to-moses.awk > data-en-es/europarl.phrasetable.en-es<br />
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 1 \<br />
| sed 's/~/ /g' | ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil.bin -p -t > tmp1<br />
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 2 \<br />
| sed 's/~/ /g' | ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil.bin -p -t > tmp2<br />
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 3 > tmp3<br />
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 2 \<br />
| sed 's/~/ /g' | ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil.bin -b -t > data-en-es/europarl.clean-biltrans.en-es<br />
paste tmp1 tmp2 tmp3 | sed 's/\t/ ||| /g' > data-en-es/europarl.phrasetable.en-es<br />
rm tmp1 tmp2 tmp3<br />
</pre><br />
<br />
Then we want to make sure again that our file has the right number of lines:<br />
<br />
<pre><br />
$ wc -l data-en-es/europarl.phrasetable.en-es<br />
1400000 data-en-es/europarl.phrasetable.en-es<br />
</pre><br />
<br />
Then we want to extract the sentences where the target language word aligned to a source language word is a possible translation in the bilingual dictionary:<br />
<br />
<pre><br />
$ ~/source/apertium-lex-tools/scripts/extract-sentences.py data-en-es/europarl-en-es.phrasetable.en-es data-en-es/europarl-en-es.biltrans-tok.en-es \<br />
> data-en-es/europarl-en-es.candidates.en-es<br />
</pre><br />
<br />
These are basically sentences that we can hope that Apertium might be able to generate.<br />
<br />
=== Extract bilingual dictionary candidates ===<br />
<br />
Using the phrasetable and the bilingual file we can extract candidates for the bilingual dictionary.<br />
<br />
<pre><br />
python3 ~/source/apertium-lex-tools/scripts/extract-biltrans-candidates.py data-en-es/europarl-en-es.phrasetable.en-es data-en-es/europarl-en-es.biltrans-tok.en-es > data-en-es/europarl-en-es.biltrans-candidates.en-es 2> data-en-es/europarl-en-es.biltrans-pairs.en-es<br />
</pre><br />
<br />
where data-en-es/europarl-en-es.biltrans-candidates.en-es contains the generated candidates for the bilingual dictionary.<br />
<br />
===Extract frequency lexicon===<br />
<br />
The next step is to extract the frequency lexicon. <br />
<br />
<pre><br />
$ python ~/source/apertium-lex-tools/scripts/extract-freq-lexicon.py data-en-es/europarl-en-es.candidates.en-es > data-en-es/europarl-en-es.lex.en-es<br />
</pre><br />
<br />
This file should look like:<br />
<br />
<pre><br />
$ cat europarl.lex.en-es | head <br />
31381 union<n> unión<n> @<br />
101 union<n> sindicato<n><br />
1 union<n> situación<n><br />
1 union<n> monetario<adj><br />
4 slope<n> pendiente<n> @<br />
1 slope<n> ladera<n><br />
</pre><br />
<br />
Where the highest frequency translation is marked with an <code>@</code>.<br />
<br />
Note: This frequency lexicon can be used as a substitute for "choosing the most general translation" in your bilingual dictionary.<br />
<br />
===Generate patterns===<br />
<br />
Now we generate the ngrams that we are going to generate the rules from.<br />
<br />
<pre><br />
$ crisphold=1.5 # ratio of how many times you see the alternative translation compared to the default<br />
$ python ~/source/apertium-lex-tools/scripts/ngram-count-patterns.py europarl.lex.en-es data-en-es/europarl-en-es.candidates.en-es $crisphold 2>/dev/null > data-en-es/europarl-en-es.ngrams.en-es<br />
</pre><br />
<br />
This script outputs lines in the following format: <br />
<br />
<pre><br />
-language<n> and<cnjcoo> language<n> ,<cm> lengua<n> 2<br />
+language<n> plain<adj> language<n> ,<cm> lenguaje<n> 3<br />
-language<n> language<n> knowledge<n> lengua<n> 4<br />
-language<n> language<n> of<pr> communication<n> lengua<n> 3<br />
-language<n> Community<adj> language<n> .<sent> lengua<n> 5<br />
-language<n> language<n> in~addition~to<pr> their<det><pos> lengua<n> 2<br />
-language<n> every<det><ind> language<n> lengua<n> 2<br />
+language<n> and<cnjcoo> *understandable language<n> lenguaje<n> 2<br />
-language<n> two<num> language<n> lengua<n> 8<br />
-language<n> only<adj> official<adj> language<n> lengua<n> 2<br />
</pre><br />
<br />
The <code>+</code> and <code>-</code> indicate if this line chooses the most frequent transation (<code>-</code>) or a translation which is not the most frequent (<code>+</code>). The pattern selecting the translation is then shown, followed by the translation and then the frequency.<br />
<br />
===Filter rules===<br />
<br />
Now you can filter the rules, for example by removing rules with conjunctions, or removing rules with unknown words.<br />
<br />
=== Generate rules ===<br />
<br />
The final stage is to generate the rules,<br />
<br />
<pre><br />
python3 ~/source/apertium-lex-tools/scripts/ngrams-to-rules.py data-en-es/europarl-en-es.ngrams.en-es $crisphold > data-en-es/europarl-en-es.ngrams.en-es.lrx<br />
</pre><br />
<br />
=== Makefile ===<br />
For the whole process you can run the following Makefile:<br />
<br />
<pre><br />
CORPUS=europarl<br />
PAIR=en-es<br />
SL=en<br />
TL=es<br />
DATA=/home/philip/Apertium/apertium-en-es<br />
<br />
LEX_TOOLS=/home/philip/Apertium/apertium-lex-tools<br />
SCRIPTS=$(LEX_TOOLS)/scripts<br />
MOSESDECODER=/home/philip/mosesdecoder/scripts/training<br />
TRAINING_LINES=200000<br />
BIN_DIR=/home/philip/giza-pp/bin<br />
LM=/home/philip/Apertium/gsoc2013/giza/dummy.lm<br />
<br />
crisphold=1<br />
<br />
all: data-$(SL)-$(TL)/$(CORPUS).ngrams.$(SL)-$(TL).lrx data-$(SL)-$(TL)/$(CORPUS).biltrans-entries.$(SL)-$(TL)<br />
<br />
# TAG CORPUS<br />
data-$(SL)-$(TL)/$(CORPUS).tagged.$(SL): $(CORPUS).$(PAIR).$(SL)<br />
if [ ! -d data-$(SL)-$(TL) ]; then mkdir data-$(SL)-$(TL); fi<br />
cat $(CORPUS).$(PAIR).$(SL) | head -n $(TRAINING_LINES) \<br />
| apertium-destxt \<br />
| apertium -f none -d $(DATA) $(SL)-$(TL)-tagger \<br />
| apertium-pretransfer > $@;<br />
<br />
data-$(SL)-$(TL)/$(CORPUS).tagged.$(TL): $(CORPUS).$(PAIR).$(TL)<br />
if [ ! -d data-$(SL)-$(TL) ]; then mkdir data-$(SL)-$(TL); fi<br />
cat $(CORPUS).$(PAIR).$(TL) | head -n $(TRAINING_LINES) \<br />
| apertium-destxt \<br />
| apertium -f none -d $(DATA) $(TL)-$(SL)-tagger \<br />
| apertium-pretransfer > $@;<br />
<br />
# REMOVE LINES WITH NO ANALYSES<br />
data-$(SL)-$(TL)/$(CORPUS).tagged.new.$(SL): data-$(SL)-$(TL)/$(CORPUS).tagged.$(SL) data-$(SL)-$(TL)/$(CORPUS).tagged.$(TL)<br />
paste data-$(SL)-$(TL)/$(CORPUS).tagged.$(SL) data-$(SL)-$(TL)/$(CORPUS).tagged.$(TL) \<br />
| grep '<' \<br />
| cut -f1 \<br />
| sed 's/ /~/g' | sed 's/$$[^\^]*/$$ /g' > $@<br />
<br />
data-$(SL)-$(TL)/$(CORPUS).tagged.new.$(TL): data-$(SL)-$(TL)/$(CORPUS).tagged.$(SL) data-$(SL)-$(TL)/$(CORPUS).tagged.$(TL)<br />
paste data-$(SL)-$(TL)/$(CORPUS).tagged.$(SL) data-$(SL)-$(TL)/$(CORPUS).tagged.$(TL) \<br />
| grep '<' \<br />
| cut -f2 \<br />
| sed 's/ /~/g' | sed 's/$$[^\^]*/$$ /g' > $@<br />
<br />
# CLEAN<br />
data-$(SL)-$(TL)/$(CORPUS).tag-clean.$(SL) data-$(SL)-$(TL)/$(CORPUS).tag-clean.$(TL): data-$(SL)-$(TL)/$(CORPUS).tagged.new.$(SL) data-$(SL)-$(TL)/$(CORPUS).tagged.new.$(TL)<br />
perl $(MOSESDECODER)/clean-corpus-n.perl data-$(SL)-$(TL)/$(CORPUS).tagged.new $(SL) $(TL) data-$(SL)-$(TL)/$(CORPUS).tag-clean 1 40;<br />
<br />
# ALIGN<br />
model: data-$(SL)-$(TL)/$(CORPUS).tag-clean.$(SL) data-$(SL)-$(TL)/$(CORPUS).tag-clean.$(TL)<br />
-perl $(MOSESDECODER)/train-model.perl -external-bin-dir $(BIN_DIR) -corpus data-$(SL)-$(TL)/$(CORPUS).tag-clean \<br />
-f $(TL) -e $(SL) -alignment grow-diag-final-and -reordering msd-bidirectional-fe \<br />
-lm 0:5:$(LM):0 2>&1<br />
<br />
# EXTRACT AND TRIM<br />
data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL): model<br />
zcat giza.$(SL)-$(TL)/$(SL)-$(TL).A3.final.gz | $(SCRIPTS)/giza-to-moses.awk > $@<br />
cat data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL) | sed 's/ ||| /\t/g' | cut -f 1 \<br />
| sed 's/~/ /g' | $(LEX_TOOLS)/multitrans $(DATA)/$(TL)-$(SL).autobil.bin -p -t > tmp1<br />
cat data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL) | sed 's/ ||| /\t/g' | cut -f 2 \<br />
| sed 's/~/ /g' | $(LEX_TOOLS)/multitrans $(DATA)/$(SL)-$(TL).autobil.bin -p -t > tmp2<br />
cat data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL) | sed 's/ ||| /\t/g' | cut -f 3 > tmp3<br />
cat data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL) | sed 's/ ||| /\t/g' | cut -f 2 \<br />
| sed 's/~/ /g' | $(LEX_TOOLS)/multitrans $(DATA)/$(SL)-$(TL).autobil.bin -b -t > data-$(SL)-$(TL)/$(CORPUS).clean-biltrans.$(PAIR)<br />
paste tmp1 tmp2 tmp3 | sed 's/\t/ ||| /g' > data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL)<br />
rm tmp1 tmp2 tmp3<br />
<br />
# SENTENCES<br />
data-$(SL)-$(TL)/$(CORPUS).candidates.$(SL)-$(TL): data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL) data-$(SL)-$(TL)/$(CORPUS).clean-biltrans.$(PAIR)<br />
python3 $(SCRIPTS)/extract-sentences.py data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL) \<br />
data-$(SL)-$(TL)/$(CORPUS).clean-biltrans.$(PAIR) > $@ 2>/dev/null<br />
<br />
# FREQUENCY LEXICON<br />
data-$(SL)-$(TL)/$(CORPUS).lex.$(SL)-$(TL): data-$(SL)-$(TL)/$(CORPUS).candidates.$(SL)-$(TL)<br />
python $(SCRIPTS)/extract-freq-lexicon.py data-$(SL)-$(TL)/$(CORPUS).candidates.$(SL)-$(TL) > $@ 2>/dev/null<br />
<br />
# BILTRANS CANDIDATES<br />
data-$(SL)-$(TL)/$(CORPUS).biltrans-entries.$(SL)-$(TL): data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL) data-$(SL)-$(TL)/$(CORPUS).clean-biltrans.$(PAIR)<br />
python3 $(SCRIPTS)/extract-biltrans-candidates.py data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL) data-$(SL)-$(TL)/$(CORPUS).clean-biltrans.$(PAIR) \<br />
> $@ 2>/dev/null<br />
<br />
# NGRAM PATTERNS<br />
data-$(SL)-$(TL)/$(CORPUS).ngrams.$(SL)-$(TL): data-$(SL)-$(TL)/$(CORPUS).lex.$(SL)-$(TL) data-$(SL)-$(TL)/$(CORPUS).candidates.$(SL)-$(TL)<br />
python $(SCRIPTS)/ngram-count-patterns.py data-$(SL)-$(TL)/$(CORPUS).lex.$(SL)-$(TL) data-$(SL)-$(TL)/$(CORPUS).candidates.$(SL)-$(TL) $(crisphold) 2>/dev/null > $@<br />
<br />
# NGRAMS TO RULES<br />
data-$(SL)-$(TL)/$(CORPUS).ngrams.$(SL)-$(TL).lrx: data-$(SL)-$(TL)/$(CORPUS).ngrams.$(SL)-$(TL)<br />
python3 $(SCRIPTS)/ngrams-to-rules.py data-$(SL)-$(TL)/$(CORPUS).ngrams.$(SL)-$(TL) $(crisphold) > $@ 2>/dev/null<br />
<br />
<br />
<br />
</pre><br />
<br />
[[Category:Lexical selection]]<br />
[[Category:Documentation in English]]</div>
Fpetkovski
https://wiki.apertium.org/w/index.php?title=Generating_lexical-selection_rules_from_monolingual_corpora&diff=43851
Generating lexical-selection rules from monolingual corpora
2013-09-23T05:24:48Z
<p>Fpetkovski: /* Prerequisites */</p>
<hr />
<div>{{TOCD}}<br />
This page describes how to generate lexical selection rules without relying on a parallel corpus.<br />
<br />
==Prerequisites==<br />
<br />
* [[apertium-lex-tools]]<br />
* [[IRSTLM]]<br />
* A language pair (e.g. apertium-br-fr)<br />
** The language pair should have the following two modes:<br />
*** <code>-multi</code> which is all the modules after lexical transfer (see apertium-mk-en/modes.xml)<br />
*** <code>-pretransfer</code> which is all the modules up to lexical transfer (see apertium-mk-en/modes.xml)<br />
<br />
==Annotation==<br />
<br />
Take your corpus and run it through the lexical transfer:<br />
<br />
<pre><br />
cat $(CORPUS).$(DIR).txt | apertium-destxt | apertium -f none -d $(DATA) $(DIR)-pretransfer | lt-proc -b $(DATA)/$(AUTOBIL) > $@<br />
</pre><br />
<br />
Then select only the lines which have more than one and less than 10,000 translations, which have an ambiguous noun/verb/adjective and which have >= 90% coverage of the morphology.<br />
<br />
<pre><br />
cat $< | python3 $(SCRIPTS)/trim-fertile-lines.py | python3 $(SCRIPTS)/biltrans-line-only-pos-ambig.py | python3 $(SCRIPTS)/biltrans-trim-uncovered.py > $@<br />
</pre><br />
<br />
Generate all the possible disambiguation paths:<br />
<br />
<pre><br />
cat $< | python $(SCRIPTS)/biltrans-to-multitrans-line-recursive.py > $@<br />
</pre><br />
<br />
Translate all possible disambiguation paths:<br />
<br />
<pre><br />
cat $< | apertium -f none -d $(DATA) $(DIR)-multi > $@<br />
</pre><br />
<br />
Score all the possible disambiguation paths with IRSTLM.<br />
<br />
<pre><br />
<br />
</pre><br />
<br />
==Rule-extraction==<br />
<br />
First extract the default translations:<br />
<br />
<br />
Then the ngram partial counts:<br />
<br />
<br />
===Finding the best threshold===<br />
<br />
<br />
<br />
<br />
[[Category:Lexical selection]]</div>
Fpetkovski
https://wiki.apertium.org/w/index.php?title=Generating_lexical-selection_rules_from_a_parallel_corpus&diff=43795
Generating lexical-selection rules from a parallel corpus
2013-09-21T15:19:40Z
<p>Fpetkovski: /* Getting started */</p>
<hr />
<div>{{TOCD}}<br />
If you have a parallel corpus, one of the things you can do is generate some lexical selection rules from it, to improve translation of words with more than one possible translation.<br />
<br />
== You will need ==<br />
<br />
Here is a list of software that you will need installed:<br />
<br />
* Giza++ (or some other word aligner)<br />
* Moses (for making Giza++ less human hostile)<br />
* All the Moses scripts<br />
* [[lttoolbox]]<br />
* Apertium<br />
* [[apertium-lex-tools]]<br />
<br />
Furthermore you'll need:<br />
<br />
* an Apertium language pair<br />
* a parallel corpus (see [[Corpora]])<br />
<br />
===Installing prerequisites===<br />
See [[Minimal installation from SVN]] for apertium/lttoolbox. <br />
<br />
See [[Constraint-based lexical selection module]] for apertium-lex-tools.<br />
<br />
For Giza++ and moses-decoder, etc. you can do<br />
<pre><br />
$ mkdir ~/smt<br />
$ cd ~/smt<br />
$ mkdir local # our "install prefix"<br />
$ wget https://giza-pp.googlecode.com/files/giza-pp-v1.0.7.tar.gz<br />
$ tar xzvf giza-pp-v1.0.7.tar.gz<br />
$ cd giza-pp<br />
$ make<br />
$ mkdir ../local/bin<br />
$ cp GIZA++-v2/snt2cooc.out ../local/bin/<br />
$ cp GIZA++-v2/snt2plain.out ../local/bin/<br />
$ cp GIZA++-v2/GIZA++ ../local/bin/<br />
$ cp mkcls-v2/mkcls ../local/bin/<br />
$ git clone https://github.com/moses-smt/mosesdecoder<br />
$ cd mosesdecoder/<br />
$ ./bjam <br />
</pre><br />
<br />
Now e.g. the clean-corpus and train-model scripts referred to below will be in ~/smt/mosesdecoder/scripts/training/clean-corpus-n.perl<br />
See http://www.statmt.org/moses/?n=Development.GetStarted if you want to install the binaries to some other directory.<br />
<br />
== Getting started ==<br />
<br />
If you don't want through the whole process step by step, you can use the Makefile script provided in the [[#Makefile|last section]] of this page.<br />
<br />
<br />
We're going to do the example with [[Corpora|EuroParl]] and the English to Spanish pair in Apertium.<br />
<br />
Given that you've got all the stuff installed, the work will be as follows:<br />
<br />
=== Prepare corpus ===<br />
<br />
To generate the rules, we need three files,<br />
<br />
* The tagged and tokenised source corpus<br />
* The tagged and tokenised target corpus<br />
* The output of the lexical transfer module in the source→target direction, tokenised<br />
<br />
These three files should be sentence aligned. <br />
<br />
Make a folder called data-en-es. We are going to keep all the generated files there.<br />
<br />
The first thing that we need to do is tag both sides of the corpus:<br />
<br />
<pre><br />
$ nohup cat europarl.en-es.en | apertium-destxt |\<br />
apertium -f none -d /home/fran/source/apertium-en-es en-es-pretransfer > data-en-es/europarl-en-es.tagged.en &<br />
$ nohup cat europarl.clean.es | apertium-destxt |\<br />
apertium -f none -d /home/fran/source/apertium-en-es es-en-pretransfer > data-en-es.europarl-en-es.tagged.es &<br />
</pre><br />
<br />
Then we need to remove the lines with no analyses on and replace blanks within lemmas with a new character (we will use `~`):<br />
<br />
<pre><br />
$ paste data-en-es/europarl-en-es.tagged.en data-en-es/europarl-en-es.tagged.es | grep '<' | cut -f2 | sed 's/ /~/g' | sed 's/\$[^\^]*/\$ /g' > data-en-es/europarl-en-es.tagged.new.es<br />
$ paste data-en-es/europarl-en-es.tagged.en data-en-es/europarl-en-es.tagged.es | grep '<' | cut -f3 | sed 's/ /~/g' | sed 's/\$[^\^]*/\$ /g' > data-en-es/europarl-en-es.tagged.new.en<br />
</pre><br />
<br />
Next, we need to clean the corpus and remove long sentences.<br />
(Make sure you are in the same directory as the one where you have your europarl corpus) <br />
<br />
<pre><br />
$ perl (path to your mosesdecoder)/scripts/training/clean-corpus-n.perl data-en-es/europarl-en-es.tagged.new es en data-en-es/europarl-en-es.tag-clean 1 40<br />
clean-corpus.perl: processing europarl-v6.es-en.es & .en to data-en-es/europarl-en-es.clean, cutoff 1-40<br />
..........(100000)...<br />
<br />
Input sentences: 1786594 Output sentences: 1467708<br />
</pre><br />
<br />
<br />
We're going to cut off the bottom 67,658 for testing (also because Giza++ segfaults somewhere around there).<br />
<pre><br />
$ mkdir testing<br />
$ tail -67658 data-en-es/europarl-en-es.tag-clean.en > testing/europarl-en-es.tag-clean.67658.en<br />
$ tail -67658 data-en-es/europarl-en-es.tag-clean.es > testing/europarl-en-es.tag-clean.67658.es<br />
</pre><br />
<br />
<pre><br />
$ head -1400000 data-en-es/europarl-en-es.tag-clean.en > data-en-es/europarl-en-es.tag-clean.en.new<br />
$ head -1400000 data-en-es/europarl-en-es.tag-clean.es > data-en-es/europarl-en-es.tag-clean.es.new<br />
$ mv data-en-es/europarl-en-es.tag-clean.en.new data-en-es/europarl-en-es.tag-clean.en<br />
$ mv data-en-es/europarl-en-es.tag-clean.es.new data-en-es/europarl-en-es.tag-clean.es<br />
</pre><br />
<br />
<br />
These files are:<br />
<br />
* <code>data-en-es/europarl-en-es.tag-clean.en</code>: The tagged source language side of the corpus<br />
* <code>data-en-es/europarl-en-es.tag-clean.es</code>: The tagged target language side of the corpus<br />
<br />
Check that they have the same length:<br />
<br />
<pre><br />
$ wc -l europarl.*<br />
1400000 data-en-es/europarl-en-es.tag-clean.en<br />
1400000 data-en-es/europarl-en-es.tag-clean.es<br />
2800000 total<br />
</pre><br />
<br />
=== Align corpus ===<br />
<br />
Now we've got the corpus files ready, we can align the corpus using the Moses scripts:<br />
<br />
<pre><br />
nohup perl (path to your mosesdecoder)/scripts/training/train-model.perl -external-bin-dir \<br />
~/smt/local/bin -corpus data-en-es/europarl-en-es.tag-clean \<br />
-f en -e es -alignment grow-diag-final-and -reordering msd-bidirectional-fe \<br />
-lm 0:5:/home/fran/corpora/europarl/europarl.lm:0 >log 2>&1 &<br />
</pre><br />
<br />
Note: Remember to change all the paths in the above command!<br />
<br />
You'll need an LM file, but you can copy it from a previous Moses installation. If you don't have one, make an empty file and put a few words in it. We won't be using the LM anyway.<br />
<br />
This takes a while, from a few hours to a day. So leave it running and go and make a soufflé, or chop some wood or something.<br />
<br />
=== Extract sentences ===<br />
After the sentences are aligned, we need to trim unnecessary tags from the tokens, and generate a biltrans file.<br />
<br />
<pre><br />
zcat giza.en-es/en-es.A3.final.gz | ~/source/apertium-lex-tools/scripts/giza-to-moses.awk > data-en-es/europarl.phrasetable.en-es<br />
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 1 \<br />
| sed 's/~/ /g' | ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil.bin -p -t > tmp1<br />
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 2 \<br />
| sed 's/~/ /g' | ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil.bin -p -t > tmp2<br />
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 3 > tmp3<br />
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 2 \<br />
| sed 's/~/ /g' | ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil.bin -b -t > data-en-es/europarl.clean-biltrans.en-es<br />
paste tmp1 tmp2 tmp3 | sed 's/\t/ ||| /g' > data-en-es/europarl.phrasetable.en-es<br />
rm tmp1 tmp2 tmp3<br />
</pre><br />
<br />
Then we want to make sure again that our file has the right number of lines:<br />
<br />
<pre><br />
$ wc -l data-en-es/europarl.phrasetable.en-es<br />
1400000 data-en-es/europarl.phrasetable.en-es<br />
</pre><br />
<br />
Then we want to extract the sentences where the target language word aligned to a source language word is a possible translation in the bilingual dictionary:<br />
<br />
<pre><br />
$ ~/source/apertium-lex-tools/scripts/extract-sentences.py data-en-es/europarl-en-es.phrasetable.en-es data-en-es/europarl-en-es.biltrans-tok.en-es \<br />
> data-en-es/europarl-en-es.candidates.en-es<br />
</pre><br />
<br />
These are basically sentences that we can hope that Apertium might be able to generate.<br />
<br />
=== Extract bilingual dictionary candidates ===<br />
<br />
Using the phrasetable and the bilingual file we can extract candidates for the bilingual dictionary.<br />
<br />
<pre><br />
python3 ~/source/apertium-lex-tools/scripts/extract-biltrans-candidates.py data-en-es/europarl-en-es.phrasetable.en-es data-en-es/europarl-en-es.biltrans-tok.en-es > data-en-es/europarl-en-es.biltrans-candidates.en-es 2> data-en-es/europarl-en-es.biltrans-pairs.en-es<br />
</pre><br />
<br />
where data-en-es/europarl-en-es.biltrans-candidates.en-es contains the generated candidates for the bilingual dictionary.<br />
<br />
===Extract frequency lexicon===<br />
<br />
The next step is to extract the frequency lexicon. <br />
<br />
<pre><br />
$ python ~/source/apertium-lex-tools/scripts/extract-freq-lexicon.py data-en-es/europarl-en-es.candidates.en-es > data-en-es/europarl-en-es.lex.en-es<br />
</pre><br />
<br />
This file should look like:<br />
<br />
<pre><br />
$ cat europarl.lex.en-es | head <br />
31381 union<n> unión<n> @<br />
101 union<n> sindicato<n><br />
1 union<n> situación<n><br />
1 union<n> monetario<adj><br />
4 slope<n> pendiente<n> @<br />
1 slope<n> ladera<n><br />
</pre><br />
<br />
Where the highest frequency translation is marked with an <code>@</code>.<br />
<br />
Note: This frequency lexicon can be used as a substitute for "choosing the most general translation" in your bilingual dictionary.<br />
<br />
===Generate patterns===<br />
<br />
Now we generate the ngrams that we are going to generate the rules from.<br />
<br />
<pre><br />
$ crisphold=1.5 # ratio of how many times you see the alternative translation compared to the default<br />
$ python ~/source/apertium-lex-tools/scripts/ngram-count-patterns.py europarl.lex.en-es data-en-es/europarl-en-es.candidates.en-es $crisphold 2>/dev/null > data-en-es/europarl-en-es.ngrams.en-es<br />
</pre><br />
<br />
This script outputs lines in the following format: <br />
<br />
<pre><br />
-language<n> and<cnjcoo> language<n> ,<cm> lengua<n> 2<br />
+language<n> plain<adj> language<n> ,<cm> lenguaje<n> 3<br />
-language<n> language<n> knowledge<n> lengua<n> 4<br />
-language<n> language<n> of<pr> communication<n> lengua<n> 3<br />
-language<n> Community<adj> language<n> .<sent> lengua<n> 5<br />
-language<n> language<n> in~addition~to<pr> their<det><pos> lengua<n> 2<br />
-language<n> every<det><ind> language<n> lengua<n> 2<br />
+language<n> and<cnjcoo> *understandable language<n> lenguaje<n> 2<br />
-language<n> two<num> language<n> lengua<n> 8<br />
-language<n> only<adj> official<adj> language<n> lengua<n> 2<br />
</pre><br />
<br />
The <code>+</code> and <code>-</code> indicate if this line chooses the most frequent transation (<code>-</code>) or a translation which is not the most frequent (<code>+</code>). The pattern selecting the translation is then shown, followed by the translation and then the frequency.<br />
<br />
===Filter rules===<br />
<br />
Now you can filter the rules, for example by removing rules with conjunctions, or removing rules with unknown words.<br />
<br />
=== Generate rules ===<br />
<br />
The final stage is to generate the rules,<br />
<br />
<pre><br />
python3 ~/source/apertium-lex-tools/scripts/ngrams-to-rules.py data-en-es/europarl-en-es.ngrams.en-es $crisphold > data-en-es/europarl-en-es.ngrams.en-es.lrx<br />
</pre><br />
<br />
=== Makefile ===<br />
For the whole process you can run the following Makefile:<br />
<br />
<pre><br />
CORPUS=europarl<br />
PAIR=en-es<br />
SL=en<br />
TL=es<br />
DATA=/home/philip/Apertium/apertium-en-es<br />
<br />
LEX_TOOLS=/home/philip/Apertium/apertium-lex-tools<br />
SCRIPTS=$(LEX_TOOLS)/scripts<br />
MOSESDECODER=/home/philip/mosesdecoder/scripts/training<br />
TRAINING_LINES=200000<br />
BIN_DIR=/home/philip/giza-pp/bin<br />
LM=/home/philip/Apertium/gsoc2013/giza/dummy.lm<br />
<br />
crisphold=1<br />
<br />
all: data-$(SL)-$(TL)/$(CORPUS).ngrams.$(SL)-$(TL).lrx data-$(SL)-$(TL)/$(CORPUS).biltrans-entries.$(SL)-$(TL)<br />
<br />
# TAG CORPUS<br />
data-$(SL)-$(TL)/$(CORPUS).tagged.$(SL): $(CORPUS).$(PAIR).$(SL)<br />
if [ ! -d data-$(SL)-$(TL) ]; then mkdir data-$(SL)-$(TL); fi<br />
cat $(CORPUS).$(PAIR).$(SL) | head -n $(TRAINING_LINES) \<br />
| apertium-destxt \<br />
| apertium -f none -d $(DATA) $(SL)-$(TL)-tagger \<br />
| apertium-pretransfer > $@;<br />
<br />
data-$(SL)-$(TL)/$(CORPUS).tagged.$(TL): $(CORPUS).$(PAIR).$(TL)<br />
if [ ! -d data-$(SL)-$(TL) ]; then mkdir data-$(SL)-$(TL); fi<br />
cat $(CORPUS).$(PAIR).$(TL) | head -n $(TRAINING_LINES) \<br />
| apertium-destxt \<br />
| apertium -f none -d $(DATA) $(TL)-$(SL)-tagger \<br />
| apertium-pretransfer > $@;<br />
<br />
# REMOVE LINES WITH NO ANALYSES<br />
data-$(SL)-$(TL)/$(CORPUS).tagged.new.$(SL): data-$(SL)-$(TL)/$(CORPUS).tagged.$(SL) data-$(SL)-$(TL)/$(CORPUS).tagged.$(TL)<br />
paste data-$(SL)-$(TL)/$(CORPUS).tagged.$(SL) data-$(SL)-$(TL)/$(CORPUS).tagged.$(TL) \<br />
| grep '<' \<br />
| cut -f1 \<br />
| sed 's/ /~/g' | sed 's/$$[^\^]*/$$ /g' > $@<br />
<br />
data-$(SL)-$(TL)/$(CORPUS).tagged.new.$(TL): data-$(SL)-$(TL)/$(CORPUS).tagged.$(SL) data-$(SL)-$(TL)/$(CORPUS).tagged.$(TL)<br />
paste data-$(SL)-$(TL)/$(CORPUS).tagged.$(SL) data-$(SL)-$(TL)/$(CORPUS).tagged.$(TL) \<br />
| grep '<' \<br />
| cut -f2 \<br />
| sed 's/ /~/g' | sed 's/$$[^\^]*/$$ /g' > $@<br />
<br />
# CLEAN<br />
data-$(SL)-$(TL)/$(CORPUS).tag-clean.$(SL) data-$(SL)-$(TL)/$(CORPUS).tag-clean.$(TL): data-$(SL)-$(TL)/$(CORPUS).tagged.new.$(SL) data-$(SL)-$(TL)/$(CORPUS).tagged.new.$(TL)<br />
perl $(MOSESDECODER)/clean-corpus-n.perl data-$(SL)-$(TL)/$(CORPUS).tagged.new $(SL) $(TL) data-$(SL)-$(TL)/$(CORPUS).tag-clean 1 40;<br />
<br />
# ALIGN<br />
model: data-$(SL)-$(TL)/$(CORPUS).tag-clean.$(SL) data-$(SL)-$(TL)/$(CORPUS).tag-clean.$(TL)<br />
-perl $(MOSESDECODER)/train-model.perl -external-bin-dir $(BIN_DIR) -corpus data-$(SL)-$(TL)/$(CORPUS).tag-clean \<br />
-f $(TL) -e $(SL) -alignment grow-diag-final-and -reordering msd-bidirectional-fe \<br />
-lm 0:5:$(LM):0 2>&1<br />
<br />
# EXTRACT AND TRIM<br />
data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL): model<br />
zcat giza.$(SL)-$(TL)/$(SL)-$(TL).A3.final.gz | $(SCRIPTS)/giza-to-moses.awk > $@<br />
cat data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL) | sed 's/ ||| /\t/g' | cut -f 1 \<br />
| sed 's/~/ /g' | $(LEX_TOOLS)/multitrans $(DATA)/$(TL)-$(SL).autobil.bin -p -t > tmp1<br />
cat data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL) | sed 's/ ||| /\t/g' | cut -f 2 \<br />
| sed 's/~/ /g' | $(LEX_TOOLS)/multitrans $(DATA)/$(SL)-$(TL).autobil.bin -p -t > tmp2<br />
cat data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL) | sed 's/ ||| /\t/g' | cut -f 3 > tmp3<br />
cat data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL) | sed 's/ ||| /\t/g' | cut -f 2 \<br />
| sed 's/~/ /g' | $(LEX_TOOLS)/multitrans $(DATA)/$(SL)-$(TL).autobil.bin -b -t > data-$(SL)-$(TL)/$(CORPUS).clean-biltrans.$(PAIR)<br />
paste tmp1 tmp2 tmp3 | sed 's/\t/ ||| /g' > data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL)<br />
rm tmp1 tmp2 tmp3<br />
<br />
# SENTENCES<br />
data-$(SL)-$(TL)/$(CORPUS).candidates.$(SL)-$(TL): data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL) data-$(SL)-$(TL)/$(CORPUS).clean-biltrans.$(PAIR)<br />
python3 $(SCRIPTS)/extract-sentences.py data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL) \<br />
data-$(SL)-$(TL)/$(CORPUS).clean-biltrans.$(PAIR) > $@ 2>/dev/null<br />
<br />
# FREQUENCY LEXICON<br />
data-$(SL)-$(TL)/$(CORPUS).lex.$(SL)-$(TL): data-$(SL)-$(TL)/$(CORPUS).candidates.$(SL)-$(TL)<br />
python $(SCRIPTS)/extract-freq-lexicon.py data-$(SL)-$(TL)/$(CORPUS).candidates.$(SL)-$(TL) > $@ 2>/dev/null<br />
<br />
# BILTRANS CANDIDATES<br />
data-$(SL)-$(TL)/$(CORPUS).biltrans-entries.$(SL)-$(TL): data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL) data-$(SL)-$(TL)/$(CORPUS).clean-biltrans.$(PAIR)<br />
python3 $(SCRIPTS)/extract-biltrans-candidates.py data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL) data-$(SL)-$(TL)/$(CORPUS).clean-biltrans.$(PAIR) \<br />
> $@ 2>/dev/null<br />
<br />
# NGRAM PATTERNS<br />
data-$(SL)-$(TL)/$(CORPUS).ngrams.$(SL)-$(TL): data-$(SL)-$(TL)/$(CORPUS).lex.$(SL)-$(TL) data-$(SL)-$(TL)/$(CORPUS).candidates.$(SL)-$(TL)<br />
python $(SCRIPTS)/ngram-count-patterns.py data-$(SL)-$(TL)/$(CORPUS).lex.$(SL)-$(TL) data-$(SL)-$(TL)/$(CORPUS).candidates.$(SL)-$(TL) $(crisphold) 2>/dev/null > $@<br />
<br />
# NGRAMS TO RULES<br />
data-$(SL)-$(TL)/$(CORPUS).ngrams.$(SL)-$(TL).lrx: data-$(SL)-$(TL)/$(CORPUS).ngrams.$(SL)-$(TL)<br />
python3 $(SCRIPTS)/ngrams-to-rules.py data-$(SL)-$(TL)/$(CORPUS).ngrams.$(SL)-$(TL) $(crisphold) > $@ 2>/dev/null<br />
<br />
<br />
<br />
</pre><br />
<br />
[[Category:Lexical selection]]<br />
[[Category:Documentation in English]]</div>
Fpetkovski
https://wiki.apertium.org/w/index.php?title=Generating_lexical-selection_rules_from_a_parallel_corpus&diff=43794
Generating lexical-selection rules from a parallel corpus
2013-09-21T15:15:44Z
<p>Fpetkovski: /* Process script */</p>
<hr />
<div>{{TOCD}}<br />
If you have a parallel corpus, one of the things you can do is generate some lexical selection rules from it, to improve translation of words with more than one possible translation.<br />
<br />
== You will need ==<br />
<br />
Here is a list of software that you will need installed:<br />
<br />
* Giza++ (or some other word aligner)<br />
* Moses (for making Giza++ less human hostile)<br />
* All the Moses scripts<br />
* [[lttoolbox]]<br />
* Apertium<br />
* [[apertium-lex-tools]]<br />
<br />
Furthermore you'll need:<br />
<br />
* an Apertium language pair<br />
* a parallel corpus (see [[Corpora]])<br />
<br />
===Installing prerequisites===<br />
See [[Minimal installation from SVN]] for apertium/lttoolbox. <br />
<br />
See [[Constraint-based lexical selection module]] for apertium-lex-tools.<br />
<br />
For Giza++ and moses-decoder, etc. you can do<br />
<pre><br />
$ mkdir ~/smt<br />
$ cd ~/smt<br />
$ mkdir local # our "install prefix"<br />
$ wget https://giza-pp.googlecode.com/files/giza-pp-v1.0.7.tar.gz<br />
$ tar xzvf giza-pp-v1.0.7.tar.gz<br />
$ cd giza-pp<br />
$ make<br />
$ mkdir ../local/bin<br />
$ cp GIZA++-v2/snt2cooc.out ../local/bin/<br />
$ cp GIZA++-v2/snt2plain.out ../local/bin/<br />
$ cp GIZA++-v2/GIZA++ ../local/bin/<br />
$ cp mkcls-v2/mkcls ../local/bin/<br />
$ git clone https://github.com/moses-smt/mosesdecoder<br />
$ cd mosesdecoder/<br />
$ ./bjam <br />
</pre><br />
<br />
Now e.g. the clean-corpus and train-model scripts referred to below will be in ~/smt/mosesdecoder/scripts/training/clean-corpus-n.perl<br />
See http://www.statmt.org/moses/?n=Development.GetStarted if you want to install the binaries to some other directory.<br />
<br />
== Getting started ==<br />
<br />
We're going to do the example with [[Corpora|EuroParl]] and the English to Spanish pair in Apertium.<br />
<br />
Given that you've got all the stuff installed, the work will be as follows:<br />
<br />
=== Prepare corpus ===<br />
<br />
To generate the rules, we need three files,<br />
<br />
* The tagged and tokenised source corpus<br />
* The tagged and tokenised target corpus<br />
* The output of the lexical transfer module in the source→target direction, tokenised<br />
<br />
These three files should be sentence aligned. <br />
<br />
Make a folder called data-en-es. We are going to keep all the generated files there.<br />
<br />
The first thing that we need to do is tag both sides of the corpus:<br />
<br />
<pre><br />
$ nohup cat europarl.en-es.en | apertium-destxt |\<br />
apertium -f none -d /home/fran/source/apertium-en-es en-es-pretransfer > data-en-es/europarl-en-es.tagged.en &<br />
$ nohup cat europarl.clean.es | apertium-destxt |\<br />
apertium -f none -d /home/fran/source/apertium-en-es es-en-pretransfer > data-en-es.europarl-en-es.tagged.es &<br />
</pre><br />
<br />
Then we need to remove the lines with no analyses on and replace blanks within lemmas with a new character (we will use `~`):<br />
<br />
<pre><br />
$ paste data-en-es/europarl-en-es.tagged.en data-en-es/europarl-en-es.tagged.es | grep '<' | cut -f2 | sed 's/ /~/g' | sed 's/\$[^\^]*/\$ /g' > data-en-es/europarl-en-es.tagged.new.es<br />
$ paste data-en-es/europarl-en-es.tagged.en data-en-es/europarl-en-es.tagged.es | grep '<' | cut -f3 | sed 's/ /~/g' | sed 's/\$[^\^]*/\$ /g' > data-en-es/europarl-en-es.tagged.new.en<br />
</pre><br />
<br />
Next, we need to clean the corpus and remove long sentences.<br />
(Make sure you are in the same directory as the one where you have your europarl corpus) <br />
<br />
<pre><br />
$ perl (path to your mosesdecoder)/scripts/training/clean-corpus-n.perl data-en-es/europarl-en-es.tagged.new es en data-en-es/europarl-en-es.tag-clean 1 40<br />
clean-corpus.perl: processing europarl-v6.es-en.es & .en to data-en-es/europarl-en-es.clean, cutoff 1-40<br />
..........(100000)...<br />
<br />
Input sentences: 1786594 Output sentences: 1467708<br />
</pre><br />
<br />
<br />
We're going to cut off the bottom 67,658 for testing (also because Giza++ segfaults somewhere around there).<br />
<pre><br />
$ mkdir testing<br />
$ tail -67658 data-en-es/europarl-en-es.tag-clean.en > testing/europarl-en-es.tag-clean.67658.en<br />
$ tail -67658 data-en-es/europarl-en-es.tag-clean.es > testing/europarl-en-es.tag-clean.67658.es<br />
</pre><br />
<br />
<pre><br />
$ head -1400000 data-en-es/europarl-en-es.tag-clean.en > data-en-es/europarl-en-es.tag-clean.en.new<br />
$ head -1400000 data-en-es/europarl-en-es.tag-clean.es > data-en-es/europarl-en-es.tag-clean.es.new<br />
$ mv data-en-es/europarl-en-es.tag-clean.en.new data-en-es/europarl-en-es.tag-clean.en<br />
$ mv data-en-es/europarl-en-es.tag-clean.es.new data-en-es/europarl-en-es.tag-clean.es<br />
</pre><br />
<br />
<br />
These files are:<br />
<br />
* <code>data-en-es/europarl-en-es.tag-clean.en</code>: The tagged source language side of the corpus<br />
* <code>data-en-es/europarl-en-es.tag-clean.es</code>: The tagged target language side of the corpus<br />
<br />
Check that they have the same length:<br />
<br />
<pre><br />
$ wc -l europarl.*<br />
1400000 data-en-es/europarl-en-es.tag-clean.en<br />
1400000 data-en-es/europarl-en-es.tag-clean.es<br />
2800000 total<br />
</pre><br />
<br />
=== Align corpus ===<br />
<br />
Now we've got the corpus files ready, we can align the corpus using the Moses scripts:<br />
<br />
<pre><br />
nohup perl (path to your mosesdecoder)/scripts/training/train-model.perl -external-bin-dir \<br />
~/smt/local/bin -corpus data-en-es/europarl-en-es.tag-clean \<br />
-f en -e es -alignment grow-diag-final-and -reordering msd-bidirectional-fe \<br />
-lm 0:5:/home/fran/corpora/europarl/europarl.lm:0 >log 2>&1 &<br />
</pre><br />
<br />
Note: Remember to change all the paths in the above command!<br />
<br />
You'll need an LM file, but you can copy it from a previous Moses installation. If you don't have one, make an empty file and put a few words in it. We won't be using the LM anyway.<br />
<br />
This takes a while, from a few hours to a day. So leave it running and go and make a soufflé, or chop some wood or something.<br />
<br />
=== Extract sentences ===<br />
After the sentences are aligned, we need to trim unnecessary tags from the tokens, and generate a biltrans file.<br />
<br />
<pre><br />
zcat giza.en-es/en-es.A3.final.gz | ~/source/apertium-lex-tools/scripts/giza-to-moses.awk > data-en-es/europarl.phrasetable.en-es<br />
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 1 \<br />
| sed 's/~/ /g' | ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil.bin -p -t > tmp1<br />
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 2 \<br />
| sed 's/~/ /g' | ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil.bin -p -t > tmp2<br />
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 3 > tmp3<br />
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 2 \<br />
| sed 's/~/ /g' | ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil.bin -b -t > data-en-es/europarl.clean-biltrans.en-es<br />
paste tmp1 tmp2 tmp3 | sed 's/\t/ ||| /g' > data-en-es/europarl.phrasetable.en-es<br />
rm tmp1 tmp2 tmp3<br />
</pre><br />
<br />
Then we want to make sure again that our file has the right number of lines:<br />
<br />
<pre><br />
$ wc -l data-en-es/europarl.phrasetable.en-es<br />
1400000 data-en-es/europarl.phrasetable.en-es<br />
</pre><br />
<br />
Then we want to extract the sentences where the target language word aligned to a source language word is a possible translation in the bilingual dictionary:<br />
<br />
<pre><br />
$ ~/source/apertium-lex-tools/scripts/extract-sentences.py data-en-es/europarl-en-es.phrasetable.en-es data-en-es/europarl-en-es.biltrans-tok.en-es \<br />
> data-en-es/europarl-en-es.candidates.en-es<br />
</pre><br />
<br />
These are basically sentences that we can hope that Apertium might be able to generate.<br />
<br />
=== Extract bilingual dictionary candidates ===<br />
<br />
Using the phrasetable and the bilingual file we can extract candidates for the bilingual dictionary.<br />
<br />
<pre><br />
python3 ~/source/apertium-lex-tools/scripts/extract-biltrans-candidates.py data-en-es/europarl-en-es.phrasetable.en-es data-en-es/europarl-en-es.biltrans-tok.en-es > data-en-es/europarl-en-es.biltrans-candidates.en-es 2> data-en-es/europarl-en-es.biltrans-pairs.en-es<br />
</pre><br />
<br />
where data-en-es/europarl-en-es.biltrans-candidates.en-es contains the generated candidates for the bilingual dictionary.<br />
<br />
===Extract frequency lexicon===<br />
<br />
The next step is to extract the frequency lexicon. <br />
<br />
<pre><br />
$ python ~/source/apertium-lex-tools/scripts/extract-freq-lexicon.py data-en-es/europarl-en-es.candidates.en-es > data-en-es/europarl-en-es.lex.en-es<br />
</pre><br />
<br />
This file should look like:<br />
<br />
<pre><br />
$ cat europarl.lex.en-es | head <br />
31381 union<n> unión<n> @<br />
101 union<n> sindicato<n><br />
1 union<n> situación<n><br />
1 union<n> monetario<adj><br />
4 slope<n> pendiente<n> @<br />
1 slope<n> ladera<n><br />
</pre><br />
<br />
Where the highest frequency translation is marked with an <code>@</code>.<br />
<br />
Note: This frequency lexicon can be used as a substitute for "choosing the most general translation" in your bilingual dictionary.<br />
<br />
===Generate patterns===<br />
<br />
Now we generate the ngrams that we are going to generate the rules from.<br />
<br />
<pre><br />
$ crisphold=1.5 # ratio of how many times you see the alternative translation compared to the default<br />
$ python ~/source/apertium-lex-tools/scripts/ngram-count-patterns.py europarl.lex.en-es data-en-es/europarl-en-es.candidates.en-es $crisphold 2>/dev/null > data-en-es/europarl-en-es.ngrams.en-es<br />
</pre><br />
<br />
This script outputs lines in the following format: <br />
<br />
<pre><br />
-language<n> and<cnjcoo> language<n> ,<cm> lengua<n> 2<br />
+language<n> plain<adj> language<n> ,<cm> lenguaje<n> 3<br />
-language<n> language<n> knowledge<n> lengua<n> 4<br />
-language<n> language<n> of<pr> communication<n> lengua<n> 3<br />
-language<n> Community<adj> language<n> .<sent> lengua<n> 5<br />
-language<n> language<n> in~addition~to<pr> their<det><pos> lengua<n> 2<br />
-language<n> every<det><ind> language<n> lengua<n> 2<br />
+language<n> and<cnjcoo> *understandable language<n> lenguaje<n> 2<br />
-language<n> two<num> language<n> lengua<n> 8<br />
-language<n> only<adj> official<adj> language<n> lengua<n> 2<br />
</pre><br />
<br />
The <code>+</code> and <code>-</code> indicate if this line chooses the most frequent transation (<code>-</code>) or a translation which is not the most frequent (<code>+</code>). The pattern selecting the translation is then shown, followed by the translation and then the frequency.<br />
<br />
===Filter rules===<br />
<br />
Now you can filter the rules, for example by removing rules with conjunctions, or removing rules with unknown words.<br />
<br />
=== Generate rules ===<br />
<br />
The final stage is to generate the rules,<br />
<br />
<pre><br />
python3 ~/source/apertium-lex-tools/scripts/ngrams-to-rules.py data-en-es/europarl-en-es.ngrams.en-es $crisphold > data-en-es/europarl-en-es.ngrams.en-es.lrx<br />
</pre><br />
<br />
=== Makefile ===<br />
For the whole process you can run the following Makefile:<br />
<br />
<pre><br />
CORPUS=europarl<br />
PAIR=en-es<br />
SL=en<br />
TL=es<br />
DATA=/home/philip/Apertium/apertium-en-es<br />
<br />
LEX_TOOLS=/home/philip/Apertium/apertium-lex-tools<br />
SCRIPTS=$(LEX_TOOLS)/scripts<br />
MOSESDECODER=/home/philip/mosesdecoder/scripts/training<br />
TRAINING_LINES=200000<br />
BIN_DIR=/home/philip/giza-pp/bin<br />
LM=/home/philip/Apertium/gsoc2013/giza/dummy.lm<br />
<br />
crisphold=1<br />
<br />
all: data-$(SL)-$(TL)/$(CORPUS).ngrams.$(SL)-$(TL).lrx data-$(SL)-$(TL)/$(CORPUS).biltrans-entries.$(SL)-$(TL)<br />
<br />
# TAG CORPUS<br />
data-$(SL)-$(TL)/$(CORPUS).tagged.$(SL): $(CORPUS).$(PAIR).$(SL)<br />
if [ ! -d data-$(SL)-$(TL) ]; then mkdir data-$(SL)-$(TL); fi<br />
cat $(CORPUS).$(PAIR).$(SL) | head -n $(TRAINING_LINES) \<br />
| apertium-destxt \<br />
| apertium -f none -d $(DATA) $(SL)-$(TL)-tagger \<br />
| apertium-pretransfer > $@;<br />
<br />
data-$(SL)-$(TL)/$(CORPUS).tagged.$(TL): $(CORPUS).$(PAIR).$(TL)<br />
if [ ! -d data-$(SL)-$(TL) ]; then mkdir data-$(SL)-$(TL); fi<br />
cat $(CORPUS).$(PAIR).$(TL) | head -n $(TRAINING_LINES) \<br />
| apertium-destxt \<br />
| apertium -f none -d $(DATA) $(TL)-$(SL)-tagger \<br />
| apertium-pretransfer > $@;<br />
<br />
# REMOVE LINES WITH NO ANALYSES<br />
data-$(SL)-$(TL)/$(CORPUS).tagged.new.$(SL): data-$(SL)-$(TL)/$(CORPUS).tagged.$(SL) data-$(SL)-$(TL)/$(CORPUS).tagged.$(TL)<br />
paste data-$(SL)-$(TL)/$(CORPUS).tagged.$(SL) data-$(SL)-$(TL)/$(CORPUS).tagged.$(TL) \<br />
| grep '<' \<br />
| cut -f1 \<br />
| sed 's/ /~/g' | sed 's/$$[^\^]*/$$ /g' > $@<br />
<br />
data-$(SL)-$(TL)/$(CORPUS).tagged.new.$(TL): data-$(SL)-$(TL)/$(CORPUS).tagged.$(SL) data-$(SL)-$(TL)/$(CORPUS).tagged.$(TL)<br />
paste data-$(SL)-$(TL)/$(CORPUS).tagged.$(SL) data-$(SL)-$(TL)/$(CORPUS).tagged.$(TL) \<br />
| grep '<' \<br />
| cut -f2 \<br />
| sed 's/ /~/g' | sed 's/$$[^\^]*/$$ /g' > $@<br />
<br />
# CLEAN<br />
data-$(SL)-$(TL)/$(CORPUS).tag-clean.$(SL) data-$(SL)-$(TL)/$(CORPUS).tag-clean.$(TL): data-$(SL)-$(TL)/$(CORPUS).tagged.new.$(SL) data-$(SL)-$(TL)/$(CORPUS).tagged.new.$(TL)<br />
perl $(MOSESDECODER)/clean-corpus-n.perl data-$(SL)-$(TL)/$(CORPUS).tagged.new $(SL) $(TL) data-$(SL)-$(TL)/$(CORPUS).tag-clean 1 40;<br />
<br />
# ALIGN<br />
model: data-$(SL)-$(TL)/$(CORPUS).tag-clean.$(SL) data-$(SL)-$(TL)/$(CORPUS).tag-clean.$(TL)<br />
-perl $(MOSESDECODER)/train-model.perl -external-bin-dir $(BIN_DIR) -corpus data-$(SL)-$(TL)/$(CORPUS).tag-clean \<br />
-f $(TL) -e $(SL) -alignment grow-diag-final-and -reordering msd-bidirectional-fe \<br />
-lm 0:5:$(LM):0 2>&1<br />
<br />
# EXTRACT AND TRIM<br />
data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL): model<br />
zcat giza.$(SL)-$(TL)/$(SL)-$(TL).A3.final.gz | $(SCRIPTS)/giza-to-moses.awk > $@<br />
cat data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL) | sed 's/ ||| /\t/g' | cut -f 1 \<br />
| sed 's/~/ /g' | $(LEX_TOOLS)/multitrans $(DATA)/$(TL)-$(SL).autobil.bin -p -t > tmp1<br />
cat data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL) | sed 's/ ||| /\t/g' | cut -f 2 \<br />
| sed 's/~/ /g' | $(LEX_TOOLS)/multitrans $(DATA)/$(SL)-$(TL).autobil.bin -p -t > tmp2<br />
cat data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL) | sed 's/ ||| /\t/g' | cut -f 3 > tmp3<br />
cat data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL) | sed 's/ ||| /\t/g' | cut -f 2 \<br />
| sed 's/~/ /g' | $(LEX_TOOLS)/multitrans $(DATA)/$(SL)-$(TL).autobil.bin -b -t > data-$(SL)-$(TL)/$(CORPUS).clean-biltrans.$(PAIR)<br />
paste tmp1 tmp2 tmp3 | sed 's/\t/ ||| /g' > data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL)<br />
rm tmp1 tmp2 tmp3<br />
<br />
# SENTENCES<br />
data-$(SL)-$(TL)/$(CORPUS).candidates.$(SL)-$(TL): data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL) data-$(SL)-$(TL)/$(CORPUS).clean-biltrans.$(PAIR)<br />
python3 $(SCRIPTS)/extract-sentences.py data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL) \<br />
data-$(SL)-$(TL)/$(CORPUS).clean-biltrans.$(PAIR) > $@ 2>/dev/null<br />
<br />
# FREQUENCY LEXICON<br />
data-$(SL)-$(TL)/$(CORPUS).lex.$(SL)-$(TL): data-$(SL)-$(TL)/$(CORPUS).candidates.$(SL)-$(TL)<br />
python $(SCRIPTS)/extract-freq-lexicon.py data-$(SL)-$(TL)/$(CORPUS).candidates.$(SL)-$(TL) > $@ 2>/dev/null<br />
<br />
# BILTRANS CANDIDATES<br />
data-$(SL)-$(TL)/$(CORPUS).biltrans-entries.$(SL)-$(TL): data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL) data-$(SL)-$(TL)/$(CORPUS).clean-biltrans.$(PAIR)<br />
python3 $(SCRIPTS)/extract-biltrans-candidates.py data-$(SL)-$(TL)/$(CORPUS).phrasetable.$(SL)-$(TL) data-$(SL)-$(TL)/$(CORPUS).clean-biltrans.$(PAIR) \<br />
> $@ 2>/dev/null<br />
<br />
# NGRAM PATTERNS<br />
data-$(SL)-$(TL)/$(CORPUS).ngrams.$(SL)-$(TL): data-$(SL)-$(TL)/$(CORPUS).lex.$(SL)-$(TL) data-$(SL)-$(TL)/$(CORPUS).candidates.$(SL)-$(TL)<br />
python $(SCRIPTS)/ngram-count-patterns.py data-$(SL)-$(TL)/$(CORPUS).lex.$(SL)-$(TL) data-$(SL)-$(TL)/$(CORPUS).candidates.$(SL)-$(TL) $(crisphold) 2>/dev/null > $@<br />
<br />
# NGRAMS TO RULES<br />
data-$(SL)-$(TL)/$(CORPUS).ngrams.$(SL)-$(TL).lrx: data-$(SL)-$(TL)/$(CORPUS).ngrams.$(SL)-$(TL)<br />
python3 $(SCRIPTS)/ngrams-to-rules.py data-$(SL)-$(TL)/$(CORPUS).ngrams.$(SL)-$(TL) $(crisphold) > $@ 2>/dev/null<br />
<br />
<br />
<br />
</pre><br />
<br />
[[Category:Lexical selection]]<br />
[[Category:Documentation in English]]</div>
Fpetkovski
https://wiki.apertium.org/w/index.php?title=Generating_lexical-selection_rules_from_a_parallel_corpus&diff=43793
Generating lexical-selection rules from a parallel corpus
2013-09-21T15:14:46Z
<p>Fpetkovski: /* Generate rules */</p>
<hr />
<div>{{TOCD}}<br />
If you have a parallel corpus, one of the things you can do is generate some lexical selection rules from it, to improve translation of words with more than one possible translation.<br />
<br />
== You will need ==<br />
<br />
Here is a list of software that you will need installed:<br />
<br />
* Giza++ (or some other word aligner)<br />
* Moses (for making Giza++ less human hostile)<br />
* All the Moses scripts<br />
* [[lttoolbox]]<br />
* Apertium<br />
* [[apertium-lex-tools]]<br />
<br />
Furthermore you'll need:<br />
<br />
* an Apertium language pair<br />
* a parallel corpus (see [[Corpora]])<br />
<br />
===Installing prerequisites===<br />
See [[Minimal installation from SVN]] for apertium/lttoolbox. <br />
<br />
See [[Constraint-based lexical selection module]] for apertium-lex-tools.<br />
<br />
For Giza++ and moses-decoder, etc. you can do<br />
<pre><br />
$ mkdir ~/smt<br />
$ cd ~/smt<br />
$ mkdir local # our "install prefix"<br />
$ wget https://giza-pp.googlecode.com/files/giza-pp-v1.0.7.tar.gz<br />
$ tar xzvf giza-pp-v1.0.7.tar.gz<br />
$ cd giza-pp<br />
$ make<br />
$ mkdir ../local/bin<br />
$ cp GIZA++-v2/snt2cooc.out ../local/bin/<br />
$ cp GIZA++-v2/snt2plain.out ../local/bin/<br />
$ cp GIZA++-v2/GIZA++ ../local/bin/<br />
$ cp mkcls-v2/mkcls ../local/bin/<br />
$ git clone https://github.com/moses-smt/mosesdecoder<br />
$ cd mosesdecoder/<br />
$ ./bjam <br />
</pre><br />
<br />
Now e.g. the clean-corpus and train-model scripts referred to below will be in ~/smt/mosesdecoder/scripts/training/clean-corpus-n.perl<br />
See http://www.statmt.org/moses/?n=Development.GetStarted if you want to install the binaries to some other directory.<br />
<br />
== Getting started ==<br />
<br />
We're going to do the example with [[Corpora|EuroParl]] and the English to Spanish pair in Apertium.<br />
<br />
Given that you've got all the stuff installed, the work will be as follows:<br />
<br />
=== Prepare corpus ===<br />
<br />
To generate the rules, we need three files,<br />
<br />
* The tagged and tokenised source corpus<br />
* The tagged and tokenised target corpus<br />
* The output of the lexical transfer module in the source→target direction, tokenised<br />
<br />
These three files should be sentence aligned. <br />
<br />
Make a folder called data-en-es. We are going to keep all the generated files there.<br />
<br />
The first thing that we need to do is tag both sides of the corpus:<br />
<br />
<pre><br />
$ nohup cat europarl.en-es.en | apertium-destxt |\<br />
apertium -f none -d /home/fran/source/apertium-en-es en-es-pretransfer > data-en-es/europarl-en-es.tagged.en &<br />
$ nohup cat europarl.clean.es | apertium-destxt |\<br />
apertium -f none -d /home/fran/source/apertium-en-es es-en-pretransfer > data-en-es.europarl-en-es.tagged.es &<br />
</pre><br />
<br />
Then we need to remove the lines with no analyses on and replace blanks within lemmas with a new character (we will use `~`):<br />
<br />
<pre><br />
$ paste data-en-es/europarl-en-es.tagged.en data-en-es/europarl-en-es.tagged.es | grep '<' | cut -f2 | sed 's/ /~/g' | sed 's/\$[^\^]*/\$ /g' > data-en-es/europarl-en-es.tagged.new.es<br />
$ paste data-en-es/europarl-en-es.tagged.en data-en-es/europarl-en-es.tagged.es | grep '<' | cut -f3 | sed 's/ /~/g' | sed 's/\$[^\^]*/\$ /g' > data-en-es/europarl-en-es.tagged.new.en<br />
</pre><br />
<br />
Next, we need to clean the corpus and remove long sentences.<br />
(Make sure you are in the same directory as the one where you have your europarl corpus) <br />
<br />
<pre><br />
$ perl (path to your mosesdecoder)/scripts/training/clean-corpus-n.perl data-en-es/europarl-en-es.tagged.new es en data-en-es/europarl-en-es.tag-clean 1 40<br />
clean-corpus.perl: processing europarl-v6.es-en.es & .en to data-en-es/europarl-en-es.clean, cutoff 1-40<br />
..........(100000)...<br />
<br />
Input sentences: 1786594 Output sentences: 1467708<br />
</pre><br />
<br />
<br />
We're going to cut off the bottom 67,658 for testing (also because Giza++ segfaults somewhere around there).<br />
<pre><br />
$ mkdir testing<br />
$ tail -67658 data-en-es/europarl-en-es.tag-clean.en > testing/europarl-en-es.tag-clean.67658.en<br />
$ tail -67658 data-en-es/europarl-en-es.tag-clean.es > testing/europarl-en-es.tag-clean.67658.es<br />
</pre><br />
<br />
<pre><br />
$ head -1400000 data-en-es/europarl-en-es.tag-clean.en > data-en-es/europarl-en-es.tag-clean.en.new<br />
$ head -1400000 data-en-es/europarl-en-es.tag-clean.es > data-en-es/europarl-en-es.tag-clean.es.new<br />
$ mv data-en-es/europarl-en-es.tag-clean.en.new data-en-es/europarl-en-es.tag-clean.en<br />
$ mv data-en-es/europarl-en-es.tag-clean.es.new data-en-es/europarl-en-es.tag-clean.es<br />
</pre><br />
<br />
<br />
These files are:<br />
<br />
* <code>data-en-es/europarl-en-es.tag-clean.en</code>: The tagged source language side of the corpus<br />
* <code>data-en-es/europarl-en-es.tag-clean.es</code>: The tagged target language side of the corpus<br />
<br />
Check that they have the same length:<br />
<br />
<pre><br />
$ wc -l europarl.*<br />
1400000 data-en-es/europarl-en-es.tag-clean.en<br />
1400000 data-en-es/europarl-en-es.tag-clean.es<br />
2800000 total<br />
</pre><br />
<br />
=== Align corpus ===<br />
<br />
Now we've got the corpus files ready, we can align the corpus using the Moses scripts:<br />
<br />
<pre><br />
nohup perl (path to your mosesdecoder)/scripts/training/train-model.perl -external-bin-dir \<br />
~/smt/local/bin -corpus data-en-es/europarl-en-es.tag-clean \<br />
-f en -e es -alignment grow-diag-final-and -reordering msd-bidirectional-fe \<br />
-lm 0:5:/home/fran/corpora/europarl/europarl.lm:0 >log 2>&1 &<br />
</pre><br />
<br />
Note: Remember to change all the paths in the above command!<br />
<br />
You'll need an LM file, but you can copy it from a previous Moses installation. If you don't have one, make an empty file and put a few words in it. We won't be using the LM anyway.<br />
<br />
This takes a while, from a few hours to a day. So leave it running and go and make a soufflé, or chop some wood or something.<br />
<br />
=== Extract sentences ===<br />
After the sentences are aligned, we need to trim unnecessary tags from the tokens, and generate a biltrans file.<br />
<br />
<pre><br />
zcat giza.en-es/en-es.A3.final.gz | ~/source/apertium-lex-tools/scripts/giza-to-moses.awk > data-en-es/europarl.phrasetable.en-es<br />
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 1 \<br />
| sed 's/~/ /g' | ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil.bin -p -t > tmp1<br />
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 2 \<br />
| sed 's/~/ /g' | ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil.bin -p -t > tmp2<br />
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 3 > tmp3<br />
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 2 \<br />
| sed 's/~/ /g' | ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil.bin -b -t > data-en-es/europarl.clean-biltrans.en-es<br />
paste tmp1 tmp2 tmp3 | sed 's/\t/ ||| /g' > data-en-es/europarl.phrasetable.en-es<br />
rm tmp1 tmp2 tmp3<br />
</pre><br />
<br />
Then we want to make sure again that our file has the right number of lines:<br />
<br />
<pre><br />
$ wc -l data-en-es/europarl.phrasetable.en-es<br />
1400000 data-en-es/europarl.phrasetable.en-es<br />
</pre><br />
<br />
Then we want to extract the sentences where the target language word aligned to a source language word is a possible translation in the bilingual dictionary:<br />
<br />
<pre><br />
$ ~/source/apertium-lex-tools/scripts/extract-sentences.py data-en-es/europarl-en-es.phrasetable.en-es data-en-es/europarl-en-es.biltrans-tok.en-es \<br />
> data-en-es/europarl-en-es.candidates.en-es<br />
</pre><br />
<br />
These are basically sentences that we can hope that Apertium might be able to generate.<br />
<br />
=== Extract bilingual dictionary candidates ===<br />
<br />
Using the phrasetable and the bilingual file we can extract candidates for the bilingual dictionary.<br />
<br />
<pre><br />
python3 ~/source/apertium-lex-tools/scripts/extract-biltrans-candidates.py data-en-es/europarl-en-es.phrasetable.en-es data-en-es/europarl-en-es.biltrans-tok.en-es > data-en-es/europarl-en-es.biltrans-candidates.en-es 2> data-en-es/europarl-en-es.biltrans-pairs.en-es<br />
</pre><br />
<br />
where data-en-es/europarl-en-es.biltrans-candidates.en-es contains the generated candidates for the bilingual dictionary.<br />
<br />
===Extract frequency lexicon===<br />
<br />
The next step is to extract the frequency lexicon. <br />
<br />
<pre><br />
$ python ~/source/apertium-lex-tools/scripts/extract-freq-lexicon.py data-en-es/europarl-en-es.candidates.en-es > data-en-es/europarl-en-es.lex.en-es<br />
</pre><br />
<br />
This file should look like:<br />
<br />
<pre><br />
$ cat europarl.lex.en-es | head <br />
31381 union<n> unión<n> @<br />
101 union<n> sindicato<n><br />
1 union<n> situación<n><br />
1 union<n> monetario<adj><br />
4 slope<n> pendiente<n> @<br />
1 slope<n> ladera<n><br />
</pre><br />
<br />
Where the highest frequency translation is marked with an <code>@</code>.<br />
<br />
Note: This frequency lexicon can be used as a substitute for "choosing the most general translation" in your bilingual dictionary.<br />
<br />
===Generate patterns===<br />
<br />
Now we generate the ngrams that we are going to generate the rules from.<br />
<br />
<pre><br />
$ crisphold=1.5 # ratio of how many times you see the alternative translation compared to the default<br />
$ python ~/source/apertium-lex-tools/scripts/ngram-count-patterns.py europarl.lex.en-es data-en-es/europarl-en-es.candidates.en-es $crisphold 2>/dev/null > data-en-es/europarl-en-es.ngrams.en-es<br />
</pre><br />
<br />
This script outputs lines in the following format: <br />
<br />
<pre><br />
-language<n> and<cnjcoo> language<n> ,<cm> lengua<n> 2<br />
+language<n> plain<adj> language<n> ,<cm> lenguaje<n> 3<br />
-language<n> language<n> knowledge<n> lengua<n> 4<br />
-language<n> language<n> of<pr> communication<n> lengua<n> 3<br />
-language<n> Community<adj> language<n> .<sent> lengua<n> 5<br />
-language<n> language<n> in~addition~to<pr> their<det><pos> lengua<n> 2<br />
-language<n> every<det><ind> language<n> lengua<n> 2<br />
+language<n> and<cnjcoo> *understandable language<n> lenguaje<n> 2<br />
-language<n> two<num> language<n> lengua<n> 8<br />
-language<n> only<adj> official<adj> language<n> lengua<n> 2<br />
</pre><br />
<br />
The <code>+</code> and <code>-</code> indicate if this line chooses the most frequent transation (<code>-</code>) or a translation which is not the most frequent (<code>+</code>). The pattern selecting the translation is then shown, followed by the translation and then the frequency.<br />
<br />
===Filter rules===<br />
<br />
Now you can filter the rules, for example by removing rules with conjunctions, or removing rules with unknown words.<br />
<br />
=== Generate rules ===<br />
<br />
The final stage is to generate the rules,<br />
<br />
<pre><br />
python3 ~/source/apertium-lex-tools/scripts/ngrams-to-rules.py data-en-es/europarl-en-es.ngrams.en-es $crisphold > data-en-es/europarl-en-es.ngrams.en-es.lrx<br />
</pre><br />
<br />
=== Process script ===<br />
For the whole process you can run the following script:<br />
<br />
<pre><br />
CORPUS_DIR="/home/philip/Apertium/corpora/raw/europarl-fr-es"<br />
CORPUS="Europarl3"<br />
PAIR="es-fr"<br />
SL="fr"<br />
TL="es"<br />
LEX_TOOLS="/home/philip/Apertium/apertium-lex-tools"<br />
SCRIPTS="$LEX_TOOLS/scripts"<br />
MOSESDECODER="/home/philip/Apertium/mosesdecoder/scripts/training"<br />
TRAINING_LINES=50000<br />
DATA="/home/philip/Apertium/apertium-fr-es"<br />
BIN_DIR="/home/philip/Apertium/smt/bin"<br />
<br />
# CLEAN CORPUS<br />
perl "$MOSESDECODER/clean-corpus-n.perl" $CORPUS.$PAIR $SL $TL "$CORPUS.clean" 1 40;<br />
<br />
#TAG CORPUS<br />
cat "$CORPUS.clean.$SL" | tail -n $TRAINING_LINES | apertium -d "$DATA" $SL-$TL-tagger \<br />
| apertium-pretransfer > $CORPUS.tagged.$SL;<br />
<br />
cat "$CORPUS.clean.$TL" | tail -n $TRAINING_LINES | apertium -d "$DATA" $TL-$SL-tagger \<br />
| apertium-pretransfer > $CORPUS.tagged.$TL;<br />
<br />
cat "$CORPUS.tagged.$SL" | lt-proc -b "$DATA/$SL-$TL.autobil.bin" > $CORPUS.biltrans.$PAIR<br />
<br />
N=`wc -l $CORPUS.clean.$SL | cut -d ' ' -f 1`<br />
<br />
# REMOVE LINES WITH NO ANALYSES<br />
seq 1 $N > $CORPUS.lines<br />
paste $CORPUS.lines $CORPUS.tagged.$SL $CORPUS.tagged.$TL | grep '<' | cut -f1 > $CORPUS.lines.new<br />
paste $CORPUS.lines $CORPUS.tagged.$SL $CORPUS.tagged.$TL | grep '<' | cut -f2 > $CORPUS.tagged.$SL.new<br />
paste $CORPUS.lines $CORPUS.tagged.$SL $CORPUS.tagged.$TL | grep '<' | cut -f3 > $CORPUS.tagged.$TL.new<br />
<br />
mv $CORPUS.lines.new $CORPUS.lines<br />
mv $CORPUS.tagged.$SL.new $CORPUS.tagged.$SL<br />
mv $CORPUS.tagged.$TL.new $CORPUS.tagged.$TL<br />
<br />
# TRIM TAGS<br />
cat $CORPUS.tagged.$SL | $LEX_TOOLS/process-tagger-output $DATA/$SL-$TL.autobil.bin -p -t > $CORPUS.tag-tok.$SL<br />
cat $CORPUS.tagged.$TL | $LEX_TOOLS/process-tagger-output $DATA/$TL-$SL.autobil.bin -p -t > $CORPUS.tag-tok.$TL<br />
cat $CORPUS.tagged.$SL | $LEX_TOOLS/process-tagger-output $DATA/$SL-$TL.autobil.bin -b -t > $CORPUS.biltrans-tok.$PAIR<br />
<br />
# ALIGN<br />
perl $MOSESDECODER/train-model.perl -external-bin-dir "$BIN_DIR" -corpus $CORPUS.tag-tok \<br />
-f $TL -e $SL -alignment grow-diag-final-and -reordering msd-bidirectional-fe \<br />
-lm 0:5:/home/philip/Apertium/gsoc2013/giza/europarl.lm:0 2>&1<br />
<br />
# EXTRACT<br />
zcat giza.$SL-$TL/$SL-$TL.A3.final.gz | $SCRIPTS/giza-to-moses.awk > $CORPUS.phrasetable.$SL-$TL<br />
<br />
# SENTENCES<br />
python3 $SCRIPTS/extract-sentences.py $CORPUS.phrasetable.$SL-$TL $CORPUS.biltrans-tok.$PAIR \<br />
> $CORPUS.candidates.$SL-$TL 2>/dev/null<br />
<br />
# FREQUENCY LEXICON<br />
python $SCRIPTS/extract-freq-lexicon.py $CORPUS.candidates.$SL-$TL > $CORPUS.lex.$SL-$TL 2>/dev/null<br />
<br />
<br />
# BILTRANS CANDIDATES<br />
python3 $SCRIPTS/extract-biltrans-candidates.py $CORPUS.phrasetable.$SL-$TL $CORPUS.biltrans-tok.$PAIR \<br />
> $CORPUS.biltrans-entries.$SL-$TL 2>$CORPUS.biltrans-pairs.$SL-$TL<br />
<br />
# NGRAM PATTERNS<br />
$crisphold=1.5<br />
python $SCRIPTS/ngram-count-patterns.py $CORPUS.lex.$SL-$TL $CORPUS.candidates.$SL-$TL $crisphold 2>/dev/null > $CORPUS.ngrams.$SL-$TL<br />
<br />
# FILTER PATTERNS<br />
<br />
<br />
# NGRAMS TO RULES<br />
$cripshold=1.5<br />
python3 $SCRIPTS/ngrams-to-rules.py europarl.ngrams.en-es $crisphold > $Europarl3.ngrams.$SL-TL.lrx<br />
<br />
</pre><br />
<br />
[[Category:Lexical selection]]<br />
[[Category:Documentation in English]]</div>
Fpetkovski
https://wiki.apertium.org/w/index.php?title=Generating_lexical-selection_rules_from_a_parallel_corpus&diff=43792
Generating lexical-selection rules from a parallel corpus
2013-09-21T15:14:22Z
<p>Fpetkovski: /* Generate patterns */</p>
<hr />
<div>{{TOCD}}<br />
If you have a parallel corpus, one of the things you can do is generate some lexical selection rules from it, to improve translation of words with more than one possible translation.<br />
<br />
== You will need ==<br />
<br />
Here is a list of software that you will need installed:<br />
<br />
* Giza++ (or some other word aligner)<br />
* Moses (for making Giza++ less human hostile)<br />
* All the Moses scripts<br />
* [[lttoolbox]]<br />
* Apertium<br />
* [[apertium-lex-tools]]<br />
<br />
Furthermore you'll need:<br />
<br />
* an Apertium language pair<br />
* a parallel corpus (see [[Corpora]])<br />
<br />
===Installing prerequisites===<br />
See [[Minimal installation from SVN]] for apertium/lttoolbox. <br />
<br />
See [[Constraint-based lexical selection module]] for apertium-lex-tools.<br />
<br />
For Giza++ and moses-decoder, etc. you can do<br />
<pre><br />
$ mkdir ~/smt<br />
$ cd ~/smt<br />
$ mkdir local # our "install prefix"<br />
$ wget https://giza-pp.googlecode.com/files/giza-pp-v1.0.7.tar.gz<br />
$ tar xzvf giza-pp-v1.0.7.tar.gz<br />
$ cd giza-pp<br />
$ make<br />
$ mkdir ../local/bin<br />
$ cp GIZA++-v2/snt2cooc.out ../local/bin/<br />
$ cp GIZA++-v2/snt2plain.out ../local/bin/<br />
$ cp GIZA++-v2/GIZA++ ../local/bin/<br />
$ cp mkcls-v2/mkcls ../local/bin/<br />
$ git clone https://github.com/moses-smt/mosesdecoder<br />
$ cd mosesdecoder/<br />
$ ./bjam <br />
</pre><br />
<br />
Now e.g. the clean-corpus and train-model scripts referred to below will be in ~/smt/mosesdecoder/scripts/training/clean-corpus-n.perl<br />
See http://www.statmt.org/moses/?n=Development.GetStarted if you want to install the binaries to some other directory.<br />
<br />
== Getting started ==<br />
<br />
We're going to do the example with [[Corpora|EuroParl]] and the English to Spanish pair in Apertium.<br />
<br />
Given that you've got all the stuff installed, the work will be as follows:<br />
<br />
=== Prepare corpus ===<br />
<br />
To generate the rules, we need three files,<br />
<br />
* The tagged and tokenised source corpus<br />
* The tagged and tokenised target corpus<br />
* The output of the lexical transfer module in the source→target direction, tokenised<br />
<br />
These three files should be sentence aligned. <br />
<br />
Make a folder called data-en-es. We are going to keep all the generated files there.<br />
<br />
The first thing that we need to do is tag both sides of the corpus:<br />
<br />
<pre><br />
$ nohup cat europarl.en-es.en | apertium-destxt |\<br />
apertium -f none -d /home/fran/source/apertium-en-es en-es-pretransfer > data-en-es/europarl-en-es.tagged.en &<br />
$ nohup cat europarl.clean.es | apertium-destxt |\<br />
apertium -f none -d /home/fran/source/apertium-en-es es-en-pretransfer > data-en-es.europarl-en-es.tagged.es &<br />
</pre><br />
<br />
Then we need to remove the lines with no analyses on and replace blanks within lemmas with a new character (we will use `~`):<br />
<br />
<pre><br />
$ paste data-en-es/europarl-en-es.tagged.en data-en-es/europarl-en-es.tagged.es | grep '<' | cut -f2 | sed 's/ /~/g' | sed 's/\$[^\^]*/\$ /g' > data-en-es/europarl-en-es.tagged.new.es<br />
$ paste data-en-es/europarl-en-es.tagged.en data-en-es/europarl-en-es.tagged.es | grep '<' | cut -f3 | sed 's/ /~/g' | sed 's/\$[^\^]*/\$ /g' > data-en-es/europarl-en-es.tagged.new.en<br />
</pre><br />
<br />
Next, we need to clean the corpus and remove long sentences.<br />
(Make sure you are in the same directory as the one where you have your europarl corpus) <br />
<br />
<pre><br />
$ perl (path to your mosesdecoder)/scripts/training/clean-corpus-n.perl data-en-es/europarl-en-es.tagged.new es en data-en-es/europarl-en-es.tag-clean 1 40<br />
clean-corpus.perl: processing europarl-v6.es-en.es & .en to data-en-es/europarl-en-es.clean, cutoff 1-40<br />
..........(100000)...<br />
<br />
Input sentences: 1786594 Output sentences: 1467708<br />
</pre><br />
<br />
<br />
We're going to cut off the bottom 67,658 for testing (also because Giza++ segfaults somewhere around there).<br />
<pre><br />
$ mkdir testing<br />
$ tail -67658 data-en-es/europarl-en-es.tag-clean.en > testing/europarl-en-es.tag-clean.67658.en<br />
$ tail -67658 data-en-es/europarl-en-es.tag-clean.es > testing/europarl-en-es.tag-clean.67658.es<br />
</pre><br />
<br />
<pre><br />
$ head -1400000 data-en-es/europarl-en-es.tag-clean.en > data-en-es/europarl-en-es.tag-clean.en.new<br />
$ head -1400000 data-en-es/europarl-en-es.tag-clean.es > data-en-es/europarl-en-es.tag-clean.es.new<br />
$ mv data-en-es/europarl-en-es.tag-clean.en.new data-en-es/europarl-en-es.tag-clean.en<br />
$ mv data-en-es/europarl-en-es.tag-clean.es.new data-en-es/europarl-en-es.tag-clean.es<br />
</pre><br />
<br />
<br />
These files are:<br />
<br />
* <code>data-en-es/europarl-en-es.tag-clean.en</code>: The tagged source language side of the corpus<br />
* <code>data-en-es/europarl-en-es.tag-clean.es</code>: The tagged target language side of the corpus<br />
<br />
Check that they have the same length:<br />
<br />
<pre><br />
$ wc -l europarl.*<br />
1400000 data-en-es/europarl-en-es.tag-clean.en<br />
1400000 data-en-es/europarl-en-es.tag-clean.es<br />
2800000 total<br />
</pre><br />
<br />
=== Align corpus ===<br />
<br />
Now we've got the corpus files ready, we can align the corpus using the Moses scripts:<br />
<br />
<pre><br />
nohup perl (path to your mosesdecoder)/scripts/training/train-model.perl -external-bin-dir \<br />
~/smt/local/bin -corpus data-en-es/europarl-en-es.tag-clean \<br />
-f en -e es -alignment grow-diag-final-and -reordering msd-bidirectional-fe \<br />
-lm 0:5:/home/fran/corpora/europarl/europarl.lm:0 >log 2>&1 &<br />
</pre><br />
<br />
Note: Remember to change all the paths in the above command!<br />
<br />
You'll need an LM file, but you can copy it from a previous Moses installation. If you don't have one, make an empty file and put a few words in it. We won't be using the LM anyway.<br />
<br />
This takes a while, from a few hours to a day. So leave it running and go and make a soufflé, or chop some wood or something.<br />
<br />
=== Extract sentences ===<br />
After the sentences are aligned, we need to trim unnecessary tags from the tokens, and generate a biltrans file.<br />
<br />
<pre><br />
zcat giza.en-es/en-es.A3.final.gz | ~/source/apertium-lex-tools/scripts/giza-to-moses.awk > data-en-es/europarl.phrasetable.en-es<br />
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 1 \<br />
| sed 's/~/ /g' | ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil.bin -p -t > tmp1<br />
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 2 \<br />
| sed 's/~/ /g' | ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil.bin -p -t > tmp2<br />
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 3 > tmp3<br />
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 2 \<br />
| sed 's/~/ /g' | ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil.bin -b -t > data-en-es/europarl.clean-biltrans.en-es<br />
paste tmp1 tmp2 tmp3 | sed 's/\t/ ||| /g' > data-en-es/europarl.phrasetable.en-es<br />
rm tmp1 tmp2 tmp3<br />
</pre><br />
<br />
Then we want to make sure again that our file has the right number of lines:<br />
<br />
<pre><br />
$ wc -l data-en-es/europarl.phrasetable.en-es<br />
1400000 data-en-es/europarl.phrasetable.en-es<br />
</pre><br />
<br />
Then we want to extract the sentences where the target language word aligned to a source language word is a possible translation in the bilingual dictionary:<br />
<br />
<pre><br />
$ ~/source/apertium-lex-tools/scripts/extract-sentences.py data-en-es/europarl-en-es.phrasetable.en-es data-en-es/europarl-en-es.biltrans-tok.en-es \<br />
> data-en-es/europarl-en-es.candidates.en-es<br />
</pre><br />
<br />
These are basically sentences that we can hope that Apertium might be able to generate.<br />
<br />
=== Extract bilingual dictionary candidates ===<br />
<br />
Using the phrasetable and the bilingual file we can extract candidates for the bilingual dictionary.<br />
<br />
<pre><br />
python3 ~/source/apertium-lex-tools/scripts/extract-biltrans-candidates.py data-en-es/europarl-en-es.phrasetable.en-es data-en-es/europarl-en-es.biltrans-tok.en-es > data-en-es/europarl-en-es.biltrans-candidates.en-es 2> data-en-es/europarl-en-es.biltrans-pairs.en-es<br />
</pre><br />
<br />
where data-en-es/europarl-en-es.biltrans-candidates.en-es contains the generated candidates for the bilingual dictionary.<br />
<br />
===Extract frequency lexicon===<br />
<br />
The next step is to extract the frequency lexicon. <br />
<br />
<pre><br />
$ python ~/source/apertium-lex-tools/scripts/extract-freq-lexicon.py data-en-es/europarl-en-es.candidates.en-es > data-en-es/europarl-en-es.lex.en-es<br />
</pre><br />
<br />
This file should look like:<br />
<br />
<pre><br />
$ cat europarl.lex.en-es | head <br />
31381 union<n> unión<n> @<br />
101 union<n> sindicato<n><br />
1 union<n> situación<n><br />
1 union<n> monetario<adj><br />
4 slope<n> pendiente<n> @<br />
1 slope<n> ladera<n><br />
</pre><br />
<br />
Where the highest frequency translation is marked with an <code>@</code>.<br />
<br />
Note: This frequency lexicon can be used as a substitute for "choosing the most general translation" in your bilingual dictionary.<br />
<br />
===Generate patterns===<br />
<br />
Now we generate the ngrams that we are going to generate the rules from.<br />
<br />
<pre><br />
$ crisphold=1.5 # ratio of how many times you see the alternative translation compared to the default<br />
$ python ~/source/apertium-lex-tools/scripts/ngram-count-patterns.py europarl.lex.en-es data-en-es/europarl-en-es.candidates.en-es $crisphold 2>/dev/null > data-en-es/europarl-en-es.ngrams.en-es<br />
</pre><br />
<br />
This script outputs lines in the following format: <br />
<br />
<pre><br />
-language<n> and<cnjcoo> language<n> ,<cm> lengua<n> 2<br />
+language<n> plain<adj> language<n> ,<cm> lenguaje<n> 3<br />
-language<n> language<n> knowledge<n> lengua<n> 4<br />
-language<n> language<n> of<pr> communication<n> lengua<n> 3<br />
-language<n> Community<adj> language<n> .<sent> lengua<n> 5<br />
-language<n> language<n> in~addition~to<pr> their<det><pos> lengua<n> 2<br />
-language<n> every<det><ind> language<n> lengua<n> 2<br />
+language<n> and<cnjcoo> *understandable language<n> lenguaje<n> 2<br />
-language<n> two<num> language<n> lengua<n> 8<br />
-language<n> only<adj> official<adj> language<n> lengua<n> 2<br />
</pre><br />
<br />
The <code>+</code> and <code>-</code> indicate if this line chooses the most frequent transation (<code>-</code>) or a translation which is not the most frequent (<code>+</code>). The pattern selecting the translation is then shown, followed by the translation and then the frequency.<br />
<br />
===Filter rules===<br />
<br />
Now you can filter the rules, for example by removing rules with conjunctions, or removing rules with unknown words.<br />
<br />
=== Generate rules ===<br />
<br />
The final stage is to generate the rules,<br />
<br />
<pre><br />
python3 ~/source/apertium-lex-tools/scripts/ngrams-to-rules.py europarl.ngrams.en-es $crisphold > europarl.ngrams.en-es.lrx<br />
</pre> <br />
<br />
=== Process script ===<br />
For the whole process you can run the following script:<br />
<br />
<pre><br />
CORPUS_DIR="/home/philip/Apertium/corpora/raw/europarl-fr-es"<br />
CORPUS="Europarl3"<br />
PAIR="es-fr"<br />
SL="fr"<br />
TL="es"<br />
LEX_TOOLS="/home/philip/Apertium/apertium-lex-tools"<br />
SCRIPTS="$LEX_TOOLS/scripts"<br />
MOSESDECODER="/home/philip/Apertium/mosesdecoder/scripts/training"<br />
TRAINING_LINES=50000<br />
DATA="/home/philip/Apertium/apertium-fr-es"<br />
BIN_DIR="/home/philip/Apertium/smt/bin"<br />
<br />
# CLEAN CORPUS<br />
perl "$MOSESDECODER/clean-corpus-n.perl" $CORPUS.$PAIR $SL $TL "$CORPUS.clean" 1 40;<br />
<br />
#TAG CORPUS<br />
cat "$CORPUS.clean.$SL" | tail -n $TRAINING_LINES | apertium -d "$DATA" $SL-$TL-tagger \<br />
| apertium-pretransfer > $CORPUS.tagged.$SL;<br />
<br />
cat "$CORPUS.clean.$TL" | tail -n $TRAINING_LINES | apertium -d "$DATA" $TL-$SL-tagger \<br />
| apertium-pretransfer > $CORPUS.tagged.$TL;<br />
<br />
cat "$CORPUS.tagged.$SL" | lt-proc -b "$DATA/$SL-$TL.autobil.bin" > $CORPUS.biltrans.$PAIR<br />
<br />
N=`wc -l $CORPUS.clean.$SL | cut -d ' ' -f 1`<br />
<br />
# REMOVE LINES WITH NO ANALYSES<br />
seq 1 $N > $CORPUS.lines<br />
paste $CORPUS.lines $CORPUS.tagged.$SL $CORPUS.tagged.$TL | grep '<' | cut -f1 > $CORPUS.lines.new<br />
paste $CORPUS.lines $CORPUS.tagged.$SL $CORPUS.tagged.$TL | grep '<' | cut -f2 > $CORPUS.tagged.$SL.new<br />
paste $CORPUS.lines $CORPUS.tagged.$SL $CORPUS.tagged.$TL | grep '<' | cut -f3 > $CORPUS.tagged.$TL.new<br />
<br />
mv $CORPUS.lines.new $CORPUS.lines<br />
mv $CORPUS.tagged.$SL.new $CORPUS.tagged.$SL<br />
mv $CORPUS.tagged.$TL.new $CORPUS.tagged.$TL<br />
<br />
# TRIM TAGS<br />
cat $CORPUS.tagged.$SL | $LEX_TOOLS/process-tagger-output $DATA/$SL-$TL.autobil.bin -p -t > $CORPUS.tag-tok.$SL<br />
cat $CORPUS.tagged.$TL | $LEX_TOOLS/process-tagger-output $DATA/$TL-$SL.autobil.bin -p -t > $CORPUS.tag-tok.$TL<br />
cat $CORPUS.tagged.$SL | $LEX_TOOLS/process-tagger-output $DATA/$SL-$TL.autobil.bin -b -t > $CORPUS.biltrans-tok.$PAIR<br />
<br />
# ALIGN<br />
perl $MOSESDECODER/train-model.perl -external-bin-dir "$BIN_DIR" -corpus $CORPUS.tag-tok \<br />
-f $TL -e $SL -alignment grow-diag-final-and -reordering msd-bidirectional-fe \<br />
-lm 0:5:/home/philip/Apertium/gsoc2013/giza/europarl.lm:0 2>&1<br />
<br />
# EXTRACT<br />
zcat giza.$SL-$TL/$SL-$TL.A3.final.gz | $SCRIPTS/giza-to-moses.awk > $CORPUS.phrasetable.$SL-$TL<br />
<br />
# SENTENCES<br />
python3 $SCRIPTS/extract-sentences.py $CORPUS.phrasetable.$SL-$TL $CORPUS.biltrans-tok.$PAIR \<br />
> $CORPUS.candidates.$SL-$TL 2>/dev/null<br />
<br />
# FREQUENCY LEXICON<br />
python $SCRIPTS/extract-freq-lexicon.py $CORPUS.candidates.$SL-$TL > $CORPUS.lex.$SL-$TL 2>/dev/null<br />
<br />
<br />
# BILTRANS CANDIDATES<br />
python3 $SCRIPTS/extract-biltrans-candidates.py $CORPUS.phrasetable.$SL-$TL $CORPUS.biltrans-tok.$PAIR \<br />
> $CORPUS.biltrans-entries.$SL-$TL 2>$CORPUS.biltrans-pairs.$SL-$TL<br />
<br />
# NGRAM PATTERNS<br />
$crisphold=1.5<br />
python $SCRIPTS/ngram-count-patterns.py $CORPUS.lex.$SL-$TL $CORPUS.candidates.$SL-$TL $crisphold 2>/dev/null > $CORPUS.ngrams.$SL-$TL<br />
<br />
# FILTER PATTERNS<br />
<br />
<br />
# NGRAMS TO RULES<br />
$cripshold=1.5<br />
python3 $SCRIPTS/ngrams-to-rules.py europarl.ngrams.en-es $crisphold > $Europarl3.ngrams.$SL-TL.lrx<br />
<br />
</pre><br />
<br />
[[Category:Lexical selection]]<br />
[[Category:Documentation in English]]</div>
Fpetkovski
https://wiki.apertium.org/w/index.php?title=Generating_lexical-selection_rules_from_a_parallel_corpus&diff=43791
Generating lexical-selection rules from a parallel corpus
2013-09-21T15:13:51Z
<p>Fpetkovski: /* Extract frequency lexicon */</p>
<hr />
<div>{{TOCD}}<br />
If you have a parallel corpus, one of the things you can do is generate some lexical selection rules from it, to improve translation of words with more than one possible translation.<br />
<br />
== You will need ==<br />
<br />
Here is a list of software that you will need installed:<br />
<br />
* Giza++ (or some other word aligner)<br />
* Moses (for making Giza++ less human hostile)<br />
* All the Moses scripts<br />
* [[lttoolbox]]<br />
* Apertium<br />
* [[apertium-lex-tools]]<br />
<br />
Furthermore you'll need:<br />
<br />
* an Apertium language pair<br />
* a parallel corpus (see [[Corpora]])<br />
<br />
===Installing prerequisites===<br />
See [[Minimal installation from SVN]] for apertium/lttoolbox. <br />
<br />
See [[Constraint-based lexical selection module]] for apertium-lex-tools.<br />
<br />
For Giza++ and moses-decoder, etc. you can do<br />
<pre><br />
$ mkdir ~/smt<br />
$ cd ~/smt<br />
$ mkdir local # our "install prefix"<br />
$ wget https://giza-pp.googlecode.com/files/giza-pp-v1.0.7.tar.gz<br />
$ tar xzvf giza-pp-v1.0.7.tar.gz<br />
$ cd giza-pp<br />
$ make<br />
$ mkdir ../local/bin<br />
$ cp GIZA++-v2/snt2cooc.out ../local/bin/<br />
$ cp GIZA++-v2/snt2plain.out ../local/bin/<br />
$ cp GIZA++-v2/GIZA++ ../local/bin/<br />
$ cp mkcls-v2/mkcls ../local/bin/<br />
$ git clone https://github.com/moses-smt/mosesdecoder<br />
$ cd mosesdecoder/<br />
$ ./bjam <br />
</pre><br />
<br />
Now e.g. the clean-corpus and train-model scripts referred to below will be in ~/smt/mosesdecoder/scripts/training/clean-corpus-n.perl<br />
See http://www.statmt.org/moses/?n=Development.GetStarted if you want to install the binaries to some other directory.<br />
<br />
== Getting started ==<br />
<br />
We're going to do the example with [[Corpora|EuroParl]] and the English to Spanish pair in Apertium.<br />
<br />
Given that you've got all the stuff installed, the work will be as follows:<br />
<br />
=== Prepare corpus ===<br />
<br />
To generate the rules, we need three files,<br />
<br />
* The tagged and tokenised source corpus<br />
* The tagged and tokenised target corpus<br />
* The output of the lexical transfer module in the source→target direction, tokenised<br />
<br />
These three files should be sentence aligned. <br />
<br />
Make a folder called data-en-es. We are going to keep all the generated files there.<br />
<br />
The first thing that we need to do is tag both sides of the corpus:<br />
<br />
<pre><br />
$ nohup cat europarl.en-es.en | apertium-destxt |\<br />
apertium -f none -d /home/fran/source/apertium-en-es en-es-pretransfer > data-en-es/europarl-en-es.tagged.en &<br />
$ nohup cat europarl.clean.es | apertium-destxt |\<br />
apertium -f none -d /home/fran/source/apertium-en-es es-en-pretransfer > data-en-es.europarl-en-es.tagged.es &<br />
</pre><br />
<br />
Then we need to remove the lines with no analyses on and replace blanks within lemmas with a new character (we will use `~`):<br />
<br />
<pre><br />
$ paste data-en-es/europarl-en-es.tagged.en data-en-es/europarl-en-es.tagged.es | grep '<' | cut -f2 | sed 's/ /~/g' | sed 's/\$[^\^]*/\$ /g' > data-en-es/europarl-en-es.tagged.new.es<br />
$ paste data-en-es/europarl-en-es.tagged.en data-en-es/europarl-en-es.tagged.es | grep '<' | cut -f3 | sed 's/ /~/g' | sed 's/\$[^\^]*/\$ /g' > data-en-es/europarl-en-es.tagged.new.en<br />
</pre><br />
<br />
Next, we need to clean the corpus and remove long sentences.<br />
(Make sure you are in the same directory as the one where you have your europarl corpus) <br />
<br />
<pre><br />
$ perl (path to your mosesdecoder)/scripts/training/clean-corpus-n.perl data-en-es/europarl-en-es.tagged.new es en data-en-es/europarl-en-es.tag-clean 1 40<br />
clean-corpus.perl: processing europarl-v6.es-en.es & .en to data-en-es/europarl-en-es.clean, cutoff 1-40<br />
..........(100000)...<br />
<br />
Input sentences: 1786594 Output sentences: 1467708<br />
</pre><br />
<br />
<br />
We're going to cut off the bottom 67,658 for testing (also because Giza++ segfaults somewhere around there).<br />
<pre><br />
$ mkdir testing<br />
$ tail -67658 data-en-es/europarl-en-es.tag-clean.en > testing/europarl-en-es.tag-clean.67658.en<br />
$ tail -67658 data-en-es/europarl-en-es.tag-clean.es > testing/europarl-en-es.tag-clean.67658.es<br />
</pre><br />
<br />
<pre><br />
$ head -1400000 data-en-es/europarl-en-es.tag-clean.en > data-en-es/europarl-en-es.tag-clean.en.new<br />
$ head -1400000 data-en-es/europarl-en-es.tag-clean.es > data-en-es/europarl-en-es.tag-clean.es.new<br />
$ mv data-en-es/europarl-en-es.tag-clean.en.new data-en-es/europarl-en-es.tag-clean.en<br />
$ mv data-en-es/europarl-en-es.tag-clean.es.new data-en-es/europarl-en-es.tag-clean.es<br />
</pre><br />
<br />
<br />
These files are:<br />
<br />
* <code>data-en-es/europarl-en-es.tag-clean.en</code>: The tagged source language side of the corpus<br />
* <code>data-en-es/europarl-en-es.tag-clean.es</code>: The tagged target language side of the corpus<br />
<br />
Check that they have the same length:<br />
<br />
<pre><br />
$ wc -l europarl.*<br />
1400000 data-en-es/europarl-en-es.tag-clean.en<br />
1400000 data-en-es/europarl-en-es.tag-clean.es<br />
2800000 total<br />
</pre><br />
<br />
=== Align corpus ===<br />
<br />
Now we've got the corpus files ready, we can align the corpus using the Moses scripts:<br />
<br />
<pre><br />
nohup perl (path to your mosesdecoder)/scripts/training/train-model.perl -external-bin-dir \<br />
~/smt/local/bin -corpus data-en-es/europarl-en-es.tag-clean \<br />
-f en -e es -alignment grow-diag-final-and -reordering msd-bidirectional-fe \<br />
-lm 0:5:/home/fran/corpora/europarl/europarl.lm:0 >log 2>&1 &<br />
</pre><br />
<br />
Note: Remember to change all the paths in the above command!<br />
<br />
You'll need an LM file, but you can copy it from a previous Moses installation. If you don't have one, make an empty file and put a few words in it. We won't be using the LM anyway.<br />
<br />
This takes a while, from a few hours to a day. So leave it running and go and make a soufflé, or chop some wood or something.<br />
<br />
=== Extract sentences ===<br />
After the sentences are aligned, we need to trim unnecessary tags from the tokens, and generate a biltrans file.<br />
<br />
<pre><br />
zcat giza.en-es/en-es.A3.final.gz | ~/source/apertium-lex-tools/scripts/giza-to-moses.awk > data-en-es/europarl.phrasetable.en-es<br />
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 1 \<br />
| sed 's/~/ /g' | ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil.bin -p -t > tmp1<br />
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 2 \<br />
| sed 's/~/ /g' | ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil.bin -p -t > tmp2<br />
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 3 > tmp3<br />
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 2 \<br />
| sed 's/~/ /g' | ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil.bin -b -t > data-en-es/europarl.clean-biltrans.en-es<br />
paste tmp1 tmp2 tmp3 | sed 's/\t/ ||| /g' > data-en-es/europarl.phrasetable.en-es<br />
rm tmp1 tmp2 tmp3<br />
</pre><br />
<br />
Then we want to make sure again that our file has the right number of lines:<br />
<br />
<pre><br />
$ wc -l data-en-es/europarl.phrasetable.en-es<br />
1400000 data-en-es/europarl.phrasetable.en-es<br />
</pre><br />
<br />
Then we want to extract the sentences where the target language word aligned to a source language word is a possible translation in the bilingual dictionary:<br />
<br />
<pre><br />
$ ~/source/apertium-lex-tools/scripts/extract-sentences.py data-en-es/europarl-en-es.phrasetable.en-es data-en-es/europarl-en-es.biltrans-tok.en-es \<br />
> data-en-es/europarl-en-es.candidates.en-es<br />
</pre><br />
<br />
These are basically sentences that we can hope that Apertium might be able to generate.<br />
<br />
=== Extract bilingual dictionary candidates ===<br />
<br />
Using the phrasetable and the bilingual file we can extract candidates for the bilingual dictionary.<br />
<br />
<pre><br />
python3 ~/source/apertium-lex-tools/scripts/extract-biltrans-candidates.py data-en-es/europarl-en-es.phrasetable.en-es data-en-es/europarl-en-es.biltrans-tok.en-es > data-en-es/europarl-en-es.biltrans-candidates.en-es 2> data-en-es/europarl-en-es.biltrans-pairs.en-es<br />
</pre><br />
<br />
where data-en-es/europarl-en-es.biltrans-candidates.en-es contains the generated candidates for the bilingual dictionary.<br />
<br />
===Extract frequency lexicon===<br />
<br />
The next step is to extract the frequency lexicon. <br />
<br />
<pre><br />
$ python ~/source/apertium-lex-tools/scripts/extract-freq-lexicon.py data-en-es/europarl-en-es.candidates.en-es > data-en-es/europarl-en-es.lex.en-es<br />
</pre><br />
<br />
This file should look like:<br />
<br />
<pre><br />
$ cat europarl.lex.en-es | head <br />
31381 union<n> unión<n> @<br />
101 union<n> sindicato<n><br />
1 union<n> situación<n><br />
1 union<n> monetario<adj><br />
4 slope<n> pendiente<n> @<br />
1 slope<n> ladera<n><br />
</pre><br />
<br />
Where the highest frequency translation is marked with an <code>@</code>.<br />
<br />
Note: This frequency lexicon can be used as a substitute for "choosing the most general translation" in your bilingual dictionary.<br />
<br />
===Generate patterns===<br />
<br />
Now we generate the ngrams that we are going to generate the rules from.<br />
<br />
<pre><br />
$ crisphold=1.5 # ratio of how many times you see the alternative translation compared to the default<br />
$ python ~/source/apertium-lex-tools/scripts/ngram-count-patterns.py europarl.lex.en-es europarl.candidates.en-es $crisphold 2>/dev/null > europarl.ngrams.en-es<br />
</pre><br />
<br />
This script outputs lines in the following format: <br />
<br />
<pre><br />
-language<n> and<cnjcoo> language<n> ,<cm> lengua<n> 2<br />
+language<n> plain<adj> language<n> ,<cm> lenguaje<n> 3<br />
-language<n> language<n> knowledge<n> lengua<n> 4<br />
-language<n> language<n> of<pr> communication<n> lengua<n> 3<br />
-language<n> Community<adj> language<n> .<sent> lengua<n> 5<br />
-language<n> language<n> in~addition~to<pr> their<det><pos> lengua<n> 2<br />
-language<n> every<det><ind> language<n> lengua<n> 2<br />
+language<n> and<cnjcoo> *understandable language<n> lenguaje<n> 2<br />
-language<n> two<num> language<n> lengua<n> 8<br />
-language<n> only<adj> official<adj> language<n> lengua<n> 2<br />
</pre><br />
<br />
The <code>+</code> and <code>-</code> indicate if this line chooses the most frequent transation (<code>-</code>) or a translation which is not the most frequent (<code>+</code>). The pattern selecting the translation is then shown, followed by the translation and then the frequency.<br />
<br />
===Filter rules===<br />
<br />
Now you can filter the rules, for example by removing rules with conjunctions, or removing rules with unknown words.<br />
<br />
=== Generate rules ===<br />
<br />
The final stage is to generate the rules,<br />
<br />
<pre><br />
python3 ~/source/apertium-lex-tools/scripts/ngrams-to-rules.py europarl.ngrams.en-es $crisphold > europarl.ngrams.en-es.lrx<br />
</pre> <br />
<br />
=== Process script ===<br />
For the whole process you can run the following script:<br />
<br />
<pre><br />
CORPUS_DIR="/home/philip/Apertium/corpora/raw/europarl-fr-es"<br />
CORPUS="Europarl3"<br />
PAIR="es-fr"<br />
SL="fr"<br />
TL="es"<br />
LEX_TOOLS="/home/philip/Apertium/apertium-lex-tools"<br />
SCRIPTS="$LEX_TOOLS/scripts"<br />
MOSESDECODER="/home/philip/Apertium/mosesdecoder/scripts/training"<br />
TRAINING_LINES=50000<br />
DATA="/home/philip/Apertium/apertium-fr-es"<br />
BIN_DIR="/home/philip/Apertium/smt/bin"<br />
<br />
# CLEAN CORPUS<br />
perl "$MOSESDECODER/clean-corpus-n.perl" $CORPUS.$PAIR $SL $TL "$CORPUS.clean" 1 40;<br />
<br />
#TAG CORPUS<br />
cat "$CORPUS.clean.$SL" | tail -n $TRAINING_LINES | apertium -d "$DATA" $SL-$TL-tagger \<br />
| apertium-pretransfer > $CORPUS.tagged.$SL;<br />
<br />
cat "$CORPUS.clean.$TL" | tail -n $TRAINING_LINES | apertium -d "$DATA" $TL-$SL-tagger \<br />
| apertium-pretransfer > $CORPUS.tagged.$TL;<br />
<br />
cat "$CORPUS.tagged.$SL" | lt-proc -b "$DATA/$SL-$TL.autobil.bin" > $CORPUS.biltrans.$PAIR<br />
<br />
N=`wc -l $CORPUS.clean.$SL | cut -d ' ' -f 1`<br />
<br />
# REMOVE LINES WITH NO ANALYSES<br />
seq 1 $N > $CORPUS.lines<br />
paste $CORPUS.lines $CORPUS.tagged.$SL $CORPUS.tagged.$TL | grep '<' | cut -f1 > $CORPUS.lines.new<br />
paste $CORPUS.lines $CORPUS.tagged.$SL $CORPUS.tagged.$TL | grep '<' | cut -f2 > $CORPUS.tagged.$SL.new<br />
paste $CORPUS.lines $CORPUS.tagged.$SL $CORPUS.tagged.$TL | grep '<' | cut -f3 > $CORPUS.tagged.$TL.new<br />
<br />
mv $CORPUS.lines.new $CORPUS.lines<br />
mv $CORPUS.tagged.$SL.new $CORPUS.tagged.$SL<br />
mv $CORPUS.tagged.$TL.new $CORPUS.tagged.$TL<br />
<br />
# TRIM TAGS<br />
cat $CORPUS.tagged.$SL | $LEX_TOOLS/process-tagger-output $DATA/$SL-$TL.autobil.bin -p -t > $CORPUS.tag-tok.$SL<br />
cat $CORPUS.tagged.$TL | $LEX_TOOLS/process-tagger-output $DATA/$TL-$SL.autobil.bin -p -t > $CORPUS.tag-tok.$TL<br />
cat $CORPUS.tagged.$SL | $LEX_TOOLS/process-tagger-output $DATA/$SL-$TL.autobil.bin -b -t > $CORPUS.biltrans-tok.$PAIR<br />
<br />
# ALIGN<br />
perl $MOSESDECODER/train-model.perl -external-bin-dir "$BIN_DIR" -corpus $CORPUS.tag-tok \<br />
-f $TL -e $SL -alignment grow-diag-final-and -reordering msd-bidirectional-fe \<br />
-lm 0:5:/home/philip/Apertium/gsoc2013/giza/europarl.lm:0 2>&1<br />
<br />
# EXTRACT<br />
zcat giza.$SL-$TL/$SL-$TL.A3.final.gz | $SCRIPTS/giza-to-moses.awk > $CORPUS.phrasetable.$SL-$TL<br />
<br />
# SENTENCES<br />
python3 $SCRIPTS/extract-sentences.py $CORPUS.phrasetable.$SL-$TL $CORPUS.biltrans-tok.$PAIR \<br />
> $CORPUS.candidates.$SL-$TL 2>/dev/null<br />
<br />
# FREQUENCY LEXICON<br />
python $SCRIPTS/extract-freq-lexicon.py $CORPUS.candidates.$SL-$TL > $CORPUS.lex.$SL-$TL 2>/dev/null<br />
<br />
<br />
# BILTRANS CANDIDATES<br />
python3 $SCRIPTS/extract-biltrans-candidates.py $CORPUS.phrasetable.$SL-$TL $CORPUS.biltrans-tok.$PAIR \<br />
> $CORPUS.biltrans-entries.$SL-$TL 2>$CORPUS.biltrans-pairs.$SL-$TL<br />
<br />
# NGRAM PATTERNS<br />
$crisphold=1.5<br />
python $SCRIPTS/ngram-count-patterns.py $CORPUS.lex.$SL-$TL $CORPUS.candidates.$SL-$TL $crisphold 2>/dev/null > $CORPUS.ngrams.$SL-$TL<br />
<br />
# FILTER PATTERNS<br />
<br />
<br />
# NGRAMS TO RULES<br />
$cripshold=1.5<br />
python3 $SCRIPTS/ngrams-to-rules.py europarl.ngrams.en-es $crisphold > $Europarl3.ngrams.$SL-TL.lrx<br />
<br />
</pre><br />
<br />
[[Category:Lexical selection]]<br />
[[Category:Documentation in English]]</div>
Fpetkovski
https://wiki.apertium.org/w/index.php?title=Generating_lexical-selection_rules_from_a_parallel_corpus&diff=43790
Generating lexical-selection rules from a parallel corpus
2013-09-21T15:12:21Z
<p>Fpetkovski: /* Extract bilingual dictionary candidates */</p>
<hr />
<div>{{TOCD}}<br />
If you have a parallel corpus, one of the things you can do is generate some lexical selection rules from it, to improve translation of words with more than one possible translation.<br />
<br />
== You will need ==<br />
<br />
Here is a list of software that you will need installed:<br />
<br />
* Giza++ (or some other word aligner)<br />
* Moses (for making Giza++ less human hostile)<br />
* All the Moses scripts<br />
* [[lttoolbox]]<br />
* Apertium<br />
* [[apertium-lex-tools]]<br />
<br />
Furthermore you'll need:<br />
<br />
* an Apertium language pair<br />
* a parallel corpus (see [[Corpora]])<br />
<br />
===Installing prerequisites===<br />
See [[Minimal installation from SVN]] for apertium/lttoolbox. <br />
<br />
See [[Constraint-based lexical selection module]] for apertium-lex-tools.<br />
<br />
For Giza++ and moses-decoder, etc. you can do<br />
<pre><br />
$ mkdir ~/smt<br />
$ cd ~/smt<br />
$ mkdir local # our "install prefix"<br />
$ wget https://giza-pp.googlecode.com/files/giza-pp-v1.0.7.tar.gz<br />
$ tar xzvf giza-pp-v1.0.7.tar.gz<br />
$ cd giza-pp<br />
$ make<br />
$ mkdir ../local/bin<br />
$ cp GIZA++-v2/snt2cooc.out ../local/bin/<br />
$ cp GIZA++-v2/snt2plain.out ../local/bin/<br />
$ cp GIZA++-v2/GIZA++ ../local/bin/<br />
$ cp mkcls-v2/mkcls ../local/bin/<br />
$ git clone https://github.com/moses-smt/mosesdecoder<br />
$ cd mosesdecoder/<br />
$ ./bjam <br />
</pre><br />
<br />
Now e.g. the clean-corpus and train-model scripts referred to below will be in ~/smt/mosesdecoder/scripts/training/clean-corpus-n.perl<br />
See http://www.statmt.org/moses/?n=Development.GetStarted if you want to install the binaries to some other directory.<br />
<br />
== Getting started ==<br />
<br />
We're going to do the example with [[Corpora|EuroParl]] and the English to Spanish pair in Apertium.<br />
<br />
Given that you've got all the stuff installed, the work will be as follows:<br />
<br />
=== Prepare corpus ===<br />
<br />
To generate the rules, we need three files,<br />
<br />
* The tagged and tokenised source corpus<br />
* The tagged and tokenised target corpus<br />
* The output of the lexical transfer module in the source→target direction, tokenised<br />
<br />
These three files should be sentence aligned. <br />
<br />
Make a folder called data-en-es. We are going to keep all the generated files there.<br />
<br />
The first thing that we need to do is tag both sides of the corpus:<br />
<br />
<pre><br />
$ nohup cat europarl.en-es.en | apertium-destxt |\<br />
apertium -f none -d /home/fran/source/apertium-en-es en-es-pretransfer > data-en-es/europarl-en-es.tagged.en &<br />
$ nohup cat europarl.clean.es | apertium-destxt |\<br />
apertium -f none -d /home/fran/source/apertium-en-es es-en-pretransfer > data-en-es.europarl-en-es.tagged.es &<br />
</pre><br />
<br />
Then we need to remove the lines with no analyses on and replace blanks within lemmas with a new character (we will use `~`):<br />
<br />
<pre><br />
$ paste data-en-es/europarl-en-es.tagged.en data-en-es/europarl-en-es.tagged.es | grep '<' | cut -f2 | sed 's/ /~/g' | sed 's/\$[^\^]*/\$ /g' > data-en-es/europarl-en-es.tagged.new.es<br />
$ paste data-en-es/europarl-en-es.tagged.en data-en-es/europarl-en-es.tagged.es | grep '<' | cut -f3 | sed 's/ /~/g' | sed 's/\$[^\^]*/\$ /g' > data-en-es/europarl-en-es.tagged.new.en<br />
</pre><br />
<br />
Next, we need to clean the corpus and remove long sentences.<br />
(Make sure you are in the same directory as the one where you have your europarl corpus) <br />
<br />
<pre><br />
$ perl (path to your mosesdecoder)/scripts/training/clean-corpus-n.perl data-en-es/europarl-en-es.tagged.new es en data-en-es/europarl-en-es.tag-clean 1 40<br />
clean-corpus.perl: processing europarl-v6.es-en.es & .en to data-en-es/europarl-en-es.clean, cutoff 1-40<br />
..........(100000)...<br />
<br />
Input sentences: 1786594 Output sentences: 1467708<br />
</pre><br />
<br />
<br />
We're going to cut off the bottom 67,658 for testing (also because Giza++ segfaults somewhere around there).<br />
<pre><br />
$ mkdir testing<br />
$ tail -67658 data-en-es/europarl-en-es.tag-clean.en > testing/europarl-en-es.tag-clean.67658.en<br />
$ tail -67658 data-en-es/europarl-en-es.tag-clean.es > testing/europarl-en-es.tag-clean.67658.es<br />
</pre><br />
<br />
<pre><br />
$ head -1400000 data-en-es/europarl-en-es.tag-clean.en > data-en-es/europarl-en-es.tag-clean.en.new<br />
$ head -1400000 data-en-es/europarl-en-es.tag-clean.es > data-en-es/europarl-en-es.tag-clean.es.new<br />
$ mv data-en-es/europarl-en-es.tag-clean.en.new data-en-es/europarl-en-es.tag-clean.en<br />
$ mv data-en-es/europarl-en-es.tag-clean.es.new data-en-es/europarl-en-es.tag-clean.es<br />
</pre><br />
<br />
<br />
These files are:<br />
<br />
* <code>data-en-es/europarl-en-es.tag-clean.en</code>: The tagged source language side of the corpus<br />
* <code>data-en-es/europarl-en-es.tag-clean.es</code>: The tagged target language side of the corpus<br />
<br />
Check that they have the same length:<br />
<br />
<pre><br />
$ wc -l europarl.*<br />
1400000 data-en-es/europarl-en-es.tag-clean.en<br />
1400000 data-en-es/europarl-en-es.tag-clean.es<br />
2800000 total<br />
</pre><br />
<br />
=== Align corpus ===<br />
<br />
Now we've got the corpus files ready, we can align the corpus using the Moses scripts:<br />
<br />
<pre><br />
nohup perl (path to your mosesdecoder)/scripts/training/train-model.perl -external-bin-dir \<br />
~/smt/local/bin -corpus data-en-es/europarl-en-es.tag-clean \<br />
-f en -e es -alignment grow-diag-final-and -reordering msd-bidirectional-fe \<br />
-lm 0:5:/home/fran/corpora/europarl/europarl.lm:0 >log 2>&1 &<br />
</pre><br />
<br />
Note: Remember to change all the paths in the above command!<br />
<br />
You'll need an LM file, but you can copy it from a previous Moses installation. If you don't have one, make an empty file and put a few words in it. We won't be using the LM anyway.<br />
<br />
This takes a while, from a few hours to a day. So leave it running and go and make a soufflé, or chop some wood or something.<br />
<br />
=== Extract sentences ===<br />
After the sentences are aligned, we need to trim unnecessary tags from the tokens, and generate a biltrans file.<br />
<br />
<pre><br />
zcat giza.en-es/en-es.A3.final.gz | ~/source/apertium-lex-tools/scripts/giza-to-moses.awk > data-en-es/europarl.phrasetable.en-es<br />
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 1 \<br />
| sed 's/~/ /g' | ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil.bin -p -t > tmp1<br />
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 2 \<br />
| sed 's/~/ /g' | ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil.bin -p -t > tmp2<br />
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 3 > tmp3<br />
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 2 \<br />
| sed 's/~/ /g' | ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil.bin -b -t > data-en-es/europarl.clean-biltrans.en-es<br />
paste tmp1 tmp2 tmp3 | sed 's/\t/ ||| /g' > data-en-es/europarl.phrasetable.en-es<br />
rm tmp1 tmp2 tmp3<br />
</pre><br />
<br />
Then we want to make sure again that our file has the right number of lines:<br />
<br />
<pre><br />
$ wc -l data-en-es/europarl.phrasetable.en-es<br />
1400000 data-en-es/europarl.phrasetable.en-es<br />
</pre><br />
<br />
Then we want to extract the sentences where the target language word aligned to a source language word is a possible translation in the bilingual dictionary:<br />
<br />
<pre><br />
$ ~/source/apertium-lex-tools/scripts/extract-sentences.py data-en-es/europarl-en-es.phrasetable.en-es data-en-es/europarl-en-es.biltrans-tok.en-es \<br />
> data-en-es/europarl-en-es.candidates.en-es<br />
</pre><br />
<br />
These are basically sentences that we can hope that Apertium might be able to generate.<br />
<br />
=== Extract bilingual dictionary candidates ===<br />
<br />
Using the phrasetable and the bilingual file we can extract candidates for the bilingual dictionary.<br />
<br />
<pre><br />
python3 ~/source/apertium-lex-tools/scripts/extract-biltrans-candidates.py data-en-es/europarl-en-es.phrasetable.en-es data-en-es/europarl-en-es.biltrans-tok.en-es > data-en-es/europarl-en-es.biltrans-candidates.en-es 2> data-en-es/europarl-en-es.biltrans-pairs.en-es<br />
</pre><br />
<br />
where data-en-es/europarl-en-es.biltrans-candidates.en-es contains the generated candidates for the bilingual dictionary.<br />
<br />
===Extract frequency lexicon===<br />
<br />
The next step is to extract the frequency lexicon. <br />
<br />
<pre><br />
$ python ~/source/apertium-lex-tools/scripts/extract-freq-lexicon.py europarl.candidates.en-es > europarl.lex.en-es<br />
</pre><br />
<br />
This file should look like:<br />
<br />
<pre><br />
$ cat europarl.lex.en-es | head <br />
31381 union<n> unión<n> @<br />
101 union<n> sindicato<n><br />
1 union<n> situación<n><br />
1 union<n> monetario<adj><br />
4 slope<n> pendiente<n> @<br />
1 slope<n> ladera<n><br />
</pre><br />
<br />
Where the highest frequency translation is marked with an <code>@</code>.<br />
<br />
Note: This frequency lexicon can be used as a substitute for "choosing the most general translation" in your bilingual dictionary.<br />
<br />
===Generate patterns===<br />
<br />
Now we generate the ngrams that we are going to generate the rules from.<br />
<br />
<pre><br />
$ crisphold=1.5 # ratio of how many times you see the alternative translation compared to the default<br />
$ python ~/source/apertium-lex-tools/scripts/ngram-count-patterns.py europarl.lex.en-es europarl.candidates.en-es $crisphold 2>/dev/null > europarl.ngrams.en-es<br />
</pre><br />
<br />
This script outputs lines in the following format: <br />
<br />
<pre><br />
-language<n> and<cnjcoo> language<n> ,<cm> lengua<n> 2<br />
+language<n> plain<adj> language<n> ,<cm> lenguaje<n> 3<br />
-language<n> language<n> knowledge<n> lengua<n> 4<br />
-language<n> language<n> of<pr> communication<n> lengua<n> 3<br />
-language<n> Community<adj> language<n> .<sent> lengua<n> 5<br />
-language<n> language<n> in~addition~to<pr> their<det><pos> lengua<n> 2<br />
-language<n> every<det><ind> language<n> lengua<n> 2<br />
+language<n> and<cnjcoo> *understandable language<n> lenguaje<n> 2<br />
-language<n> two<num> language<n> lengua<n> 8<br />
-language<n> only<adj> official<adj> language<n> lengua<n> 2<br />
</pre><br />
<br />
The <code>+</code> and <code>-</code> indicate if this line chooses the most frequent transation (<code>-</code>) or a translation which is not the most frequent (<code>+</code>). The pattern selecting the translation is then shown, followed by the translation and then the frequency.<br />
<br />
===Filter rules===<br />
<br />
Now you can filter the rules, for example by removing rules with conjunctions, or removing rules with unknown words.<br />
<br />
=== Generate rules ===<br />
<br />
The final stage is to generate the rules,<br />
<br />
<pre><br />
python3 ~/source/apertium-lex-tools/scripts/ngrams-to-rules.py europarl.ngrams.en-es $crisphold > europarl.ngrams.en-es.lrx<br />
</pre> <br />
<br />
=== Process script ===<br />
For the whole process you can run the following script:<br />
<br />
<pre><br />
CORPUS_DIR="/home/philip/Apertium/corpora/raw/europarl-fr-es"<br />
CORPUS="Europarl3"<br />
PAIR="es-fr"<br />
SL="fr"<br />
TL="es"<br />
LEX_TOOLS="/home/philip/Apertium/apertium-lex-tools"<br />
SCRIPTS="$LEX_TOOLS/scripts"<br />
MOSESDECODER="/home/philip/Apertium/mosesdecoder/scripts/training"<br />
TRAINING_LINES=50000<br />
DATA="/home/philip/Apertium/apertium-fr-es"<br />
BIN_DIR="/home/philip/Apertium/smt/bin"<br />
<br />
# CLEAN CORPUS<br />
perl "$MOSESDECODER/clean-corpus-n.perl" $CORPUS.$PAIR $SL $TL "$CORPUS.clean" 1 40;<br />
<br />
#TAG CORPUS<br />
cat "$CORPUS.clean.$SL" | tail -n $TRAINING_LINES | apertium -d "$DATA" $SL-$TL-tagger \<br />
| apertium-pretransfer > $CORPUS.tagged.$SL;<br />
<br />
cat "$CORPUS.clean.$TL" | tail -n $TRAINING_LINES | apertium -d "$DATA" $TL-$SL-tagger \<br />
| apertium-pretransfer > $CORPUS.tagged.$TL;<br />
<br />
cat "$CORPUS.tagged.$SL" | lt-proc -b "$DATA/$SL-$TL.autobil.bin" > $CORPUS.biltrans.$PAIR<br />
<br />
N=`wc -l $CORPUS.clean.$SL | cut -d ' ' -f 1`<br />
<br />
# REMOVE LINES WITH NO ANALYSES<br />
seq 1 $N > $CORPUS.lines<br />
paste $CORPUS.lines $CORPUS.tagged.$SL $CORPUS.tagged.$TL | grep '<' | cut -f1 > $CORPUS.lines.new<br />
paste $CORPUS.lines $CORPUS.tagged.$SL $CORPUS.tagged.$TL | grep '<' | cut -f2 > $CORPUS.tagged.$SL.new<br />
paste $CORPUS.lines $CORPUS.tagged.$SL $CORPUS.tagged.$TL | grep '<' | cut -f3 > $CORPUS.tagged.$TL.new<br />
<br />
mv $CORPUS.lines.new $CORPUS.lines<br />
mv $CORPUS.tagged.$SL.new $CORPUS.tagged.$SL<br />
mv $CORPUS.tagged.$TL.new $CORPUS.tagged.$TL<br />
<br />
# TRIM TAGS<br />
cat $CORPUS.tagged.$SL | $LEX_TOOLS/process-tagger-output $DATA/$SL-$TL.autobil.bin -p -t > $CORPUS.tag-tok.$SL<br />
cat $CORPUS.tagged.$TL | $LEX_TOOLS/process-tagger-output $DATA/$TL-$SL.autobil.bin -p -t > $CORPUS.tag-tok.$TL<br />
cat $CORPUS.tagged.$SL | $LEX_TOOLS/process-tagger-output $DATA/$SL-$TL.autobil.bin -b -t > $CORPUS.biltrans-tok.$PAIR<br />
<br />
# ALIGN<br />
perl $MOSESDECODER/train-model.perl -external-bin-dir "$BIN_DIR" -corpus $CORPUS.tag-tok \<br />
-f $TL -e $SL -alignment grow-diag-final-and -reordering msd-bidirectional-fe \<br />
-lm 0:5:/home/philip/Apertium/gsoc2013/giza/europarl.lm:0 2>&1<br />
<br />
# EXTRACT<br />
zcat giza.$SL-$TL/$SL-$TL.A3.final.gz | $SCRIPTS/giza-to-moses.awk > $CORPUS.phrasetable.$SL-$TL<br />
<br />
# SENTENCES<br />
python3 $SCRIPTS/extract-sentences.py $CORPUS.phrasetable.$SL-$TL $CORPUS.biltrans-tok.$PAIR \<br />
> $CORPUS.candidates.$SL-$TL 2>/dev/null<br />
<br />
# FREQUENCY LEXICON<br />
python $SCRIPTS/extract-freq-lexicon.py $CORPUS.candidates.$SL-$TL > $CORPUS.lex.$SL-$TL 2>/dev/null<br />
<br />
<br />
# BILTRANS CANDIDATES<br />
python3 $SCRIPTS/extract-biltrans-candidates.py $CORPUS.phrasetable.$SL-$TL $CORPUS.biltrans-tok.$PAIR \<br />
> $CORPUS.biltrans-entries.$SL-$TL 2>$CORPUS.biltrans-pairs.$SL-$TL<br />
<br />
# NGRAM PATTERNS<br />
$crisphold=1.5<br />
python $SCRIPTS/ngram-count-patterns.py $CORPUS.lex.$SL-$TL $CORPUS.candidates.$SL-$TL $crisphold 2>/dev/null > $CORPUS.ngrams.$SL-$TL<br />
<br />
# FILTER PATTERNS<br />
<br />
<br />
# NGRAMS TO RULES<br />
$cripshold=1.5<br />
python3 $SCRIPTS/ngrams-to-rules.py europarl.ngrams.en-es $crisphold > $Europarl3.ngrams.$SL-TL.lrx<br />
<br />
</pre><br />
<br />
[[Category:Lexical selection]]<br />
[[Category:Documentation in English]]</div>
Fpetkovski
https://wiki.apertium.org/w/index.php?title=Generating_lexical-selection_rules_from_a_parallel_corpus&diff=43789
Generating lexical-selection rules from a parallel corpus
2013-09-21T15:10:09Z
<p>Fpetkovski: /* Extract sentences */</p>
<hr />
<div>{{TOCD}}<br />
If you have a parallel corpus, one of the things you can do is generate some lexical selection rules from it, to improve translation of words with more than one possible translation.<br />
<br />
== You will need ==<br />
<br />
Here is a list of software that you will need installed:<br />
<br />
* Giza++ (or some other word aligner)<br />
* Moses (for making Giza++ less human hostile)<br />
* All the Moses scripts<br />
* [[lttoolbox]]<br />
* Apertium<br />
* [[apertium-lex-tools]]<br />
<br />
Furthermore you'll need:<br />
<br />
* an Apertium language pair<br />
* a parallel corpus (see [[Corpora]])<br />
<br />
===Installing prerequisites===<br />
See [[Minimal installation from SVN]] for apertium/lttoolbox. <br />
<br />
See [[Constraint-based lexical selection module]] for apertium-lex-tools.<br />
<br />
For Giza++ and moses-decoder, etc. you can do<br />
<pre><br />
$ mkdir ~/smt<br />
$ cd ~/smt<br />
$ mkdir local # our "install prefix"<br />
$ wget https://giza-pp.googlecode.com/files/giza-pp-v1.0.7.tar.gz<br />
$ tar xzvf giza-pp-v1.0.7.tar.gz<br />
$ cd giza-pp<br />
$ make<br />
$ mkdir ../local/bin<br />
$ cp GIZA++-v2/snt2cooc.out ../local/bin/<br />
$ cp GIZA++-v2/snt2plain.out ../local/bin/<br />
$ cp GIZA++-v2/GIZA++ ../local/bin/<br />
$ cp mkcls-v2/mkcls ../local/bin/<br />
$ git clone https://github.com/moses-smt/mosesdecoder<br />
$ cd mosesdecoder/<br />
$ ./bjam <br />
</pre><br />
<br />
Now e.g. the clean-corpus and train-model scripts referred to below will be in ~/smt/mosesdecoder/scripts/training/clean-corpus-n.perl<br />
See http://www.statmt.org/moses/?n=Development.GetStarted if you want to install the binaries to some other directory.<br />
<br />
== Getting started ==<br />
<br />
We're going to do the example with [[Corpora|EuroParl]] and the English to Spanish pair in Apertium.<br />
<br />
Given that you've got all the stuff installed, the work will be as follows:<br />
<br />
=== Prepare corpus ===<br />
<br />
To generate the rules, we need three files,<br />
<br />
* The tagged and tokenised source corpus<br />
* The tagged and tokenised target corpus<br />
* The output of the lexical transfer module in the source→target direction, tokenised<br />
<br />
These three files should be sentence aligned. <br />
<br />
Make a folder called data-en-es. We are going to keep all the generated files there.<br />
<br />
The first thing that we need to do is tag both sides of the corpus:<br />
<br />
<pre><br />
$ nohup cat europarl.en-es.en | apertium-destxt |\<br />
apertium -f none -d /home/fran/source/apertium-en-es en-es-pretransfer > data-en-es/europarl-en-es.tagged.en &<br />
$ nohup cat europarl.clean.es | apertium-destxt |\<br />
apertium -f none -d /home/fran/source/apertium-en-es es-en-pretransfer > data-en-es.europarl-en-es.tagged.es &<br />
</pre><br />
<br />
Then we need to remove the lines with no analyses on and replace blanks within lemmas with a new character (we will use `~`):<br />
<br />
<pre><br />
$ paste data-en-es/europarl-en-es.tagged.en data-en-es/europarl-en-es.tagged.es | grep '<' | cut -f2 | sed 's/ /~/g' | sed 's/\$[^\^]*/\$ /g' > data-en-es/europarl-en-es.tagged.new.es<br />
$ paste data-en-es/europarl-en-es.tagged.en data-en-es/europarl-en-es.tagged.es | grep '<' | cut -f3 | sed 's/ /~/g' | sed 's/\$[^\^]*/\$ /g' > data-en-es/europarl-en-es.tagged.new.en<br />
</pre><br />
<br />
Next, we need to clean the corpus and remove long sentences.<br />
(Make sure you are in the same directory as the one where you have your europarl corpus) <br />
<br />
<pre><br />
$ perl (path to your mosesdecoder)/scripts/training/clean-corpus-n.perl data-en-es/europarl-en-es.tagged.new es en data-en-es/europarl-en-es.tag-clean 1 40<br />
clean-corpus.perl: processing europarl-v6.es-en.es & .en to data-en-es/europarl-en-es.clean, cutoff 1-40<br />
..........(100000)...<br />
<br />
Input sentences: 1786594 Output sentences: 1467708<br />
</pre><br />
<br />
<br />
We're going to cut off the bottom 67,658 for testing (also because Giza++ segfaults somewhere around there).<br />
<pre><br />
$ mkdir testing<br />
$ tail -67658 data-en-es/europarl-en-es.tag-clean.en > testing/europarl-en-es.tag-clean.67658.en<br />
$ tail -67658 data-en-es/europarl-en-es.tag-clean.es > testing/europarl-en-es.tag-clean.67658.es<br />
</pre><br />
<br />
<pre><br />
$ head -1400000 data-en-es/europarl-en-es.tag-clean.en > data-en-es/europarl-en-es.tag-clean.en.new<br />
$ head -1400000 data-en-es/europarl-en-es.tag-clean.es > data-en-es/europarl-en-es.tag-clean.es.new<br />
$ mv data-en-es/europarl-en-es.tag-clean.en.new data-en-es/europarl-en-es.tag-clean.en<br />
$ mv data-en-es/europarl-en-es.tag-clean.es.new data-en-es/europarl-en-es.tag-clean.es<br />
</pre><br />
<br />
<br />
These files are:<br />
<br />
* <code>data-en-es/europarl-en-es.tag-clean.en</code>: The tagged source language side of the corpus<br />
* <code>data-en-es/europarl-en-es.tag-clean.es</code>: The tagged target language side of the corpus<br />
<br />
Check that they have the same length:<br />
<br />
<pre><br />
$ wc -l europarl.*<br />
1400000 data-en-es/europarl-en-es.tag-clean.en<br />
1400000 data-en-es/europarl-en-es.tag-clean.es<br />
2800000 total<br />
</pre><br />
<br />
=== Align corpus ===<br />
<br />
Now we've got the corpus files ready, we can align the corpus using the Moses scripts:<br />
<br />
<pre><br />
nohup perl (path to your mosesdecoder)/scripts/training/train-model.perl -external-bin-dir \<br />
~/smt/local/bin -corpus data-en-es/europarl-en-es.tag-clean \<br />
-f en -e es -alignment grow-diag-final-and -reordering msd-bidirectional-fe \<br />
-lm 0:5:/home/fran/corpora/europarl/europarl.lm:0 >log 2>&1 &<br />
</pre><br />
<br />
Note: Remember to change all the paths in the above command!<br />
<br />
You'll need an LM file, but you can copy it from a previous Moses installation. If you don't have one, make an empty file and put a few words in it. We won't be using the LM anyway.<br />
<br />
This takes a while, from a few hours to a day. So leave it running and go and make a soufflé, or chop some wood or something.<br />
<br />
=== Extract sentences ===<br />
After the sentences are aligned, we need to trim unnecessary tags from the tokens, and generate a biltrans file.<br />
<br />
<pre><br />
zcat giza.en-es/en-es.A3.final.gz | ~/source/apertium-lex-tools/scripts/giza-to-moses.awk > data-en-es/europarl.phrasetable.en-es<br />
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 1 \<br />
| sed 's/~/ /g' | ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil.bin -p -t > tmp1<br />
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 2 \<br />
| sed 's/~/ /g' | ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil.bin -p -t > tmp2<br />
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 3 > tmp3<br />
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 2 \<br />
| sed 's/~/ /g' | ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil.bin -b -t > data-en-es/europarl.clean-biltrans.en-es<br />
paste tmp1 tmp2 tmp3 | sed 's/\t/ ||| /g' > data-en-es/europarl.phrasetable.en-es<br />
rm tmp1 tmp2 tmp3<br />
</pre><br />
<br />
Then we want to make sure again that our file has the right number of lines:<br />
<br />
<pre><br />
$ wc -l data-en-es/europarl.phrasetable.en-es<br />
1400000 data-en-es/europarl.phrasetable.en-es<br />
</pre><br />
<br />
Then we want to extract the sentences where the target language word aligned to a source language word is a possible translation in the bilingual dictionary:<br />
<br />
<pre><br />
$ ~/source/apertium-lex-tools/scripts/extract-sentences.py data-en-es/europarl-en-es.phrasetable.en-es data-en-es/europarl-en-es.biltrans-tok.en-es \<br />
> data-en-es/europarl-en-es.candidates.en-es<br />
</pre><br />
<br />
These are basically sentences that we can hope that Apertium might be able to generate.<br />
<br />
=== Extract bilingual dictionary candidates ===<br />
<br />
Using the phrasetable and the bilingual file we can extract candidates for the bilingual dictionary.<br />
<br />
<pre><br />
python3 ~/Apertium/apertium-lex-tools/scripts/extract-biltrans-candidates.py europarl.phrasetable.en-es europarl.biltrans-tok.en-es > europarl.biltrans-candidates.en-es 2> europarl.biltrans-pairs.en-es<br />
</pre><br />
<br />
where europarl.biltrans-candidates.en-es contains the generated entries for the bilingual dictionary.<br />
<br />
===Extract frequency lexicon===<br />
<br />
The next step is to extract the frequency lexicon. <br />
<br />
<pre><br />
$ python ~/source/apertium-lex-tools/scripts/extract-freq-lexicon.py europarl.candidates.en-es > europarl.lex.en-es<br />
</pre><br />
<br />
This file should look like:<br />
<br />
<pre><br />
$ cat europarl.lex.en-es | head <br />
31381 union<n> unión<n> @<br />
101 union<n> sindicato<n><br />
1 union<n> situación<n><br />
1 union<n> monetario<adj><br />
4 slope<n> pendiente<n> @<br />
1 slope<n> ladera<n><br />
</pre><br />
<br />
Where the highest frequency translation is marked with an <code>@</code>.<br />
<br />
Note: This frequency lexicon can be used as a substitute for "choosing the most general translation" in your bilingual dictionary.<br />
<br />
===Generate patterns===<br />
<br />
Now we generate the ngrams that we are going to generate the rules from.<br />
<br />
<pre><br />
$ crisphold=1.5 # ratio of how many times you see the alternative translation compared to the default<br />
$ python ~/source/apertium-lex-tools/scripts/ngram-count-patterns.py europarl.lex.en-es europarl.candidates.en-es $crisphold 2>/dev/null > europarl.ngrams.en-es<br />
</pre><br />
<br />
This script outputs lines in the following format: <br />
<br />
<pre><br />
-language<n> and<cnjcoo> language<n> ,<cm> lengua<n> 2<br />
+language<n> plain<adj> language<n> ,<cm> lenguaje<n> 3<br />
-language<n> language<n> knowledge<n> lengua<n> 4<br />
-language<n> language<n> of<pr> communication<n> lengua<n> 3<br />
-language<n> Community<adj> language<n> .<sent> lengua<n> 5<br />
-language<n> language<n> in~addition~to<pr> their<det><pos> lengua<n> 2<br />
-language<n> every<det><ind> language<n> lengua<n> 2<br />
+language<n> and<cnjcoo> *understandable language<n> lenguaje<n> 2<br />
-language<n> two<num> language<n> lengua<n> 8<br />
-language<n> only<adj> official<adj> language<n> lengua<n> 2<br />
</pre><br />
<br />
The <code>+</code> and <code>-</code> indicate if this line chooses the most frequent transation (<code>-</code>) or a translation which is not the most frequent (<code>+</code>). The pattern selecting the translation is then shown, followed by the translation and then the frequency.<br />
<br />
===Filter rules===<br />
<br />
Now you can filter the rules, for example by removing rules with conjunctions, or removing rules with unknown words.<br />
<br />
=== Generate rules ===<br />
<br />
The final stage is to generate the rules,<br />
<br />
<pre><br />
python3 ~/source/apertium-lex-tools/scripts/ngrams-to-rules.py europarl.ngrams.en-es $crisphold > europarl.ngrams.en-es.lrx<br />
</pre> <br />
<br />
=== Process script ===<br />
For the whole process you can run the following script:<br />
<br />
<pre><br />
CORPUS_DIR="/home/philip/Apertium/corpora/raw/europarl-fr-es"<br />
CORPUS="Europarl3"<br />
PAIR="es-fr"<br />
SL="fr"<br />
TL="es"<br />
LEX_TOOLS="/home/philip/Apertium/apertium-lex-tools"<br />
SCRIPTS="$LEX_TOOLS/scripts"<br />
MOSESDECODER="/home/philip/Apertium/mosesdecoder/scripts/training"<br />
TRAINING_LINES=50000<br />
DATA="/home/philip/Apertium/apertium-fr-es"<br />
BIN_DIR="/home/philip/Apertium/smt/bin"<br />
<br />
# CLEAN CORPUS<br />
perl "$MOSESDECODER/clean-corpus-n.perl" $CORPUS.$PAIR $SL $TL "$CORPUS.clean" 1 40;<br />
<br />
#TAG CORPUS<br />
cat "$CORPUS.clean.$SL" | tail -n $TRAINING_LINES | apertium -d "$DATA" $SL-$TL-tagger \<br />
| apertium-pretransfer > $CORPUS.tagged.$SL;<br />
<br />
cat "$CORPUS.clean.$TL" | tail -n $TRAINING_LINES | apertium -d "$DATA" $TL-$SL-tagger \<br />
| apertium-pretransfer > $CORPUS.tagged.$TL;<br />
<br />
cat "$CORPUS.tagged.$SL" | lt-proc -b "$DATA/$SL-$TL.autobil.bin" > $CORPUS.biltrans.$PAIR<br />
<br />
N=`wc -l $CORPUS.clean.$SL | cut -d ' ' -f 1`<br />
<br />
# REMOVE LINES WITH NO ANALYSES<br />
seq 1 $N > $CORPUS.lines<br />
paste $CORPUS.lines $CORPUS.tagged.$SL $CORPUS.tagged.$TL | grep '<' | cut -f1 > $CORPUS.lines.new<br />
paste $CORPUS.lines $CORPUS.tagged.$SL $CORPUS.tagged.$TL | grep '<' | cut -f2 > $CORPUS.tagged.$SL.new<br />
paste $CORPUS.lines $CORPUS.tagged.$SL $CORPUS.tagged.$TL | grep '<' | cut -f3 > $CORPUS.tagged.$TL.new<br />
<br />
mv $CORPUS.lines.new $CORPUS.lines<br />
mv $CORPUS.tagged.$SL.new $CORPUS.tagged.$SL<br />
mv $CORPUS.tagged.$TL.new $CORPUS.tagged.$TL<br />
<br />
# TRIM TAGS<br />
cat $CORPUS.tagged.$SL | $LEX_TOOLS/process-tagger-output $DATA/$SL-$TL.autobil.bin -p -t > $CORPUS.tag-tok.$SL<br />
cat $CORPUS.tagged.$TL | $LEX_TOOLS/process-tagger-output $DATA/$TL-$SL.autobil.bin -p -t > $CORPUS.tag-tok.$TL<br />
cat $CORPUS.tagged.$SL | $LEX_TOOLS/process-tagger-output $DATA/$SL-$TL.autobil.bin -b -t > $CORPUS.biltrans-tok.$PAIR<br />
<br />
# ALIGN<br />
perl $MOSESDECODER/train-model.perl -external-bin-dir "$BIN_DIR" -corpus $CORPUS.tag-tok \<br />
-f $TL -e $SL -alignment grow-diag-final-and -reordering msd-bidirectional-fe \<br />
-lm 0:5:/home/philip/Apertium/gsoc2013/giza/europarl.lm:0 2>&1<br />
<br />
# EXTRACT<br />
zcat giza.$SL-$TL/$SL-$TL.A3.final.gz | $SCRIPTS/giza-to-moses.awk > $CORPUS.phrasetable.$SL-$TL<br />
<br />
# SENTENCES<br />
python3 $SCRIPTS/extract-sentences.py $CORPUS.phrasetable.$SL-$TL $CORPUS.biltrans-tok.$PAIR \<br />
> $CORPUS.candidates.$SL-$TL 2>/dev/null<br />
<br />
# FREQUENCY LEXICON<br />
python $SCRIPTS/extract-freq-lexicon.py $CORPUS.candidates.$SL-$TL > $CORPUS.lex.$SL-$TL 2>/dev/null<br />
<br />
<br />
# BILTRANS CANDIDATES<br />
python3 $SCRIPTS/extract-biltrans-candidates.py $CORPUS.phrasetable.$SL-$TL $CORPUS.biltrans-tok.$PAIR \<br />
> $CORPUS.biltrans-entries.$SL-$TL 2>$CORPUS.biltrans-pairs.$SL-$TL<br />
<br />
# NGRAM PATTERNS<br />
$crisphold=1.5<br />
python $SCRIPTS/ngram-count-patterns.py $CORPUS.lex.$SL-$TL $CORPUS.candidates.$SL-$TL $crisphold 2>/dev/null > $CORPUS.ngrams.$SL-$TL<br />
<br />
# FILTER PATTERNS<br />
<br />
<br />
# NGRAMS TO RULES<br />
$cripshold=1.5<br />
python3 $SCRIPTS/ngrams-to-rules.py europarl.ngrams.en-es $crisphold > $Europarl3.ngrams.$SL-TL.lrx<br />
<br />
</pre><br />
<br />
[[Category:Lexical selection]]<br />
[[Category:Documentation in English]]</div>
Fpetkovski
https://wiki.apertium.org/w/index.php?title=Generating_lexical-selection_rules_from_a_parallel_corpus&diff=43788
Generating lexical-selection rules from a parallel corpus
2013-09-21T15:02:54Z
<p>Fpetkovski: /* Align corpus */</p>
<hr />
<div>{{TOCD}}<br />
If you have a parallel corpus, one of the things you can do is generate some lexical selection rules from it, to improve translation of words with more than one possible translation.<br />
<br />
== You will need ==<br />
<br />
Here is a list of software that you will need installed:<br />
<br />
* Giza++ (or some other word aligner)<br />
* Moses (for making Giza++ less human hostile)<br />
* All the Moses scripts<br />
* [[lttoolbox]]<br />
* Apertium<br />
* [[apertium-lex-tools]]<br />
<br />
Furthermore you'll need:<br />
<br />
* an Apertium language pair<br />
* a parallel corpus (see [[Corpora]])<br />
<br />
===Installing prerequisites===<br />
See [[Minimal installation from SVN]] for apertium/lttoolbox. <br />
<br />
See [[Constraint-based lexical selection module]] for apertium-lex-tools.<br />
<br />
For Giza++ and moses-decoder, etc. you can do<br />
<pre><br />
$ mkdir ~/smt<br />
$ cd ~/smt<br />
$ mkdir local # our "install prefix"<br />
$ wget https://giza-pp.googlecode.com/files/giza-pp-v1.0.7.tar.gz<br />
$ tar xzvf giza-pp-v1.0.7.tar.gz<br />
$ cd giza-pp<br />
$ make<br />
$ mkdir ../local/bin<br />
$ cp GIZA++-v2/snt2cooc.out ../local/bin/<br />
$ cp GIZA++-v2/snt2plain.out ../local/bin/<br />
$ cp GIZA++-v2/GIZA++ ../local/bin/<br />
$ cp mkcls-v2/mkcls ../local/bin/<br />
$ git clone https://github.com/moses-smt/mosesdecoder<br />
$ cd mosesdecoder/<br />
$ ./bjam <br />
</pre><br />
<br />
Now e.g. the clean-corpus and train-model scripts referred to below will be in ~/smt/mosesdecoder/scripts/training/clean-corpus-n.perl<br />
See http://www.statmt.org/moses/?n=Development.GetStarted if you want to install the binaries to some other directory.<br />
<br />
== Getting started ==<br />
<br />
We're going to do the example with [[Corpora|EuroParl]] and the English to Spanish pair in Apertium.<br />
<br />
Given that you've got all the stuff installed, the work will be as follows:<br />
<br />
=== Prepare corpus ===<br />
<br />
To generate the rules, we need three files,<br />
<br />
* The tagged and tokenised source corpus<br />
* The tagged and tokenised target corpus<br />
* The output of the lexical transfer module in the source→target direction, tokenised<br />
<br />
These three files should be sentence aligned. <br />
<br />
Make a folder called data-en-es. We are going to keep all the generated files there.<br />
<br />
The first thing that we need to do is tag both sides of the corpus:<br />
<br />
<pre><br />
$ nohup cat europarl.en-es.en | apertium-destxt |\<br />
apertium -f none -d /home/fran/source/apertium-en-es en-es-pretransfer > data-en-es/europarl-en-es.tagged.en &<br />
$ nohup cat europarl.clean.es | apertium-destxt |\<br />
apertium -f none -d /home/fran/source/apertium-en-es es-en-pretransfer > data-en-es.europarl-en-es.tagged.es &<br />
</pre><br />
<br />
Then we need to remove the lines with no analyses on and replace blanks within lemmas with a new character (we will use `~`):<br />
<br />
<pre><br />
$ paste data-en-es/europarl-en-es.tagged.en data-en-es/europarl-en-es.tagged.es | grep '<' | cut -f2 | sed 's/ /~/g' | sed 's/\$[^\^]*/\$ /g' > data-en-es/europarl-en-es.tagged.new.es<br />
$ paste data-en-es/europarl-en-es.tagged.en data-en-es/europarl-en-es.tagged.es | grep '<' | cut -f3 | sed 's/ /~/g' | sed 's/\$[^\^]*/\$ /g' > data-en-es/europarl-en-es.tagged.new.en<br />
</pre><br />
<br />
Next, we need to clean the corpus and remove long sentences.<br />
(Make sure you are in the same directory as the one where you have your europarl corpus) <br />
<br />
<pre><br />
$ perl (path to your mosesdecoder)/scripts/training/clean-corpus-n.perl data-en-es/europarl-en-es.tagged.new es en data-en-es/europarl-en-es.tag-clean 1 40<br />
clean-corpus.perl: processing europarl-v6.es-en.es & .en to data-en-es/europarl-en-es.clean, cutoff 1-40<br />
..........(100000)...<br />
<br />
Input sentences: 1786594 Output sentences: 1467708<br />
</pre><br />
<br />
<br />
We're going to cut off the bottom 67,658 for testing (also because Giza++ segfaults somewhere around there).<br />
<pre><br />
$ mkdir testing<br />
$ tail -67658 data-en-es/europarl-en-es.tag-clean.en > testing/europarl-en-es.tag-clean.67658.en<br />
$ tail -67658 data-en-es/europarl-en-es.tag-clean.es > testing/europarl-en-es.tag-clean.67658.es<br />
</pre><br />
<br />
<pre><br />
$ head -1400000 data-en-es/europarl-en-es.tag-clean.en > data-en-es/europarl-en-es.tag-clean.en.new<br />
$ head -1400000 data-en-es/europarl-en-es.tag-clean.es > data-en-es/europarl-en-es.tag-clean.es.new<br />
$ mv data-en-es/europarl-en-es.tag-clean.en.new data-en-es/europarl-en-es.tag-clean.en<br />
$ mv data-en-es/europarl-en-es.tag-clean.es.new data-en-es/europarl-en-es.tag-clean.es<br />
</pre><br />
<br />
<br />
These files are:<br />
<br />
* <code>data-en-es/europarl-en-es.tag-clean.en</code>: The tagged source language side of the corpus<br />
* <code>data-en-es/europarl-en-es.tag-clean.es</code>: The tagged target language side of the corpus<br />
<br />
Check that they have the same length:<br />
<br />
<pre><br />
$ wc -l europarl.*<br />
1400000 data-en-es/europarl-en-es.tag-clean.en<br />
1400000 data-en-es/europarl-en-es.tag-clean.es<br />
2800000 total<br />
</pre><br />
<br />
=== Align corpus ===<br />
<br />
Now we've got the corpus files ready, we can align the corpus using the Moses scripts:<br />
<br />
<pre><br />
nohup perl (path to your mosesdecoder)/scripts/training/train-model.perl -external-bin-dir \<br />
~/smt/local/bin -corpus data-en-es/europarl-en-es.tag-clean \<br />
-f en -e es -alignment grow-diag-final-and -reordering msd-bidirectional-fe \<br />
-lm 0:5:/home/fran/corpora/europarl/europarl.lm:0 >log 2>&1 &<br />
</pre><br />
<br />
Note: Remember to change all the paths in the above command!<br />
<br />
You'll need an LM file, but you can copy it from a previous Moses installation. If you don't have one, make an empty file and put a few words in it. We won't be using the LM anyway.<br />
<br />
This takes a while, from a few hours to a day. So leave it running and go and make a soufflé, or chop some wood or something.<br />
<br />
=== Extract sentences ===<br />
After the sentences are aligned, we need to trim unnecessary tags from the tokens, and generate a biltrans file.<br />
<br />
<pre><br />
zcat giza.en-es/en-es.A3.final.gz | ~/source/apertium-lex-tools/scripts/giza-to-moses.awk > data-en-es/europarl.phrasetable.en-es<br />
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 1 \<br />
| sed 's/~/ /g' | ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil.bin -p -t > tmp1<br />
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 2 \<br />
| sed 's/~/ /g' | ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil.bin -p -t > tmp2<br />
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 3 > tmp3<br />
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 2 \<br />
| sed 's/~/ /g' | ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil.bin -b -t > data-en-es/europarl.clean-biltrans.en-es<br />
paste tmp1 tmp2 tmp3 | sed 's/\t/ ||| /g' > data-en-es/europarl.phrasetable.en-es<br />
rm tmp1 tmp2 tmp3<br />
</pre><br />
<br />
Then we want to make sure again that our file has the right number of lines:<br />
<br />
<pre><br />
$ wc -l data/en-es/europarl.phrasetable.en-es<br />
1400000 data/en-es/europarl.phrasetable.en-es<br />
</pre><br />
<br />
Then we want to extract the sentences where the target language word aligned to a source language word is a possible translation in the bilingual dictionary:<br />
<br />
<pre><br />
$ ~/source/apertium-lex-tools/scripts/extract-sentences.py europarl.phrasetable.en-es europarl.biltrans-tok.en-es \<br />
> europarl.candidates.en-es<br />
</pre><br />
<br />
These are basically sentences that we can hope that Apertium might be able to generate.<br />
<br />
=== Extract bilingual dictionary candidates ===<br />
<br />
Using the phrasetable and the bilingual file we can extract candidates for the bilingual dictionary.<br />
<br />
<pre><br />
python3 ~/Apertium/apertium-lex-tools/scripts/extract-biltrans-candidates.py europarl.phrasetable.en-es europarl.biltrans-tok.en-es > europarl.biltrans-candidates.en-es 2> europarl.biltrans-pairs.en-es<br />
</pre><br />
<br />
where europarl.biltrans-candidates.en-es contains the generated entries for the bilingual dictionary.<br />
<br />
===Extract frequency lexicon===<br />
<br />
The next step is to extract the frequency lexicon. <br />
<br />
<pre><br />
$ python ~/source/apertium-lex-tools/scripts/extract-freq-lexicon.py europarl.candidates.en-es > europarl.lex.en-es<br />
</pre><br />
<br />
This file should look like:<br />
<br />
<pre><br />
$ cat europarl.lex.en-es | head <br />
31381 union<n> unión<n> @<br />
101 union<n> sindicato<n><br />
1 union<n> situación<n><br />
1 union<n> monetario<adj><br />
4 slope<n> pendiente<n> @<br />
1 slope<n> ladera<n><br />
</pre><br />
<br />
Where the highest frequency translation is marked with an <code>@</code>.<br />
<br />
Note: This frequency lexicon can be used as a substitute for "choosing the most general translation" in your bilingual dictionary.<br />
<br />
===Generate patterns===<br />
<br />
Now we generate the ngrams that we are going to generate the rules from.<br />
<br />
<pre><br />
$ crisphold=1.5 # ratio of how many times you see the alternative translation compared to the default<br />
$ python ~/source/apertium-lex-tools/scripts/ngram-count-patterns.py europarl.lex.en-es europarl.candidates.en-es $crisphold 2>/dev/null > europarl.ngrams.en-es<br />
</pre><br />
<br />
This script outputs lines in the following format: <br />
<br />
<pre><br />
-language<n> and<cnjcoo> language<n> ,<cm> lengua<n> 2<br />
+language<n> plain<adj> language<n> ,<cm> lenguaje<n> 3<br />
-language<n> language<n> knowledge<n> lengua<n> 4<br />
-language<n> language<n> of<pr> communication<n> lengua<n> 3<br />
-language<n> Community<adj> language<n> .<sent> lengua<n> 5<br />
-language<n> language<n> in~addition~to<pr> their<det><pos> lengua<n> 2<br />
-language<n> every<det><ind> language<n> lengua<n> 2<br />
+language<n> and<cnjcoo> *understandable language<n> lenguaje<n> 2<br />
-language<n> two<num> language<n> lengua<n> 8<br />
-language<n> only<adj> official<adj> language<n> lengua<n> 2<br />
</pre><br />
<br />
The <code>+</code> and <code>-</code> indicate if this line chooses the most frequent transation (<code>-</code>) or a translation which is not the most frequent (<code>+</code>). The pattern selecting the translation is then shown, followed by the translation and then the frequency.<br />
<br />
===Filter rules===<br />
<br />
Now you can filter the rules, for example by removing rules with conjunctions, or removing rules with unknown words.<br />
<br />
=== Generate rules ===<br />
<br />
The final stage is to generate the rules,<br />
<br />
<pre><br />
python3 ~/source/apertium-lex-tools/scripts/ngrams-to-rules.py europarl.ngrams.en-es $crisphold > europarl.ngrams.en-es.lrx<br />
</pre> <br />
<br />
=== Process script ===<br />
For the whole process you can run the following script:<br />
<br />
<pre><br />
CORPUS_DIR="/home/philip/Apertium/corpora/raw/europarl-fr-es"<br />
CORPUS="Europarl3"<br />
PAIR="es-fr"<br />
SL="fr"<br />
TL="es"<br />
LEX_TOOLS="/home/philip/Apertium/apertium-lex-tools"<br />
SCRIPTS="$LEX_TOOLS/scripts"<br />
MOSESDECODER="/home/philip/Apertium/mosesdecoder/scripts/training"<br />
TRAINING_LINES=50000<br />
DATA="/home/philip/Apertium/apertium-fr-es"<br />
BIN_DIR="/home/philip/Apertium/smt/bin"<br />
<br />
# CLEAN CORPUS<br />
perl "$MOSESDECODER/clean-corpus-n.perl" $CORPUS.$PAIR $SL $TL "$CORPUS.clean" 1 40;<br />
<br />
#TAG CORPUS<br />
cat "$CORPUS.clean.$SL" | tail -n $TRAINING_LINES | apertium -d "$DATA" $SL-$TL-tagger \<br />
| apertium-pretransfer > $CORPUS.tagged.$SL;<br />
<br />
cat "$CORPUS.clean.$TL" | tail -n $TRAINING_LINES | apertium -d "$DATA" $TL-$SL-tagger \<br />
| apertium-pretransfer > $CORPUS.tagged.$TL;<br />
<br />
cat "$CORPUS.tagged.$SL" | lt-proc -b "$DATA/$SL-$TL.autobil.bin" > $CORPUS.biltrans.$PAIR<br />
<br />
N=`wc -l $CORPUS.clean.$SL | cut -d ' ' -f 1`<br />
<br />
# REMOVE LINES WITH NO ANALYSES<br />
seq 1 $N > $CORPUS.lines<br />
paste $CORPUS.lines $CORPUS.tagged.$SL $CORPUS.tagged.$TL | grep '<' | cut -f1 > $CORPUS.lines.new<br />
paste $CORPUS.lines $CORPUS.tagged.$SL $CORPUS.tagged.$TL | grep '<' | cut -f2 > $CORPUS.tagged.$SL.new<br />
paste $CORPUS.lines $CORPUS.tagged.$SL $CORPUS.tagged.$TL | grep '<' | cut -f3 > $CORPUS.tagged.$TL.new<br />
<br />
mv $CORPUS.lines.new $CORPUS.lines<br />
mv $CORPUS.tagged.$SL.new $CORPUS.tagged.$SL<br />
mv $CORPUS.tagged.$TL.new $CORPUS.tagged.$TL<br />
<br />
# TRIM TAGS<br />
cat $CORPUS.tagged.$SL | $LEX_TOOLS/process-tagger-output $DATA/$SL-$TL.autobil.bin -p -t > $CORPUS.tag-tok.$SL<br />
cat $CORPUS.tagged.$TL | $LEX_TOOLS/process-tagger-output $DATA/$TL-$SL.autobil.bin -p -t > $CORPUS.tag-tok.$TL<br />
cat $CORPUS.tagged.$SL | $LEX_TOOLS/process-tagger-output $DATA/$SL-$TL.autobil.bin -b -t > $CORPUS.biltrans-tok.$PAIR<br />
<br />
# ALIGN<br />
perl $MOSESDECODER/train-model.perl -external-bin-dir "$BIN_DIR" -corpus $CORPUS.tag-tok \<br />
-f $TL -e $SL -alignment grow-diag-final-and -reordering msd-bidirectional-fe \<br />
-lm 0:5:/home/philip/Apertium/gsoc2013/giza/europarl.lm:0 2>&1<br />
<br />
# EXTRACT<br />
zcat giza.$SL-$TL/$SL-$TL.A3.final.gz | $SCRIPTS/giza-to-moses.awk > $CORPUS.phrasetable.$SL-$TL<br />
<br />
# SENTENCES<br />
python3 $SCRIPTS/extract-sentences.py $CORPUS.phrasetable.$SL-$TL $CORPUS.biltrans-tok.$PAIR \<br />
> $CORPUS.candidates.$SL-$TL 2>/dev/null<br />
<br />
# FREQUENCY LEXICON<br />
python $SCRIPTS/extract-freq-lexicon.py $CORPUS.candidates.$SL-$TL > $CORPUS.lex.$SL-$TL 2>/dev/null<br />
<br />
<br />
# BILTRANS CANDIDATES<br />
python3 $SCRIPTS/extract-biltrans-candidates.py $CORPUS.phrasetable.$SL-$TL $CORPUS.biltrans-tok.$PAIR \<br />
> $CORPUS.biltrans-entries.$SL-$TL 2>$CORPUS.biltrans-pairs.$SL-$TL<br />
<br />
# NGRAM PATTERNS<br />
$crisphold=1.5<br />
python $SCRIPTS/ngram-count-patterns.py $CORPUS.lex.$SL-$TL $CORPUS.candidates.$SL-$TL $crisphold 2>/dev/null > $CORPUS.ngrams.$SL-$TL<br />
<br />
# FILTER PATTERNS<br />
<br />
<br />
# NGRAMS TO RULES<br />
$cripshold=1.5<br />
python3 $SCRIPTS/ngrams-to-rules.py europarl.ngrams.en-es $crisphold > $Europarl3.ngrams.$SL-TL.lrx<br />
<br />
</pre><br />
<br />
[[Category:Lexical selection]]<br />
[[Category:Documentation in English]]</div>
Fpetkovski
https://wiki.apertium.org/w/index.php?title=Generating_lexical-selection_rules_from_a_parallel_corpus&diff=43787
Generating lexical-selection rules from a parallel corpus
2013-09-21T15:00:06Z
<p>Fpetkovski: /* Prepare corpus */</p>
<hr />
<div>{{TOCD}}<br />
If you have a parallel corpus, one of the things you can do is generate some lexical selection rules from it, to improve translation of words with more than one possible translation.<br />
<br />
== You will need ==<br />
<br />
Here is a list of software that you will need installed:<br />
<br />
* Giza++ (or some other word aligner)<br />
* Moses (for making Giza++ less human hostile)<br />
* All the Moses scripts<br />
* [[lttoolbox]]<br />
* Apertium<br />
* [[apertium-lex-tools]]<br />
<br />
Furthermore you'll need:<br />
<br />
* an Apertium language pair<br />
* a parallel corpus (see [[Corpora]])<br />
<br />
===Installing prerequisites===<br />
See [[Minimal installation from SVN]] for apertium/lttoolbox. <br />
<br />
See [[Constraint-based lexical selection module]] for apertium-lex-tools.<br />
<br />
For Giza++ and moses-decoder, etc. you can do<br />
<pre><br />
$ mkdir ~/smt<br />
$ cd ~/smt<br />
$ mkdir local # our "install prefix"<br />
$ wget https://giza-pp.googlecode.com/files/giza-pp-v1.0.7.tar.gz<br />
$ tar xzvf giza-pp-v1.0.7.tar.gz<br />
$ cd giza-pp<br />
$ make<br />
$ mkdir ../local/bin<br />
$ cp GIZA++-v2/snt2cooc.out ../local/bin/<br />
$ cp GIZA++-v2/snt2plain.out ../local/bin/<br />
$ cp GIZA++-v2/GIZA++ ../local/bin/<br />
$ cp mkcls-v2/mkcls ../local/bin/<br />
$ git clone https://github.com/moses-smt/mosesdecoder<br />
$ cd mosesdecoder/<br />
$ ./bjam <br />
</pre><br />
<br />
Now e.g. the clean-corpus and train-model scripts referred to below will be in ~/smt/mosesdecoder/scripts/training/clean-corpus-n.perl<br />
See http://www.statmt.org/moses/?n=Development.GetStarted if you want to install the binaries to some other directory.<br />
<br />
== Getting started ==<br />
<br />
We're going to do the example with [[Corpora|EuroParl]] and the English to Spanish pair in Apertium.<br />
<br />
Given that you've got all the stuff installed, the work will be as follows:<br />
<br />
=== Prepare corpus ===<br />
<br />
To generate the rules, we need three files,<br />
<br />
* The tagged and tokenised source corpus<br />
* The tagged and tokenised target corpus<br />
* The output of the lexical transfer module in the source→target direction, tokenised<br />
<br />
These three files should be sentence aligned. <br />
<br />
Make a folder called data-en-es. We are going to keep all the generated files there.<br />
<br />
The first thing that we need to do is tag both sides of the corpus:<br />
<br />
<pre><br />
$ nohup cat europarl.en-es.en | apertium-destxt |\<br />
apertium -f none -d /home/fran/source/apertium-en-es en-es-pretransfer > data-en-es/europarl-en-es.tagged.en &<br />
$ nohup cat europarl.clean.es | apertium-destxt |\<br />
apertium -f none -d /home/fran/source/apertium-en-es es-en-pretransfer > data-en-es.europarl-en-es.tagged.es &<br />
</pre><br />
<br />
Then we need to remove the lines with no analyses on and replace blanks within lemmas with a new character (we will use `~`):<br />
<br />
<pre><br />
$ paste data-en-es/europarl-en-es.tagged.en data-en-es/europarl-en-es.tagged.es | grep '<' | cut -f2 | sed 's/ /~/g' | sed 's/\$[^\^]*/\$ /g' > data-en-es/europarl-en-es.tagged.new.es<br />
$ paste data-en-es/europarl-en-es.tagged.en data-en-es/europarl-en-es.tagged.es | grep '<' | cut -f3 | sed 's/ /~/g' | sed 's/\$[^\^]*/\$ /g' > data-en-es/europarl-en-es.tagged.new.en<br />
</pre><br />
<br />
Next, we need to clean the corpus and remove long sentences.<br />
(Make sure you are in the same directory as the one where you have your europarl corpus) <br />
<br />
<pre><br />
$ perl (path to your mosesdecoder)/scripts/training/clean-corpus-n.perl data-en-es/europarl-en-es.tagged.new es en data-en-es/europarl-en-es.tag-clean 1 40<br />
clean-corpus.perl: processing europarl-v6.es-en.es & .en to data-en-es/europarl-en-es.clean, cutoff 1-40<br />
..........(100000)...<br />
<br />
Input sentences: 1786594 Output sentences: 1467708<br />
</pre><br />
<br />
<br />
We're going to cut off the bottom 67,658 for testing (also because Giza++ segfaults somewhere around there).<br />
<pre><br />
$ mkdir testing<br />
$ tail -67658 data-en-es/europarl-en-es.tag-clean.en > testing/europarl-en-es.tag-clean.67658.en<br />
$ tail -67658 data-en-es/europarl-en-es.tag-clean.es > testing/europarl-en-es.tag-clean.67658.es<br />
</pre><br />
<br />
<pre><br />
$ head -1400000 data-en-es/europarl-en-es.tag-clean.en > data-en-es/europarl-en-es.tag-clean.en.new<br />
$ head -1400000 data-en-es/europarl-en-es.tag-clean.es > data-en-es/europarl-en-es.tag-clean.es.new<br />
$ mv data-en-es/europarl-en-es.tag-clean.en.new data-en-es/europarl-en-es.tag-clean.en<br />
$ mv data-en-es/europarl-en-es.tag-clean.es.new data-en-es/europarl-en-es.tag-clean.es<br />
</pre><br />
<br />
<br />
These files are:<br />
<br />
* <code>data-en-es/europarl-en-es.tag-clean.en</code>: The tagged source language side of the corpus<br />
* <code>data-en-es/europarl-en-es.tag-clean.es</code>: The tagged target language side of the corpus<br />
<br />
Check that they have the same length:<br />
<br />
<pre><br />
$ wc -l europarl.*<br />
1400000 data-en-es/europarl-en-es.tag-clean.en<br />
1400000 data-en-es/europarl-en-es.tag-clean.es<br />
2800000 total<br />
</pre><br />
<br />
=== Align corpus ===<br />
<br />
Now we've got the corpus files ready, we can align the corpus using the Moses scripts:<br />
<br />
<pre><br />
nohup perl (path to your mosesdecoder)/scripts/training/train-model.perl -external-bin-dir \<br />
~/smt/local/bin -corpus europarl.tag-clean \<br />
-f en -e es -alignment grow-diag-final-and -reordering msd-bidirectional-fe \<br />
-lm 0:5:/home/fran/corpora/europarl/europarl.lm:0 >log 2>&1 &<br />
</pre><br />
<br />
Note: Remember to change all the paths in the above command!<br />
<br />
You'll need an LM file, but you can copy it from a previous Moses installation. If you don't have one, make an empty file and put a few words in it. We won't be using the LM anyway.<br />
<br />
This takes a while, from a few hours to a day. So leave it running and go and make a soufflé, or chop some wood or something.<br />
<br />
=== Extract sentences ===<br />
After the sentences are aligned, we need to trim unnecessary tags from the tokens, and generate a biltrans file.<br />
<br />
<pre><br />
zcat giza.en-es/en-es.A3.final.gz | ~/source/apertium-lex-tools/scripts/giza-to-moses.awk > data-en-es/europarl.phrasetable.en-es<br />
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 1 \<br />
| sed 's/~/ /g' | ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil.bin -p -t > tmp1<br />
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 2 \<br />
| sed 's/~/ /g' | ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil.bin -p -t > tmp2<br />
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 3 > tmp3<br />
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 2 \<br />
| sed 's/~/ /g' | ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil.bin -b -t > data-en-es/europarl.clean-biltrans.en-es<br />
paste tmp1 tmp2 tmp3 | sed 's/\t/ ||| /g' > data-en-es/europarl.phrasetable.en-es<br />
rm tmp1 tmp2 tmp3<br />
</pre><br />
<br />
Then we want to make sure again that our file has the right number of lines:<br />
<br />
<pre><br />
$ wc -l data/en-es/europarl.phrasetable.en-es<br />
1400000 data/en-es/europarl.phrasetable.en-es<br />
</pre><br />
<br />
Then we want to extract the sentences where the target language word aligned to a source language word is a possible translation in the bilingual dictionary:<br />
<br />
<pre><br />
$ ~/source/apertium-lex-tools/scripts/extract-sentences.py europarl.phrasetable.en-es europarl.biltrans-tok.en-es \<br />
> europarl.candidates.en-es<br />
</pre><br />
<br />
These are basically sentences that we can hope that Apertium might be able to generate.<br />
<br />
=== Extract bilingual dictionary candidates ===<br />
<br />
Using the phrasetable and the bilingual file we can extract candidates for the bilingual dictionary.<br />
<br />
<pre><br />
python3 ~/Apertium/apertium-lex-tools/scripts/extract-biltrans-candidates.py europarl.phrasetable.en-es europarl.biltrans-tok.en-es > europarl.biltrans-candidates.en-es 2> europarl.biltrans-pairs.en-es<br />
</pre><br />
<br />
where europarl.biltrans-candidates.en-es contains the generated entries for the bilingual dictionary.<br />
<br />
===Extract frequency lexicon===<br />
<br />
The next step is to extract the frequency lexicon. <br />
<br />
<pre><br />
$ python ~/source/apertium-lex-tools/scripts/extract-freq-lexicon.py europarl.candidates.en-es > europarl.lex.en-es<br />
</pre><br />
<br />
This file should look like:<br />
<br />
<pre><br />
$ cat europarl.lex.en-es | head <br />
31381 union<n> unión<n> @<br />
101 union<n> sindicato<n><br />
1 union<n> situación<n><br />
1 union<n> monetario<adj><br />
4 slope<n> pendiente<n> @<br />
1 slope<n> ladera<n><br />
</pre><br />
<br />
Where the highest frequency translation is marked with an <code>@</code>.<br />
<br />
Note: This frequency lexicon can be used as a substitute for "choosing the most general translation" in your bilingual dictionary.<br />
<br />
===Generate patterns===<br />
<br />
Now we generate the ngrams that we are going to generate the rules from.<br />
<br />
<pre><br />
$ crisphold=1.5 # ratio of how many times you see the alternative translation compared to the default<br />
$ python ~/source/apertium-lex-tools/scripts/ngram-count-patterns.py europarl.lex.en-es europarl.candidates.en-es $crisphold 2>/dev/null > europarl.ngrams.en-es<br />
</pre><br />
<br />
This script outputs lines in the following format: <br />
<br />
<pre><br />
-language<n> and<cnjcoo> language<n> ,<cm> lengua<n> 2<br />
+language<n> plain<adj> language<n> ,<cm> lenguaje<n> 3<br />
-language<n> language<n> knowledge<n> lengua<n> 4<br />
-language<n> language<n> of<pr> communication<n> lengua<n> 3<br />
-language<n> Community<adj> language<n> .<sent> lengua<n> 5<br />
-language<n> language<n> in~addition~to<pr> their<det><pos> lengua<n> 2<br />
-language<n> every<det><ind> language<n> lengua<n> 2<br />
+language<n> and<cnjcoo> *understandable language<n> lenguaje<n> 2<br />
-language<n> two<num> language<n> lengua<n> 8<br />
-language<n> only<adj> official<adj> language<n> lengua<n> 2<br />
</pre><br />
<br />
The <code>+</code> and <code>-</code> indicate if this line chooses the most frequent transation (<code>-</code>) or a translation which is not the most frequent (<code>+</code>). The pattern selecting the translation is then shown, followed by the translation and then the frequency.<br />
<br />
===Filter rules===<br />
<br />
Now you can filter the rules, for example by removing rules with conjunctions, or removing rules with unknown words.<br />
<br />
=== Generate rules ===<br />
<br />
The final stage is to generate the rules,<br />
<br />
<pre><br />
python3 ~/source/apertium-lex-tools/scripts/ngrams-to-rules.py europarl.ngrams.en-es $crisphold > europarl.ngrams.en-es.lrx<br />
</pre> <br />
<br />
=== Process script ===<br />
For the whole process you can run the following script:<br />
<br />
<pre><br />
CORPUS_DIR="/home/philip/Apertium/corpora/raw/europarl-fr-es"<br />
CORPUS="Europarl3"<br />
PAIR="es-fr"<br />
SL="fr"<br />
TL="es"<br />
LEX_TOOLS="/home/philip/Apertium/apertium-lex-tools"<br />
SCRIPTS="$LEX_TOOLS/scripts"<br />
MOSESDECODER="/home/philip/Apertium/mosesdecoder/scripts/training"<br />
TRAINING_LINES=50000<br />
DATA="/home/philip/Apertium/apertium-fr-es"<br />
BIN_DIR="/home/philip/Apertium/smt/bin"<br />
<br />
# CLEAN CORPUS<br />
perl "$MOSESDECODER/clean-corpus-n.perl" $CORPUS.$PAIR $SL $TL "$CORPUS.clean" 1 40;<br />
<br />
#TAG CORPUS<br />
cat "$CORPUS.clean.$SL" | tail -n $TRAINING_LINES | apertium -d "$DATA" $SL-$TL-tagger \<br />
| apertium-pretransfer > $CORPUS.tagged.$SL;<br />
<br />
cat "$CORPUS.clean.$TL" | tail -n $TRAINING_LINES | apertium -d "$DATA" $TL-$SL-tagger \<br />
| apertium-pretransfer > $CORPUS.tagged.$TL;<br />
<br />
cat "$CORPUS.tagged.$SL" | lt-proc -b "$DATA/$SL-$TL.autobil.bin" > $CORPUS.biltrans.$PAIR<br />
<br />
N=`wc -l $CORPUS.clean.$SL | cut -d ' ' -f 1`<br />
<br />
# REMOVE LINES WITH NO ANALYSES<br />
seq 1 $N > $CORPUS.lines<br />
paste $CORPUS.lines $CORPUS.tagged.$SL $CORPUS.tagged.$TL | grep '<' | cut -f1 > $CORPUS.lines.new<br />
paste $CORPUS.lines $CORPUS.tagged.$SL $CORPUS.tagged.$TL | grep '<' | cut -f2 > $CORPUS.tagged.$SL.new<br />
paste $CORPUS.lines $CORPUS.tagged.$SL $CORPUS.tagged.$TL | grep '<' | cut -f3 > $CORPUS.tagged.$TL.new<br />
<br />
mv $CORPUS.lines.new $CORPUS.lines<br />
mv $CORPUS.tagged.$SL.new $CORPUS.tagged.$SL<br />
mv $CORPUS.tagged.$TL.new $CORPUS.tagged.$TL<br />
<br />
# TRIM TAGS<br />
cat $CORPUS.tagged.$SL | $LEX_TOOLS/process-tagger-output $DATA/$SL-$TL.autobil.bin -p -t > $CORPUS.tag-tok.$SL<br />
cat $CORPUS.tagged.$TL | $LEX_TOOLS/process-tagger-output $DATA/$TL-$SL.autobil.bin -p -t > $CORPUS.tag-tok.$TL<br />
cat $CORPUS.tagged.$SL | $LEX_TOOLS/process-tagger-output $DATA/$SL-$TL.autobil.bin -b -t > $CORPUS.biltrans-tok.$PAIR<br />
<br />
# ALIGN<br />
perl $MOSESDECODER/train-model.perl -external-bin-dir "$BIN_DIR" -corpus $CORPUS.tag-tok \<br />
-f $TL -e $SL -alignment grow-diag-final-and -reordering msd-bidirectional-fe \<br />
-lm 0:5:/home/philip/Apertium/gsoc2013/giza/europarl.lm:0 2>&1<br />
<br />
# EXTRACT<br />
zcat giza.$SL-$TL/$SL-$TL.A3.final.gz | $SCRIPTS/giza-to-moses.awk > $CORPUS.phrasetable.$SL-$TL<br />
<br />
# SENTENCES<br />
python3 $SCRIPTS/extract-sentences.py $CORPUS.phrasetable.$SL-$TL $CORPUS.biltrans-tok.$PAIR \<br />
> $CORPUS.candidates.$SL-$TL 2>/dev/null<br />
<br />
# FREQUENCY LEXICON<br />
python $SCRIPTS/extract-freq-lexicon.py $CORPUS.candidates.$SL-$TL > $CORPUS.lex.$SL-$TL 2>/dev/null<br />
<br />
<br />
# BILTRANS CANDIDATES<br />
python3 $SCRIPTS/extract-biltrans-candidates.py $CORPUS.phrasetable.$SL-$TL $CORPUS.biltrans-tok.$PAIR \<br />
> $CORPUS.biltrans-entries.$SL-$TL 2>$CORPUS.biltrans-pairs.$SL-$TL<br />
<br />
# NGRAM PATTERNS<br />
$crisphold=1.5<br />
python $SCRIPTS/ngram-count-patterns.py $CORPUS.lex.$SL-$TL $CORPUS.candidates.$SL-$TL $crisphold 2>/dev/null > $CORPUS.ngrams.$SL-$TL<br />
<br />
# FILTER PATTERNS<br />
<br />
<br />
# NGRAMS TO RULES<br />
$cripshold=1.5<br />
python3 $SCRIPTS/ngrams-to-rules.py europarl.ngrams.en-es $crisphold > $Europarl3.ngrams.$SL-TL.lrx<br />
<br />
</pre><br />
<br />
[[Category:Lexical selection]]<br />
[[Category:Documentation in English]]</div>
Fpetkovski
https://wiki.apertium.org/w/index.php?title=Generating_lexical-selection_rules_from_a_parallel_corpus&diff=43786
Generating lexical-selection rules from a parallel corpus
2013-09-21T14:58:39Z
<p>Fpetkovski: /* Prepare corpus */</p>
<hr />
<div>{{TOCD}}<br />
If you have a parallel corpus, one of the things you can do is generate some lexical selection rules from it, to improve translation of words with more than one possible translation.<br />
<br />
== You will need ==<br />
<br />
Here is a list of software that you will need installed:<br />
<br />
* Giza++ (or some other word aligner)<br />
* Moses (for making Giza++ less human hostile)<br />
* All the Moses scripts<br />
* [[lttoolbox]]<br />
* Apertium<br />
* [[apertium-lex-tools]]<br />
<br />
Furthermore you'll need:<br />
<br />
* an Apertium language pair<br />
* a parallel corpus (see [[Corpora]])<br />
<br />
===Installing prerequisites===<br />
See [[Minimal installation from SVN]] for apertium/lttoolbox. <br />
<br />
See [[Constraint-based lexical selection module]] for apertium-lex-tools.<br />
<br />
For Giza++ and moses-decoder, etc. you can do<br />
<pre><br />
$ mkdir ~/smt<br />
$ cd ~/smt<br />
$ mkdir local # our "install prefix"<br />
$ wget https://giza-pp.googlecode.com/files/giza-pp-v1.0.7.tar.gz<br />
$ tar xzvf giza-pp-v1.0.7.tar.gz<br />
$ cd giza-pp<br />
$ make<br />
$ mkdir ../local/bin<br />
$ cp GIZA++-v2/snt2cooc.out ../local/bin/<br />
$ cp GIZA++-v2/snt2plain.out ../local/bin/<br />
$ cp GIZA++-v2/GIZA++ ../local/bin/<br />
$ cp mkcls-v2/mkcls ../local/bin/<br />
$ git clone https://github.com/moses-smt/mosesdecoder<br />
$ cd mosesdecoder/<br />
$ ./bjam <br />
</pre><br />
<br />
Now e.g. the clean-corpus and train-model scripts referred to below will be in ~/smt/mosesdecoder/scripts/training/clean-corpus-n.perl<br />
See http://www.statmt.org/moses/?n=Development.GetStarted if you want to install the binaries to some other directory.<br />
<br />
== Getting started ==<br />
<br />
We're going to do the example with [[Corpora|EuroParl]] and the English to Spanish pair in Apertium.<br />
<br />
Given that you've got all the stuff installed, the work will be as follows:<br />
<br />
=== Prepare corpus ===<br />
<br />
To generate the rules, we need three files,<br />
<br />
* The tagged and tokenised source corpus<br />
* The tagged and tokenised target corpus<br />
* The output of the lexical transfer module in the source→target direction, tokenised<br />
<br />
These three files should be sentence aligned. <br />
<br />
Make a folder called data-en-es. We are going to keep all the generated files there.<br />
<br />
The first thing that we need to do is tag both sides of the corpus:<br />
<br />
<pre><br />
$ nohup cat europarl.en-es.en | apertium-destxt |\<br />
apertium -f none -d /home/fran/source/apertium-en-es en-es-pretransfer > data-en-es/europarl-en-es.tagged.en &<br />
$ nohup cat europarl.clean.es | apertium-destxt |\<br />
apertium -f none -d /home/fran/source/apertium-en-es es-en-pretransfer > data-en-es.europarl-en-es.tagged.es &<br />
</pre><br />
<br />
Then we need to remove the lines with no analyses on and replace blanks within lemmas with a new character (we will use `~`):<br />
<br />
<pre><br />
$ paste data-en-es/europarl-en-es.tagged.en data-en-es/europarl-en-es.tagged.es | grep '<' | cut -f2 | sed 's/ /~/g' | sed 's/\$[^\^]*/\$ /g' > data-en-es/europarl-en-es.tagged.new.es<br />
$ paste data-en-es/europarl-en-es.tagged.en data-en-es/europarl-en-es.tagged.es | grep '<' | cut -f3 | sed 's/ /~/g' | sed 's/\$[^\^]*/\$ /g' > data-en-es/europarl-en-es.tagged.new.en<br />
</pre><br />
<br />
Next, we need to clean the corpus and remove long sentences.<br />
(Make sure you are in the same directory as the one where you have your europarl corpus) <br />
<br />
<pre><br />
$ perl (path to your mosesdecoder)/scripts/training/clean-corpus-n.perl data-en-es/europarl-en-es.tagged.new es en data-en-es/europarl-en-es.tag-clean 1 40<br />
clean-corpus.perl: processing europarl-v6.es-en.es & .en to data-en-es/europarl-en-es.clean, cutoff 1-40<br />
..........(100000)...<br />
<br />
Input sentences: 1786594 Output sentences: 1467708<br />
</pre><br />
<br />
<br />
We're going to cut off the bottom 67,658 for testing (also because Giza++ segfaults somewhere around there).<br />
<pre><br />
$ mkdir testing<br />
$ tail -67658 data-en-es/europarl-en-es.tag-clean.en > testing/europarl.tag-clean.67658.en<br />
$ tail -67658 data-en-es/europarl-en-es.tagged.es > testing/europarl.tag-clean.67658.es<br />
</pre><br />
<br />
<pre><br />
$ head -1400000 data-en-es/europarl-en-es.tag-clean.en > data-en-es/europarl-en-es.tag-clean.en.new<br />
$ head -1400000 data-en-es/europarl-en-es.tag-clean.es > data-en-es/europarl-en-es.tag-clean.es.new<br />
$ mv data-en-es/europarl-en-es.tag-clean.en.new data-en-es/europarl-en-es.tag-clean.en<br />
$ mv data-en-es/europarl-en-es.tag-clean.es.new data-en-es/europarl-en-es.tag-clean.es<br />
</pre><br />
<br />
<br />
These files are:<br />
<br />
* <code>data-en-es/europarl-en-es.tag-clean.en</code>: The tagged source language side of the corpus<br />
* <code>data-en-es/europarl-en-es.tag-clean.es</code>: The tagged target language side of the corpus<br />
<br />
Check that they have the same length:<br />
<br />
<pre><br />
$ wc -l europarl.*<br />
1400000 data-en-es/europarl-en-es.tag-clean.en<br />
1400000 data-en-es/europarl-en-es.tag-clean.es<br />
2800000 total<br />
</pre><br />
<br />
=== Align corpus ===<br />
<br />
Now we've got the corpus files ready, we can align the corpus using the Moses scripts:<br />
<br />
<pre><br />
nohup perl (path to your mosesdecoder)/scripts/training/train-model.perl -external-bin-dir \<br />
~/smt/local/bin -corpus europarl.tag-clean \<br />
-f en -e es -alignment grow-diag-final-and -reordering msd-bidirectional-fe \<br />
-lm 0:5:/home/fran/corpora/europarl/europarl.lm:0 >log 2>&1 &<br />
</pre><br />
<br />
Note: Remember to change all the paths in the above command!<br />
<br />
You'll need an LM file, but you can copy it from a previous Moses installation. If you don't have one, make an empty file and put a few words in it. We won't be using the LM anyway.<br />
<br />
This takes a while, from a few hours to a day. So leave it running and go and make a soufflé, or chop some wood or something.<br />
<br />
=== Extract sentences ===<br />
After the sentences are aligned, we need to trim unnecessary tags from the tokens, and generate a biltrans file.<br />
<br />
<pre><br />
zcat giza.en-es/en-es.A3.final.gz | ~/source/apertium-lex-tools/scripts/giza-to-moses.awk > data-en-es/europarl.phrasetable.en-es<br />
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 1 \<br />
| sed 's/~/ /g' | ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil.bin -p -t > tmp1<br />
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 2 \<br />
| sed 's/~/ /g' | ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil.bin -p -t > tmp2<br />
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 3 > tmp3<br />
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 2 \<br />
| sed 's/~/ /g' | ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil.bin -b -t > data-en-es/europarl.clean-biltrans.en-es<br />
paste tmp1 tmp2 tmp3 | sed 's/\t/ ||| /g' > data-en-es/europarl.phrasetable.en-es<br />
rm tmp1 tmp2 tmp3<br />
</pre><br />
<br />
Then we want to make sure again that our file has the right number of lines:<br />
<br />
<pre><br />
$ wc -l data/en-es/europarl.phrasetable.en-es<br />
1400000 data/en-es/europarl.phrasetable.en-es<br />
</pre><br />
<br />
Then we want to extract the sentences where the target language word aligned to a source language word is a possible translation in the bilingual dictionary:<br />
<br />
<pre><br />
$ ~/source/apertium-lex-tools/scripts/extract-sentences.py europarl.phrasetable.en-es europarl.biltrans-tok.en-es \<br />
> europarl.candidates.en-es<br />
</pre><br />
<br />
These are basically sentences that we can hope that Apertium might be able to generate.<br />
<br />
=== Extract bilingual dictionary candidates ===<br />
<br />
Using the phrasetable and the bilingual file we can extract candidates for the bilingual dictionary.<br />
<br />
<pre><br />
python3 ~/Apertium/apertium-lex-tools/scripts/extract-biltrans-candidates.py europarl.phrasetable.en-es europarl.biltrans-tok.en-es > europarl.biltrans-candidates.en-es 2> europarl.biltrans-pairs.en-es<br />
</pre><br />
<br />
where europarl.biltrans-candidates.en-es contains the generated entries for the bilingual dictionary.<br />
<br />
===Extract frequency lexicon===<br />
<br />
The next step is to extract the frequency lexicon. <br />
<br />
<pre><br />
$ python ~/source/apertium-lex-tools/scripts/extract-freq-lexicon.py europarl.candidates.en-es > europarl.lex.en-es<br />
</pre><br />
<br />
This file should look like:<br />
<br />
<pre><br />
$ cat europarl.lex.en-es | head <br />
31381 union<n> unión<n> @<br />
101 union<n> sindicato<n><br />
1 union<n> situación<n><br />
1 union<n> monetario<adj><br />
4 slope<n> pendiente<n> @<br />
1 slope<n> ladera<n><br />
</pre><br />
<br />
Where the highest frequency translation is marked with an <code>@</code>.<br />
<br />
Note: This frequency lexicon can be used as a substitute for "choosing the most general translation" in your bilingual dictionary.<br />
<br />
===Generate patterns===<br />
<br />
Now we generate the ngrams that we are going to generate the rules from.<br />
<br />
<pre><br />
$ crisphold=1.5 # ratio of how many times you see the alternative translation compared to the default<br />
$ python ~/source/apertium-lex-tools/scripts/ngram-count-patterns.py europarl.lex.en-es europarl.candidates.en-es $crisphold 2>/dev/null > europarl.ngrams.en-es<br />
</pre><br />
<br />
This script outputs lines in the following format: <br />
<br />
<pre><br />
-language<n> and<cnjcoo> language<n> ,<cm> lengua<n> 2<br />
+language<n> plain<adj> language<n> ,<cm> lenguaje<n> 3<br />
-language<n> language<n> knowledge<n> lengua<n> 4<br />
-language<n> language<n> of<pr> communication<n> lengua<n> 3<br />
-language<n> Community<adj> language<n> .<sent> lengua<n> 5<br />
-language<n> language<n> in~addition~to<pr> their<det><pos> lengua<n> 2<br />
-language<n> every<det><ind> language<n> lengua<n> 2<br />
+language<n> and<cnjcoo> *understandable language<n> lenguaje<n> 2<br />
-language<n> two<num> language<n> lengua<n> 8<br />
-language<n> only<adj> official<adj> language<n> lengua<n> 2<br />
</pre><br />
<br />
The <code>+</code> and <code>-</code> indicate if this line chooses the most frequent transation (<code>-</code>) or a translation which is not the most frequent (<code>+</code>). The pattern selecting the translation is then shown, followed by the translation and then the frequency.<br />
<br />
===Filter rules===<br />
<br />
Now you can filter the rules, for example by removing rules with conjunctions, or removing rules with unknown words.<br />
<br />
=== Generate rules ===<br />
<br />
The final stage is to generate the rules,<br />
<br />
<pre><br />
python3 ~/source/apertium-lex-tools/scripts/ngrams-to-rules.py europarl.ngrams.en-es $crisphold > europarl.ngrams.en-es.lrx<br />
</pre> <br />
<br />
=== Process script ===<br />
For the whole process you can run the following script:<br />
<br />
<pre><br />
CORPUS_DIR="/home/philip/Apertium/corpora/raw/europarl-fr-es"<br />
CORPUS="Europarl3"<br />
PAIR="es-fr"<br />
SL="fr"<br />
TL="es"<br />
LEX_TOOLS="/home/philip/Apertium/apertium-lex-tools"<br />
SCRIPTS="$LEX_TOOLS/scripts"<br />
MOSESDECODER="/home/philip/Apertium/mosesdecoder/scripts/training"<br />
TRAINING_LINES=50000<br />
DATA="/home/philip/Apertium/apertium-fr-es"<br />
BIN_DIR="/home/philip/Apertium/smt/bin"<br />
<br />
# CLEAN CORPUS<br />
perl "$MOSESDECODER/clean-corpus-n.perl" $CORPUS.$PAIR $SL $TL "$CORPUS.clean" 1 40;<br />
<br />
#TAG CORPUS<br />
cat "$CORPUS.clean.$SL" | tail -n $TRAINING_LINES | apertium -d "$DATA" $SL-$TL-tagger \<br />
| apertium-pretransfer > $CORPUS.tagged.$SL;<br />
<br />
cat "$CORPUS.clean.$TL" | tail -n $TRAINING_LINES | apertium -d "$DATA" $TL-$SL-tagger \<br />
| apertium-pretransfer > $CORPUS.tagged.$TL;<br />
<br />
cat "$CORPUS.tagged.$SL" | lt-proc -b "$DATA/$SL-$TL.autobil.bin" > $CORPUS.biltrans.$PAIR<br />
<br />
N=`wc -l $CORPUS.clean.$SL | cut -d ' ' -f 1`<br />
<br />
# REMOVE LINES WITH NO ANALYSES<br />
seq 1 $N > $CORPUS.lines<br />
paste $CORPUS.lines $CORPUS.tagged.$SL $CORPUS.tagged.$TL | grep '<' | cut -f1 > $CORPUS.lines.new<br />
paste $CORPUS.lines $CORPUS.tagged.$SL $CORPUS.tagged.$TL | grep '<' | cut -f2 > $CORPUS.tagged.$SL.new<br />
paste $CORPUS.lines $CORPUS.tagged.$SL $CORPUS.tagged.$TL | grep '<' | cut -f3 > $CORPUS.tagged.$TL.new<br />
<br />
mv $CORPUS.lines.new $CORPUS.lines<br />
mv $CORPUS.tagged.$SL.new $CORPUS.tagged.$SL<br />
mv $CORPUS.tagged.$TL.new $CORPUS.tagged.$TL<br />
<br />
# TRIM TAGS<br />
cat $CORPUS.tagged.$SL | $LEX_TOOLS/process-tagger-output $DATA/$SL-$TL.autobil.bin -p -t > $CORPUS.tag-tok.$SL<br />
cat $CORPUS.tagged.$TL | $LEX_TOOLS/process-tagger-output $DATA/$TL-$SL.autobil.bin -p -t > $CORPUS.tag-tok.$TL<br />
cat $CORPUS.tagged.$SL | $LEX_TOOLS/process-tagger-output $DATA/$SL-$TL.autobil.bin -b -t > $CORPUS.biltrans-tok.$PAIR<br />
<br />
# ALIGN<br />
perl $MOSESDECODER/train-model.perl -external-bin-dir "$BIN_DIR" -corpus $CORPUS.tag-tok \<br />
-f $TL -e $SL -alignment grow-diag-final-and -reordering msd-bidirectional-fe \<br />
-lm 0:5:/home/philip/Apertium/gsoc2013/giza/europarl.lm:0 2>&1<br />
<br />
# EXTRACT<br />
zcat giza.$SL-$TL/$SL-$TL.A3.final.gz | $SCRIPTS/giza-to-moses.awk > $CORPUS.phrasetable.$SL-$TL<br />
<br />
# SENTENCES<br />
python3 $SCRIPTS/extract-sentences.py $CORPUS.phrasetable.$SL-$TL $CORPUS.biltrans-tok.$PAIR \<br />
> $CORPUS.candidates.$SL-$TL 2>/dev/null<br />
<br />
# FREQUENCY LEXICON<br />
python $SCRIPTS/extract-freq-lexicon.py $CORPUS.candidates.$SL-$TL > $CORPUS.lex.$SL-$TL 2>/dev/null<br />
<br />
<br />
# BILTRANS CANDIDATES<br />
python3 $SCRIPTS/extract-biltrans-candidates.py $CORPUS.phrasetable.$SL-$TL $CORPUS.biltrans-tok.$PAIR \<br />
> $CORPUS.biltrans-entries.$SL-$TL 2>$CORPUS.biltrans-pairs.$SL-$TL<br />
<br />
# NGRAM PATTERNS<br />
$crisphold=1.5<br />
python $SCRIPTS/ngram-count-patterns.py $CORPUS.lex.$SL-$TL $CORPUS.candidates.$SL-$TL $crisphold 2>/dev/null > $CORPUS.ngrams.$SL-$TL<br />
<br />
# FILTER PATTERNS<br />
<br />
<br />
# NGRAMS TO RULES<br />
$cripshold=1.5<br />
python3 $SCRIPTS/ngrams-to-rules.py europarl.ngrams.en-es $crisphold > $Europarl3.ngrams.$SL-TL.lrx<br />
<br />
</pre><br />
<br />
[[Category:Lexical selection]]<br />
[[Category:Documentation in English]]</div>
Fpetkovski
https://wiki.apertium.org/w/index.php?title=Generating_lexical-selection_rules_from_a_parallel_corpus&diff=43785
Generating lexical-selection rules from a parallel corpus
2013-09-21T14:58:19Z
<p>Fpetkovski: /* Prepare corpus */</p>
<hr />
<div>{{TOCD}}<br />
If you have a parallel corpus, one of the things you can do is generate some lexical selection rules from it, to improve translation of words with more than one possible translation.<br />
<br />
== You will need ==<br />
<br />
Here is a list of software that you will need installed:<br />
<br />
* Giza++ (or some other word aligner)<br />
* Moses (for making Giza++ less human hostile)<br />
* All the Moses scripts<br />
* [[lttoolbox]]<br />
* Apertium<br />
* [[apertium-lex-tools]]<br />
<br />
Furthermore you'll need:<br />
<br />
* an Apertium language pair<br />
* a parallel corpus (see [[Corpora]])<br />
<br />
===Installing prerequisites===<br />
See [[Minimal installation from SVN]] for apertium/lttoolbox. <br />
<br />
See [[Constraint-based lexical selection module]] for apertium-lex-tools.<br />
<br />
For Giza++ and moses-decoder, etc. you can do<br />
<pre><br />
$ mkdir ~/smt<br />
$ cd ~/smt<br />
$ mkdir local # our "install prefix"<br />
$ wget https://giza-pp.googlecode.com/files/giza-pp-v1.0.7.tar.gz<br />
$ tar xzvf giza-pp-v1.0.7.tar.gz<br />
$ cd giza-pp<br />
$ make<br />
$ mkdir ../local/bin<br />
$ cp GIZA++-v2/snt2cooc.out ../local/bin/<br />
$ cp GIZA++-v2/snt2plain.out ../local/bin/<br />
$ cp GIZA++-v2/GIZA++ ../local/bin/<br />
$ cp mkcls-v2/mkcls ../local/bin/<br />
$ git clone https://github.com/moses-smt/mosesdecoder<br />
$ cd mosesdecoder/<br />
$ ./bjam <br />
</pre><br />
<br />
Now e.g. the clean-corpus and train-model scripts referred to below will be in ~/smt/mosesdecoder/scripts/training/clean-corpus-n.perl<br />
See http://www.statmt.org/moses/?n=Development.GetStarted if you want to install the binaries to some other directory.<br />
<br />
== Getting started ==<br />
<br />
We're going to do the example with [[Corpora|EuroParl]] and the English to Spanish pair in Apertium.<br />
<br />
Given that you've got all the stuff installed, the work will be as follows:<br />
<br />
=== Prepare corpus ===<br />
<br />
To generate the rules, we need three files,<br />
<br />
* The tagged and tokenised source corpus<br />
* The tagged and tokenised target corpus<br />
* The output of the lexical transfer module in the source→target direction, tokenised<br />
<br />
These three files should be sentence aligned. <br />
<br />
Make a folder called data-en-es. We are going to keep all the generated files there.<br />
<br />
The first thing that we need to do is tag both sides of the corpus:<br />
<br />
<pre><br />
$ nohup cat europarl.en-es.en | apertium-destxt |\<br />
apertium -f none -d /home/fran/source/apertium-en-es en-es-pretransfer > data-en-es/europarl-en-es.tagged.en &<br />
$ nohup cat europarl.clean.es | apertium-destxt |\<br />
apertium -f none -d /home/fran/source/apertium-en-es es-en-pretransfer > data-en-es.europarl-en-es.tagged.es &<br />
</pre><br />
<br />
Then we need to remove the lines with no analyses on and replace blanks within lemmas with a new character (we will use `~`):<br />
<br />
<pre><br />
$ paste data-en-es/europarl-en-es.tagged.en data-en-es/europarl-en-es.tagged.es | grep '<' | cut -f2 | sed 's/ /~/g' | sed 's/\$[^\^]*/\$ /g' > data-en-es/europarl-en-es.tagged.new.es<br />
$ paste data-en-es/europarl-en-es.tagged.en data-en-es/europarl-en-es.tagged.es | grep '<' | cut -f3 | sed 's/ /~/g' | sed 's/\$[^\^]*/\$ /g'> data-en-es/europarl-en-es.tagged.new.en<br />
</pre><br />
<br />
Next, we need to clean the corpus and remove long sentences.<br />
(Make sure you are in the same directory as the one where you have your europarl corpus) <br />
<br />
<pre><br />
$ perl (path to your mosesdecoder)/scripts/training/clean-corpus-n.perl data-en-es/europarl-en-es.tagged.new es en data-en-es/europarl-en-es.tag-clean 1 40<br />
clean-corpus.perl: processing europarl-v6.es-en.es & .en to data-en-es/europarl-en-es.clean, cutoff 1-40<br />
..........(100000)...<br />
<br />
Input sentences: 1786594 Output sentences: 1467708<br />
</pre><br />
<br />
<br />
We're going to cut off the bottom 67,658 for testing (also because Giza++ segfaults somewhere around there).<br />
<pre><br />
$ mkdir testing<br />
$ tail -67658 data-en-es/europarl-en-es.tag-clean.en > testing/europarl.tag-clean.67658.en<br />
$ tail -67658 data-en-es/europarl-en-es.tagged.es > testing/europarl.tag-clean.67658.es<br />
</pre><br />
<br />
<pre><br />
$ head -1400000 data-en-es/europarl-en-es.tag-clean.en > data-en-es/europarl-en-es.tag-clean.en.new<br />
$ head -1400000 data-en-es/europarl-en-es.tag-clean.es > data-en-es/europarl-en-es.tag-clean.es.new<br />
$ mv data-en-es/europarl-en-es.tag-clean.en.new data-en-es/europarl-en-es.tag-clean.en<br />
$ mv data-en-es/europarl-en-es.tag-clean.es.new data-en-es/europarl-en-es.tag-clean.es<br />
</pre><br />
<br />
<br />
These files are:<br />
<br />
* <code>data-en-es/europarl-en-es.tag-clean.en</code>: The tagged source language side of the corpus<br />
* <code>data-en-es/europarl-en-es.tag-clean.es</code>: The tagged target language side of the corpus<br />
<br />
Check that they have the same length:<br />
<br />
<pre><br />
$ wc -l europarl.*<br />
1400000 data-en-es/europarl-en-es.tag-clean.en<br />
1400000 data-en-es/europarl-en-es.tag-clean.es<br />
2800000 total<br />
</pre><br />
<br />
=== Align corpus ===<br />
<br />
Now we've got the corpus files ready, we can align the corpus using the Moses scripts:<br />
<br />
<pre><br />
nohup perl (path to your mosesdecoder)/scripts/training/train-model.perl -external-bin-dir \<br />
~/smt/local/bin -corpus europarl.tag-clean \<br />
-f en -e es -alignment grow-diag-final-and -reordering msd-bidirectional-fe \<br />
-lm 0:5:/home/fran/corpora/europarl/europarl.lm:0 >log 2>&1 &<br />
</pre><br />
<br />
Note: Remember to change all the paths in the above command!<br />
<br />
You'll need an LM file, but you can copy it from a previous Moses installation. If you don't have one, make an empty file and put a few words in it. We won't be using the LM anyway.<br />
<br />
This takes a while, from a few hours to a day. So leave it running and go and make a soufflé, or chop some wood or something.<br />
<br />
=== Extract sentences ===<br />
After the sentences are aligned, we need to trim unnecessary tags from the tokens, and generate a biltrans file.<br />
<br />
<pre><br />
zcat giza.en-es/en-es.A3.final.gz | ~/source/apertium-lex-tools/scripts/giza-to-moses.awk > data-en-es/europarl.phrasetable.en-es<br />
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 1 \<br />
| sed 's/~/ /g' | ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil.bin -p -t > tmp1<br />
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 2 \<br />
| sed 's/~/ /g' | ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil.bin -p -t > tmp2<br />
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 3 > tmp3<br />
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 2 \<br />
| sed 's/~/ /g' | ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil.bin -b -t > data-en-es/europarl.clean-biltrans.en-es<br />
paste tmp1 tmp2 tmp3 | sed 's/\t/ ||| /g' > data-en-es/europarl.phrasetable.en-es<br />
rm tmp1 tmp2 tmp3<br />
</pre><br />
<br />
Then we want to make sure again that our file has the right number of lines:<br />
<br />
<pre><br />
$ wc -l data/en-es/europarl.phrasetable.en-es<br />
1400000 data/en-es/europarl.phrasetable.en-es<br />
</pre><br />
<br />
Then we want to extract the sentences where the target language word aligned to a source language word is a possible translation in the bilingual dictionary:<br />
<br />
<pre><br />
$ ~/source/apertium-lex-tools/scripts/extract-sentences.py europarl.phrasetable.en-es europarl.biltrans-tok.en-es \<br />
> europarl.candidates.en-es<br />
</pre><br />
<br />
These are basically sentences that we can hope that Apertium might be able to generate.<br />
<br />
=== Extract bilingual dictionary candidates ===<br />
<br />
Using the phrasetable and the bilingual file we can extract candidates for the bilingual dictionary.<br />
<br />
<pre><br />
python3 ~/Apertium/apertium-lex-tools/scripts/extract-biltrans-candidates.py europarl.phrasetable.en-es europarl.biltrans-tok.en-es > europarl.biltrans-candidates.en-es 2> europarl.biltrans-pairs.en-es<br />
</pre><br />
<br />
where europarl.biltrans-candidates.en-es contains the generated entries for the bilingual dictionary.<br />
<br />
===Extract frequency lexicon===<br />
<br />
The next step is to extract the frequency lexicon. <br />
<br />
<pre><br />
$ python ~/source/apertium-lex-tools/scripts/extract-freq-lexicon.py europarl.candidates.en-es > europarl.lex.en-es<br />
</pre><br />
<br />
This file should look like:<br />
<br />
<pre><br />
$ cat europarl.lex.en-es | head <br />
31381 union<n> unión<n> @<br />
101 union<n> sindicato<n><br />
1 union<n> situación<n><br />
1 union<n> monetario<adj><br />
4 slope<n> pendiente<n> @<br />
1 slope<n> ladera<n><br />
</pre><br />
<br />
Where the highest frequency translation is marked with an <code>@</code>.<br />
<br />
Note: This frequency lexicon can be used as a substitute for "choosing the most general translation" in your bilingual dictionary.<br />
<br />
===Generate patterns===<br />
<br />
Now we generate the ngrams that we are going to generate the rules from.<br />
<br />
<pre><br />
$ crisphold=1.5 # ratio of how many times you see the alternative translation compared to the default<br />
$ python ~/source/apertium-lex-tools/scripts/ngram-count-patterns.py europarl.lex.en-es europarl.candidates.en-es $crisphold 2>/dev/null > europarl.ngrams.en-es<br />
</pre><br />
<br />
This script outputs lines in the following format: <br />
<br />
<pre><br />
-language<n> and<cnjcoo> language<n> ,<cm> lengua<n> 2<br />
+language<n> plain<adj> language<n> ,<cm> lenguaje<n> 3<br />
-language<n> language<n> knowledge<n> lengua<n> 4<br />
-language<n> language<n> of<pr> communication<n> lengua<n> 3<br />
-language<n> Community<adj> language<n> .<sent> lengua<n> 5<br />
-language<n> language<n> in~addition~to<pr> their<det><pos> lengua<n> 2<br />
-language<n> every<det><ind> language<n> lengua<n> 2<br />
+language<n> and<cnjcoo> *understandable language<n> lenguaje<n> 2<br />
-language<n> two<num> language<n> lengua<n> 8<br />
-language<n> only<adj> official<adj> language<n> lengua<n> 2<br />
</pre><br />
<br />
The <code>+</code> and <code>-</code> indicate if this line chooses the most frequent transation (<code>-</code>) or a translation which is not the most frequent (<code>+</code>). The pattern selecting the translation is then shown, followed by the translation and then the frequency.<br />
<br />
===Filter rules===<br />
<br />
Now you can filter the rules, for example by removing rules with conjunctions, or removing rules with unknown words.<br />
<br />
=== Generate rules ===<br />
<br />
The final stage is to generate the rules,<br />
<br />
<pre><br />
python3 ~/source/apertium-lex-tools/scripts/ngrams-to-rules.py europarl.ngrams.en-es $crisphold > europarl.ngrams.en-es.lrx<br />
</pre> <br />
<br />
=== Process script ===<br />
For the whole process you can run the following script:<br />
<br />
<pre><br />
CORPUS_DIR="/home/philip/Apertium/corpora/raw/europarl-fr-es"<br />
CORPUS="Europarl3"<br />
PAIR="es-fr"<br />
SL="fr"<br />
TL="es"<br />
LEX_TOOLS="/home/philip/Apertium/apertium-lex-tools"<br />
SCRIPTS="$LEX_TOOLS/scripts"<br />
MOSESDECODER="/home/philip/Apertium/mosesdecoder/scripts/training"<br />
TRAINING_LINES=50000<br />
DATA="/home/philip/Apertium/apertium-fr-es"<br />
BIN_DIR="/home/philip/Apertium/smt/bin"<br />
<br />
# CLEAN CORPUS<br />
perl "$MOSESDECODER/clean-corpus-n.perl" $CORPUS.$PAIR $SL $TL "$CORPUS.clean" 1 40;<br />
<br />
#TAG CORPUS<br />
cat "$CORPUS.clean.$SL" | tail -n $TRAINING_LINES | apertium -d "$DATA" $SL-$TL-tagger \<br />
| apertium-pretransfer > $CORPUS.tagged.$SL;<br />
<br />
cat "$CORPUS.clean.$TL" | tail -n $TRAINING_LINES | apertium -d "$DATA" $TL-$SL-tagger \<br />
| apertium-pretransfer > $CORPUS.tagged.$TL;<br />
<br />
cat "$CORPUS.tagged.$SL" | lt-proc -b "$DATA/$SL-$TL.autobil.bin" > $CORPUS.biltrans.$PAIR<br />
<br />
N=`wc -l $CORPUS.clean.$SL | cut -d ' ' -f 1`<br />
<br />
# REMOVE LINES WITH NO ANALYSES<br />
seq 1 $N > $CORPUS.lines<br />
paste $CORPUS.lines $CORPUS.tagged.$SL $CORPUS.tagged.$TL | grep '<' | cut -f1 > $CORPUS.lines.new<br />
paste $CORPUS.lines $CORPUS.tagged.$SL $CORPUS.tagged.$TL | grep '<' | cut -f2 > $CORPUS.tagged.$SL.new<br />
paste $CORPUS.lines $CORPUS.tagged.$SL $CORPUS.tagged.$TL | grep '<' | cut -f3 > $CORPUS.tagged.$TL.new<br />
<br />
mv $CORPUS.lines.new $CORPUS.lines<br />
mv $CORPUS.tagged.$SL.new $CORPUS.tagged.$SL<br />
mv $CORPUS.tagged.$TL.new $CORPUS.tagged.$TL<br />
<br />
# TRIM TAGS<br />
cat $CORPUS.tagged.$SL | $LEX_TOOLS/process-tagger-output $DATA/$SL-$TL.autobil.bin -p -t > $CORPUS.tag-tok.$SL<br />
cat $CORPUS.tagged.$TL | $LEX_TOOLS/process-tagger-output $DATA/$TL-$SL.autobil.bin -p -t > $CORPUS.tag-tok.$TL<br />
cat $CORPUS.tagged.$SL | $LEX_TOOLS/process-tagger-output $DATA/$SL-$TL.autobil.bin -b -t > $CORPUS.biltrans-tok.$PAIR<br />
<br />
# ALIGN<br />
perl $MOSESDECODER/train-model.perl -external-bin-dir "$BIN_DIR" -corpus $CORPUS.tag-tok \<br />
-f $TL -e $SL -alignment grow-diag-final-and -reordering msd-bidirectional-fe \<br />
-lm 0:5:/home/philip/Apertium/gsoc2013/giza/europarl.lm:0 2>&1<br />
<br />
# EXTRACT<br />
zcat giza.$SL-$TL/$SL-$TL.A3.final.gz | $SCRIPTS/giza-to-moses.awk > $CORPUS.phrasetable.$SL-$TL<br />
<br />
# SENTENCES<br />
python3 $SCRIPTS/extract-sentences.py $CORPUS.phrasetable.$SL-$TL $CORPUS.biltrans-tok.$PAIR \<br />
> $CORPUS.candidates.$SL-$TL 2>/dev/null<br />
<br />
# FREQUENCY LEXICON<br />
python $SCRIPTS/extract-freq-lexicon.py $CORPUS.candidates.$SL-$TL > $CORPUS.lex.$SL-$TL 2>/dev/null<br />
<br />
<br />
# BILTRANS CANDIDATES<br />
python3 $SCRIPTS/extract-biltrans-candidates.py $CORPUS.phrasetable.$SL-$TL $CORPUS.biltrans-tok.$PAIR \<br />
> $CORPUS.biltrans-entries.$SL-$TL 2>$CORPUS.biltrans-pairs.$SL-$TL<br />
<br />
# NGRAM PATTERNS<br />
$crisphold=1.5<br />
python $SCRIPTS/ngram-count-patterns.py $CORPUS.lex.$SL-$TL $CORPUS.candidates.$SL-$TL $crisphold 2>/dev/null > $CORPUS.ngrams.$SL-$TL<br />
<br />
# FILTER PATTERNS<br />
<br />
<br />
# NGRAMS TO RULES<br />
$cripshold=1.5<br />
python3 $SCRIPTS/ngrams-to-rules.py europarl.ngrams.en-es $crisphold > $Europarl3.ngrams.$SL-TL.lrx<br />
<br />
</pre><br />
<br />
[[Category:Lexical selection]]<br />
[[Category:Documentation in English]]</div>
Fpetkovski
https://wiki.apertium.org/w/index.php?title=Generating_lexical-selection_rules_from_a_parallel_corpus&diff=43784
Generating lexical-selection rules from a parallel corpus
2013-09-21T14:57:42Z
<p>Fpetkovski: /* Prepare corpus */</p>
<hr />
<div>{{TOCD}}<br />
If you have a parallel corpus, one of the things you can do is generate some lexical selection rules from it, to improve translation of words with more than one possible translation.<br />
<br />
== You will need ==<br />
<br />
Here is a list of software that you will need installed:<br />
<br />
* Giza++ (or some other word aligner)<br />
* Moses (for making Giza++ less human hostile)<br />
* All the Moses scripts<br />
* [[lttoolbox]]<br />
* Apertium<br />
* [[apertium-lex-tools]]<br />
<br />
Furthermore you'll need:<br />
<br />
* an Apertium language pair<br />
* a parallel corpus (see [[Corpora]])<br />
<br />
===Installing prerequisites===<br />
See [[Minimal installation from SVN]] for apertium/lttoolbox. <br />
<br />
See [[Constraint-based lexical selection module]] for apertium-lex-tools.<br />
<br />
For Giza++ and moses-decoder, etc. you can do<br />
<pre><br />
$ mkdir ~/smt<br />
$ cd ~/smt<br />
$ mkdir local # our "install prefix"<br />
$ wget https://giza-pp.googlecode.com/files/giza-pp-v1.0.7.tar.gz<br />
$ tar xzvf giza-pp-v1.0.7.tar.gz<br />
$ cd giza-pp<br />
$ make<br />
$ mkdir ../local/bin<br />
$ cp GIZA++-v2/snt2cooc.out ../local/bin/<br />
$ cp GIZA++-v2/snt2plain.out ../local/bin/<br />
$ cp GIZA++-v2/GIZA++ ../local/bin/<br />
$ cp mkcls-v2/mkcls ../local/bin/<br />
$ git clone https://github.com/moses-smt/mosesdecoder<br />
$ cd mosesdecoder/<br />
$ ./bjam <br />
</pre><br />
<br />
Now e.g. the clean-corpus and train-model scripts referred to below will be in ~/smt/mosesdecoder/scripts/training/clean-corpus-n.perl<br />
See http://www.statmt.org/moses/?n=Development.GetStarted if you want to install the binaries to some other directory.<br />
<br />
== Getting started ==<br />
<br />
We're going to do the example with [[Corpora|EuroParl]] and the English to Spanish pair in Apertium.<br />
<br />
Given that you've got all the stuff installed, the work will be as follows:<br />
<br />
=== Prepare corpus ===<br />
<br />
To generate the rules, we need three files,<br />
<br />
* The tagged and tokenised source corpus<br />
* The tagged and tokenised target corpus<br />
* The output of the lexical transfer module in the source→target direction, tokenised<br />
<br />
These three files should be sentence aligned. <br />
<br />
Make a folder called data-en-es. We are going to keep all the generated files there.<br />
<br />
The first thing that we need to do is tag both sides of the corpus:<br />
<br />
<pre><br />
$ nohup cat europarl.en-es.en | apertium-destxt |\<br />
apertium -f none -d /home/fran/source/apertium-en-es en-es-pretransfer > data-en-es/europarl-en-es.tagged.en &<br />
$ nohup cat europarl.clean.es | apertium-destxt |\<br />
apertium -f none -d /home/fran/source/apertium-en-es es-en-pretransfer > data-en-es.europarl-en-es.tagged.es &<br />
</pre><br />
<br />
Then we need to remove the lines with no analyses on and replace blanks within lemmas with a new character (we will use `~`):<br />
<br />
<pre><br />
$ paste data-en-es/europarl-en-es.tagged.en data-en-es/europarl-en-es.tagged.es | grep '<' | cut -f2 | sed 's/ /~/g' | sed 's/\$[^\^]*/\$ /g' > data/europarl-en-es.tagged.new.es<br />
$ paste data-en-es/europarl-en-es.tagged.en data-en-es/europarl-en-es.tagged.es | grep '<' | cut -f3 | sed 's/ /~/g' | sed 's/\$[^\^]*/\$ /g'> data-en-es/europarl-en-es.tagged.new.en<br />
</pre><br />
<br />
Next, we need to clean the corpus and remove long sentences.<br />
(Make sure you are in the same directory as the one where you have your europarl corpus) <br />
<br />
<pre><br />
$ perl (path to your mosesdecoder)/scripts/training/clean-corpus-n.perl data-en-es/europarl-en-es.tagged.new es en data-en-es/europarl-en-es.tag-clean 1 40<br />
clean-corpus.perl: processing europarl-v6.es-en.es & .en to data-en-es/europarl-en-es.clean, cutoff 1-40<br />
..........(100000)...<br />
<br />
Input sentences: 1786594 Output sentences: 1467708<br />
</pre><br />
<br />
<br />
We're going to cut off the bottom 67,658 for testing (also because Giza++ segfaults somewhere around there).<br />
<pre><br />
$ mkdir testing<br />
$ tail -67658 data-en-es/europarl-en-es.tag-clean.en > testing/europarl.tag-clean.67658.en<br />
$ tail -67658 data-en-es/europarl-en-es.tagged.es > testing/europarl.tag-clean.67658.es<br />
</pre><br />
<br />
<pre><br />
$ head -1400000 data-en-es/europarl-en-es.tag-clean.en > data-en-es/europarl-en-es.tag-clean.en.new<br />
$ head -1400000 data-en-es/europarl-en-es.tag-clean.es > data-en-es/europarl-en-es.tag-clean.es.new<br />
$ mv data-en-es/europarl-en-es.tag-clean.en.new data-en-es/europarl-en-es.tag-clean.en<br />
$ mv data-en-es/europarl-en-es.tag-clean.es.new data-en-es/europarl-en-es.tag-clean.es<br />
</pre><br />
<br />
<br />
These files are:<br />
<br />
* <code>data-en-es/europarl-en-es.tag-clean.en</code>: The tagged source language side of the corpus<br />
* <code>data-en-es/europarl-en-es.tag-clean.es</code>: The tagged target language side of the corpus<br />
<br />
Check that they have the same length:<br />
<br />
<pre><br />
$ wc -l europarl.*<br />
1400000 data-en-es/europarl-en-es.tag-clean.en<br />
1400000 data-en-es/europarl-en-es.tag-clean.es<br />
2800000 total<br />
</pre><br />
<br />
=== Align corpus ===<br />
<br />
Now we've got the corpus files ready, we can align the corpus using the Moses scripts:<br />
<br />
<pre><br />
nohup perl (path to your mosesdecoder)/scripts/training/train-model.perl -external-bin-dir \<br />
~/smt/local/bin -corpus europarl.tag-clean \<br />
-f en -e es -alignment grow-diag-final-and -reordering msd-bidirectional-fe \<br />
-lm 0:5:/home/fran/corpora/europarl/europarl.lm:0 >log 2>&1 &<br />
</pre><br />
<br />
Note: Remember to change all the paths in the above command!<br />
<br />
You'll need an LM file, but you can copy it from a previous Moses installation. If you don't have one, make an empty file and put a few words in it. We won't be using the LM anyway.<br />
<br />
This takes a while, from a few hours to a day. So leave it running and go and make a soufflé, or chop some wood or something.<br />
<br />
=== Extract sentences ===<br />
After the sentences are aligned, we need to trim unnecessary tags from the tokens, and generate a biltrans file.<br />
<br />
<pre><br />
zcat giza.en-es/en-es.A3.final.gz | ~/source/apertium-lex-tools/scripts/giza-to-moses.awk > data-en-es/europarl.phrasetable.en-es<br />
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 1 \<br />
| sed 's/~/ /g' | ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil.bin -p -t > tmp1<br />
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 2 \<br />
| sed 's/~/ /g' | ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil.bin -p -t > tmp2<br />
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 3 > tmp3<br />
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 2 \<br />
| sed 's/~/ /g' | ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil.bin -b -t > data-en-es/europarl.clean-biltrans.en-es<br />
paste tmp1 tmp2 tmp3 | sed 's/\t/ ||| /g' > data-en-es/europarl.phrasetable.en-es<br />
rm tmp1 tmp2 tmp3<br />
</pre><br />
<br />
Then we want to make sure again that our file has the right number of lines:<br />
<br />
<pre><br />
$ wc -l data/en-es/europarl.phrasetable.en-es<br />
1400000 data/en-es/europarl.phrasetable.en-es<br />
</pre><br />
<br />
Then we want to extract the sentences where the target language word aligned to a source language word is a possible translation in the bilingual dictionary:<br />
<br />
<pre><br />
$ ~/source/apertium-lex-tools/scripts/extract-sentences.py europarl.phrasetable.en-es europarl.biltrans-tok.en-es \<br />
> europarl.candidates.en-es<br />
</pre><br />
<br />
These are basically sentences that we can hope that Apertium might be able to generate.<br />
<br />
=== Extract bilingual dictionary candidates ===<br />
<br />
Using the phrasetable and the bilingual file we can extract candidates for the bilingual dictionary.<br />
<br />
<pre><br />
python3 ~/Apertium/apertium-lex-tools/scripts/extract-biltrans-candidates.py europarl.phrasetable.en-es europarl.biltrans-tok.en-es > europarl.biltrans-candidates.en-es 2> europarl.biltrans-pairs.en-es<br />
</pre><br />
<br />
where europarl.biltrans-candidates.en-es contains the generated entries for the bilingual dictionary.<br />
<br />
===Extract frequency lexicon===<br />
<br />
The next step is to extract the frequency lexicon. <br />
<br />
<pre><br />
$ python ~/source/apertium-lex-tools/scripts/extract-freq-lexicon.py europarl.candidates.en-es > europarl.lex.en-es<br />
</pre><br />
<br />
This file should look like:<br />
<br />
<pre><br />
$ cat europarl.lex.en-es | head <br />
31381 union<n> unión<n> @<br />
101 union<n> sindicato<n><br />
1 union<n> situación<n><br />
1 union<n> monetario<adj><br />
4 slope<n> pendiente<n> @<br />
1 slope<n> ladera<n><br />
</pre><br />
<br />
Where the highest frequency translation is marked with an <code>@</code>.<br />
<br />
Note: This frequency lexicon can be used as a substitute for "choosing the most general translation" in your bilingual dictionary.<br />
<br />
===Generate patterns===<br />
<br />
Now we generate the ngrams that we are going to generate the rules from.<br />
<br />
<pre><br />
$ crisphold=1.5 # ratio of how many times you see the alternative translation compared to the default<br />
$ python ~/source/apertium-lex-tools/scripts/ngram-count-patterns.py europarl.lex.en-es europarl.candidates.en-es $crisphold 2>/dev/null > europarl.ngrams.en-es<br />
</pre><br />
<br />
This script outputs lines in the following format: <br />
<br />
<pre><br />
-language<n> and<cnjcoo> language<n> ,<cm> lengua<n> 2<br />
+language<n> plain<adj> language<n> ,<cm> lenguaje<n> 3<br />
-language<n> language<n> knowledge<n> lengua<n> 4<br />
-language<n> language<n> of<pr> communication<n> lengua<n> 3<br />
-language<n> Community<adj> language<n> .<sent> lengua<n> 5<br />
-language<n> language<n> in~addition~to<pr> their<det><pos> lengua<n> 2<br />
-language<n> every<det><ind> language<n> lengua<n> 2<br />
+language<n> and<cnjcoo> *understandable language<n> lenguaje<n> 2<br />
-language<n> two<num> language<n> lengua<n> 8<br />
-language<n> only<adj> official<adj> language<n> lengua<n> 2<br />
</pre><br />
<br />
The <code>+</code> and <code>-</code> indicate if this line chooses the most frequent transation (<code>-</code>) or a translation which is not the most frequent (<code>+</code>). The pattern selecting the translation is then shown, followed by the translation and then the frequency.<br />
<br />
===Filter rules===<br />
<br />
Now you can filter the rules, for example by removing rules with conjunctions, or removing rules with unknown words.<br />
<br />
=== Generate rules ===<br />
<br />
The final stage is to generate the rules,<br />
<br />
<pre><br />
python3 ~/source/apertium-lex-tools/scripts/ngrams-to-rules.py europarl.ngrams.en-es $crisphold > europarl.ngrams.en-es.lrx<br />
</pre> <br />
<br />
=== Process script ===<br />
For the whole process you can run the following script:<br />
<br />
<pre><br />
CORPUS_DIR="/home/philip/Apertium/corpora/raw/europarl-fr-es"<br />
CORPUS="Europarl3"<br />
PAIR="es-fr"<br />
SL="fr"<br />
TL="es"<br />
LEX_TOOLS="/home/philip/Apertium/apertium-lex-tools"<br />
SCRIPTS="$LEX_TOOLS/scripts"<br />
MOSESDECODER="/home/philip/Apertium/mosesdecoder/scripts/training"<br />
TRAINING_LINES=50000<br />
DATA="/home/philip/Apertium/apertium-fr-es"<br />
BIN_DIR="/home/philip/Apertium/smt/bin"<br />
<br />
# CLEAN CORPUS<br />
perl "$MOSESDECODER/clean-corpus-n.perl" $CORPUS.$PAIR $SL $TL "$CORPUS.clean" 1 40;<br />
<br />
#TAG CORPUS<br />
cat "$CORPUS.clean.$SL" | tail -n $TRAINING_LINES | apertium -d "$DATA" $SL-$TL-tagger \<br />
| apertium-pretransfer > $CORPUS.tagged.$SL;<br />
<br />
cat "$CORPUS.clean.$TL" | tail -n $TRAINING_LINES | apertium -d "$DATA" $TL-$SL-tagger \<br />
| apertium-pretransfer > $CORPUS.tagged.$TL;<br />
<br />
cat "$CORPUS.tagged.$SL" | lt-proc -b "$DATA/$SL-$TL.autobil.bin" > $CORPUS.biltrans.$PAIR<br />
<br />
N=`wc -l $CORPUS.clean.$SL | cut -d ' ' -f 1`<br />
<br />
# REMOVE LINES WITH NO ANALYSES<br />
seq 1 $N > $CORPUS.lines<br />
paste $CORPUS.lines $CORPUS.tagged.$SL $CORPUS.tagged.$TL | grep '<' | cut -f1 > $CORPUS.lines.new<br />
paste $CORPUS.lines $CORPUS.tagged.$SL $CORPUS.tagged.$TL | grep '<' | cut -f2 > $CORPUS.tagged.$SL.new<br />
paste $CORPUS.lines $CORPUS.tagged.$SL $CORPUS.tagged.$TL | grep '<' | cut -f3 > $CORPUS.tagged.$TL.new<br />
<br />
mv $CORPUS.lines.new $CORPUS.lines<br />
mv $CORPUS.tagged.$SL.new $CORPUS.tagged.$SL<br />
mv $CORPUS.tagged.$TL.new $CORPUS.tagged.$TL<br />
<br />
# TRIM TAGS<br />
cat $CORPUS.tagged.$SL | $LEX_TOOLS/process-tagger-output $DATA/$SL-$TL.autobil.bin -p -t > $CORPUS.tag-tok.$SL<br />
cat $CORPUS.tagged.$TL | $LEX_TOOLS/process-tagger-output $DATA/$TL-$SL.autobil.bin -p -t > $CORPUS.tag-tok.$TL<br />
cat $CORPUS.tagged.$SL | $LEX_TOOLS/process-tagger-output $DATA/$SL-$TL.autobil.bin -b -t > $CORPUS.biltrans-tok.$PAIR<br />
<br />
# ALIGN<br />
perl $MOSESDECODER/train-model.perl -external-bin-dir "$BIN_DIR" -corpus $CORPUS.tag-tok \<br />
-f $TL -e $SL -alignment grow-diag-final-and -reordering msd-bidirectional-fe \<br />
-lm 0:5:/home/philip/Apertium/gsoc2013/giza/europarl.lm:0 2>&1<br />
<br />
# EXTRACT<br />
zcat giza.$SL-$TL/$SL-$TL.A3.final.gz | $SCRIPTS/giza-to-moses.awk > $CORPUS.phrasetable.$SL-$TL<br />
<br />
# SENTENCES<br />
python3 $SCRIPTS/extract-sentences.py $CORPUS.phrasetable.$SL-$TL $CORPUS.biltrans-tok.$PAIR \<br />
> $CORPUS.candidates.$SL-$TL 2>/dev/null<br />
<br />
# FREQUENCY LEXICON<br />
python $SCRIPTS/extract-freq-lexicon.py $CORPUS.candidates.$SL-$TL > $CORPUS.lex.$SL-$TL 2>/dev/null<br />
<br />
<br />
# BILTRANS CANDIDATES<br />
python3 $SCRIPTS/extract-biltrans-candidates.py $CORPUS.phrasetable.$SL-$TL $CORPUS.biltrans-tok.$PAIR \<br />
> $CORPUS.biltrans-entries.$SL-$TL 2>$CORPUS.biltrans-pairs.$SL-$TL<br />
<br />
# NGRAM PATTERNS<br />
$crisphold=1.5<br />
python $SCRIPTS/ngram-count-patterns.py $CORPUS.lex.$SL-$TL $CORPUS.candidates.$SL-$TL $crisphold 2>/dev/null > $CORPUS.ngrams.$SL-$TL<br />
<br />
# FILTER PATTERNS<br />
<br />
<br />
# NGRAMS TO RULES<br />
$cripshold=1.5<br />
python3 $SCRIPTS/ngrams-to-rules.py europarl.ngrams.en-es $crisphold > $Europarl3.ngrams.$SL-TL.lrx<br />
<br />
</pre><br />
<br />
[[Category:Lexical selection]]<br />
[[Category:Documentation in English]]</div>
Fpetkovski
https://wiki.apertium.org/w/index.php?title=Generating_lexical-selection_rules_from_a_parallel_corpus&diff=43783
Generating lexical-selection rules from a parallel corpus
2013-09-21T14:52:00Z
<p>Fpetkovski: /* Extract sentences */</p>
<hr />
<div>{{TOCD}}<br />
If you have a parallel corpus, one of the things you can do is generate some lexical selection rules from it, to improve translation of words with more than one possible translation.<br />
<br />
== You will need ==<br />
<br />
Here is a list of software that you will need installed:<br />
<br />
* Giza++ (or some other word aligner)<br />
* Moses (for making Giza++ less human hostile)<br />
* All the Moses scripts<br />
* [[lttoolbox]]<br />
* Apertium<br />
* [[apertium-lex-tools]]<br />
<br />
Furthermore you'll need:<br />
<br />
* an Apertium language pair<br />
* a parallel corpus (see [[Corpora]])<br />
<br />
===Installing prerequisites===<br />
See [[Minimal installation from SVN]] for apertium/lttoolbox. <br />
<br />
See [[Constraint-based lexical selection module]] for apertium-lex-tools.<br />
<br />
For Giza++ and moses-decoder, etc. you can do<br />
<pre><br />
$ mkdir ~/smt<br />
$ cd ~/smt<br />
$ mkdir local # our "install prefix"<br />
$ wget https://giza-pp.googlecode.com/files/giza-pp-v1.0.7.tar.gz<br />
$ tar xzvf giza-pp-v1.0.7.tar.gz<br />
$ cd giza-pp<br />
$ make<br />
$ mkdir ../local/bin<br />
$ cp GIZA++-v2/snt2cooc.out ../local/bin/<br />
$ cp GIZA++-v2/snt2plain.out ../local/bin/<br />
$ cp GIZA++-v2/GIZA++ ../local/bin/<br />
$ cp mkcls-v2/mkcls ../local/bin/<br />
$ git clone https://github.com/moses-smt/mosesdecoder<br />
$ cd mosesdecoder/<br />
$ ./bjam <br />
</pre><br />
<br />
Now e.g. the clean-corpus and train-model scripts referred to below will be in ~/smt/mosesdecoder/scripts/training/clean-corpus-n.perl<br />
See http://www.statmt.org/moses/?n=Development.GetStarted if you want to install the binaries to some other directory.<br />
<br />
== Getting started ==<br />
<br />
We're going to do the example with [[Corpora|EuroParl]] and the English to Spanish pair in Apertium.<br />
<br />
Given that you've got all the stuff installed, the work will be as follows:<br />
<br />
=== Prepare corpus ===<br />
<br />
To generate the rules, we need three files,<br />
<br />
* The tagged and tokenised source corpus<br />
* The tagged and tokenised target corpus<br />
* The output of the lexical transfer module in the source→target direction, tokenised<br />
<br />
These three files should be sentence aligned.<br />
<br />
The first thing that we need to do is tag both sides of the corpus:<br />
<br />
<pre><br />
$ nohup cat europarl.clean.en | apertium-destxt |\<br />
apertium -f none -d /home/fran/source/apertium-en-es en-es-pretransfer > europarl.tagged.en &<br />
$ nohup cat europarl.clean.es | apertium-destxt |\<br />
apertium -f none -d /home/fran/source/apertium-en-es es-en-pretransfer > europarl.tagged.es &<br />
</pre><br />
<br />
Then we need to remove the lines with no analyses on and replace blanks within lemmas with a new character (we will use `~`):<br />
<br />
<pre><br />
$ paste europarl.lines europarl.tagged.en europarl.tagged.es | grep '<' | cut -f2 | sed 's/ /~/g' | sed 's/$$[^\^]*/$$ /g' > europarl.tagged.new.es<br />
$ paste europarl.lines europarl.tagged.en europarl.tagged.es | grep '<' | cut -f3 | sed 's/ /~/g' | sed 's/$$[^\^]*/$$ /g'> europarl.tagged.new.en<br />
</pre><br />
<br />
Next, we need to clean the corpus and remove long sentences.<br />
(Make sure you are in the same directory as the one where you have your europarl corpus) <br />
<br />
<pre><br />
$ perl (path to your mosesdecoder)/scripts/training/clean-corpus-n.perl europarl.tagged.new es en europarl.tag-clean 1 40<br />
clean-corpus.perl: processing europarl-v6.es-en.es & .en to europarl.clean, cutoff 1-40<br />
..........(100000)...<br />
<br />
Input sentences: 1786594 Output sentences: 1467708<br />
</pre><br />
<br />
<br />
We're going to cut off the bottom 67,658 for testing (also because Giza++ segfaults somewhere around there).<br />
<pre><br />
$ mkdir testing<br />
$ tail -67658 europarl.tag-clean.en > testing/europarl.tag-clean.67658.en<br />
$ tail -67658 europarl.tagged.es > testing/europarl.tag-clean.67658.es<br />
</pre><br />
<br />
<pre><br />
$ head -1400000 europarl.tag-clean.en > europarl.tag-clean.en.new<br />
$ head -1400000 europarl.tag-clean.es > europarl.tag-clean.es.new<br />
$ mv europarl.tag-clean.en.new europarl.tag-clean.en<br />
$ mv europarl.tag-clean.es.new europarl.tag-clean.es<br />
</pre><br />
<br />
<br />
These files are:<br />
<br />
* <code>europarl.tag-clean.en</code>: The tagged source language side of the corpus<br />
* <code>europarl.tag-clean.es</code>: The tagged target language side of the corpus<br />
<br />
Check that they have the same length:<br />
<br />
<pre><br />
$ wc -l europarl.*<br />
1400000 europarl.tag-clean.en<br />
1400000 europarl.tag-clean.es<br />
2800000 total<br />
</pre><br />
<br />
=== Align corpus ===<br />
<br />
Now we've got the corpus files ready, we can align the corpus using the Moses scripts:<br />
<br />
<pre><br />
nohup perl (path to your mosesdecoder)/scripts/training/train-model.perl -external-bin-dir \<br />
~/smt/local/bin -corpus europarl.tag-clean \<br />
-f en -e es -alignment grow-diag-final-and -reordering msd-bidirectional-fe \<br />
-lm 0:5:/home/fran/corpora/europarl/europarl.lm:0 >log 2>&1 &<br />
</pre><br />
<br />
Note: Remember to change all the paths in the above command!<br />
<br />
You'll need an LM file, but you can copy it from a previous Moses installation. If you don't have one, make an empty file and put a few words in it. We won't be using the LM anyway.<br />
<br />
This takes a while, from a few hours to a day. So leave it running and go and make a soufflé, or chop some wood or something.<br />
<br />
=== Extract sentences ===<br />
After the sentences are aligned, we need to trim unnecessary tags from the tokens, and generate a biltrans file.<br />
<br />
<pre><br />
zcat giza.en-es/en-es.A3.final.gz | ~/source/apertium-lex-tools/scripts/giza-to-moses.awk > data-en-es/europarl.phrasetable.en-es<br />
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 1 \<br />
| sed 's/~/ /g' | ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil.bin -p -t > tmp1<br />
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 2 \<br />
| sed 's/~/ /g' | ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil.bin -p -t > tmp2<br />
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 3 > tmp3<br />
cat data-en-es/europarl.phrasetable.en-es | sed 's/ ||| /\t/g' | cut -f 2 \<br />
| sed 's/~/ /g' | ~/source/apertium-lex-tools/multitrans ~/source/apertium-en-es/en-es.autobil.bin -b -t > data-en-es/europarl.clean-biltrans.en-es<br />
paste tmp1 tmp2 tmp3 | sed 's/\t/ ||| /g' > data-en-es/europarl.phrasetable.en-es<br />
rm tmp1 tmp2 tmp3<br />
</pre><br />
<br />
Then we want to make sure again that our file has the right number of lines:<br />
<br />
<pre><br />
$ wc -l data/en-es/europarl.phrasetable.en-es<br />
1400000 data/en-es/europarl.phrasetable.en-es<br />
</pre><br />
<br />
Then we want to extract the sentences where the target language word aligned to a source language word is a possible translation in the bilingual dictionary:<br />
<br />
<pre><br />
$ ~/source/apertium-lex-tools/scripts/extract-sentences.py europarl.phrasetable.en-es europarl.biltrans-tok.en-es \<br />
> europarl.candidates.en-es<br />
</pre><br />
<br />
These are basically sentences that we can hope that Apertium might be able to generate.<br />
<br />
=== Extract bilingual dictionary candidates ===<br />
<br />
Using the phrasetable and the bilingual file we can extract candidates for the bilingual dictionary.<br />
<br />
<pre><br />
python3 ~/Apertium/apertium-lex-tools/scripts/extract-biltrans-candidates.py europarl.phrasetable.en-es europarl.biltrans-tok.en-es > europarl.biltrans-candidates.en-es 2> europarl.biltrans-pairs.en-es<br />
</pre><br />
<br />
where europarl.biltrans-candidates.en-es contains the generated entries for the bilingual dictionary.<br />
<br />
===Extract frequency lexicon===<br />
<br />
The next step is to extract the frequency lexicon. <br />
<br />
<pre><br />
$ python ~/source/apertium-lex-tools/scripts/extract-freq-lexicon.py europarl.candidates.en-es > europarl.lex.en-es<br />
</pre><br />
<br />
This file should look like:<br />
<br />
<pre><br />
$ cat europarl.lex.en-es | head <br />
31381 union<n> unión<n> @<br />
101 union<n> sindicato<n><br />
1 union<n> situación<n><br />
1 union<n> monetario<adj><br />
4 slope<n> pendiente<n> @<br />
1 slope<n> ladera<n><br />
</pre><br />
<br />
Where the highest frequency translation is marked with an <code>@</code>.<br />
<br />
Note: This frequency lexicon can be used as a substitute for "choosing the most general translation" in your bilingual dictionary.<br />
<br />
===Generate patterns===<br />
<br />
Now we generate the ngrams that we are going to generate the rules from.<br />
<br />
<pre><br />
$ crisphold=1.5 # ratio of how many times you see the alternative translation compared to the default<br />
$ python ~/source/apertium-lex-tools/scripts/ngram-count-patterns.py europarl.lex.en-es europarl.candidates.en-es $crisphold 2>/dev/null > europarl.ngrams.en-es<br />
</pre><br />
<br />
This script outputs lines in the following format: <br />
<br />
<pre><br />
-language<n> and<cnjcoo> language<n> ,<cm> lengua<n> 2<br />
+language<n> plain<adj> language<n> ,<cm> lenguaje<n> 3<br />
-language<n> language<n> knowledge<n> lengua<n> 4<br />
-language<n> language<n> of<pr> communication<n> lengua<n> 3<br />
-language<n> Community<adj> language<n> .<sent> lengua<n> 5<br />
-language<n> language<n> in~addition~to<pr> their<det><pos> lengua<n> 2<br />
-language<n> every<det><ind> language<n> lengua<n> 2<br />
+language<n> and<cnjcoo> *understandable language<n> lenguaje<n> 2<br />
-language<n> two<num> language<n> lengua<n> 8<br />
-language<n> only<adj> official<adj> language<n> lengua<n> 2<br />
</pre><br />
<br />
The <code>+</code> and <code>-</code> indicate if this line chooses the most frequent transation (<code>-</code>) or a translation which is not the most frequent (<code>+</code>). The pattern selecting the translation is then shown, followed by the translation and then the frequency.<br />
<br />
===Filter rules===<br />
<br />
Now you can filter the rules, for example by removing rules with conjunctions, or removing rules with unknown words.<br />
<br />
=== Generate rules ===<br />
<br />
The final stage is to generate the rules,<br />
<br />
<pre><br />
python3 ~/source/apertium-lex-tools/scripts/ngrams-to-rules.py europarl.ngrams.en-es $crisphold > europarl.ngrams.en-es.lrx<br />
</pre> <br />
<br />
=== Process script ===<br />
For the whole process you can run the following script:<br />
<br />
<pre><br />
CORPUS_DIR="/home/philip/Apertium/corpora/raw/europarl-fr-es"<br />
CORPUS="Europarl3"<br />
PAIR="es-fr"<br />
SL="fr"<br />
TL="es"<br />
LEX_TOOLS="/home/philip/Apertium/apertium-lex-tools"<br />
SCRIPTS="$LEX_TOOLS/scripts"<br />
MOSESDECODER="/home/philip/Apertium/mosesdecoder/scripts/training"<br />
TRAINING_LINES=50000<br />
DATA="/home/philip/Apertium/apertium-fr-es"<br />
BIN_DIR="/home/philip/Apertium/smt/bin"<br />
<br />
# CLEAN CORPUS<br />
perl "$MOSESDECODER/clean-corpus-n.perl" $CORPUS.$PAIR $SL $TL "$CORPUS.clean" 1 40;<br />
<br />
#TAG CORPUS<br />
cat "$CORPUS.clean.$SL" | tail -n $TRAINING_LINES | apertium -d "$DATA" $SL-$TL-tagger \<br />
| apertium-pretransfer > $CORPUS.tagged.$SL;<br />
<br />
cat "$CORPUS.clean.$TL" | tail -n $TRAINING_LINES | apertium -d "$DATA" $TL-$SL-tagger \<br />
| apertium-pretransfer > $CORPUS.tagged.$TL;<br />
<br />
cat "$CORPUS.tagged.$SL" | lt-proc -b "$DATA/$SL-$TL.autobil.bin" > $CORPUS.biltrans.$PAIR<br />
<br />
N=`wc -l $CORPUS.clean.$SL | cut -d ' ' -f 1`<br />
<br />
# REMOVE LINES WITH NO ANALYSES<br />
seq 1 $N > $CORPUS.lines<br />
paste $CORPUS.lines $CORPUS.tagged.$SL $CORPUS.tagged.$TL | grep '<' | cut -f1 > $CORPUS.lines.new<br />
paste $CORPUS.lines $CORPUS.tagged.$SL $CORPUS.tagged.$TL | grep '<' | cut -f2 > $CORPUS.tagged.$SL.new<br />
paste $CORPUS.lines $CORPUS.tagged.$SL $CORPUS.tagged.$TL | grep '<' | cut -f3 > $CORPUS.tagged.$TL.new<br />
<br />
mv $CORPUS.lines.new $CORPUS.lines<br />
mv $CORPUS.tagged.$SL.new $CORPUS.tagged.$SL<br />
mv $CORPUS.tagged.$TL.new $CORPUS.tagged.$TL<br />
<br />
# TRIM TAGS<br />
cat $CORPUS.tagged.$SL | $LEX_TOOLS/process-tagger-output $DATA/$SL-$TL.autobil.bin -p -t > $CORPUS.tag-tok.$SL<br />
cat $CORPUS.tagged.$TL | $LEX_TOOLS/process-tagger-output $DATA/$TL-$SL.autobil.bin -p -t > $CORPUS.tag-tok.$TL<br />
cat $CORPUS.tagged.$SL | $LEX_TOOLS/process-tagger-output $DATA/$SL-$TL.autobil.bin -b -t > $CORPUS.biltrans-tok.$PAIR<br />
<br />
# ALIGN<br />
perl $MOSESDECODER/train-model.perl -external-bin-dir "$BIN_DIR" -corpus $CORPUS.tag-tok \<br />
-f $TL -e $SL -alignment grow-diag-final-and -reordering msd-bidirectional-fe \<br />
-lm 0:5:/home/philip/Apertium/gsoc2013/giza/europarl.lm:0 2>&1<br />
<br />
# EXTRACT<br />
zcat giza.$SL-$TL/$SL-$TL.A3.final.gz | $SCRIPTS/giza-to-moses.awk > $CORPUS.phrasetable.$SL-$TL<br />
<br />
# SENTENCES<br />
python3 $SCRIPTS/extract-sentences.py $CORPUS.phrasetable.$SL-$TL $CORPUS.biltrans-tok.$PAIR \<br />
> $CORPUS.candidates.$SL-$TL 2>/dev/null<br />
<br />
# FREQUENCY LEXICON<br />
python $SCRIPTS/extract-freq-lexicon.py $CORPUS.candidates.$SL-$TL > $CORPUS.lex.$SL-$TL 2>/dev/null<br />
<br />
<br />
# BILTRANS CANDIDATES<br />
python3 $SCRIPTS/extract-biltrans-candidates.py $CORPUS.phrasetable.$SL-$TL $CORPUS.biltrans-tok.$PAIR \<br />
> $CORPUS.biltrans-entries.$SL-$TL 2>$CORPUS.biltrans-pairs.$SL-$TL<br />
<br />
# NGRAM PATTERNS<br />
$crisphold=1.5<br />
python $SCRIPTS/ngram-count-patterns.py $CORPUS.lex.$SL-$TL $CORPUS.candidates.$SL-$TL $crisphold 2>/dev/null > $CORPUS.ngrams.$SL-$TL<br />
<br />
# FILTER PATTERNS<br />
<br />
<br />
# NGRAMS TO RULES<br />
$cripshold=1.5<br />
python3 $SCRIPTS/ngrams-to-rules.py europarl.ngrams.en-es $crisphold > $Europarl3.ngrams.$SL-TL.lrx<br />
<br />
</pre><br />
<br />
[[Category:Lexical selection]]<br />
[[Category:Documentation in English]]</div>
Fpetkovski
https://wiki.apertium.org/w/index.php?title=Generating_lexical-selection_rules_from_a_parallel_corpus&diff=43782
Generating lexical-selection rules from a parallel corpus
2013-09-21T14:44:47Z
<p>Fpetkovski: /* Align corpus */</p>
<hr />
<div>{{TOCD}}<br />
If you have a parallel corpus, one of the things you can do is generate some lexical selection rules from it, to improve translation of words with more than one possible translation.<br />
<br />
== You will need ==<br />
<br />
Here is a list of software that you will need installed:<br />
<br />
* Giza++ (or some other word aligner)<br />
* Moses (for making Giza++ less human hostile)<br />
* All the Moses scripts<br />
* [[lttoolbox]]<br />
* Apertium<br />
* [[apertium-lex-tools]]<br />
<br />
Furthermore you'll need:<br />
<br />
* an Apertium language pair<br />
* a parallel corpus (see [[Corpora]])<br />
<br />
===Installing prerequisites===<br />
See [[Minimal installation from SVN]] for apertium/lttoolbox. <br />
<br />
See [[Constraint-based lexical selection module]] for apertium-lex-tools.<br />
<br />
For Giza++ and moses-decoder, etc. you can do<br />
<pre><br />
$ mkdir ~/smt<br />
$ cd ~/smt<br />
$ mkdir local # our "install prefix"<br />
$ wget https://giza-pp.googlecode.com/files/giza-pp-v1.0.7.tar.gz<br />
$ tar xzvf giza-pp-v1.0.7.tar.gz<br />
$ cd giza-pp<br />
$ make<br />
$ mkdir ../local/bin<br />
$ cp GIZA++-v2/snt2cooc.out ../local/bin/<br />
$ cp GIZA++-v2/snt2plain.out ../local/bin/<br />
$ cp GIZA++-v2/GIZA++ ../local/bin/<br />
$ cp mkcls-v2/mkcls ../local/bin/<br />
$ git clone https://github.com/moses-smt/mosesdecoder<br />
$ cd mosesdecoder/<br />
$ ./bjam <br />
</pre><br />
<br />
Now e.g. the clean-corpus and train-model scripts referred to below will be in ~/smt/mosesdecoder/scripts/training/clean-corpus-n.perl<br />
See http://www.statmt.org/moses/?n=Development.GetStarted if you want to install the binaries to some other directory.<br />
<br />
== Getting started ==<br />
<br />
We're going to do the example with [[Corpora|EuroParl]] and the English to Spanish pair in Apertium.<br />
<br />
Given that you've got all the stuff installed, the work will be as follows:<br />
<br />
=== Prepare corpus ===<br />
<br />
To generate the rules, we need three files,<br />
<br />
* The tagged and tokenised source corpus<br />
* The tagged and tokenised target corpus<br />
* The output of the lexical transfer module in the source→target direction, tokenised<br />
<br />
These three files should be sentence aligned.<br />
<br />
The first thing that we need to do is tag both sides of the corpus:<br />
<br />
<pre><br />
$ nohup cat europarl.clean.en | apertium-destxt |\<br />
apertium -f none -d /home/fran/source/apertium-en-es en-es-pretransfer > europarl.tagged.en &<br />
$ nohup cat europarl.clean.es | apertium-destxt |\<br />
apertium -f none -d /home/fran/source/apertium-en-es es-en-pretransfer > europarl.tagged.es &<br />
</pre><br />
<br />
Then we need to remove the lines with no analyses on and replace blanks within lemmas with a new character (we will use `~`):<br />
<br />
<pre><br />
$ paste europarl.lines europarl.tagged.en europarl.tagged.es | grep '<' | cut -f2 | sed 's/ /~/g' | sed 's/$$[^\^]*/$$ /g' > europarl.tagged.new.es<br />
$ paste europarl.lines europarl.tagged.en europarl.tagged.es | grep '<' | cut -f3 | sed 's/ /~/g' | sed 's/$$[^\^]*/$$ /g'> europarl.tagged.new.en<br />
</pre><br />
<br />
Next, we need to clean the corpus and remove long sentences.<br />
(Make sure you are in the same directory as the one where you have your europarl corpus) <br />
<br />
<pre><br />
$ perl (path to your mosesdecoder)/scripts/training/clean-corpus-n.perl europarl.tagged.new es en europarl.tag-clean 1 40<br />
clean-corpus.perl: processing europarl-v6.es-en.es & .en to europarl.clean, cutoff 1-40<br />
..........(100000)...<br />
<br />
Input sentences: 1786594 Output sentences: 1467708<br />
</pre><br />
<br />
<br />
We're going to cut off the bottom 67,658 for testing (also because Giza++ segfaults somewhere around there).<br />
<pre><br />
$ mkdir testing<br />
$ tail -67658 europarl.tag-clean.en > testing/europarl.tag-clean.67658.en<br />
$ tail -67658 europarl.tagged.es > testing/europarl.tag-clean.67658.es<br />
</pre><br />
<br />
<pre><br />
$ head -1400000 europarl.tag-clean.en > europarl.tag-clean.en.new<br />
$ head -1400000 europarl.tag-clean.es > europarl.tag-clean.es.new<br />
$ mv europarl.tag-clean.en.new europarl.tag-clean.en<br />
$ mv europarl.tag-clean.es.new europarl.tag-clean.es<br />
</pre><br />
<br />
<br />
These files are:<br />
<br />
* <code>europarl.tag-clean.en</code>: The tagged source language side of the corpus<br />
* <code>europarl.tag-clean.es</code>: The tagged target language side of the corpus<br />
<br />
Check that they have the same length:<br />
<br />
<pre><br />
$ wc -l europarl.*<br />
1400000 europarl.tag-clean.en<br />
1400000 europarl.tag-clean.es<br />
2800000 total<br />
</pre><br />
<br />
=== Align corpus ===<br />
<br />
Now we've got the corpus files ready, we can align the corpus using the Moses scripts:<br />
<br />
<pre><br />
nohup perl (path to your mosesdecoder)/scripts/training/train-model.perl -external-bin-dir \<br />
~/smt/local/bin -corpus europarl.tag-clean \<br />
-f en -e es -alignment grow-diag-final-and -reordering msd-bidirectional-fe \<br />
-lm 0:5:/home/fran/corpora/europarl/europarl.lm:0 >log 2>&1 &<br />
</pre><br />
<br />
Note: Remember to change all the paths in the above command!<br />
<br />
You'll need an LM file, but you can copy it from a previous Moses installation. If you don't have one, make an empty file and put a few words in it. We won't be using the LM anyway.<br />
<br />
This takes a while, from a few hours to a day. So leave it running and go and make a soufflé, or chop some wood or something.<br />
<br />
=== Extract sentences ===<br />
<br />
The first thing we need to do after Moses has finished training is convert the Giza++ alignments to a less human- (and machine-) hostile format:<br />
<br />
<pre><br />
$ zcat giza.en-es/en-es.A3.final.gz | ~/source/apertium-lex-tools/scripts/giza-to-moses.awk > europarl.phrasetable.en-es<br />
</pre><br />
<br />
Then we want to make sure again that our file has the right number of lines:<br />
<br />
<pre><br />
$ wc -l europarl.phrasetable.en-es<br />
1400000 europarl.phrasetable.en-es<br />
</pre><br />
<br />
Then we want to extract the sentences where the target language word aligned to a source language word is a possible translation in the bilingual dictionary:<br />
<br />
<pre><br />
$ ~/source/apertium-lex-tools/scripts/extract-sentences.py europarl.phrasetable.en-es europarl.biltrans-tok.en-es \<br />
> europarl.candidates.en-es<br />
</pre><br />
<br />
These are basically sentences that we can hope that Apertium might be able to generate.<br />
<br />
=== Extract bilingual dictionary candidates ===<br />
<br />
Using the phrasetable and the bilingual file we can extract candidates for the bilingual dictionary.<br />
<br />
<pre><br />
python3 ~/Apertium/apertium-lex-tools/scripts/extract-biltrans-candidates.py europarl.phrasetable.en-es europarl.biltrans-tok.en-es > europarl.biltrans-candidates.en-es 2> europarl.biltrans-pairs.en-es<br />
</pre><br />
<br />
where europarl.biltrans-candidates.en-es contains the generated entries for the bilingual dictionary.<br />
<br />
===Extract frequency lexicon===<br />
<br />
The next step is to extract the frequency lexicon. <br />
<br />
<pre><br />
$ python ~/source/apertium-lex-tools/scripts/extract-freq-lexicon.py europarl.candidates.en-es > europarl.lex.en-es<br />
</pre><br />
<br />
This file should look like:<br />
<br />
<pre><br />
$ cat europarl.lex.en-es | head <br />
31381 union<n> unión<n> @<br />
101 union<n> sindicato<n><br />
1 union<n> situación<n><br />
1 union<n> monetario<adj><br />
4 slope<n> pendiente<n> @<br />
1 slope<n> ladera<n><br />
</pre><br />
<br />
Where the highest frequency translation is marked with an <code>@</code>.<br />
<br />
Note: This frequency lexicon can be used as a substitute for "choosing the most general translation" in your bilingual dictionary.<br />
<br />
===Generate patterns===<br />
<br />
Now we generate the ngrams that we are going to generate the rules from.<br />
<br />
<pre><br />
$ crisphold=1.5 # ratio of how many times you see the alternative translation compared to the default<br />
$ python ~/source/apertium-lex-tools/scripts/ngram-count-patterns.py europarl.lex.en-es europarl.candidates.en-es $crisphold 2>/dev/null > europarl.ngrams.en-es<br />
</pre><br />
<br />
This script outputs lines in the following format: <br />
<br />
<pre><br />
-language<n> and<cnjcoo> language<n> ,<cm> lengua<n> 2<br />
+language<n> plain<adj> language<n> ,<cm> lenguaje<n> 3<br />
-language<n> language<n> knowledge<n> lengua<n> 4<br />
-language<n> language<n> of<pr> communication<n> lengua<n> 3<br />
-language<n> Community<adj> language<n> .<sent> lengua<n> 5<br />
-language<n> language<n> in~addition~to<pr> their<det><pos> lengua<n> 2<br />
-language<n> every<det><ind> language<n> lengua<n> 2<br />
+language<n> and<cnjcoo> *understandable language<n> lenguaje<n> 2<br />
-language<n> two<num> language<n> lengua<n> 8<br />
-language<n> only<adj> official<adj> language<n> lengua<n> 2<br />
</pre><br />
<br />
The <code>+</code> and <code>-</code> indicate if this line chooses the most frequent transation (<code>-</code>) or a translation which is not the most frequent (<code>+</code>). The pattern selecting the translation is then shown, followed by the translation and then the frequency.<br />
<br />
===Filter rules===<br />
<br />
Now you can filter the rules, for example by removing rules with conjunctions, or removing rules with unknown words.<br />
<br />
=== Generate rules ===<br />
<br />
The final stage is to generate the rules,<br />
<br />
<pre><br />
python3 ~/source/apertium-lex-tools/scripts/ngrams-to-rules.py europarl.ngrams.en-es $crisphold > europarl.ngrams.en-es.lrx<br />
</pre> <br />
<br />
=== Process script ===<br />
For the whole process you can run the following script:<br />
<br />
<pre><br />
CORPUS_DIR="/home/philip/Apertium/corpora/raw/europarl-fr-es"<br />
CORPUS="Europarl3"<br />
PAIR="es-fr"<br />
SL="fr"<br />
TL="es"<br />
LEX_TOOLS="/home/philip/Apertium/apertium-lex-tools"<br />
SCRIPTS="$LEX_TOOLS/scripts"<br />
MOSESDECODER="/home/philip/Apertium/mosesdecoder/scripts/training"<br />
TRAINING_LINES=50000<br />
DATA="/home/philip/Apertium/apertium-fr-es"<br />
BIN_DIR="/home/philip/Apertium/smt/bin"<br />
<br />
# CLEAN CORPUS<br />
perl "$MOSESDECODER/clean-corpus-n.perl" $CORPUS.$PAIR $SL $TL "$CORPUS.clean" 1 40;<br />
<br />
#TAG CORPUS<br />
cat "$CORPUS.clean.$SL" | tail -n $TRAINING_LINES | apertium -d "$DATA" $SL-$TL-tagger \<br />
| apertium-pretransfer > $CORPUS.tagged.$SL;<br />
<br />
cat "$CORPUS.clean.$TL" | tail -n $TRAINING_LINES | apertium -d "$DATA" $TL-$SL-tagger \<br />
| apertium-pretransfer > $CORPUS.tagged.$TL;<br />
<br />
cat "$CORPUS.tagged.$SL" | lt-proc -b "$DATA/$SL-$TL.autobil.bin" > $CORPUS.biltrans.$PAIR<br />
<br />
N=`wc -l $CORPUS.clean.$SL | cut -d ' ' -f 1`<br />
<br />
# REMOVE LINES WITH NO ANALYSES<br />
seq 1 $N > $CORPUS.lines<br />
paste $CORPUS.lines $CORPUS.tagged.$SL $CORPUS.tagged.$TL | grep '<' | cut -f1 > $CORPUS.lines.new<br />
paste $CORPUS.lines $CORPUS.tagged.$SL $CORPUS.tagged.$TL | grep '<' | cut -f2 > $CORPUS.tagged.$SL.new<br />
paste $CORPUS.lines $CORPUS.tagged.$SL $CORPUS.tagged.$TL | grep '<' | cut -f3 > $CORPUS.tagged.$TL.new<br />
<br />
mv $CORPUS.lines.new $CORPUS.lines<br />
mv $CORPUS.tagged.$SL.new $CORPUS.tagged.$SL<br />
mv $CORPUS.tagged.$TL.new $CORPUS.tagged.$TL<br />
<br />
# TRIM TAGS<br />
cat $CORPUS.tagged.$SL | $LEX_TOOLS/process-tagger-output $DATA/$SL-$TL.autobil.bin -p -t > $CORPUS.tag-tok.$SL<br />
cat $CORPUS.tagged.$TL | $LEX_TOOLS/process-tagger-output $DATA/$TL-$SL.autobil.bin -p -t > $CORPUS.tag-tok.$TL<br />
cat $CORPUS.tagged.$SL | $LEX_TOOLS/process-tagger-output $DATA/$SL-$TL.autobil.bin -b -t > $CORPUS.biltrans-tok.$PAIR<br />
<br />
# ALIGN<br />
perl $MOSESDECODER/train-model.perl -external-bin-dir "$BIN_DIR" -corpus $CORPUS.tag-tok \<br />
-f $TL -e $SL -alignment grow-diag-final-and -reordering msd-bidirectional-fe \<br />
-lm 0:5:/home/philip/Apertium/gsoc2013/giza/europarl.lm:0 2>&1<br />
<br />
# EXTRACT<br />
zcat giza.$SL-$TL/$SL-$TL.A3.final.gz | $SCRIPTS/giza-to-moses.awk > $CORPUS.phrasetable.$SL-$TL<br />
<br />
# SENTENCES<br />
python3 $SCRIPTS/extract-sentences.py $CORPUS.phrasetable.$SL-$TL $CORPUS.biltrans-tok.$PAIR \<br />
> $CORPUS.candidates.$SL-$TL 2>/dev/null<br />
<br />
# FREQUENCY LEXICON<br />
python $SCRIPTS/extract-freq-lexicon.py $CORPUS.candidates.$SL-$TL > $CORPUS.lex.$SL-$TL 2>/dev/null<br />
<br />
<br />
# BILTRANS CANDIDATES<br />
python3 $SCRIPTS/extract-biltrans-candidates.py $CORPUS.phrasetable.$SL-$TL $CORPUS.biltrans-tok.$PAIR \<br />
> $CORPUS.biltrans-entries.$SL-$TL 2>$CORPUS.biltrans-pairs.$SL-$TL<br />
<br />
# NGRAM PATTERNS<br />
$crisphold=1.5<br />
python $SCRIPTS/ngram-count-patterns.py $CORPUS.lex.$SL-$TL $CORPUS.candidates.$SL-$TL $crisphold 2>/dev/null > $CORPUS.ngrams.$SL-$TL<br />
<br />
# FILTER PATTERNS<br />
<br />
<br />
# NGRAMS TO RULES<br />
$cripshold=1.5<br />
python3 $SCRIPTS/ngrams-to-rules.py europarl.ngrams.en-es $crisphold > $Europarl3.ngrams.$SL-TL.lrx<br />
<br />
</pre><br />
<br />
[[Category:Lexical selection]]<br />
[[Category:Documentation in English]]</div>
Fpetkovski
https://wiki.apertium.org/w/index.php?title=Generating_lexical-selection_rules_from_a_parallel_corpus&diff=43781
Generating lexical-selection rules from a parallel corpus
2013-09-21T14:43:45Z
<p>Fpetkovski: /* Prepare corpus */</p>
<hr />
<div>{{TOCD}}<br />
If you have a parallel corpus, one of the things you can do is generate some lexical selection rules from it, to improve translation of words with more than one possible translation.<br />
<br />
== You will need ==<br />
<br />
Here is a list of software that you will need installed:<br />
<br />
* Giza++ (or some other word aligner)<br />
* Moses (for making Giza++ less human hostile)<br />
* All the Moses scripts<br />
* [[lttoolbox]]<br />
* Apertium<br />
* [[apertium-lex-tools]]<br />
<br />
Furthermore you'll need:<br />
<br />
* an Apertium language pair<br />
* a parallel corpus (see [[Corpora]])<br />
<br />
===Installing prerequisites===<br />
See [[Minimal installation from SVN]] for apertium/lttoolbox. <br />
<br />
See [[Constraint-based lexical selection module]] for apertium-lex-tools.<br />
<br />
For Giza++ and moses-decoder, etc. you can do<br />
<pre><br />
$ mkdir ~/smt<br />
$ cd ~/smt<br />
$ mkdir local # our "install prefix"<br />
$ wget https://giza-pp.googlecode.com/files/giza-pp-v1.0.7.tar.gz<br />
$ tar xzvf giza-pp-v1.0.7.tar.gz<br />
$ cd giza-pp<br />
$ make<br />
$ mkdir ../local/bin<br />
$ cp GIZA++-v2/snt2cooc.out ../local/bin/<br />
$ cp GIZA++-v2/snt2plain.out ../local/bin/<br />
$ cp GIZA++-v2/GIZA++ ../local/bin/<br />
$ cp mkcls-v2/mkcls ../local/bin/<br />
$ git clone https://github.com/moses-smt/mosesdecoder<br />
$ cd mosesdecoder/<br />
$ ./bjam <br />
</pre><br />
<br />
Now e.g. the clean-corpus and train-model scripts referred to below will be in ~/smt/mosesdecoder/scripts/training/clean-corpus-n.perl<br />
See http://www.statmt.org/moses/?n=Development.GetStarted if you want to install the binaries to some other directory.<br />
<br />
== Getting started ==<br />
<br />
We're going to do the example with [[Corpora|EuroParl]] and the English to Spanish pair in Apertium.<br />
<br />
Given that you've got all the stuff installed, the work will be as follows:<br />
<br />
=== Prepare corpus ===<br />
<br />
To generate the rules, we need three files,<br />
<br />
* The tagged and tokenised source corpus<br />
* The tagged and tokenised target corpus<br />
* The output of the lexical transfer module in the source→target direction, tokenised<br />
<br />
These three files should be sentence aligned.<br />
<br />
The first thing that we need to do is tag both sides of the corpus:<br />
<br />
<pre><br />
$ nohup cat europarl.clean.en | apertium-destxt |\<br />
apertium -f none -d /home/fran/source/apertium-en-es en-es-pretransfer > europarl.tagged.en &<br />
$ nohup cat europarl.clean.es | apertium-destxt |\<br />
apertium -f none -d /home/fran/source/apertium-en-es es-en-pretransfer > europarl.tagged.es &<br />
</pre><br />
<br />
Then we need to remove the lines with no analyses on and replace blanks within lemmas with a new character (we will use `~`):<br />
<br />
<pre><br />
$ paste europarl.lines europarl.tagged.en europarl.tagged.es | grep '<' | cut -f2 | sed 's/ /~/g' | sed 's/$$[^\^]*/$$ /g' > europarl.tagged.new.es<br />
$ paste europarl.lines europarl.tagged.en europarl.tagged.es | grep '<' | cut -f3 | sed 's/ /~/g' | sed 's/$$[^\^]*/$$ /g'> europarl.tagged.new.en<br />
</pre><br />
<br />
Next, we need to clean the corpus and remove long sentences.<br />
(Make sure you are in the same directory as the one where you have your europarl corpus) <br />
<br />
<pre><br />
$ perl (path to your mosesdecoder)/scripts/training/clean-corpus-n.perl europarl.tagged.new es en europarl.tag-clean 1 40<br />
clean-corpus.perl: processing europarl-v6.es-en.es & .en to europarl.clean, cutoff 1-40<br />
..........(100000)...<br />
<br />
Input sentences: 1786594 Output sentences: 1467708<br />
</pre><br />
<br />
<br />
We're going to cut off the bottom 67,658 for testing (also because Giza++ segfaults somewhere around there).<br />
<pre><br />
$ mkdir testing<br />
$ tail -67658 europarl.tag-clean.en > testing/europarl.tag-clean.67658.en<br />
$ tail -67658 europarl.tagged.es > testing/europarl.tag-clean.67658.es<br />
</pre><br />
<br />
<pre><br />
$ head -1400000 europarl.tag-clean.en > europarl.tag-clean.en.new<br />
$ head -1400000 europarl.tag-clean.es > europarl.tag-clean.es.new<br />
$ mv europarl.tag-clean.en.new europarl.tag-clean.en<br />
$ mv europarl.tag-clean.es.new europarl.tag-clean.es<br />
</pre><br />
<br />
<br />
These files are:<br />
<br />
* <code>europarl.tag-clean.en</code>: The tagged source language side of the corpus<br />
* <code>europarl.tag-clean.es</code>: The tagged target language side of the corpus<br />
<br />
Check that they have the same length:<br />
<br />
<pre><br />
$ wc -l europarl.*<br />
1400000 europarl.tag-clean.en<br />
1400000 europarl.tag-clean.es<br />
2800000 total<br />
</pre><br />
<br />
=== Align corpus ===<br />
<br />
Now we've got the corpus files ready, we can align the corpus using the Moses scripts:<br />
<br />
<pre><br />
nohup perl (path to your mosesdecoder)/scripts/training/train-model.perl -external-bin-dir \<br />
~/smt/local/bin -corpus europarl.tag-tok \<br />
-f en -e es -alignment grow-diag-final-and -reordering msd-bidirectional-fe \<br />
-lm 0:5:/home/fran/corpora/europarl/europarl.lm:0 >log 2>&1 &<br />
</pre><br />
<br />
Note: Remember to change all the paths in the above command!<br />
<br />
You'll need an LM file, but you can copy it from a previous Moses installation. If you don't have one, make an empty file and put a few words in it. We won't be using the LM anyway.<br />
<br />
This takes a while, from a few hours to a day. So leave it running and go and make a soufflé, or chop some wood or something.<br />
<br />
=== Extract sentences ===<br />
<br />
The first thing we need to do after Moses has finished training is convert the Giza++ alignments to a less human- (and machine-) hostile format:<br />
<br />
<pre><br />
$ zcat giza.en-es/en-es.A3.final.gz | ~/source/apertium-lex-tools/scripts/giza-to-moses.awk > europarl.phrasetable.en-es<br />
</pre><br />
<br />
Then we want to make sure again that our file has the right number of lines:<br />
<br />
<pre><br />
$ wc -l europarl.phrasetable.en-es<br />
1400000 europarl.phrasetable.en-es<br />
</pre><br />
<br />
Then we want to extract the sentences where the target language word aligned to a source language word is a possible translation in the bilingual dictionary:<br />
<br />
<pre><br />
$ ~/source/apertium-lex-tools/scripts/extract-sentences.py europarl.phrasetable.en-es europarl.biltrans-tok.en-es \<br />
> europarl.candidates.en-es<br />
</pre><br />
<br />
These are basically sentences that we can hope that Apertium might be able to generate.<br />
<br />
=== Extract bilingual dictionary candidates ===<br />
<br />
Using the phrasetable and the bilingual file we can extract candidates for the bilingual dictionary.<br />
<br />
<pre><br />
python3 ~/Apertium/apertium-lex-tools/scripts/extract-biltrans-candidates.py europarl.phrasetable.en-es europarl.biltrans-tok.en-es > europarl.biltrans-candidates.en-es 2> europarl.biltrans-pairs.en-es<br />
</pre><br />
<br />
where europarl.biltrans-candidates.en-es contains the generated entries for the bilingual dictionary.<br />
<br />
===Extract frequency lexicon===<br />
<br />
The next step is to extract the frequency lexicon. <br />
<br />
<pre><br />
$ python ~/source/apertium-lex-tools/scripts/extract-freq-lexicon.py europarl.candidates.en-es > europarl.lex.en-es<br />
</pre><br />
<br />
This file should look like:<br />
<br />
<pre><br />
$ cat europarl.lex.en-es | head <br />
31381 union<n> unión<n> @<br />
101 union<n> sindicato<n><br />
1 union<n> situación<n><br />
1 union<n> monetario<adj><br />
4 slope<n> pendiente<n> @<br />
1 slope<n> ladera<n><br />
</pre><br />
<br />
Where the highest frequency translation is marked with an <code>@</code>.<br />
<br />
Note: This frequency lexicon can be used as a substitute for "choosing the most general translation" in your bilingual dictionary.<br />
<br />
===Generate patterns===<br />
<br />
Now we generate the ngrams that we are going to generate the rules from.<br />
<br />
<pre><br />
$ crisphold=1.5 # ratio of how many times you see the alternative translation compared to the default<br />
$ python ~/source/apertium-lex-tools/scripts/ngram-count-patterns.py europarl.lex.en-es europarl.candidates.en-es $crisphold 2>/dev/null > europarl.ngrams.en-es<br />
</pre><br />
<br />
This script outputs lines in the following format: <br />
<br />
<pre><br />
-language<n> and<cnjcoo> language<n> ,<cm> lengua<n> 2<br />
+language<n> plain<adj> language<n> ,<cm> lenguaje<n> 3<br />
-language<n> language<n> knowledge<n> lengua<n> 4<br />
-language<n> language<n> of<pr> communication<n> lengua<n> 3<br />
-language<n> Community<adj> language<n> .<sent> lengua<n> 5<br />
-language<n> language<n> in~addition~to<pr> their<det><pos> lengua<n> 2<br />
-language<n> every<det><ind> language<n> lengua<n> 2<br />
+language<n> and<cnjcoo> *understandable language<n> lenguaje<n> 2<br />
-language<n> two<num> language<n> lengua<n> 8<br />
-language<n> only<adj> official<adj> language<n> lengua<n> 2<br />
</pre><br />
<br />
The <code>+</code> and <code>-</code> indicate if this line chooses the most frequent transation (<code>-</code>) or a translation which is not the most frequent (<code>+</code>). The pattern selecting the translation is then shown, followed by the translation and then the frequency.<br />
<br />
===Filter rules===<br />
<br />
Now you can filter the rules, for example by removing rules with conjunctions, or removing rules with unknown words.<br />
<br />
=== Generate rules ===<br />
<br />
The final stage is to generate the rules,<br />
<br />
<pre><br />
python3 ~/source/apertium-lex-tools/scripts/ngrams-to-rules.py europarl.ngrams.en-es $crisphold > europarl.ngrams.en-es.lrx<br />
</pre> <br />
<br />
=== Process script ===<br />
For the whole process you can run the following script:<br />
<br />
<pre><br />
CORPUS_DIR="/home/philip/Apertium/corpora/raw/europarl-fr-es"<br />
CORPUS="Europarl3"<br />
PAIR="es-fr"<br />
SL="fr"<br />
TL="es"<br />
LEX_TOOLS="/home/philip/Apertium/apertium-lex-tools"<br />
SCRIPTS="$LEX_TOOLS/scripts"<br />
MOSESDECODER="/home/philip/Apertium/mosesdecoder/scripts/training"<br />
TRAINING_LINES=50000<br />
DATA="/home/philip/Apertium/apertium-fr-es"<br />
BIN_DIR="/home/philip/Apertium/smt/bin"<br />
<br />
# CLEAN CORPUS<br />
perl "$MOSESDECODER/clean-corpus-n.perl" $CORPUS.$PAIR $SL $TL "$CORPUS.clean" 1 40;<br />
<br />
#TAG CORPUS<br />
cat "$CORPUS.clean.$SL" | tail -n $TRAINING_LINES | apertium -d "$DATA" $SL-$TL-tagger \<br />
| apertium-pretransfer > $CORPUS.tagged.$SL;<br />
<br />
cat "$CORPUS.clean.$TL" | tail -n $TRAINING_LINES | apertium -d "$DATA" $TL-$SL-tagger \<br />
| apertium-pretransfer > $CORPUS.tagged.$TL;<br />
<br />
cat "$CORPUS.tagged.$SL" | lt-proc -b "$DATA/$SL-$TL.autobil.bin" > $CORPUS.biltrans.$PAIR<br />
<br />
N=`wc -l $CORPUS.clean.$SL | cut -d ' ' -f 1`<br />
<br />
# REMOVE LINES WITH NO ANALYSES<br />
seq 1 $N > $CORPUS.lines<br />
paste $CORPUS.lines $CORPUS.tagged.$SL $CORPUS.tagged.$TL | grep '<' | cut -f1 > $CORPUS.lines.new<br />
paste $CORPUS.lines $CORPUS.tagged.$SL $CORPUS.tagged.$TL | grep '<' | cut -f2 > $CORPUS.tagged.$SL.new<br />
paste $CORPUS.lines $CORPUS.tagged.$SL $CORPUS.tagged.$TL | grep '<' | cut -f3 > $CORPUS.tagged.$TL.new<br />
<br />
mv $CORPUS.lines.new $CORPUS.lines<br />
mv $CORPUS.tagged.$SL.new $CORPUS.tagged.$SL<br />
mv $CORPUS.tagged.$TL.new $CORPUS.tagged.$TL<br />
<br />
# TRIM TAGS<br />
cat $CORPUS.tagged.$SL | $LEX_TOOLS/process-tagger-output $DATA/$SL-$TL.autobil.bin -p -t > $CORPUS.tag-tok.$SL<br />
cat $CORPUS.tagged.$TL | $LEX_TOOLS/process-tagger-output $DATA/$TL-$SL.autobil.bin -p -t > $CORPUS.tag-tok.$TL<br />
cat $CORPUS.tagged.$SL | $LEX_TOOLS/process-tagger-output $DATA/$SL-$TL.autobil.bin -b -t > $CORPUS.biltrans-tok.$PAIR<br />
<br />
# ALIGN<br />
perl $MOSESDECODER/train-model.perl -external-bin-dir "$BIN_DIR" -corpus $CORPUS.tag-tok \<br />
-f $TL -e $SL -alignment grow-diag-final-and -reordering msd-bidirectional-fe \<br />
-lm 0:5:/home/philip/Apertium/gsoc2013/giza/europarl.lm:0 2>&1<br />
<br />
# EXTRACT<br />
zcat giza.$SL-$TL/$SL-$TL.A3.final.gz | $SCRIPTS/giza-to-moses.awk > $CORPUS.phrasetable.$SL-$TL<br />
<br />
# SENTENCES<br />
python3 $SCRIPTS/extract-sentences.py $CORPUS.phrasetable.$SL-$TL $CORPUS.biltrans-tok.$PAIR \<br />
> $CORPUS.candidates.$SL-$TL 2>/dev/null<br />
<br />
# FREQUENCY LEXICON<br />
python $SCRIPTS/extract-freq-lexicon.py $CORPUS.candidates.$SL-$TL > $CORPUS.lex.$SL-$TL 2>/dev/null<br />
<br />
<br />
# BILTRANS CANDIDATES<br />
python3 $SCRIPTS/extract-biltrans-candidates.py $CORPUS.phrasetable.$SL-$TL $CORPUS.biltrans-tok.$PAIR \<br />
> $CORPUS.biltrans-entries.$SL-$TL 2>$CORPUS.biltrans-pairs.$SL-$TL<br />
<br />
# NGRAM PATTERNS<br />
$crisphold=1.5<br />
python $SCRIPTS/ngram-count-patterns.py $CORPUS.lex.$SL-$TL $CORPUS.candidates.$SL-$TL $crisphold 2>/dev/null > $CORPUS.ngrams.$SL-$TL<br />
<br />
# FILTER PATTERNS<br />
<br />
<br />
# NGRAMS TO RULES<br />
$cripshold=1.5<br />
python3 $SCRIPTS/ngrams-to-rules.py europarl.ngrams.en-es $crisphold > $Europarl3.ngrams.$SL-TL.lrx<br />
<br />
</pre><br />
<br />
[[Category:Lexical selection]]<br />
[[Category:Documentation in English]]</div>
Fpetkovski
https://wiki.apertium.org/w/index.php?title=Generating_lexical-selection_rules_from_a_parallel_corpus&diff=43780
Generating lexical-selection rules from a parallel corpus
2013-09-21T14:40:51Z
<p>Fpetkovski: /* Prepare corpus */</p>
<hr />
<div>{{TOCD}}<br />
If you have a parallel corpus, one of the things you can do is generate some lexical selection rules from it, to improve translation of words with more than one possible translation.<br />
<br />
== You will need ==<br />
<br />
Here is a list of software that you will need installed:<br />
<br />
* Giza++ (or some other word aligner)<br />
* Moses (for making Giza++ less human hostile)<br />
* All the Moses scripts<br />
* [[lttoolbox]]<br />
* Apertium<br />
* [[apertium-lex-tools]]<br />
<br />
Furthermore you'll need:<br />
<br />
* an Apertium language pair<br />
* a parallel corpus (see [[Corpora]])<br />
<br />
===Installing prerequisites===<br />
See [[Minimal installation from SVN]] for apertium/lttoolbox. <br />
<br />
See [[Constraint-based lexical selection module]] for apertium-lex-tools.<br />
<br />
For Giza++ and moses-decoder, etc. you can do<br />
<pre><br />
$ mkdir ~/smt<br />
$ cd ~/smt<br />
$ mkdir local # our "install prefix"<br />
$ wget https://giza-pp.googlecode.com/files/giza-pp-v1.0.7.tar.gz<br />
$ tar xzvf giza-pp-v1.0.7.tar.gz<br />
$ cd giza-pp<br />
$ make<br />
$ mkdir ../local/bin<br />
$ cp GIZA++-v2/snt2cooc.out ../local/bin/<br />
$ cp GIZA++-v2/snt2plain.out ../local/bin/<br />
$ cp GIZA++-v2/GIZA++ ../local/bin/<br />
$ cp mkcls-v2/mkcls ../local/bin/<br />
$ git clone https://github.com/moses-smt/mosesdecoder<br />
$ cd mosesdecoder/<br />
$ ./bjam <br />
</pre><br />
<br />
Now e.g. the clean-corpus and train-model scripts referred to below will be in ~/smt/mosesdecoder/scripts/training/clean-corpus-n.perl<br />
See http://www.statmt.org/moses/?n=Development.GetStarted if you want to install the binaries to some other directory.<br />
<br />
== Getting started ==<br />
<br />
We're going to do the example with [[Corpora|EuroParl]] and the English to Spanish pair in Apertium.<br />
<br />
Given that you've got all the stuff installed, the work will be as follows:<br />
<br />
=== Prepare corpus ===<br />
<br />
To generate the rules, we need three files,<br />
<br />
* The tagged and tokenised source corpus<br />
* The tagged and tokenised target corpus<br />
* The output of the lexical transfer module in the source→target direction, tokenised<br />
<br />
These three files should be sentence aligned.<br />
<br />
The first thing that we need to do is tag both sides of the corpus:<br />
<br />
<pre><br />
$ nohup cat europarl.clean.en | apertium-destxt |\<br />
apertium -f none -d /home/fran/source/apertium-en-es en-es-pretransfer > europarl.tagged.en &<br />
$ nohup cat europarl.clean.es | apertium-destxt |\<br />
apertium -f none -d /home/fran/source/apertium-en-es es-en-pretransfer > europarl.tagged.es &<br />
</pre><br />
<br />
Then we need to remove the lines with no analyses on:<br />
<br />
<pre><br />
$ paste europarl.lines europarl.tagged.en europarl.tagged.es | grep '<' | cut -f2 > europarl.tagged.new.es<br />
$ paste europarl.lines europarl.tagged.en europarl.tagged.es | grep '<' | cut -f3 > europarl.tagged.new.en<br />
</pre><br />
<br />
Next, we need to clean the corpus and remove long sentences.<br />
(Make sure you are in the same directory as the one where you have your europarl corpus) <br />
<br />
<pre><br />
$ perl (path to your mosesdecoder)/scripts/training/clean-corpus-n.perl europarl.tagged.new es en europarl.tag-clean 1 40<br />
clean-corpus.perl: processing europarl-v6.es-en.es & .en to europarl.clean, cutoff 1-40<br />
..........(100000)...<br />
<br />
Input sentences: 1786594 Output sentences: 1467708<br />
</pre><br />
<br />
<br />
We're going to cut off the bottom 67,658 for testing (also because Giza++ segfaults somewhere around there).<br />
<pre><br />
$ mkdir testing<br />
$ tail -67658 europarl.tag-clean.en > testing/europarl.tag-clean.67658.en<br />
$ tail -67658 europarl.tagged.es > testing/europarl.tag-clean.67658.es<br />
</pre><br />
<br />
<pre><br />
$ head -1400000 europarl.tag-clean.en > europarl.tag-clean.en.new<br />
$ head -1400000 europarl.tag-clean.es > europarl.tag-clean.es.new<br />
$ mv europarl.tag-clean.en.new europarl.tag-clean.en<br />
$ mv europarl.tag-clean.es.new europarl.tag-clean.es<br />
</pre><br />
<br />
<br />
These files are:<br />
<br />
* <code>europarl.tag-clean.en</code>: The tagged source language side of the corpus<br />
* <code>europarl.tag-clean.es</code>: The tagged target language side of the corpus<br />
<br />
Check that they have the same length:<br />
<br />
<pre><br />
$ wc -l europarl.*<br />
1400000 europarl.tag-clean.en<br />
1400000 europarl.tag-clean.es<br />
2800000 total<br />
</pre><br />
<br />
=== Align corpus ===<br />
<br />
Now we've got the corpus files ready, we can align the corpus using the Moses scripts:<br />
<br />
<pre><br />
nohup perl (path to your mosesdecoder)/scripts/training/train-model.perl -external-bin-dir \<br />
~/smt/local/bin -corpus europarl.tag-tok \<br />
-f en -e es -alignment grow-diag-final-and -reordering msd-bidirectional-fe \<br />
-lm 0:5:/home/fran/corpora/europarl/europarl.lm:0 >log 2>&1 &<br />
</pre><br />
<br />
Note: Remember to change all the paths in the above command!<br />
<br />
You'll need an LM file, but you can copy it from a previous Moses installation. If you don't have one, make an empty file and put a few words in it. We won't be using the LM anyway.<br />
<br />
This takes a while, from a few hours to a day. So leave it running and go and make a soufflé, or chop some wood or something.<br />
<br />
=== Extract sentences ===<br />
<br />
The first thing we need to do after Moses has finished training is convert the Giza++ alignments to a less human- (and machine-) hostile format:<br />
<br />
<pre><br />
$ zcat giza.en-es/en-es.A3.final.gz | ~/source/apertium-lex-tools/scripts/giza-to-moses.awk > europarl.phrasetable.en-es<br />
</pre><br />
<br />
Then we want to make sure again that our file has the right number of lines:<br />
<br />
<pre><br />
$ wc -l europarl.phrasetable.en-es<br />
1400000 europarl.phrasetable.en-es<br />
</pre><br />
<br />
Then we want to extract the sentences where the target language word aligned to a source language word is a possible translation in the bilingual dictionary:<br />
<br />
<pre><br />
$ ~/source/apertium-lex-tools/scripts/extract-sentences.py europarl.phrasetable.en-es europarl.biltrans-tok.en-es \<br />
> europarl.candidates.en-es<br />
</pre><br />
<br />
These are basically sentences that we can hope that Apertium might be able to generate.<br />
<br />
=== Extract bilingual dictionary candidates ===<br />
<br />
Using the phrasetable and the bilingual file we can extract candidates for the bilingual dictionary.<br />
<br />
<pre><br />
python3 ~/Apertium/apertium-lex-tools/scripts/extract-biltrans-candidates.py europarl.phrasetable.en-es europarl.biltrans-tok.en-es > europarl.biltrans-candidates.en-es 2> europarl.biltrans-pairs.en-es<br />
</pre><br />
<br />
where europarl.biltrans-candidates.en-es contains the generated entries for the bilingual dictionary.<br />
<br />
===Extract frequency lexicon===<br />
<br />
The next step is to extract the frequency lexicon. <br />
<br />
<pre><br />
$ python ~/source/apertium-lex-tools/scripts/extract-freq-lexicon.py europarl.candidates.en-es > europarl.lex.en-es<br />
</pre><br />
<br />
This file should look like:<br />
<br />
<pre><br />
$ cat europarl.lex.en-es | head <br />
31381 union<n> unión<n> @<br />
101 union<n> sindicato<n><br />
1 union<n> situación<n><br />
1 union<n> monetario<adj><br />
4 slope<n> pendiente<n> @<br />
1 slope<n> ladera<n><br />
</pre><br />
<br />
Where the highest frequency translation is marked with an <code>@</code>.<br />
<br />
Note: This frequency lexicon can be used as a substitute for "choosing the most general translation" in your bilingual dictionary.<br />
<br />
===Generate patterns===<br />
<br />
Now we generate the ngrams that we are going to generate the rules from.<br />
<br />
<pre><br />
$ crisphold=1.5 # ratio of how many times you see the alternative translation compared to the default<br />
$ python ~/source/apertium-lex-tools/scripts/ngram-count-patterns.py europarl.lex.en-es europarl.candidates.en-es $crisphold 2>/dev/null > europarl.ngrams.en-es<br />
</pre><br />
<br />
This script outputs lines in the following format: <br />
<br />
<pre><br />
-language<n> and<cnjcoo> language<n> ,<cm> lengua<n> 2<br />
+language<n> plain<adj> language<n> ,<cm> lenguaje<n> 3<br />
-language<n> language<n> knowledge<n> lengua<n> 4<br />
-language<n> language<n> of<pr> communication<n> lengua<n> 3<br />
-language<n> Community<adj> language<n> .<sent> lengua<n> 5<br />
-language<n> language<n> in~addition~to<pr> their<det><pos> lengua<n> 2<br />
-language<n> every<det><ind> language<n> lengua<n> 2<br />
+language<n> and<cnjcoo> *understandable language<n> lenguaje<n> 2<br />
-language<n> two<num> language<n> lengua<n> 8<br />
-language<n> only<adj> official<adj> language<n> lengua<n> 2<br />
</pre><br />
<br />
The <code>+</code> and <code>-</code> indicate if this line chooses the most frequent transation (<code>-</code>) or a translation which is not the most frequent (<code>+</code>). The pattern selecting the translation is then shown, followed by the translation and then the frequency.<br />
<br />
===Filter rules===<br />
<br />
Now you can filter the rules, for example by removing rules with conjunctions, or removing rules with unknown words.<br />
<br />
=== Generate rules ===<br />
<br />
The final stage is to generate the rules,<br />
<br />
<pre><br />
python3 ~/source/apertium-lex-tools/scripts/ngrams-to-rules.py europarl.ngrams.en-es $crisphold > europarl.ngrams.en-es.lrx<br />
</pre> <br />
<br />
=== Process script ===<br />
For the whole process you can run the following script:<br />
<br />
<pre><br />
CORPUS_DIR="/home/philip/Apertium/corpora/raw/europarl-fr-es"<br />
CORPUS="Europarl3"<br />
PAIR="es-fr"<br />
SL="fr"<br />
TL="es"<br />
LEX_TOOLS="/home/philip/Apertium/apertium-lex-tools"<br />
SCRIPTS="$LEX_TOOLS/scripts"<br />
MOSESDECODER="/home/philip/Apertium/mosesdecoder/scripts/training"<br />
TRAINING_LINES=50000<br />
DATA="/home/philip/Apertium/apertium-fr-es"<br />
BIN_DIR="/home/philip/Apertium/smt/bin"<br />
<br />
# CLEAN CORPUS<br />
perl "$MOSESDECODER/clean-corpus-n.perl" $CORPUS.$PAIR $SL $TL "$CORPUS.clean" 1 40;<br />
<br />
#TAG CORPUS<br />
cat "$CORPUS.clean.$SL" | tail -n $TRAINING_LINES | apertium -d "$DATA" $SL-$TL-tagger \<br />
| apertium-pretransfer > $CORPUS.tagged.$SL;<br />
<br />
cat "$CORPUS.clean.$TL" | tail -n $TRAINING_LINES | apertium -d "$DATA" $TL-$SL-tagger \<br />
| apertium-pretransfer > $CORPUS.tagged.$TL;<br />
<br />
cat "$CORPUS.tagged.$SL" | lt-proc -b "$DATA/$SL-$TL.autobil.bin" > $CORPUS.biltrans.$PAIR<br />
<br />
N=`wc -l $CORPUS.clean.$SL | cut -d ' ' -f 1`<br />
<br />
# REMOVE LINES WITH NO ANALYSES<br />
seq 1 $N > $CORPUS.lines<br />
paste $CORPUS.lines $CORPUS.tagged.$SL $CORPUS.tagged.$TL | grep '<' | cut -f1 > $CORPUS.lines.new<br />
paste $CORPUS.lines $CORPUS.tagged.$SL $CORPUS.tagged.$TL | grep '<' | cut -f2 > $CORPUS.tagged.$SL.new<br />
paste $CORPUS.lines $CORPUS.tagged.$SL $CORPUS.tagged.$TL | grep '<' | cut -f3 > $CORPUS.tagged.$TL.new<br />
<br />
mv $CORPUS.lines.new $CORPUS.lines<br />
mv $CORPUS.tagged.$SL.new $CORPUS.tagged.$SL<br />
mv $CORPUS.tagged.$TL.new $CORPUS.tagged.$TL<br />
<br />
# TRIM TAGS<br />
cat $CORPUS.tagged.$SL | $LEX_TOOLS/process-tagger-output $DATA/$SL-$TL.autobil.bin -p -t > $CORPUS.tag-tok.$SL<br />
cat $CORPUS.tagged.$TL | $LEX_TOOLS/process-tagger-output $DATA/$TL-$SL.autobil.bin -p -t > $CORPUS.tag-tok.$TL<br />
cat $CORPUS.tagged.$SL | $LEX_TOOLS/process-tagger-output $DATA/$SL-$TL.autobil.bin -b -t > $CORPUS.biltrans-tok.$PAIR<br />
<br />
# ALIGN<br />
perl $MOSESDECODER/train-model.perl -external-bin-dir "$BIN_DIR" -corpus $CORPUS.tag-tok \<br />
-f $TL -e $SL -alignment grow-diag-final-and -reordering msd-bidirectional-fe \<br />
-lm 0:5:/home/philip/Apertium/gsoc2013/giza/europarl.lm:0 2>&1<br />
<br />
# EXTRACT<br />
zcat giza.$SL-$TL/$SL-$TL.A3.final.gz | $SCRIPTS/giza-to-moses.awk > $CORPUS.phrasetable.$SL-$TL<br />
<br />
# SENTENCES<br />
python3 $SCRIPTS/extract-sentences.py $CORPUS.phrasetable.$SL-$TL $CORPUS.biltrans-tok.$PAIR \<br />
> $CORPUS.candidates.$SL-$TL 2>/dev/null<br />
<br />
# FREQUENCY LEXICON<br />
python $SCRIPTS/extract-freq-lexicon.py $CORPUS.candidates.$SL-$TL > $CORPUS.lex.$SL-$TL 2>/dev/null<br />
<br />
<br />
# BILTRANS CANDIDATES<br />
python3 $SCRIPTS/extract-biltrans-candidates.py $CORPUS.phrasetable.$SL-$TL $CORPUS.biltrans-tok.$PAIR \<br />
> $CORPUS.biltrans-entries.$SL-$TL 2>$CORPUS.biltrans-pairs.$SL-$TL<br />
<br />
# NGRAM PATTERNS<br />
$crisphold=1.5<br />
python $SCRIPTS/ngram-count-patterns.py $CORPUS.lex.$SL-$TL $CORPUS.candidates.$SL-$TL $crisphold 2>/dev/null > $CORPUS.ngrams.$SL-$TL<br />
<br />
# FILTER PATTERNS<br />
<br />
<br />
# NGRAMS TO RULES<br />
$cripshold=1.5<br />
python3 $SCRIPTS/ngrams-to-rules.py europarl.ngrams.en-es $crisphold > $Europarl3.ngrams.$SL-TL.lrx<br />
<br />
</pre><br />
<br />
[[Category:Lexical selection]]<br />
[[Category:Documentation in English]]</div>
Fpetkovski
https://wiki.apertium.org/w/index.php?title=Generating_lexical-selection_rules_from_a_parallel_corpus&diff=43779
Generating lexical-selection rules from a parallel corpus
2013-09-21T14:40:04Z
<p>Fpetkovski: /* Prepare corpus */</p>
<hr />
<div>{{TOCD}}<br />
If you have a parallel corpus, one of the things you can do is generate some lexical selection rules from it, to improve translation of words with more than one possible translation.<br />
<br />
== You will need ==<br />
<br />
Here is a list of software that you will need installed:<br />
<br />
* Giza++ (or some other word aligner)<br />
* Moses (for making Giza++ less human hostile)<br />
* All the Moses scripts<br />
* [[lttoolbox]]<br />
* Apertium<br />
* [[apertium-lex-tools]]<br />
<br />
Furthermore you'll need:<br />
<br />
* an Apertium language pair<br />
* a parallel corpus (see [[Corpora]])<br />
<br />
===Installing prerequisites===<br />
See [[Minimal installation from SVN]] for apertium/lttoolbox. <br />
<br />
See [[Constraint-based lexical selection module]] for apertium-lex-tools.<br />
<br />
For Giza++ and moses-decoder, etc. you can do<br />
<pre><br />
$ mkdir ~/smt<br />
$ cd ~/smt<br />
$ mkdir local # our "install prefix"<br />
$ wget https://giza-pp.googlecode.com/files/giza-pp-v1.0.7.tar.gz<br />
$ tar xzvf giza-pp-v1.0.7.tar.gz<br />
$ cd giza-pp<br />
$ make<br />
$ mkdir ../local/bin<br />
$ cp GIZA++-v2/snt2cooc.out ../local/bin/<br />
$ cp GIZA++-v2/snt2plain.out ../local/bin/<br />
$ cp GIZA++-v2/GIZA++ ../local/bin/<br />
$ cp mkcls-v2/mkcls ../local/bin/<br />
$ git clone https://github.com/moses-smt/mosesdecoder<br />
$ cd mosesdecoder/<br />
$ ./bjam <br />
</pre><br />
<br />
Now e.g. the clean-corpus and train-model scripts referred to below will be in ~/smt/mosesdecoder/scripts/training/clean-corpus-n.perl<br />
See http://www.statmt.org/moses/?n=Development.GetStarted if you want to install the binaries to some other directory.<br />
<br />
== Getting started ==<br />
<br />
We're going to do the example with [[Corpora|EuroParl]] and the English to Spanish pair in Apertium.<br />
<br />
Given that you've got all the stuff installed, the work will be as follows:<br />
<br />
=== Prepare corpus ===<br />
<br />
To generate the rules, we need three files,<br />
<br />
* The tagged and tokenised source corpus<br />
* The tagged and tokenised target corpus<br />
* The output of the lexical transfer module in the source→target direction, tokenised<br />
<br />
These three files should be sentence aligned.<br />
<br />
The first thing that we need to do is tag both sides of the corpus:<br />
<br />
<pre><br />
$ nohup cat europarl.clean.en | apertium-destxt |\<br />
apertium -f none -d /home/fran/source/apertium-en-es en-es-pretransfer > europarl.tagged.en &<br />
$ nohup cat europarl.clean.es | apertium-destxt |\<br />
apertium -f none -d /home/fran/source/apertium-en-es es-en-pretransfer > europarl.tagged.es &<br />
</pre><br />
<br />
Then we need to remove the lines with no analyses on:<br />
<br />
<pre><br />
$ paste europarl.lines europarl.tagged.en europarl.tagged.es | grep '<' | cut -f2 > europarl.tagged.new.es<br />
$ paste europarl.lines europarl.tagged.en europarl.tagged.es | grep '<' | cut -f3 > europarl.tagged.new.en<br />
</pre><br />
<br />
Next, we need to clean the corpus and remove long sentences.<br />
(Make sure you are in the same directory as the one where you have your europarl corpus) <br />
<br />
<pre><br />
$ perl (path to your mosesdecoder)/scripts/training/clean-corpus-n.perl europarl.tagged.new es en europarl.tag-clean 1 40<br />
clean-corpus.perl: processing europarl-v6.es-en.es & .en to europarl.clean, cutoff 1-40<br />
..........(100000)...<br />
<br />
Input sentences: 1786594 Output sentences: 1467708<br />
</pre><br />
<br />
<br />
We're going to cut off the bottom 67,658 for testing (also because Giza++ segfaults somewhere around there).<br />
<pre><br />
$ mkdir testing<br />
$ tail -67658 europarl.tag-clean.en > testing/europarl.tag-clean.67658.en<br />
$ tail -67658 europarl.tagged.es > testing/europarl.tag-clean.67658.es<br />
</pre><br />
<br />
<pre><br />
$ head -1400000 europarl.tag-clean.en > europarl.tag-clean.en.new<br />
$ head -1400000 europarl.tag-clean.es > europarl.tag-clean.es.new<br />
$ mv europarl.tag-clean.en.new europarl.tag-clean.en<br />
$ mv europarl.tag-clean.es.new europarl.tag-clean.es<br />
</pre><br />
<br />
<br />
These files are:<br />
<br />
* <code>europarl.tag-clean.en</code>: The tagged source language side of the corpus<br />
* <code>europarl.tagged.es</code>: The tagged target language side of the corpus<br />
<br />
Check that they have the same length:<br />
<br />
<pre><br />
$ wc -l europarl.*<br />
1400000 europarl.tag-clean.en<br />
1400000 europarl.tag-clean.es<br />
5600000 total<br />
</pre><br />
<br />
=== Align corpus ===<br />
<br />
Now we've got the corpus files ready, we can align the corpus using the Moses scripts:<br />
<br />
<pre><br />
nohup perl (path to your mosesdecoder)/scripts/training/train-model.perl -external-bin-dir \<br />
~/smt/local/bin -corpus europarl.tag-tok \<br />
-f en -e es -alignment grow-diag-final-and -reordering msd-bidirectional-fe \<br />
-lm 0:5:/home/fran/corpora/europarl/europarl.lm:0 >log 2>&1 &<br />
</pre><br />
<br />
Note: Remember to change all the paths in the above command!<br />
<br />
You'll need an LM file, but you can copy it from a previous Moses installation. If you don't have one, make an empty file and put a few words in it. We won't be using the LM anyway.<br />
<br />
This takes a while, from a few hours to a day. So leave it running and go and make a soufflé, or chop some wood or something.<br />
<br />
=== Extract sentences ===<br />
<br />
The first thing we need to do after Moses has finished training is convert the Giza++ alignments to a less human- (and machine-) hostile format:<br />
<br />
<pre><br />
$ zcat giza.en-es/en-es.A3.final.gz | ~/source/apertium-lex-tools/scripts/giza-to-moses.awk > europarl.phrasetable.en-es<br />
</pre><br />
<br />
Then we want to make sure again that our file has the right number of lines:<br />
<br />
<pre><br />
$ wc -l europarl.phrasetable.en-es<br />
1400000 europarl.phrasetable.en-es<br />
</pre><br />
<br />
Then we want to extract the sentences where the target language word aligned to a source language word is a possible translation in the bilingual dictionary:<br />
<br />
<pre><br />
$ ~/source/apertium-lex-tools/scripts/extract-sentences.py europarl.phrasetable.en-es europarl.biltrans-tok.en-es \<br />
> europarl.candidates.en-es<br />
</pre><br />
<br />
These are basically sentences that we can hope that Apertium might be able to generate.<br />
<br />
=== Extract bilingual dictionary candidates ===<br />
<br />
Using the phrasetable and the bilingual file we can extract candidates for the bilingual dictionary.<br />
<br />
<pre><br />
python3 ~/Apertium/apertium-lex-tools/scripts/extract-biltrans-candidates.py europarl.phrasetable.en-es europarl.biltrans-tok.en-es > europarl.biltrans-candidates.en-es 2> europarl.biltrans-pairs.en-es<br />
</pre><br />
<br />
where europarl.biltrans-candidates.en-es contains the generated entries for the bilingual dictionary.<br />
<br />
===Extract frequency lexicon===<br />
<br />
The next step is to extract the frequency lexicon. <br />
<br />
<pre><br />
$ python ~/source/apertium-lex-tools/scripts/extract-freq-lexicon.py europarl.candidates.en-es > europarl.lex.en-es<br />
</pre><br />
<br />
This file should look like:<br />
<br />
<pre><br />
$ cat europarl.lex.en-es | head <br />
31381 union<n> unión<n> @<br />
101 union<n> sindicato<n><br />
1 union<n> situación<n><br />
1 union<n> monetario<adj><br />
4 slope<n> pendiente<n> @<br />
1 slope<n> ladera<n><br />
</pre><br />
<br />
Where the highest frequency translation is marked with an <code>@</code>.<br />
<br />
Note: This frequency lexicon can be used as a substitute for "choosing the most general translation" in your bilingual dictionary.<br />
<br />
===Generate patterns===<br />
<br />
Now we generate the ngrams that we are going to generate the rules from.<br />
<br />
<pre><br />
$ crisphold=1.5 # ratio of how many times you see the alternative translation compared to the default<br />
$ python ~/source/apertium-lex-tools/scripts/ngram-count-patterns.py europarl.lex.en-es europarl.candidates.en-es $crisphold 2>/dev/null > europarl.ngrams.en-es<br />
</pre><br />
<br />
This script outputs lines in the following format: <br />
<br />
<pre><br />
-language<n> and<cnjcoo> language<n> ,<cm> lengua<n> 2<br />
+language<n> plain<adj> language<n> ,<cm> lenguaje<n> 3<br />
-language<n> language<n> knowledge<n> lengua<n> 4<br />
-language<n> language<n> of<pr> communication<n> lengua<n> 3<br />
-language<n> Community<adj> language<n> .<sent> lengua<n> 5<br />
-language<n> language<n> in~addition~to<pr> their<det><pos> lengua<n> 2<br />
-language<n> every<det><ind> language<n> lengua<n> 2<br />
+language<n> and<cnjcoo> *understandable language<n> lenguaje<n> 2<br />
-language<n> two<num> language<n> lengua<n> 8<br />
-language<n> only<adj> official<adj> language<n> lengua<n> 2<br />
</pre><br />
<br />
The <code>+</code> and <code>-</code> indicate if this line chooses the most frequent transation (<code>-</code>) or a translation which is not the most frequent (<code>+</code>). The pattern selecting the translation is then shown, followed by the translation and then the frequency.<br />
<br />
===Filter rules===<br />
<br />
Now you can filter the rules, for example by removing rules with conjunctions, or removing rules with unknown words.<br />
<br />
=== Generate rules ===<br />
<br />
The final stage is to generate the rules,<br />
<br />
<pre><br />
python3 ~/source/apertium-lex-tools/scripts/ngrams-to-rules.py europarl.ngrams.en-es $crisphold > europarl.ngrams.en-es.lrx<br />
</pre> <br />
<br />
=== Process script ===<br />
For the whole process you can run the following script:<br />
<br />
<pre><br />
CORPUS_DIR="/home/philip/Apertium/corpora/raw/europarl-fr-es"<br />
CORPUS="Europarl3"<br />
PAIR="es-fr"<br />
SL="fr"<br />
TL="es"<br />
LEX_TOOLS="/home/philip/Apertium/apertium-lex-tools"<br />
SCRIPTS="$LEX_TOOLS/scripts"<br />
MOSESDECODER="/home/philip/Apertium/mosesdecoder/scripts/training"<br />
TRAINING_LINES=50000<br />
DATA="/home/philip/Apertium/apertium-fr-es"<br />
BIN_DIR="/home/philip/Apertium/smt/bin"<br />
<br />
# CLEAN CORPUS<br />
perl "$MOSESDECODER/clean-corpus-n.perl" $CORPUS.$PAIR $SL $TL "$CORPUS.clean" 1 40;<br />
<br />
#TAG CORPUS<br />
cat "$CORPUS.clean.$SL" | tail -n $TRAINING_LINES | apertium -d "$DATA" $SL-$TL-tagger \<br />
| apertium-pretransfer > $CORPUS.tagged.$SL;<br />
<br />
cat "$CORPUS.clean.$TL" | tail -n $TRAINING_LINES | apertium -d "$DATA" $TL-$SL-tagger \<br />
| apertium-pretransfer > $CORPUS.tagged.$TL;<br />
<br />
cat "$CORPUS.tagged.$SL" | lt-proc -b "$DATA/$SL-$TL.autobil.bin" > $CORPUS.biltrans.$PAIR<br />
<br />
N=`wc -l $CORPUS.clean.$SL | cut -d ' ' -f 1`<br />
<br />
# REMOVE LINES WITH NO ANALYSES<br />
seq 1 $N > $CORPUS.lines<br />
paste $CORPUS.lines $CORPUS.tagged.$SL $CORPUS.tagged.$TL | grep '<' | cut -f1 > $CORPUS.lines.new<br />
paste $CORPUS.lines $CORPUS.tagged.$SL $CORPUS.tagged.$TL | grep '<' | cut -f2 > $CORPUS.tagged.$SL.new<br />
paste $CORPUS.lines $CORPUS.tagged.$SL $CORPUS.tagged.$TL | grep '<' | cut -f3 > $CORPUS.tagged.$TL.new<br />
<br />
mv $CORPUS.lines.new $CORPUS.lines<br />
mv $CORPUS.tagged.$SL.new $CORPUS.tagged.$SL<br />
mv $CORPUS.tagged.$TL.new $CORPUS.tagged.$TL<br />
<br />
# TRIM TAGS<br />
cat $CORPUS.tagged.$SL | $LEX_TOOLS/process-tagger-output $DATA/$SL-$TL.autobil.bin -p -t > $CORPUS.tag-tok.$SL<br />
cat $CORPUS.tagged.$TL | $LEX_TOOLS/process-tagger-output $DATA/$TL-$SL.autobil.bin -p -t > $CORPUS.tag-tok.$TL<br />
cat $CORPUS.tagged.$SL | $LEX_TOOLS/process-tagger-output $DATA/$SL-$TL.autobil.bin -b -t > $CORPUS.biltrans-tok.$PAIR<br />
<br />
# ALIGN<br />
perl $MOSESDECODER/train-model.perl -external-bin-dir "$BIN_DIR" -corpus $CORPUS.tag-tok \<br />
-f $TL -e $SL -alignment grow-diag-final-and -reordering msd-bidirectional-fe \<br />
-lm 0:5:/home/philip/Apertium/gsoc2013/giza/europarl.lm:0 2>&1<br />
<br />
# EXTRACT<br />
zcat giza.$SL-$TL/$SL-$TL.A3.final.gz | $SCRIPTS/giza-to-moses.awk > $CORPUS.phrasetable.$SL-$TL<br />
<br />
# SENTENCES<br />
python3 $SCRIPTS/extract-sentences.py $CORPUS.phrasetable.$SL-$TL $CORPUS.biltrans-tok.$PAIR \<br />
> $CORPUS.candidates.$SL-$TL 2>/dev/null<br />
<br />
# FREQUENCY LEXICON<br />
python $SCRIPTS/extract-freq-lexicon.py $CORPUS.candidates.$SL-$TL > $CORPUS.lex.$SL-$TL 2>/dev/null<br />
<br />
<br />
# BILTRANS CANDIDATES<br />
python3 $SCRIPTS/extract-biltrans-candidates.py $CORPUS.phrasetable.$SL-$TL $CORPUS.biltrans-tok.$PAIR \<br />
> $CORPUS.biltrans-entries.$SL-$TL 2>$CORPUS.biltrans-pairs.$SL-$TL<br />
<br />
# NGRAM PATTERNS<br />
$crisphold=1.5<br />
python $SCRIPTS/ngram-count-patterns.py $CORPUS.lex.$SL-$TL $CORPUS.candidates.$SL-$TL $crisphold 2>/dev/null > $CORPUS.ngrams.$SL-$TL<br />
<br />
# FILTER PATTERNS<br />
<br />
<br />
# NGRAMS TO RULES<br />
$cripshold=1.5<br />
python3 $SCRIPTS/ngrams-to-rules.py europarl.ngrams.en-es $crisphold > $Europarl3.ngrams.$SL-TL.lrx<br />
<br />
</pre><br />
<br />
[[Category:Lexical selection]]<br />
[[Category:Documentation in English]]</div>
Fpetkovski
https://wiki.apertium.org/w/index.php?title=Generating_lexical-selection_rules_from_a_parallel_corpus&diff=43753
Generating lexical-selection rules from a parallel corpus
2013-09-19T14:35:57Z
<p>Fpetkovski: /* Prepare corpus */</p>
<hr />
<div>{{TOCD}}<br />
If you have a parallel corpus, one of the things you can do is generate some lexical selection rules from it, to improve translation of words with more than one possible translation.<br />
<br />
== You will need ==<br />
<br />
Here is a list of software that you will need installed:<br />
<br />
* Giza++ (or some other word aligner)<br />
* Moses (for making Giza++ less human hostile)<br />
* All the Moses scripts<br />
* [[lttoolbox]]<br />
* Apertium<br />
* [[apertium-lex-tools]]<br />
<br />
Furthermore you'll need:<br />
<br />
* an Apertium language pair<br />
* a parallel corpus (see [[Corpora]])<br />
<br />
===Installing prerequisites===<br />
See [[Minimal installation from SVN]] for apertium/lttoolbox. <br />
<br />
See [[Constraint-based lexical selection module]] for apertium-lex-tools.<br />
<br />
For Giza++ and moses-decoder, etc. you can do<br />
<pre><br />
$ mkdir ~/smt<br />
$ cd ~/smt<br />
$ mkdir local # our "install prefix"<br />
$ wget https://giza-pp.googlecode.com/files/giza-pp-v1.0.7.tar.gz<br />
$ tar xzvf giza-pp-v1.0.7.tar.gz<br />
$ cd giza-pp<br />
$ make<br />
$ mkdir ../local/bin<br />
$ cp GIZA++-v2/snt2cooc.out ../local/bin/<br />
$ cp GIZA++-v2/snt2plain.out ../local/bin/<br />
$ cp GIZA++-v2/GIZA++ ../local/bin/<br />
$ cp mkcls-v2/mkcls ../local/bin/<br />
$ git clone https://github.com/moses-smt/mosesdecoder<br />
$ cd mosesdecoder/<br />
$ ./bjam <br />
</pre><br />
<br />
Now e.g. the clean-corpus and train-model scripts referred to below will be in ~/smt/mosesdecoder/scripts/training/clean-corpus-n.perl<br />
See http://www.statmt.org/moses/?n=Development.GetStarted if you want to install the binaries to some other directory.<br />
<br />
== Getting started ==<br />
<br />
We're going to do the example with [[Corpora|EuroParl]] and the English to Spanish pair in Apertium.<br />
<br />
Given that you've got all the stuff installed, the work will be as follows:<br />
<br />
=== Prepare corpus ===<br />
<br />
To generate the rules, we need three files,<br />
<br />
* The tagged and tokenised source corpus<br />
* The tagged and tokenised target corpus<br />
* The output of the lexical transfer module in the source→target direction, tokenised<br />
<br />
These three files should be sentence aligned.<br />
<br />
The first thing that we need to do is tag both sides of the corpus:<br />
<br />
<pre><br />
$ nohup cat europarl.clean.en | apertium-destxt |\<br />
apertium -f none -d /home/fran/source/apertium-en-es en-es-pretransfer > europarl.tagged.en &<br />
$ nohup cat europarl.clean.es | apertium-destxt |\<br />
apertium -f none -d /home/fran/source/apertium-en-es es-en-pretransfer > europarl.tagged.es &<br />
</pre><br />
<br />
Then we need to remove the lines with no analyses on:<br />
<br />
<pre><br />
$ paste europarl.lines europarl.tagged.en europarl.tagged.es | grep '<' | cut -f2 > europarl.tagged.en.new<br />
$ paste europarl.lines europarl.tagged.en europarl.tagged.es | grep '<' | cut -f3 > europarl.tagged.es.new<br />
</pre><br />
<br />
<br />
Next, we need to clean the corpus and remove long sentences.<br />
(Make sure you are in the same directory as the one where you have your europarl corpus) <br />
<br />
<pre><br />
$ perl (path to your mosesdecoder)/scripts/training/clean-corpus-n.perl europarl-v7.es-en es en europarl.clean 1 40<br />
clean-corpus.perl: processing europarl-v6.es-en.es & .en to europarl.clean, cutoff 1-40<br />
..........(100000)...<br />
<br />
Input sentences: 1786594 Output sentences: 1467708<br />
</pre><br />
<br />
Then run the English side through the lexical transfer:<br />
<br />
<pre><br />
$ nohup cat europarl.tagged.en | lt-proc -b ~/source/apertium-en-es/en-es.autobil.bin > europarl.biltrans.en-es &<br />
</pre><br />
<br />
We're going to cut off the bottom 67,658 for testing (also because Giza++ segfaults somewhere around there).<br />
<br />
<pre><br />
$ mkdir testing<br />
$ tail -67658 europarl.lines > testing/europarl.67658.lines<br />
$ tail -67658 europarl.tagged.en > testing/europarl.tagged.67658.en<br />
$ tail -67658 europarl.tagged.es > testing/europarl.tagged.67658.es<br />
</pre><br />
<br />
<pre><br />
$ head -1400000 europarl.lines > europarl.lines.new<br />
$ head -1400000 europarl.tagged.en > europarl.tagged.en.new<br />
$ head -1400000 europarl.tagged.es > europarl.tagged.es.new<br />
$ head -1400000 europarl.biltrans.en-es > europarl.biltrans.en-es.new<br />
$ mv europarl.lines.new europarl.lines<br />
$ mv europarl.tagged.en.new europarl.tagged.en<br />
$ mv europarl.tagged.es.new europarl.tagged.es<br />
$ mv europarl.biltrans.en-es.new europarl.biltrans.en-es<br />
</pre><br />
<br />
These files are:<br />
<br />
* <code>europarl.lines</code>: The list of lines included in the corpus from the original cleaned corpus.<br />
* <code>europarl.tagged.en</code>: The tagged source language side of the corpus<br />
* <code>europarl.tagged.es</code>: The tagged target language side of the corpus<br />
* <code>europarl.biltrans.en-es</code>: The output of the lexical transfer SL→TL<br />
<br />
Check that they have the same length:<br />
<br />
<pre><br />
$ wc -l europarl.*<br />
1400000 europarl.biltrans.en-es<br />
1400000 europarl.lines<br />
1400000 europarl.tagged.en<br />
1400000 europarl.tagged.es<br />
5600000 total<br />
</pre><br />
<br />
The next step is to tokenise these into a format appropriate for Moses. We also do some tag trimming here so <br />
that we could use the correct tags when generating lexical rules and bidix entries. For this we can use process-tagger-output from the apertium-lex-tools directory.<br />
<br />
<br />
<pre><br />
$ nohup cat europarl.tagged.es | ~/source/apertium-lex-tools/process-tagger-output ~/source/apertium-en-es/es-en.autobil.bin -p -t > europarl.tag-tok.en&<br />
$ nohup cat europarl.tagged.en | ~/source/apertium-lex-tools/process-tagger-output ~/source/apertium-en-es/en-es.autobil.bin -p -t > europarl.tag-tok.es&<br />
$ nohup cat europarl.biltrans.en-es | ~/source/apertium-lex-tools/process-tagger-output ~/source/apertium-en-es/es-en.autobil.bin -b -t > europarl.biltrans-tok.en-es &<br />
</pre><br />
<br />
=== Align corpus ===<br />
<br />
Now we've got the corpus files ready, we can align the corpus using the Moses scripts:<br />
<br />
<pre><br />
nohup perl (path to your mosesdecoder)/scripts/training/train-model.perl -external-bin-dir \<br />
~/smt/local/bin -corpus europarl.tag-tok \<br />
-f en -e es -alignment grow-diag-final-and -reordering msd-bidirectional-fe \<br />
-lm 0:5:/home/fran/corpora/europarl/europarl.lm:0 >log 2>&1 &<br />
</pre><br />
<br />
Note: Remember to change all the paths in the above command!<br />
<br />
You'll need an LM file, but you can copy it from a previous Moses installation. If you don't have one, make an empty file and put a few words in it. We won't be using the LM anyway.<br />
<br />
This takes a while, from a few hours to a day. So leave it running and go and make a soufflé, or chop some wood or something.<br />
<br />
=== Extract sentences ===<br />
<br />
The first thing we need to do after Moses has finished training is convert the Giza++ alignments to a less human- (and machine-) hostile format:<br />
<br />
<pre><br />
$ zcat giza.en-es/en-es.A3.final.gz | ~/source/apertium-lex-tools/scripts/giza-to-moses.awk > europarl.phrasetable.en-es<br />
</pre><br />
<br />
Then we want to make sure again that our file has the right number of lines:<br />
<br />
<pre><br />
$ wc -l europarl.phrasetable.en-es<br />
1400000 europarl.phrasetable.en-es<br />
</pre><br />
<br />
Then we want to extract the sentences where the target language word aligned to a source language word is a possible translation in the bilingual dictionary:<br />
<br />
<pre><br />
$ ~/source/apertium-lex-tools/scripts/extract-sentences.py europarl.phrasetable.en-es europarl.biltrans-tok.en-es \<br />
> europarl.candidates.en-es<br />
</pre><br />
<br />
These are basically sentences that we can hope that Apertium might be able to generate.<br />
<br />
=== Extract bilingual dictionary candidates ===<br />
<br />
Using the phrasetable and the bilingual file we can extract candidates for the bilingual dictionary.<br />
<br />
<pre><br />
python3 ~/Apertium/apertium-lex-tools/scripts/extract-biltrans-candidates.py europarl.phrasetable.en-es europarl.biltrans-tok.en-es > europarl.biltrans-candidates.en-es 2> europarl.biltrans-pairs.en-es<br />
</pre><br />
<br />
where europarl.biltrans-candidates.en-es contains the generated entries for the bilingual dictionary.<br />
<br />
===Extract frequency lexicon===<br />
<br />
The next step is to extract the frequency lexicon. <br />
<br />
<pre><br />
$ python ~/source/apertium-lex-tools/scripts/extract-freq-lexicon.py europarl.candidates.en-es > europarl.lex.en-es<br />
</pre><br />
<br />
This file should look like:<br />
<br />
<pre><br />
$ cat europarl.lex.en-es | head <br />
31381 union<n> unión<n> @<br />
101 union<n> sindicato<n><br />
1 union<n> situación<n><br />
1 union<n> monetario<adj><br />
4 slope<n> pendiente<n> @<br />
1 slope<n> ladera<n><br />
</pre><br />
<br />
Where the highest frequency translation is marked with an <code>@</code>.<br />
<br />
Note: This frequency lexicon can be used as a substitute for "choosing the most general translation" in your bilingual dictionary.<br />
<br />
===Generate patterns===<br />
<br />
Now we generate the ngrams that we are going to generate the rules from.<br />
<br />
<pre><br />
$ crisphold=1.5 # ratio of how many times you see the alternative translation compared to the default<br />
$ python ~/source/apertium-lex-tools/scripts/ngram-count-patterns.py europarl.lex.en-es europarl.candidates.en-es $crisphold 2>/dev/null > europarl.ngrams.en-es<br />
</pre><br />
<br />
This script outputs lines in the following format: <br />
<br />
<pre><br />
-language<n> and<cnjcoo> language<n> ,<cm> lengua<n> 2<br />
+language<n> plain<adj> language<n> ,<cm> lenguaje<n> 3<br />
-language<n> language<n> knowledge<n> lengua<n> 4<br />
-language<n> language<n> of<pr> communication<n> lengua<n> 3<br />
-language<n> Community<adj> language<n> .<sent> lengua<n> 5<br />
-language<n> language<n> in~addition~to<pr> their<det><pos> lengua<n> 2<br />
-language<n> every<det><ind> language<n> lengua<n> 2<br />
+language<n> and<cnjcoo> *understandable language<n> lenguaje<n> 2<br />
-language<n> two<num> language<n> lengua<n> 8<br />
-language<n> only<adj> official<adj> language<n> lengua<n> 2<br />
</pre><br />
<br />
The <code>+</code> and <code>-</code> indicate if this line chooses the most frequent transation (<code>-</code>) or a translation which is not the most frequent (<code>+</code>). The pattern selecting the translation is then shown, followed by the translation and then the frequency.<br />
<br />
===Filter rules===<br />
<br />
Now you can filter the rules, for example by removing rules with conjunctions, or removing rules with unknown words.<br />
<br />
=== Generate rules ===<br />
<br />
The final stage is to generate the rules,<br />
<br />
<pre><br />
python3 ~/source/apertium-lex-tools/scripts/ngrams-to-rules.py europarl.ngrams.en-es $crisphold > europarl.ngrams.en-es.lrx<br />
</pre> <br />
<br />
=== Process script ===<br />
For the whole process you can run the following script:<br />
<br />
<pre><br />
CORPUS_DIR="/home/philip/Apertium/corpora/raw/europarl-fr-es"<br />
CORPUS="Europarl3"<br />
PAIR="es-fr"<br />
SL="fr"<br />
TL="es"<br />
LEX_TOOLS="/home/philip/Apertium/apertium-lex-tools"<br />
SCRIPTS="$LEX_TOOLS/scripts"<br />
MOSESDECODER="/home/philip/Apertium/mosesdecoder/scripts/training"<br />
TRAINING_LINES=50000<br />
DATA="/home/philip/Apertium/apertium-fr-es"<br />
BIN_DIR="/home/philip/Apertium/smt/bin"<br />
<br />
# CLEAN CORPUS<br />
perl "$MOSESDECODER/clean-corpus-n.perl" $CORPUS.$PAIR $SL $TL "$CORPUS.clean" 1 40;<br />
<br />
#TAG CORPUS<br />
cat "$CORPUS.clean.$SL" | tail -n $TRAINING_LINES | apertium -d "$DATA" $SL-$TL-tagger \<br />
| apertium-pretransfer > $CORPUS.tagged.$SL;<br />
<br />
cat "$CORPUS.clean.$TL" | tail -n $TRAINING_LINES | apertium -d "$DATA" $TL-$SL-tagger \<br />
| apertium-pretransfer > $CORPUS.tagged.$TL;<br />
<br />
cat "$CORPUS.tagged.$SL" | lt-proc -b "$DATA/$SL-$TL.autobil.bin" > $CORPUS.biltrans.$PAIR<br />
<br />
N=`wc -l $CORPUS.clean.$SL | cut -d ' ' -f 1`<br />
<br />
# REMOVE LINES WITH NO ANALYSES<br />
seq 1 $N > $CORPUS.lines<br />
paste $CORPUS.lines $CORPUS.tagged.$SL $CORPUS.tagged.$TL | grep '<' | cut -f1 > $CORPUS.lines.new<br />
paste $CORPUS.lines $CORPUS.tagged.$SL $CORPUS.tagged.$TL | grep '<' | cut -f2 > $CORPUS.tagged.$SL.new<br />
paste $CORPUS.lines $CORPUS.tagged.$SL $CORPUS.tagged.$TL | grep '<' | cut -f3 > $CORPUS.tagged.$TL.new<br />
<br />
mv $CORPUS.lines.new $CORPUS.lines<br />
mv $CORPUS.tagged.$SL.new $CORPUS.tagged.$SL<br />
mv $CORPUS.tagged.$TL.new $CORPUS.tagged.$TL<br />
<br />
# TRIM TAGS<br />
cat $CORPUS.tagged.$SL | $LEX_TOOLS/process-tagger-output $DATA/$SL-$TL.autobil.bin -p -t > $CORPUS.tag-tok.$SL<br />
cat $CORPUS.tagged.$TL | $LEX_TOOLS/process-tagger-output $DATA/$TL-$SL.autobil.bin -p -t > $CORPUS.tag-tok.$TL<br />
cat $CORPUS.tagged.$SL | $LEX_TOOLS/process-tagger-output $DATA/$SL-$TL.autobil.bin -b -t > $CORPUS.biltrans-tok.$PAIR<br />
<br />
# ALIGN<br />
perl $MOSESDECODER/train-model.perl -external-bin-dir "$BIN_DIR" -corpus $CORPUS.tag-tok \<br />
-f $TL -e $SL -alignment grow-diag-final-and -reordering msd-bidirectional-fe \<br />
-lm 0:5:/home/philip/Apertium/gsoc2013/giza/europarl.lm:0 2>&1<br />
<br />
# EXTRACT<br />
zcat giza.$SL-$TL/$SL-$TL.A3.final.gz | $SCRIPTS/giza-to-moses.awk > $CORPUS.phrasetable.$SL-$TL<br />
<br />
# SENTENCES<br />
python3 $SCRIPTS/extract-sentences.py $CORPUS.phrasetable.$SL-$TL $CORPUS.biltrans-tok.$PAIR \<br />
> $CORPUS.candidates.$SL-$TL 2>/dev/null<br />
<br />
# FREQUENCY LEXICON<br />
python $SCRIPTS/extract-freq-lexicon.py $CORPUS.candidates.$SL-$TL > $CORPUS.lex.$SL-$TL 2>/dev/null<br />
<br />
<br />
# BILTRANS CANDIDATES<br />
python3 $SCRIPTS/extract-biltrans-candidates.py $CORPUS.phrasetable.$SL-$TL $CORPUS.biltrans-tok.$PAIR \<br />
> $CORPUS.biltrans-entries.$SL-$TL 2>$CORPUS.biltrans-pairs.$SL-$TL<br />
<br />
# NGRAM PATTERNS<br />
$crisphold=1.5<br />
python $SCRIPTS/ngram-count-patterns.py $CORPUS.lex.$SL-$TL $CORPUS.candidates.$SL-$TL $crisphold 2>/dev/null > $CORPUS.ngrams.$SL-$TL<br />
<br />
# FILTER PATTERNS<br />
<br />
<br />
# NGRAMS TO RULES<br />
$cripshold=1.5<br />
python3 $SCRIPTS/ngrams-to-rules.py europarl.ngrams.en-es $crisphold > $Europarl3.ngrams.$SL-TL.lrx<br />
<br />
</pre><br />
<br />
[[Category:Lexical selection]]<br />
[[Category:Documentation in English]]</div>
Fpetkovski
https://wiki.apertium.org/w/index.php?title=User:Fpetkovski/GSOC_2013_Application_-_Improving_the_lexical_selection_module&diff=43637
User:Fpetkovski/GSOC 2013 Application - Improving the lexical selection module
2013-09-08T10:09:29Z
<p>Fpetkovski: /* Work to do */</p>
<hr />
<div>The lexical selection module in Apertium is currently a prototype. There are many optimisations that could be made to make it faster and more efficient. There are a number of scripts which can be used for learning lexical-selection rules, but the scripts are not particularly well written. Part of the task will be to rewrite the scripts taking into account all possible corner cases.<br />
<br />
The project idea is located here [[Ideas_for_Google_Summer_of_Code/Improvements_in_lexical-selection_module|here]].<br />
<br />
== Personal Info ==<br />
<br />
First name: Filip <br /><br />
Last name: Petkovski <br /><br />
email: filip.petkovsky@gmail.com <br /><br />
fpetkovski on IRC: #apertium <br /><br />
<br />
== Why are you interested in machine translation ? ==<br />
<br />
Machine translation can be thought of as one of the greatest challenges in natural language processing. It is the single most useful application of NLP and building a good MT system requires a blend of numerous different techniques from both computer science and linguistics.<br />
<br />
== Why is it that you are interested in the Apertium project? ==<br />
<br />
Apertium is a great project. It is obvious that a ton of work has been put into both developing the platform and creating the language resources. Apertium is a nice blend<br />
between rule based and corpus based machine translation. It also allows me to easily work with my native language, as well as with other closely related languages from the Slavic language group.<br />
<br />
== Why should Google and Apertium sponsor it? ==<br />
<br />
Lexical selection is the task of deciding which word to use in a given context.<br />
A good lexical selection module can significantly increase translation quality, <br />
and give machine translation a more human-like feel.<br />
<br />
== Which of the published tasks are you interested in? What do you plan to do? ==<br />
I'm interested in [[Ideas_for_Google_Summer_of_Code/Improvements_in_lexical-selection_module|Improving the lexical selection module]].<br />
<br />
I intend to improve the existing scripts and programs, and merge them into a release-ready package. I also intend to extend the existing functionality of the module.<br />
<br />
== Work already done ==<br />
<br />
* Generate lexical selection rules from a parallel corpus for the sh-mk language pair (submitted on svn)<br />
* Generate additional bidix entries from a parallel corpus for the sh-mk language pair (submitted on svn)<br />
* Last GSoC's participant (Corpus based feature transfer)<br />
<br />
== Work to do ==<br />
<br />
'''Community bonding period:'''<br />
* <s>go through the training process for monolingual rule extraction</s><br />
* <s>go through the tranining process for MaxEnt rule extraction (monolingual/parallel)</s><br />
* <s>document the results</s><br />
<br />
'''Week 1:'''<br /><br />
* <s>Update the instructions on the wiki</s><br />
* <s>Remove unused and redundant scripts.</s> (prefixed with unused. )<br />
* <s>Do proper processing of tags in all scripts.</s> (fixed with FSTProcessor::biltransWithoutQueue)<br />
* <s>Fix tokenization</s>. (fixed in scripts/common.py with tokenize_biltrans_line)<br />
* <s>Make sure that capitalisation, any tag and any character work as expected (fixed in tokenization).</s><br />
* <s>Ensure that all scripts process escaped characters correctly, e.g. ^ \ / $ < ></s> (fixed with tokenization)<br />
'''Week 2:'''<br /><br />
* <s>Script/program for finding possibly missing bidix entries from an aligned parallel corpus. </s><br />
* <s>Make sure that <match lemma="*" tags="*"/> works the same as <match/> </s><br />
* <s> <match/> doesn't match an LU when the lemma is , </s><br />
* Fix bug10 in the testing dir.<br />
'''Week 3:'''<br /><br />
* <s> Merge the four different implementations of irstlm_ranker into a single implementation </s><br />
* <s> add option to the ranker which marks translations which fall outside of xx% of the probability mass for a given sentence <code>|@| |+| |-|</code> </s><br />
* <s> Move lex-learner to lex-tools </s><br />
* <s>Run through and document new training process with a language pair (mk-en, br-fr, or en-es) </s><br />
* <s> Demonstrate bidix extraction script with a language pair (e.g. es-pt) </s><br />
'''Week 4-6:'''<br /><br />
* Rewrite the LRXProcessor::processME and LRXProcessor::process methods so that they share more code and are more modularised. Having a 650 line method is not the right thing. <br />
* Work on a way to trim non-significant features from the maximum-entropy models.<br />
** probability mass: discard features which fall outside of xx% of the probability mass, e.g. 80%, should be configurable<br />
** outcome pruning: discard features that select a translation which can never win: e.g. the sum of the weights of all the contexts where it appears never adds up to more than the sum of the weights of all the other translations<br />
* <s> Implement poor-man's alignment: instead of using giza++, use tagged corpora and look up to see if the equivalent word appears. </s><br />
'''Weeks 7-9:'''<br />
* ...<br />
'''Week 9-10:'''<br />
* Apply the model to different language pairs and generate lexical selection rules and bidix entries.<br />
** eu-es, es-fr, es-pt, mk-en, br-fr, en-es<br />
'''Week 11-12:''' <br />
* Wrap up / writing paper<br />
<br />
== Skills, qualifications and field of study ==<br />
<br />
I am a Graduate student of Computer Science, holding a Bachelor's degree in Computing. I have an excellent knowledge of Java and C#, and I'm fairly comfortable with C/C++ and scripting languages.<br />
<br />
<br />
Machine learning is one of my strongest skills. I have worked on quite a few ML projects involving named entity relation extraction, news articles classification, image based gender classification and real time vehicle detection. I have experience with building and optimising a model, feature selection and feature extraction for classification.<br />
I did my bachelor thesis in the field of computer vision, and my master thesis is in the field of natural language processing.<br />
<br />
<br />
I have also taken part in last year's GSoC, and have in addition worked on the sh-mk and sh-en language pairs.<br />
<br />
== Non-GSoC activities ==<br />
<br />
My Master's thesis is due 28.6 but I intend to focus on it intensively before the coding period starts (27.5).<br />
<br />
I also might be moving to the United States for a few months as a part of a work and travel programme, so I might be offline for a couple of days around 10.6.<br />
<br />
[[Category:GSoC 2013 Student proposals|Fpetkovski]]</div>
Fpetkovski
https://wiki.apertium.org/w/index.php?title=Learning_rules_from_parallel_and_non-parallel_corpora&diff=43322
Learning rules from parallel and non-parallel corpora
2013-08-19T13:47:09Z
<p>Fpetkovski: /* Estimating rules using parallel corpora */</p>
<hr />
<div>== Estimating rules using parallel corpora ==<br />
It is always recommended to use a parallel corpus for any type of machine translation training<br />
when such a resource is available. This section describes several (3) methods for estimating<br />
lexical selection rules using a parallel corpus. We start by describing the part of the training process that is shared by all three methods, and then continue to describe each of the <br />
individual methods separately.<br />
=== Prequisites ===<br />
The training methods use several software packages that need to be installed.<br />
First you will need to download and install:<br />
* [[lttoolbox]]<br />
* Apertium<br />
* [[apertium-lex-tools]]<br />
* Moses and its training scripts which you can install using the following script<br />
<br />
Furthermore you will also need:<br />
<br />
* an Apertium language pair<br />
* a parallel corpus (see [[Corpora]])<br />
<br />
===Installing prerequisites===<br />
See [[Minimal installation from SVN]] for apertium/lttoolbox. <br />
<br />
See [[Constraint-based lexical selection module]] for apertium-lex-tools.<br />
<br />
For moses-decoder you can do<br />
<pre><br />
git clone https://github.com/moses-smt/mosesdecoder<br />
cd mosesdecoder/<br />
./bjam <br />
</pre><br />
<br />
Now e.g. the clean-corpus and train-model scripts referred to below will be in ~/smt/mosesdecoder/scripts/training/clean-corpus-n.perl<br />
See http://www.statmt.org/moses/?n=Development.GetStarted if you want to install the binaries to some other directory.<br />
<br />
=== Preparing the training files ===<br />
<br />
The parallel corpus is processed in such a way that <br />
the training files (the source and target side corpus) are first analysed and tagged.<br />
<br />
Next, lines with no analysis are removed and blank within tokens are replaced with a new character<br />
since Giza tokenizes a sentence by splitting on white space.<br />
<br />
Finally, both files are cleaned using a moses training script so that Giza will not <br />
crash during the training process.<br />
<br />
All of this can be achieved using the following script:<br />
<br />
<pre><br />
CORPUS="Europarl3"<br />
PAIR="es-pt"<br />
SL="pt"<br />
TL="es"<br />
DATA="/home/philip/Apertium/apertium-es-pt"<br />
<br />
LEX_TOOLS="/home/philip/Apertium/apertium-lex-tools"<br />
SCRIPTS="$LEX_TOOLS/scripts"<br />
MOSESDECODER="/home/philip/Apertium/mosesdecoder/scripts/training"<br />
TRAINING_LINES=100000<br />
<br />
<br />
if [ ! -d data-$SL-$TL ]; then <br />
mkdir data-$SL-$TL;<br />
fi<br />
<br />
# TAG CORPUS<br />
cat "$CORPUS.$PAIR.$SL" | head -n $TRAINING_LINES | apertium -d "$DATA" $SL-$TL-tagger-dcase \<br />
| apertium-pretransfer > data-$SL-$TL/$CORPUS.tagged.$SL;<br />
<br />
cat "$CORPUS.$PAIR.$TL" | head -n $TRAINING_LINES | apertium -d "$DATA" $TL-$SL-tagger-dcase \<br />
| apertium-pretransfer > data-$SL-$TL/$CORPUS.tagged.$TL;<br />
<br />
N=`wc -l $CORPUS.$PAIR.$SL | cut -d ' ' -f 1`<br />
<br />
<br />
# REMOVE LINES WITH NO ANALYSES<br />
seq 1 $TRAINING_LINES > data-$SL-$TL/$CORPUS.lines<br />
paste data-$SL-$TL/$CORPUS.lines data-$SL-$TL/$CORPUS.tagged.$SL data-$SL-$TL/$CORPUS.tagged.$TL | grep '<' \<br />
| cut -f1 > data-$SL-$TL/$CORPUS.lines.new<br />
paste data-$SL-$TL/$CORPUS.lines data-$SL-$TL/$CORPUS.tagged.$SL data-$SL-$TL/$CORPUS.tagged.$TL | grep '<' \<br />
| cut -f2 > data-$SL-$TL/$CORPUS.tagged.$SL.new<br />
paste data-$SL-$TL/$CORPUS.lines data-$SL-$TL/$CORPUS.tagged.$SL data-$SL-$TL/$CORPUS.tagged.$TL | grep '<' \<br />
| cut -f3 > data-$SL-$TL/$CORPUS.tagged.$TL.new<br />
mv data-$SL-$TL/$CORPUS.lines.new data-$SL-$TL/$CORPUS.lines<br />
cat data-$SL-$TL/$CORPUS.tagged.$SL.new \<br />
| sed 's/ /~/g' | sed 's/\$[^\^]*/$ /g' > data-$SL-$TL/$CORPUS.tagged.$SL<br />
cat data-$SL-$TL/$CORPUS.tagged.$TL.new \<br />
| sed 's/ /~/g' | sed 's/\$[^\^]*/$ /g' > data-$SL-$TL/$CORPUS.tagged.$TL<br />
rm data-$SL-$TL/*.new<br />
<br />
<br />
# CLEAN CORPUS<br />
perl "$MOSESDECODER/clean-corpus-n.perl" data-$SL-$TL/$CORPUS.tagged $SL $TL "data-$SL-$TL/$CORPUS.tag-clean" 1 40;<br />
<br />
</pre><br />
<br />
Make sure you set the variables as follows: <br/><br />
* CORPUS="Europarl3": name of the corpus that you're using<br />
* PAIR: the direction of the corpus<br />
* SL: the source language<br />
* TL: the target language<br />
* DATA: path to the language resources for the language pair<br />
* LEX_TOOLS: path to apertium-lex-tools<br />
* MOSESDECODER: path to moses-decoder<br />
* TRAINING_LINES: amount of training lines<br />
<br />
=== Learning rules with Giza ===<br />
<br />
==== Installing Giza ====<br />
<br />
You can download and install Giza in the following way:<br />
<br />
<pre><br />
<br />
$ mkdir ~/smt<br />
$ cd ~/smt<br />
$ mkdir local # our "install prefix"<br />
$ wget https://giza-pp.googlecode.com/files/giza-pp-v1.0.7.tar.gz<br />
$ tar xzvf giza-pp-v1.0.7.tar.gz<br />
$ cd giza-pp<br />
$ make<br />
$ mkdir ../local/bin<br />
$ cp GIZA++-v2/snt2cooc.out ../local/bin/<br />
$ cp GIZA++-v2/snt2plain.out ../local/bin/<br />
$ cp GIZA++-v2/GIZA++ ../local/bin/<br />
$ cp mkcls-v2/mkcls ../local/bin/<br />
</pre><br />
<br />
Notice that all of your binary files will be placed in a single folder which is needed by the moses training script.<br />
<br />
==== Alignment ====<br />
Giza++ is used for obtaining an alignment between the tokens of the parallel sentences in the two corpora. The alignment process can sometimes be slow depending on the size of the corpora.<br />
<br />
After alignment has been done, the tokens' tags can be trimmed down so that they match the the set of tags found in the bilingual dictionary of the language pair.<br />
<br />
Next, a bilingual transfer output is obtained from the source language side so that<br />
ambiguous sentences and missing bidix candidates can be extracted.<br />
<br />
Finally, a frequency lexicon is created, marking the most common translation for each ambiguous token.<br />
<br />
<pre><br />
BIN_DIR="/home/philip/Apertium/smt/bin"<br />
# ALIGN<br />
perl $MOSESDECODER/train-model.perl -external-bin-dir "$BIN_DIR" -corpus data-$SL-$TL/$CORPUS.tag-clean \<br />
-f $TL -e $SL -alignment grow-diag-final-and -reordering msd-bidirectional-fe \<br />
-lm 0:5:/home/philip/Apertium/gsoc2013/giza/europarl.lm:0 2>&1<br />
<br />
# EXTRACT<br />
zcat giza.$SL-$TL/$SL-$TL.A3.final.gz | $SCRIPTS/giza-to-moses.awk > data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL<br />
<br />
# TRIM TAGS<br />
cat data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL | sed 's/ ||| /\t/g' | cut -f 1 \<br />
| sed 's/~/ /g' | $LEX_TOOLS/process-tagger-output $DATA/$TL-$SL.autobil.bin -p -t > tmp1<br />
<br />
cat data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL | sed 's/ ||| /\t/g' | cut -f 2 \<br />
| sed 's/~/ /g' | $LEX_TOOLS/process-tagger-output $DATA/$SL-$TL.autobil.bin -p -t > tmp2<br />
<br />
cat data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL | sed 's/ ||| /\t/g' | cut -f 3 > tmp3<br />
<br />
cat data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL | sed 's/ ||| /\t/g' | cut -f 2 \<br />
| sed 's/~/ /g' | $LEX_TOOLS/process-tagger-output $DATA/$SL-$TL.autobil.bin -b -t > data-$SL-$TL/$CORPUS.clean-biltrans.$PAIR<br />
<br />
paste tmp1 tmp2 tmp3 | sed 's/\t/ ||| /g' > data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL<br />
rm tmp1 tmp2 tmp3<br />
<br />
# SENTENCES<br />
python3 $SCRIPTS/extract-sentences.py data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL data-$SL-$TL/$CORPUS.clean-biltrans.$PAIR \<br />
> data-$SL-$TL/$CORPUS.candidates.$SL-$TL 2>/dev/null<br />
<br />
# FREQUENCY LEXICON<br />
python $SCRIPTS/extract-freq-lexicon.py data-$SL-$TL/$CORPUS.candidates.$SL-$TL > data-$SL-$TL/$CORPUS.lex.$SL-$TL 2>/dev/null<br />
<br />
</pre><br />
<br />
Make sure you set the BIN_DIR variable so that it contains the path to the binary folder<br />
generated by the Giza installation process.<br />
<br />
<br />
<br />
==== Maximum likelihood rule extraction ====<br />
The ML method counts how many each translation occurs in a given context, and compares that number<br />
with the default translation from the frequency lexicon. <br />
It then decides whether to create a rule with the given translation, or to leave the default<br />
translation.<br />
<br />
The rule generation process is done with the following script:<br />
<br />
<pre><br />
crisphold=1.5<br />
# NGRAM PATTERNS<br />
python $SCRIPTS/ngram-count-patterns.py data-$SL-$TL/$CORPUS.lex.$SL-$TL data-$SL-$TL/$CORPUS.candidates.$SL-$TL $crisphold 2>/dev/null > data-$SL-$TL/$CORPUS.ngrams.$SL-$TL<br />
<br />
# NGRAMS TO RULES<br />
python3 $SCRIPTS/ngrams-to-rules.py data-$SL-$TL/$CORPUS.ngrams.$SL-$TL $crisphold > data-$SL-$TL/$CORPUS.ngrams.$SL-$TL.lrx 2>/dev/null<br />
</pre><br />
<br />
Where ''''crisphold'''' is a variable which determines how many more times should an individual translation occur over the default translation for a rule to be created.<br />
<br />
==== Maximum entropy rule extraction ====<br />
The ME method learns a discriminative model which assigns each individual ngram<br />
a weight with which it contributes to a certain translation.<br />
<br />
The rule extraction process is done in the following way:<br />
<br />
<pre><br />
$MIN=1<br />
YASMET=$LEX_TOOLS/yasmet<br />
python $SCRIPTS/ngram-count-patterns-maxent2.py data-$SL-$TL/$CORPUS.lex.$SL-$TL data-$SL-$TL/$CORPUS.candidates.$SL-$TL 2>ngrams > events<br />
<br />
echo -n "" > all-lambdas<br />
cat events | grep -v -e '\$ 0\.0 #' -e '\$ 0 #' > events.trimmed<br />
for i in `cat events.trimmed | cut -f1 | sort -u | sed 's/\([\*\^\$]\)/\\\\\1/g'`; do<br />
<br />
num=`cat events.trimmed | grep "^$i" | cut -f2 | head -1`<br />
echo $num > tmp.yasmet.$i;<br />
cat events.trimmed | grep "^$i" | cut -f3 >> tmp.yasmet.$i;<br />
echo "$i"<br />
cat tmp.yasmet.$i | $YASMET -red $MIN > tmp.yasmet.$i.$MIN; <br />
cat tmp.yasmet.$i.$MIN | $YASMET > tmp.lambdas.$i<br />
cat tmp.lambdas.$i | sed "s/^/$i /g" >> all-lambdas;<br />
done<br />
<br />
rm tmp.*<br />
<br />
python3 $SCRIPTS/merge-ngrams-lambdas.py ngrams all-lambdas > rules-all.txt<br />
<br />
python3 $SCRIPTS/lambdas-to-rules.py data-$SL-$TL/$CORPUS.lex.$SL-$TL rules-all.txt > ngrams-all.txt<br />
<br />
python3 $SCRIPTS/ngrams-to-rules-me.py ngrams-all.txt > $PAIR.ngrams-lm-$MIN.xml 2>/tmp/$PAIR.$MIN.lög<br />
<br />
</pre><br />
<br />
The MIN variable denotes how many times a certain context should occur for it to be taken into account.<br />
<br />
=== Poorman's alignment ===<br />
<br />
When using a large corpus, aligning tokens with Giza can be very slow.<br />
For that reason, we can estimate pairwise and ngram counts directly by relaxing the coocurence criteria used by Giza.<br />
<br />
For each possible translation of an ambiguous word, we add one if the translation occurs anywhere in the target sentence of the parallel corpus.<br />
<br />
A script for learning rules with maximum likelihood is given below:<br />
<br />
<pre><br />
<br />
</pre><br />
<br />
== Estimating rules using non-parallel corpora ==<br />
Prerequisites:<br />
* Install [[apertium-lex-tools]]<br />
* Install IRSTLM (http://sourceforge.net/projects/irstlm/)<br />
* Estimate a '''binary''' target side language model (http://sourceforge.net/apps/mediawiki/irstlm/index.php?title=User_Manual). <br />
* The language pair must support the pretransfer and multi modes. See apertium-sh-mk/modes.xml <br />
as a reference on how to add these modes if they do not exist.<br />
Place the following Makefile in the folder where you want to run your training process:<br />
<br />
<pre><br />
CORPUS=setimes<br />
DIR=sh-mk<br />
DATA=/home/philip/Apertium/apertium-sh-mk/<br />
AUTOBIL=sh-mk.autobil.bin<br />
SCRIPTS=/home/philip/Apertium/apertium-lex-tools/scripts<br />
MODEL=/home/philip/Apertium/corpora/language-models/mk/setimes.mk.5.blm<br />
LEX_TOOLS=/home/philip/Apertium/apertium-lex-tools<br />
THR=0<br />
<br />
#all: data/$(CORPUS).$(DIR).lrx data/$(CORPUS).$(DIR).freq.lrx<br />
all: data/$(CORPUS).$(DIR).freq.lrx.bin data/$(CORPUS).$(DIR).patterns.lrx<br />
<br />
data/$(CORPUS).$(DIR).tagger: $(CORPUS).$(DIR).txt<br />
if [ ! -d data ]; then mkdir data; fi<br />
cat $(CORPUS).$(DIR).txt | sed 's/[^\.]$$/./g' | apertium-destxt | apertium -f none -d $(DATA) $(DIR)-tagger | apertium-pretransfer > $@<br />
<br />
data/$(CORPUS).$(DIR).ambig: data/$(CORPUS).$(DIR).tagger<br />
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/multitrans $(DATA)$(DIR).autobil.bin -b -t > $@<br />
<br />
data/$(CORPUS).$(DIR).multi-trimmed: data/$(CORPUS).$(DIR).tagger<br />
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/multitrans $(DATA)$(DIR).autobil.bin -m -t > $@<br />
<br />
data/$(CORPUS).$(DIR).ranked: data/$(CORPUS).$(DIR).tagger<br />
cat $< | $(LEX_TOOLS)/multitrans $(DATA)$(DIR).autobil.bin -m | apertium -f none -d $(DATA) $(DIR)-multi | irstlm-ranker-frac $(MODEL) > $@<br />
<br />
data/$(CORPUS).$(DIR).annotated: data/$(CORPUS).$(DIR).multi-trimmed data/$(CORPUS).$(DIR).ranked<br />
paste data/$(CORPUS).$(DIR).multi-trimmed data/$(CORPUS).$(DIR).ranked | cut -f1-4 > $@<br />
<br />
data/$(CORPUS).$(DIR).freq: data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated<br />
python3 $(SCRIPTS)/biltrans-extract-frac-freq.py data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated > $@<br />
<br />
data/$(CORPUS).$(DIR).freq.lrx: data/$(CORPUS).$(DIR).freq<br />
python3 $(SCRIPTS)/extract-alig-lrx.py $< > $@<br />
<br />
data/$(CORPUS).$(DIR).freq.lrx.bin: data/$(CORPUS).$(DIR).freq.lrx<br />
lrx-comp $< $@<br />
<br />
data/$(CORPUS).$(DIR).ngrams: data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated<br />
python3 $(SCRIPTS)/biltrans-count-patterns-ngrams.py data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated > $@<br />
<br />
data/$(CORPUS).$(DIR).patterns: data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ngrams<br />
python3 $(SCRIPTS)/ngram-pruning-frac.py data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ngrams > $@ <br />
<br />
data/$(CORPUS).$(DIR).patterns.lrx: data/$(CORPUS).$(DIR).patterns<br />
python3 $(SCRIPTS)/ngrams-to-rules.py $< $(THR) > $@<br />
</pre><br />
<br />
In the same folder also place your source side corpus file. The corpus file needs to be named as "basename"."language-pair".txt. <br/><br />
As an illustration, in the Makefile example, the corpus file is named setimes.sh-mk.txt.<br />
<br />
Set the Makefile variables as follows: <br/><br />
* CORPUS denotes the base name of your corpus file<br />
* DIR stands for the language pair<br />
* DATA is the path to the language resources for the language pair<br />
* AUTOBIL is the path to binary bilingual dictionary for the language pair<br />
* SCRIPTS denotes the path to the lex-tools scripts<br />
* MODEL is the path to the target side (binary) language model used for scoring the possible translations of ambiguous words<br />
<br />
Finally, executing the Makefile will generate lexical selection rules for the specified language pair.</div>
Fpetkovski
https://wiki.apertium.org/w/index.php?title=Learning_rules_from_parallel_and_non-parallel_corpora&diff=43312
Learning rules from parallel and non-parallel corpora
2013-08-19T07:58:09Z
<p>Fpetkovski: /* Maximum entropy rule extraction */</p>
<hr />
<div>== Estimating rules using parallel corpora ==<br />
It is always recommended to use a parallel corpus for any type of machine translation training<br />
when such a resource is available. This section describes several (3) methods for estimating<br />
lexical selection rules using a parallel corpus. We start by describing the part of the training process that is shared by all three methods, and then continue to describe each of the <br />
individual methods separately.<br />
=== Prequisites ===<br />
The training methods use several software packages that need to be installed.<br />
First you will need to download and install:<br />
* [[lttoolbox]]<br />
* Apertium<br />
* [[apertium-lex-tools]]<br />
* Moses and its training scripts which you can install using the following script<br />
<br />
Furthermore you will also need:<br />
<br />
* an Apertium language pair<br />
* a parallel corpus (see [[Corpora]])<br />
<br />
===Installing prerequisites===<br />
See [[Minimal installation from SVN]] for apertium/lttoolbox. <br />
<br />
See [[Constraint-based lexical selection module]] for apertium-lex-tools.<br />
<br />
For moses-decoder you can do<br />
<pre><br />
git clone https://github.com/moses-smt/mosesdecoder<br />
cd mosesdecoder/<br />
./bjam <br />
</pre><br />
<br />
Now e.g. the clean-corpus and train-model scripts referred to below will be in ~/smt/mosesdecoder/scripts/training/clean-corpus-n.perl<br />
See http://www.statmt.org/moses/?n=Development.GetStarted if you want to install the binaries to some other directory.<br />
<br />
=== Preparing the training files ===<br />
<br />
The parallel corpus is processed in such a way that <br />
the training files (the source and target side corpus) are first analysed and tagged.<br />
<br />
Next, lines with no analysis are removed and blank within tokens are replaced with a new character<br />
since Giza tokenizes a sentence by splitting on white space.<br />
<br />
Finally, both files are cleaned using a moses training script so that Giza will not <br />
crash during the training process.<br />
<br />
All of this can be achieved using the following script:<br />
<br />
<pre><br />
CORPUS="Europarl3"<br />
PAIR="es-pt"<br />
SL="pt"<br />
TL="es"<br />
DATA="/home/philip/Apertium/apertium-es-pt"<br />
<br />
LEX_TOOLS="/home/philip/Apertium/apertium-lex-tools"<br />
SCRIPTS="$LEX_TOOLS/scripts"<br />
MOSESDECODER="/home/philip/Apertium/mosesdecoder/scripts/training"<br />
TRAINING_LINES=100000<br />
<br />
<br />
if [ ! -d data-$SL-$TL ]; then <br />
mkdir data-$SL-$TL;<br />
fi<br />
<br />
# TAG CORPUS<br />
cat "$CORPUS.$PAIR.$SL" | head -n $TRAINING_LINES | apertium -d "$DATA" $SL-$TL-tagger-dcase \<br />
| apertium-pretransfer > data-$SL-$TL/$CORPUS.tagged.$SL;<br />
<br />
cat "$CORPUS.$PAIR.$TL" | head -n $TRAINING_LINES | apertium -d "$DATA" $TL-$SL-tagger-dcase \<br />
| apertium-pretransfer > data-$SL-$TL/$CORPUS.tagged.$TL;<br />
<br />
N=`wc -l $CORPUS.$PAIR.$SL | cut -d ' ' -f 1`<br />
<br />
<br />
# REMOVE LINES WITH NO ANALYSES<br />
seq 1 $TRAINING_LINES > data-$SL-$TL/$CORPUS.lines<br />
paste data-$SL-$TL/$CORPUS.lines data-$SL-$TL/$CORPUS.tagged.$SL data-$SL-$TL/$CORPUS.tagged.$TL | grep '<' \<br />
| cut -f1 > data-$SL-$TL/$CORPUS.lines.new<br />
paste data-$SL-$TL/$CORPUS.lines data-$SL-$TL/$CORPUS.tagged.$SL data-$SL-$TL/$CORPUS.tagged.$TL | grep '<' \<br />
| cut -f2 > data-$SL-$TL/$CORPUS.tagged.$SL.new<br />
paste data-$SL-$TL/$CORPUS.lines data-$SL-$TL/$CORPUS.tagged.$SL data-$SL-$TL/$CORPUS.tagged.$TL | grep '<' \<br />
| cut -f3 > data-$SL-$TL/$CORPUS.tagged.$TL.new<br />
mv data-$SL-$TL/$CORPUS.lines.new data-$SL-$TL/$CORPUS.lines<br />
cat data-$SL-$TL/$CORPUS.tagged.$SL.new \<br />
| sed 's/ /~/g' | sed 's/\$[^\^]*/$ /g' > data-$SL-$TL/$CORPUS.tagged.$SL<br />
cat data-$SL-$TL/$CORPUS.tagged.$TL.new \<br />
| sed 's/ /~/g' | sed 's/\$[^\^]*/$ /g' > data-$SL-$TL/$CORPUS.tagged.$TL<br />
rm data-$SL-$TL/*.new<br />
<br />
<br />
# CLEAN CORPUS<br />
perl "$MOSESDECODER/clean-corpus-n.perl" data-$SL-$TL/$CORPUS.tagged $SL $TL "data-$SL-$TL/$CORPUS.tag-clean" 1 40;<br />
<br />
</pre><br />
<br />
Make sure you set the variables as follows: <br/><br />
* CORPUS="Europarl3": name of the corpus that you're using<br />
* PAIR: the direction of the corpus<br />
* SL: the source language<br />
* TL: the target language<br />
* DATA: path to the language resources for the language pair<br />
* LEX_TOOLS: path to apertium-lex-tools<br />
* MOSESDECODER: path to moses-decoder<br />
* TRAINING_LINES: amount of training lines<br />
<br />
=== Learning rules with Giza ===<br />
<br />
==== Installing Giza ====<br />
<br />
You can download and install Giza in the following way:<br />
<br />
<pre><br />
<br />
$ mkdir ~/smt<br />
$ cd ~/smt<br />
$ mkdir local # our "install prefix"<br />
$ wget https://giza-pp.googlecode.com/files/giza-pp-v1.0.7.tar.gz<br />
$ tar xzvf giza-pp-v1.0.7.tar.gz<br />
$ cd giza-pp<br />
$ make<br />
$ mkdir ../local/bin<br />
$ cp GIZA++-v2/snt2cooc.out ../local/bin/<br />
$ cp GIZA++-v2/snt2plain.out ../local/bin/<br />
$ cp GIZA++-v2/GIZA++ ../local/bin/<br />
$ cp mkcls-v2/mkcls ../local/bin/<br />
</pre><br />
<br />
Notice that all of your binary files will be placed in a single folder which is needed by the moses training script.<br />
<br />
==== Alignment ====<br />
Giza++ is used for obtaining an alignment between the tokens of the parallel sentences in the two corpora. The alignment process can sometimes be slow depending on the size of the corpora.<br />
<br />
After alignment has been done, the tokens' tags can be trimmed down so that they match the the set of tags found in the bilingual dictionary of the language pair.<br />
<br />
Next, a bilingual transfer output is obtained from the source language side so that<br />
ambiguous sentences and missing bidix candidates can be extracted.<br />
<br />
Finally, a frequency lexicon is created, marking the most common translation for each ambiguous token.<br />
<br />
<pre><br />
BIN_DIR="/home/philip/Apertium/smt/bin"<br />
# ALIGN<br />
perl $MOSESDECODER/train-model.perl -external-bin-dir "$BIN_DIR" -corpus data-$SL-$TL/$CORPUS.tag-clean \<br />
-f $TL -e $SL -alignment grow-diag-final-and -reordering msd-bidirectional-fe \<br />
-lm 0:5:/home/philip/Apertium/gsoc2013/giza/europarl.lm:0 2>&1<br />
<br />
# EXTRACT<br />
zcat giza.$SL-$TL/$SL-$TL.A3.final.gz | $SCRIPTS/giza-to-moses.awk > data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL<br />
<br />
# TRIM TAGS<br />
cat data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL | sed 's/ ||| /\t/g' | cut -f 1 \<br />
| sed 's/~/ /g' | $LEX_TOOLS/process-tagger-output $DATA/$TL-$SL.autobil.bin -p -t > tmp1<br />
<br />
cat data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL | sed 's/ ||| /\t/g' | cut -f 2 \<br />
| sed 's/~/ /g' | $LEX_TOOLS/process-tagger-output $DATA/$SL-$TL.autobil.bin -p -t > tmp2<br />
<br />
cat data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL | sed 's/ ||| /\t/g' | cut -f 3 > tmp3<br />
<br />
cat data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL | sed 's/ ||| /\t/g' | cut -f 2 \<br />
| sed 's/~/ /g' | $LEX_TOOLS/process-tagger-output $DATA/$SL-$TL.autobil.bin -b -t > data-$SL-$TL/$CORPUS.clean-biltrans.$PAIR<br />
<br />
paste tmp1 tmp2 tmp3 | sed 's/\t/ ||| /g' > data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL<br />
rm tmp1 tmp2 tmp3<br />
<br />
# SENTENCES<br />
python3 $SCRIPTS/extract-sentences.py data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL data-$SL-$TL/$CORPUS.clean-biltrans.$PAIR \<br />
> data-$SL-$TL/$CORPUS.candidates.$SL-$TL 2>/dev/null<br />
<br />
# FREQUENCY LEXICON<br />
python $SCRIPTS/extract-freq-lexicon.py data-$SL-$TL/$CORPUS.candidates.$SL-$TL > data-$SL-$TL/$CORPUS.lex.$SL-$TL 2>/dev/null<br />
<br />
</pre><br />
<br />
Make sure you set the BIN_DIR variable so that it contains the path to the binary folder<br />
generated by the Giza installation process.<br />
<br />
<br />
<br />
==== Maximum likelihood rule extraction ====<br />
The ML method counts how many each translation occurs in a given context, and compares that number<br />
with the default translation from the frequency lexicon. <br />
It then decides whether to create a rule with the given translation, or to leave the default<br />
translation.<br />
<br />
The rule generation process is done with the following script:<br />
<br />
<pre><br />
crisphold=1.5<br />
# NGRAM PATTERNS<br />
python $SCRIPTS/ngram-count-patterns.py data-$SL-$TL/$CORPUS.lex.$SL-$TL data-$SL-$TL/$CORPUS.candidates.$SL-$TL $crisphold 2>/dev/null > data-$SL-$TL/$CORPUS.ngrams.$SL-$TL<br />
<br />
# NGRAMS TO RULES<br />
python3 $SCRIPTS/ngrams-to-rules.py data-$SL-$TL/$CORPUS.ngrams.$SL-$TL $crisphold > data-$SL-$TL/$CORPUS.ngrams.$SL-$TL.lrx 2>/dev/null<br />
</pre><br />
<br />
Where ''''crisphold'''' is a variable which determines how many more times should an individual translation occur over the default translation for a rule to be created.<br />
<br />
==== Maximum entropy rule extraction ====<br />
The ME method learns a discriminative model which assigns each individual ngram<br />
a weight with which it contributes to a certain translation.<br />
<br />
The rule extraction process is done in the following way:<br />
<br />
<pre><br />
$MIN=1<br />
YASMET=$LEX_TOOLS/yasmet<br />
python $SCRIPTS/ngram-count-patterns-maxent2.py data-$SL-$TL/$CORPUS.lex.$SL-$TL data-$SL-$TL/$CORPUS.candidates.$SL-$TL 2>ngrams > events<br />
<br />
echo -n "" > all-lambdas<br />
cat events | grep -v -e '\$ 0\.0 #' -e '\$ 0 #' > events.trimmed<br />
for i in `cat events.trimmed | cut -f1 | sort -u | sed 's/\([\*\^\$]\)/\\\\\1/g'`; do<br />
<br />
num=`cat events.trimmed | grep "^$i" | cut -f2 | head -1`<br />
echo $num > tmp.yasmet.$i;<br />
cat events.trimmed | grep "^$i" | cut -f3 >> tmp.yasmet.$i;<br />
echo "$i"<br />
cat tmp.yasmet.$i | $YASMET -red $MIN > tmp.yasmet.$i.$MIN; <br />
cat tmp.yasmet.$i.$MIN | $YASMET > tmp.lambdas.$i<br />
cat tmp.lambdas.$i | sed "s/^/$i /g" >> all-lambdas;<br />
done<br />
<br />
rm tmp.*<br />
<br />
python3 $SCRIPTS/merge-ngrams-lambdas.py ngrams all-lambdas > rules-all.txt<br />
<br />
python3 $SCRIPTS/lambdas-to-rules.py data-$SL-$TL/$CORPUS.lex.$SL-$TL rules-all.txt > ngrams-all.txt<br />
<br />
python3 $SCRIPTS/ngrams-to-rules-me.py ngrams-all.txt > $PAIR.ngrams-lm-$MIN.xml 2>/tmp/$PAIR.$MIN.lög<br />
<br />
</pre><br />
<br />
The MIN variable denotes how many times a certain context should occur for it to be taken into account.<br />
<br />
== Estimating rules using non-parallel corpora ==<br />
Prerequisites:<br />
* Install [[apertium-lex-tools]]<br />
* Install IRSTLM (http://sourceforge.net/projects/irstlm/)<br />
* Estimate a '''binary''' target side language model (http://sourceforge.net/apps/mediawiki/irstlm/index.php?title=User_Manual). <br />
* The language pair must support the pretransfer and multi modes. See apertium-sh-mk/modes.xml <br />
as a reference on how to add these modes if they do not exist.<br />
Place the following Makefile in the folder where you want to run your training process:<br />
<br />
<pre><br />
CORPUS=setimes<br />
DIR=sh-mk<br />
DATA=/home/philip/Apertium/apertium-sh-mk/<br />
AUTOBIL=sh-mk.autobil.bin<br />
SCRIPTS=/home/philip/Apertium/apertium-lex-tools/scripts<br />
MODEL=/home/philip/Apertium/corpora/language-models/mk/setimes.mk.5.blm<br />
LEX_TOOLS=/home/philip/Apertium/apertium-lex-tools<br />
THR=0<br />
<br />
#all: data/$(CORPUS).$(DIR).lrx data/$(CORPUS).$(DIR).freq.lrx<br />
all: data/$(CORPUS).$(DIR).freq.lrx.bin data/$(CORPUS).$(DIR).patterns.lrx<br />
<br />
data/$(CORPUS).$(DIR).tagger: $(CORPUS).$(DIR).txt<br />
if [ ! -d data ]; then mkdir data; fi<br />
cat $(CORPUS).$(DIR).txt | sed 's/[^\.]$$/./g' | apertium-destxt | apertium -f none -d $(DATA) $(DIR)-tagger | apertium-pretransfer > $@<br />
<br />
data/$(CORPUS).$(DIR).ambig: data/$(CORPUS).$(DIR).tagger<br />
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/multitrans $(DATA)$(DIR).autobil.bin -b -t > $@<br />
<br />
data/$(CORPUS).$(DIR).multi-trimmed: data/$(CORPUS).$(DIR).tagger<br />
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/multitrans $(DATA)$(DIR).autobil.bin -m -t > $@<br />
<br />
data/$(CORPUS).$(DIR).ranked: data/$(CORPUS).$(DIR).tagger<br />
cat $< | $(LEX_TOOLS)/multitrans $(DATA)$(DIR).autobil.bin -m | apertium -f none -d $(DATA) $(DIR)-multi | irstlm-ranker-frac $(MODEL) > $@<br />
<br />
data/$(CORPUS).$(DIR).annotated: data/$(CORPUS).$(DIR).multi-trimmed data/$(CORPUS).$(DIR).ranked<br />
paste data/$(CORPUS).$(DIR).multi-trimmed data/$(CORPUS).$(DIR).ranked | cut -f1-4 > $@<br />
<br />
data/$(CORPUS).$(DIR).freq: data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated<br />
python3 $(SCRIPTS)/biltrans-extract-frac-freq.py data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated > $@<br />
<br />
data/$(CORPUS).$(DIR).freq.lrx: data/$(CORPUS).$(DIR).freq<br />
python3 $(SCRIPTS)/extract-alig-lrx.py $< > $@<br />
<br />
data/$(CORPUS).$(DIR).freq.lrx.bin: data/$(CORPUS).$(DIR).freq.lrx<br />
lrx-comp $< $@<br />
<br />
data/$(CORPUS).$(DIR).ngrams: data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated<br />
python3 $(SCRIPTS)/biltrans-count-patterns-ngrams.py data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated > $@<br />
<br />
data/$(CORPUS).$(DIR).patterns: data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ngrams<br />
python3 $(SCRIPTS)/ngram-pruning-frac.py data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ngrams > $@ <br />
<br />
data/$(CORPUS).$(DIR).patterns.lrx: data/$(CORPUS).$(DIR).patterns<br />
python3 $(SCRIPTS)/ngrams-to-rules.py $< $(THR) > $@<br />
</pre><br />
<br />
In the same folder also place your source side corpus file. The corpus file needs to be named as "basename"."language-pair".txt. <br/><br />
As an illustration, in the Makefile example, the corpus file is named setimes.sh-mk.txt.<br />
<br />
Set the Makefile variables as follows: <br/><br />
* CORPUS denotes the base name of your corpus file<br />
* DIR stands for the language pair<br />
* DATA is the path to the language resources for the language pair<br />
* AUTOBIL is the path to binary bilingual dictionary for the language pair<br />
* SCRIPTS denotes the path to the lex-tools scripts<br />
* MODEL is the path to the target side (binary) language model used for scoring the possible translations of ambiguous words<br />
<br />
Finally, executing the Makefile will generate lexical selection rules for the specified language pair.</div>
Fpetkovski
https://wiki.apertium.org/w/index.php?title=Learning_rules_from_parallel_and_non-parallel_corpora&diff=43311
Learning rules from parallel and non-parallel corpora
2013-08-19T07:57:18Z
<p>Fpetkovski: /* Maximum entropy rule extraction */</p>
<hr />
<div>== Estimating rules using parallel corpora ==<br />
It is always recommended to use a parallel corpus for any type of machine translation training<br />
when such a resource is available. This section describes several (3) methods for estimating<br />
lexical selection rules using a parallel corpus. We start by describing the part of the training process that is shared by all three methods, and then continue to describe each of the <br />
individual methods separately.<br />
=== Prequisites ===<br />
The training methods use several software packages that need to be installed.<br />
First you will need to download and install:<br />
* [[lttoolbox]]<br />
* Apertium<br />
* [[apertium-lex-tools]]<br />
* Moses and its training scripts which you can install using the following script<br />
<br />
Furthermore you will also need:<br />
<br />
* an Apertium language pair<br />
* a parallel corpus (see [[Corpora]])<br />
<br />
===Installing prerequisites===<br />
See [[Minimal installation from SVN]] for apertium/lttoolbox. <br />
<br />
See [[Constraint-based lexical selection module]] for apertium-lex-tools.<br />
<br />
For moses-decoder you can do<br />
<pre><br />
git clone https://github.com/moses-smt/mosesdecoder<br />
cd mosesdecoder/<br />
./bjam <br />
</pre><br />
<br />
Now e.g. the clean-corpus and train-model scripts referred to below will be in ~/smt/mosesdecoder/scripts/training/clean-corpus-n.perl<br />
See http://www.statmt.org/moses/?n=Development.GetStarted if you want to install the binaries to some other directory.<br />
<br />
=== Preparing the training files ===<br />
<br />
The parallel corpus is processed in such a way that <br />
the training files (the source and target side corpus) are first analysed and tagged.<br />
<br />
Next, lines with no analysis are removed and blank within tokens are replaced with a new character<br />
since Giza tokenizes a sentence by splitting on white space.<br />
<br />
Finally, both files are cleaned using a moses training script so that Giza will not <br />
crash during the training process.<br />
<br />
All of this can be achieved using the following script:<br />
<br />
<pre><br />
CORPUS="Europarl3"<br />
PAIR="es-pt"<br />
SL="pt"<br />
TL="es"<br />
DATA="/home/philip/Apertium/apertium-es-pt"<br />
<br />
LEX_TOOLS="/home/philip/Apertium/apertium-lex-tools"<br />
SCRIPTS="$LEX_TOOLS/scripts"<br />
MOSESDECODER="/home/philip/Apertium/mosesdecoder/scripts/training"<br />
TRAINING_LINES=100000<br />
<br />
<br />
if [ ! -d data-$SL-$TL ]; then <br />
mkdir data-$SL-$TL;<br />
fi<br />
<br />
# TAG CORPUS<br />
cat "$CORPUS.$PAIR.$SL" | head -n $TRAINING_LINES | apertium -d "$DATA" $SL-$TL-tagger-dcase \<br />
| apertium-pretransfer > data-$SL-$TL/$CORPUS.tagged.$SL;<br />
<br />
cat "$CORPUS.$PAIR.$TL" | head -n $TRAINING_LINES | apertium -d "$DATA" $TL-$SL-tagger-dcase \<br />
| apertium-pretransfer > data-$SL-$TL/$CORPUS.tagged.$TL;<br />
<br />
N=`wc -l $CORPUS.$PAIR.$SL | cut -d ' ' -f 1`<br />
<br />
<br />
# REMOVE LINES WITH NO ANALYSES<br />
seq 1 $TRAINING_LINES > data-$SL-$TL/$CORPUS.lines<br />
paste data-$SL-$TL/$CORPUS.lines data-$SL-$TL/$CORPUS.tagged.$SL data-$SL-$TL/$CORPUS.tagged.$TL | grep '<' \<br />
| cut -f1 > data-$SL-$TL/$CORPUS.lines.new<br />
paste data-$SL-$TL/$CORPUS.lines data-$SL-$TL/$CORPUS.tagged.$SL data-$SL-$TL/$CORPUS.tagged.$TL | grep '<' \<br />
| cut -f2 > data-$SL-$TL/$CORPUS.tagged.$SL.new<br />
paste data-$SL-$TL/$CORPUS.lines data-$SL-$TL/$CORPUS.tagged.$SL data-$SL-$TL/$CORPUS.tagged.$TL | grep '<' \<br />
| cut -f3 > data-$SL-$TL/$CORPUS.tagged.$TL.new<br />
mv data-$SL-$TL/$CORPUS.lines.new data-$SL-$TL/$CORPUS.lines<br />
cat data-$SL-$TL/$CORPUS.tagged.$SL.new \<br />
| sed 's/ /~/g' | sed 's/\$[^\^]*/$ /g' > data-$SL-$TL/$CORPUS.tagged.$SL<br />
cat data-$SL-$TL/$CORPUS.tagged.$TL.new \<br />
| sed 's/ /~/g' | sed 's/\$[^\^]*/$ /g' > data-$SL-$TL/$CORPUS.tagged.$TL<br />
rm data-$SL-$TL/*.new<br />
<br />
<br />
# CLEAN CORPUS<br />
perl "$MOSESDECODER/clean-corpus-n.perl" data-$SL-$TL/$CORPUS.tagged $SL $TL "data-$SL-$TL/$CORPUS.tag-clean" 1 40;<br />
<br />
</pre><br />
<br />
Make sure you set the variables as follows: <br/><br />
* CORPUS="Europarl3": name of the corpus that you're using<br />
* PAIR: the direction of the corpus<br />
* SL: the source language<br />
* TL: the target language<br />
* DATA: path to the language resources for the language pair<br />
* LEX_TOOLS: path to apertium-lex-tools<br />
* MOSESDECODER: path to moses-decoder<br />
* TRAINING_LINES: amount of training lines<br />
<br />
=== Learning rules with Giza ===<br />
<br />
==== Installing Giza ====<br />
<br />
You can download and install Giza in the following way:<br />
<br />
<pre><br />
<br />
$ mkdir ~/smt<br />
$ cd ~/smt<br />
$ mkdir local # our "install prefix"<br />
$ wget https://giza-pp.googlecode.com/files/giza-pp-v1.0.7.tar.gz<br />
$ tar xzvf giza-pp-v1.0.7.tar.gz<br />
$ cd giza-pp<br />
$ make<br />
$ mkdir ../local/bin<br />
$ cp GIZA++-v2/snt2cooc.out ../local/bin/<br />
$ cp GIZA++-v2/snt2plain.out ../local/bin/<br />
$ cp GIZA++-v2/GIZA++ ../local/bin/<br />
$ cp mkcls-v2/mkcls ../local/bin/<br />
</pre><br />
<br />
Notice that all of your binary files will be placed in a single folder which is needed by the moses training script.<br />
<br />
==== Alignment ====<br />
Giza++ is used for obtaining an alignment between the tokens of the parallel sentences in the two corpora. The alignment process can sometimes be slow depending on the size of the corpora.<br />
<br />
After alignment has been done, the tokens' tags can be trimmed down so that they match the the set of tags found in the bilingual dictionary of the language pair.<br />
<br />
Next, a bilingual transfer output is obtained from the source language side so that<br />
ambiguous sentences and missing bidix candidates can be extracted.<br />
<br />
Finally, a frequency lexicon is created, marking the most common translation for each ambiguous token.<br />
<br />
<pre><br />
BIN_DIR="/home/philip/Apertium/smt/bin"<br />
# ALIGN<br />
perl $MOSESDECODER/train-model.perl -external-bin-dir "$BIN_DIR" -corpus data-$SL-$TL/$CORPUS.tag-clean \<br />
-f $TL -e $SL -alignment grow-diag-final-and -reordering msd-bidirectional-fe \<br />
-lm 0:5:/home/philip/Apertium/gsoc2013/giza/europarl.lm:0 2>&1<br />
<br />
# EXTRACT<br />
zcat giza.$SL-$TL/$SL-$TL.A3.final.gz | $SCRIPTS/giza-to-moses.awk > data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL<br />
<br />
# TRIM TAGS<br />
cat data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL | sed 's/ ||| /\t/g' | cut -f 1 \<br />
| sed 's/~/ /g' | $LEX_TOOLS/process-tagger-output $DATA/$TL-$SL.autobil.bin -p -t > tmp1<br />
<br />
cat data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL | sed 's/ ||| /\t/g' | cut -f 2 \<br />
| sed 's/~/ /g' | $LEX_TOOLS/process-tagger-output $DATA/$SL-$TL.autobil.bin -p -t > tmp2<br />
<br />
cat data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL | sed 's/ ||| /\t/g' | cut -f 3 > tmp3<br />
<br />
cat data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL | sed 's/ ||| /\t/g' | cut -f 2 \<br />
| sed 's/~/ /g' | $LEX_TOOLS/process-tagger-output $DATA/$SL-$TL.autobil.bin -b -t > data-$SL-$TL/$CORPUS.clean-biltrans.$PAIR<br />
<br />
paste tmp1 tmp2 tmp3 | sed 's/\t/ ||| /g' > data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL<br />
rm tmp1 tmp2 tmp3<br />
<br />
# SENTENCES<br />
python3 $SCRIPTS/extract-sentences.py data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL data-$SL-$TL/$CORPUS.clean-biltrans.$PAIR \<br />
> data-$SL-$TL/$CORPUS.candidates.$SL-$TL 2>/dev/null<br />
<br />
# FREQUENCY LEXICON<br />
python $SCRIPTS/extract-freq-lexicon.py data-$SL-$TL/$CORPUS.candidates.$SL-$TL > data-$SL-$TL/$CORPUS.lex.$SL-$TL 2>/dev/null<br />
<br />
</pre><br />
<br />
Make sure you set the BIN_DIR variable so that it contains the path to the binary folder<br />
generated by the Giza installation process.<br />
<br />
<br />
<br />
==== Maximum likelihood rule extraction ====<br />
The ML method counts how many each translation occurs in a given context, and compares that number<br />
with the default translation from the frequency lexicon. <br />
It then decides whether to create a rule with the given translation, or to leave the default<br />
translation.<br />
<br />
The rule generation process is done with the following script:<br />
<br />
<pre><br />
crisphold=1.5<br />
# NGRAM PATTERNS<br />
python $SCRIPTS/ngram-count-patterns.py data-$SL-$TL/$CORPUS.lex.$SL-$TL data-$SL-$TL/$CORPUS.candidates.$SL-$TL $crisphold 2>/dev/null > data-$SL-$TL/$CORPUS.ngrams.$SL-$TL<br />
<br />
# NGRAMS TO RULES<br />
python3 $SCRIPTS/ngrams-to-rules.py data-$SL-$TL/$CORPUS.ngrams.$SL-$TL $crisphold > data-$SL-$TL/$CORPUS.ngrams.$SL-$TL.lrx 2>/dev/null<br />
</pre><br />
<br />
Where ''''crisphold'''' is a variable which determines how many more times should an individual translation occur over the default translation for a rule to be created.<br />
<br />
==== Maximum entropy rule extraction ====<br />
The ME method learns a discriminative model which assigns each individual ngram<br />
a weight with which it contributes to a certain translation.<br />
<br />
The rule extraction process is done in the following way:<br />
<br />
<pre><br />
$MIN=1<br />
YASMET=$LEX_TOOLS/yasmet<br />
python $SCRIPTS/ngram-count-patterns-maxent2.py data-$SL-$TL/$CORPUS.lex.$SL-$TL data-$SL-$TL/$CORPUS.candidates.$SL-$TL 2>ngrams > events<br />
<br />
echo -n "" > all-lambdas<br />
cat events | grep -v -e '\$ 0\.0 #' -e '\$ 0 #' > events.trimmed<br />
for i in `cat events.trimmed | cut -f1 | sort -u | sed 's/\([\*\^\$]\)/\\\\\1/g'`; do<br />
<br />
num=`cat events.trimmed | grep "^$i" | cut -f2 | head -1`<br />
echo $num > tmp.yasmet.$i;<br />
cat events.trimmed | grep "^$i" | cut -f3 >> tmp.yasmet.$i;<br />
echo "$i"<br />
cat tmp.yasmet.$i | $YASMET -red $MIN > tmp.yasmet.$i.$MIN; <br />
cat tmp.yasmet.$i.$MIN | $YASMET > tmp.lambdas.$i<br />
cat tmp.lambdas.$i | sed "s/^/$i /g" >> all-lambdas;<br />
done<br />
<br />
rm tmp.*<br />
<br />
python3 $SCRIPTS/merge-ngrams-lambdas.py ngrams all-lambdas > rules-all.txt<br />
<br />
python3 $SCRIPTS/lambdas-to-rules.py $TRAIN/$CORPUS.lex.$SL-$TL rules-all.txt > ngrams-all.txt<br />
<br />
python3 $SCRIPTS/ngrams-to-rules-me.py ngrams-all.txt > $PAIR.ngrams-lm-$MIN.xml 2>/tmp/$PAIR.$MIN.lög<br />
<br />
</pre><br />
<br />
The MIN variable denotes how many times a certain context should occur for it to be taken into account.<br />
<br />
== Estimating rules using non-parallel corpora ==<br />
Prerequisites:<br />
* Install [[apertium-lex-tools]]<br />
* Install IRSTLM (http://sourceforge.net/projects/irstlm/)<br />
* Estimate a '''binary''' target side language model (http://sourceforge.net/apps/mediawiki/irstlm/index.php?title=User_Manual). <br />
* The language pair must support the pretransfer and multi modes. See apertium-sh-mk/modes.xml <br />
as a reference on how to add these modes if they do not exist.<br />
Place the following Makefile in the folder where you want to run your training process:<br />
<br />
<pre><br />
CORPUS=setimes<br />
DIR=sh-mk<br />
DATA=/home/philip/Apertium/apertium-sh-mk/<br />
AUTOBIL=sh-mk.autobil.bin<br />
SCRIPTS=/home/philip/Apertium/apertium-lex-tools/scripts<br />
MODEL=/home/philip/Apertium/corpora/language-models/mk/setimes.mk.5.blm<br />
LEX_TOOLS=/home/philip/Apertium/apertium-lex-tools<br />
THR=0<br />
<br />
#all: data/$(CORPUS).$(DIR).lrx data/$(CORPUS).$(DIR).freq.lrx<br />
all: data/$(CORPUS).$(DIR).freq.lrx.bin data/$(CORPUS).$(DIR).patterns.lrx<br />
<br />
data/$(CORPUS).$(DIR).tagger: $(CORPUS).$(DIR).txt<br />
if [ ! -d data ]; then mkdir data; fi<br />
cat $(CORPUS).$(DIR).txt | sed 's/[^\.]$$/./g' | apertium-destxt | apertium -f none -d $(DATA) $(DIR)-tagger | apertium-pretransfer > $@<br />
<br />
data/$(CORPUS).$(DIR).ambig: data/$(CORPUS).$(DIR).tagger<br />
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/multitrans $(DATA)$(DIR).autobil.bin -b -t > $@<br />
<br />
data/$(CORPUS).$(DIR).multi-trimmed: data/$(CORPUS).$(DIR).tagger<br />
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/multitrans $(DATA)$(DIR).autobil.bin -m -t > $@<br />
<br />
data/$(CORPUS).$(DIR).ranked: data/$(CORPUS).$(DIR).tagger<br />
cat $< | $(LEX_TOOLS)/multitrans $(DATA)$(DIR).autobil.bin -m | apertium -f none -d $(DATA) $(DIR)-multi | irstlm-ranker-frac $(MODEL) > $@<br />
<br />
data/$(CORPUS).$(DIR).annotated: data/$(CORPUS).$(DIR).multi-trimmed data/$(CORPUS).$(DIR).ranked<br />
paste data/$(CORPUS).$(DIR).multi-trimmed data/$(CORPUS).$(DIR).ranked | cut -f1-4 > $@<br />
<br />
data/$(CORPUS).$(DIR).freq: data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated<br />
python3 $(SCRIPTS)/biltrans-extract-frac-freq.py data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated > $@<br />
<br />
data/$(CORPUS).$(DIR).freq.lrx: data/$(CORPUS).$(DIR).freq<br />
python3 $(SCRIPTS)/extract-alig-lrx.py $< > $@<br />
<br />
data/$(CORPUS).$(DIR).freq.lrx.bin: data/$(CORPUS).$(DIR).freq.lrx<br />
lrx-comp $< $@<br />
<br />
data/$(CORPUS).$(DIR).ngrams: data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated<br />
python3 $(SCRIPTS)/biltrans-count-patterns-ngrams.py data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated > $@<br />
<br />
data/$(CORPUS).$(DIR).patterns: data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ngrams<br />
python3 $(SCRIPTS)/ngram-pruning-frac.py data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ngrams > $@ <br />
<br />
data/$(CORPUS).$(DIR).patterns.lrx: data/$(CORPUS).$(DIR).patterns<br />
python3 $(SCRIPTS)/ngrams-to-rules.py $< $(THR) > $@<br />
</pre><br />
<br />
In the same folder also place your source side corpus file. The corpus file needs to be named as "basename"."language-pair".txt. <br/><br />
As an illustration, in the Makefile example, the corpus file is named setimes.sh-mk.txt.<br />
<br />
Set the Makefile variables as follows: <br/><br />
* CORPUS denotes the base name of your corpus file<br />
* DIR stands for the language pair<br />
* DATA is the path to the language resources for the language pair<br />
* AUTOBIL is the path to binary bilingual dictionary for the language pair<br />
* SCRIPTS denotes the path to the lex-tools scripts<br />
* MODEL is the path to the target side (binary) language model used for scoring the possible translations of ambiguous words<br />
<br />
Finally, executing the Makefile will generate lexical selection rules for the specified language pair.</div>
Fpetkovski
https://wiki.apertium.org/w/index.php?title=Learning_rules_from_parallel_and_non-parallel_corpora&diff=43310
Learning rules from parallel and non-parallel corpora
2013-08-19T07:56:56Z
<p>Fpetkovski: /* Maximum entropy rule extraction */</p>
<hr />
<div>== Estimating rules using parallel corpora ==<br />
It is always recommended to use a parallel corpus for any type of machine translation training<br />
when such a resource is available. This section describes several (3) methods for estimating<br />
lexical selection rules using a parallel corpus. We start by describing the part of the training process that is shared by all three methods, and then continue to describe each of the <br />
individual methods separately.<br />
=== Prequisites ===<br />
The training methods use several software packages that need to be installed.<br />
First you will need to download and install:<br />
* [[lttoolbox]]<br />
* Apertium<br />
* [[apertium-lex-tools]]<br />
* Moses and its training scripts which you can install using the following script<br />
<br />
Furthermore you will also need:<br />
<br />
* an Apertium language pair<br />
* a parallel corpus (see [[Corpora]])<br />
<br />
===Installing prerequisites===<br />
See [[Minimal installation from SVN]] for apertium/lttoolbox. <br />
<br />
See [[Constraint-based lexical selection module]] for apertium-lex-tools.<br />
<br />
For moses-decoder you can do<br />
<pre><br />
git clone https://github.com/moses-smt/mosesdecoder<br />
cd mosesdecoder/<br />
./bjam <br />
</pre><br />
<br />
Now e.g. the clean-corpus and train-model scripts referred to below will be in ~/smt/mosesdecoder/scripts/training/clean-corpus-n.perl<br />
See http://www.statmt.org/moses/?n=Development.GetStarted if you want to install the binaries to some other directory.<br />
<br />
=== Preparing the training files ===<br />
<br />
The parallel corpus is processed in such a way that <br />
the training files (the source and target side corpus) are first analysed and tagged.<br />
<br />
Next, lines with no analysis are removed and blank within tokens are replaced with a new character<br />
since Giza tokenizes a sentence by splitting on white space.<br />
<br />
Finally, both files are cleaned using a moses training script so that Giza will not <br />
crash during the training process.<br />
<br />
All of this can be achieved using the following script:<br />
<br />
<pre><br />
CORPUS="Europarl3"<br />
PAIR="es-pt"<br />
SL="pt"<br />
TL="es"<br />
DATA="/home/philip/Apertium/apertium-es-pt"<br />
<br />
LEX_TOOLS="/home/philip/Apertium/apertium-lex-tools"<br />
SCRIPTS="$LEX_TOOLS/scripts"<br />
MOSESDECODER="/home/philip/Apertium/mosesdecoder/scripts/training"<br />
TRAINING_LINES=100000<br />
<br />
<br />
if [ ! -d data-$SL-$TL ]; then <br />
mkdir data-$SL-$TL;<br />
fi<br />
<br />
# TAG CORPUS<br />
cat "$CORPUS.$PAIR.$SL" | head -n $TRAINING_LINES | apertium -d "$DATA" $SL-$TL-tagger-dcase \<br />
| apertium-pretransfer > data-$SL-$TL/$CORPUS.tagged.$SL;<br />
<br />
cat "$CORPUS.$PAIR.$TL" | head -n $TRAINING_LINES | apertium -d "$DATA" $TL-$SL-tagger-dcase \<br />
| apertium-pretransfer > data-$SL-$TL/$CORPUS.tagged.$TL;<br />
<br />
N=`wc -l $CORPUS.$PAIR.$SL | cut -d ' ' -f 1`<br />
<br />
<br />
# REMOVE LINES WITH NO ANALYSES<br />
seq 1 $TRAINING_LINES > data-$SL-$TL/$CORPUS.lines<br />
paste data-$SL-$TL/$CORPUS.lines data-$SL-$TL/$CORPUS.tagged.$SL data-$SL-$TL/$CORPUS.tagged.$TL | grep '<' \<br />
| cut -f1 > data-$SL-$TL/$CORPUS.lines.new<br />
paste data-$SL-$TL/$CORPUS.lines data-$SL-$TL/$CORPUS.tagged.$SL data-$SL-$TL/$CORPUS.tagged.$TL | grep '<' \<br />
| cut -f2 > data-$SL-$TL/$CORPUS.tagged.$SL.new<br />
paste data-$SL-$TL/$CORPUS.lines data-$SL-$TL/$CORPUS.tagged.$SL data-$SL-$TL/$CORPUS.tagged.$TL | grep '<' \<br />
| cut -f3 > data-$SL-$TL/$CORPUS.tagged.$TL.new<br />
mv data-$SL-$TL/$CORPUS.lines.new data-$SL-$TL/$CORPUS.lines<br />
cat data-$SL-$TL/$CORPUS.tagged.$SL.new \<br />
| sed 's/ /~/g' | sed 's/\$[^\^]*/$ /g' > data-$SL-$TL/$CORPUS.tagged.$SL<br />
cat data-$SL-$TL/$CORPUS.tagged.$TL.new \<br />
| sed 's/ /~/g' | sed 's/\$[^\^]*/$ /g' > data-$SL-$TL/$CORPUS.tagged.$TL<br />
rm data-$SL-$TL/*.new<br />
<br />
<br />
# CLEAN CORPUS<br />
perl "$MOSESDECODER/clean-corpus-n.perl" data-$SL-$TL/$CORPUS.tagged $SL $TL "data-$SL-$TL/$CORPUS.tag-clean" 1 40;<br />
<br />
</pre><br />
<br />
Make sure you set the variables as follows: <br/><br />
* CORPUS="Europarl3": name of the corpus that you're using<br />
* PAIR: the direction of the corpus<br />
* SL: the source language<br />
* TL: the target language<br />
* DATA: path to the language resources for the language pair<br />
* LEX_TOOLS: path to apertium-lex-tools<br />
* MOSESDECODER: path to moses-decoder<br />
* TRAINING_LINES: amount of training lines<br />
<br />
=== Learning rules with Giza ===<br />
<br />
==== Installing Giza ====<br />
<br />
You can download and install Giza in the following way:<br />
<br />
<pre><br />
<br />
$ mkdir ~/smt<br />
$ cd ~/smt<br />
$ mkdir local # our "install prefix"<br />
$ wget https://giza-pp.googlecode.com/files/giza-pp-v1.0.7.tar.gz<br />
$ tar xzvf giza-pp-v1.0.7.tar.gz<br />
$ cd giza-pp<br />
$ make<br />
$ mkdir ../local/bin<br />
$ cp GIZA++-v2/snt2cooc.out ../local/bin/<br />
$ cp GIZA++-v2/snt2plain.out ../local/bin/<br />
$ cp GIZA++-v2/GIZA++ ../local/bin/<br />
$ cp mkcls-v2/mkcls ../local/bin/<br />
</pre><br />
<br />
Notice that all of your binary files will be placed in a single folder which is needed by the moses training script.<br />
<br />
==== Alignment ====<br />
Giza++ is used for obtaining an alignment between the tokens of the parallel sentences in the two corpora. The alignment process can sometimes be slow depending on the size of the corpora.<br />
<br />
After alignment has been done, the tokens' tags can be trimmed down so that they match the the set of tags found in the bilingual dictionary of the language pair.<br />
<br />
Next, a bilingual transfer output is obtained from the source language side so that<br />
ambiguous sentences and missing bidix candidates can be extracted.<br />
<br />
Finally, a frequency lexicon is created, marking the most common translation for each ambiguous token.<br />
<br />
<pre><br />
BIN_DIR="/home/philip/Apertium/smt/bin"<br />
# ALIGN<br />
perl $MOSESDECODER/train-model.perl -external-bin-dir "$BIN_DIR" -corpus data-$SL-$TL/$CORPUS.tag-clean \<br />
-f $TL -e $SL -alignment grow-diag-final-and -reordering msd-bidirectional-fe \<br />
-lm 0:5:/home/philip/Apertium/gsoc2013/giza/europarl.lm:0 2>&1<br />
<br />
# EXTRACT<br />
zcat giza.$SL-$TL/$SL-$TL.A3.final.gz | $SCRIPTS/giza-to-moses.awk > data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL<br />
<br />
# TRIM TAGS<br />
cat data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL | sed 's/ ||| /\t/g' | cut -f 1 \<br />
| sed 's/~/ /g' | $LEX_TOOLS/process-tagger-output $DATA/$TL-$SL.autobil.bin -p -t > tmp1<br />
<br />
cat data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL | sed 's/ ||| /\t/g' | cut -f 2 \<br />
| sed 's/~/ /g' | $LEX_TOOLS/process-tagger-output $DATA/$SL-$TL.autobil.bin -p -t > tmp2<br />
<br />
cat data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL | sed 's/ ||| /\t/g' | cut -f 3 > tmp3<br />
<br />
cat data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL | sed 's/ ||| /\t/g' | cut -f 2 \<br />
| sed 's/~/ /g' | $LEX_TOOLS/process-tagger-output $DATA/$SL-$TL.autobil.bin -b -t > data-$SL-$TL/$CORPUS.clean-biltrans.$PAIR<br />
<br />
paste tmp1 tmp2 tmp3 | sed 's/\t/ ||| /g' > data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL<br />
rm tmp1 tmp2 tmp3<br />
<br />
# SENTENCES<br />
python3 $SCRIPTS/extract-sentences.py data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL data-$SL-$TL/$CORPUS.clean-biltrans.$PAIR \<br />
> data-$SL-$TL/$CORPUS.candidates.$SL-$TL 2>/dev/null<br />
<br />
# FREQUENCY LEXICON<br />
python $SCRIPTS/extract-freq-lexicon.py data-$SL-$TL/$CORPUS.candidates.$SL-$TL > data-$SL-$TL/$CORPUS.lex.$SL-$TL 2>/dev/null<br />
<br />
</pre><br />
<br />
Make sure you set the BIN_DIR variable so that it contains the path to the binary folder<br />
generated by the Giza installation process.<br />
<br />
<br />
<br />
==== Maximum likelihood rule extraction ====<br />
The ML method counts how many each translation occurs in a given context, and compares that number<br />
with the default translation from the frequency lexicon. <br />
It then decides whether to create a rule with the given translation, or to leave the default<br />
translation.<br />
<br />
The rule generation process is done with the following script:<br />
<br />
<pre><br />
crisphold=1.5<br />
# NGRAM PATTERNS<br />
python $SCRIPTS/ngram-count-patterns.py data-$SL-$TL/$CORPUS.lex.$SL-$TL data-$SL-$TL/$CORPUS.candidates.$SL-$TL $crisphold 2>/dev/null > data-$SL-$TL/$CORPUS.ngrams.$SL-$TL<br />
<br />
# NGRAMS TO RULES<br />
python3 $SCRIPTS/ngrams-to-rules.py data-$SL-$TL/$CORPUS.ngrams.$SL-$TL $crisphold > data-$SL-$TL/$CORPUS.ngrams.$SL-$TL.lrx 2>/dev/null<br />
</pre><br />
<br />
Where ''''crisphold'''' is a variable which determines how many more times should an individual translation occur over the default translation for a rule to be created.<br />
<br />
==== Maximum entropy rule extraction ====<br />
The ME method learns a discriminative model which assigns each individual context (ngram)<br />
a weight with which it contributes to a certain translation.<br />
<br />
The rule extraction process is done in the following way:<br />
<br />
<pre><br />
$MIN=1<br />
YASMET=$LEX_TOOLS/yasmet<br />
python $SCRIPTS/ngram-count-patterns-maxent2.py data-$SL-$TL/$CORPUS.lex.$SL-$TL data-$SL-$TL/$CORPUS.candidates.$SL-$TL 2>ngrams > events<br />
<br />
echo -n "" > all-lambdas<br />
cat events | grep -v -e '\$ 0\.0 #' -e '\$ 0 #' > events.trimmed<br />
for i in `cat events.trimmed | cut -f1 | sort -u | sed 's/\([\*\^\$]\)/\\\\\1/g'`; do<br />
<br />
num=`cat events.trimmed | grep "^$i" | cut -f2 | head -1`<br />
echo $num > tmp.yasmet.$i;<br />
cat events.trimmed | grep "^$i" | cut -f3 >> tmp.yasmet.$i;<br />
echo "$i"<br />
cat tmp.yasmet.$i | $YASMET -red $MIN > tmp.yasmet.$i.$MIN; <br />
cat tmp.yasmet.$i.$MIN | $YASMET > tmp.lambdas.$i<br />
cat tmp.lambdas.$i | sed "s/^/$i /g" >> all-lambdas;<br />
done<br />
<br />
rm tmp.*<br />
<br />
python3 $SCRIPTS/merge-ngrams-lambdas.py ngrams all-lambdas > rules-all.txt<br />
<br />
python3 $SCRIPTS/lambdas-to-rules.py $TRAIN/$CORPUS.lex.$SL-$TL rules-all.txt > ngrams-all.txt<br />
<br />
python3 $SCRIPTS/ngrams-to-rules-me.py ngrams-all.txt > $PAIR.ngrams-lm-$MIN.xml 2>/tmp/$PAIR.$MIN.lög<br />
<br />
</pre><br />
<br />
The MIN variable denotes how many times a certain context should occur for it to be taken into account.<br />
<br />
== Estimating rules using non-parallel corpora ==<br />
Prerequisites:<br />
* Install [[apertium-lex-tools]]<br />
* Install IRSTLM (http://sourceforge.net/projects/irstlm/)<br />
* Estimate a '''binary''' target side language model (http://sourceforge.net/apps/mediawiki/irstlm/index.php?title=User_Manual). <br />
* The language pair must support the pretransfer and multi modes. See apertium-sh-mk/modes.xml <br />
as a reference on how to add these modes if they do not exist.<br />
Place the following Makefile in the folder where you want to run your training process:<br />
<br />
<pre><br />
CORPUS=setimes<br />
DIR=sh-mk<br />
DATA=/home/philip/Apertium/apertium-sh-mk/<br />
AUTOBIL=sh-mk.autobil.bin<br />
SCRIPTS=/home/philip/Apertium/apertium-lex-tools/scripts<br />
MODEL=/home/philip/Apertium/corpora/language-models/mk/setimes.mk.5.blm<br />
LEX_TOOLS=/home/philip/Apertium/apertium-lex-tools<br />
THR=0<br />
<br />
#all: data/$(CORPUS).$(DIR).lrx data/$(CORPUS).$(DIR).freq.lrx<br />
all: data/$(CORPUS).$(DIR).freq.lrx.bin data/$(CORPUS).$(DIR).patterns.lrx<br />
<br />
data/$(CORPUS).$(DIR).tagger: $(CORPUS).$(DIR).txt<br />
if [ ! -d data ]; then mkdir data; fi<br />
cat $(CORPUS).$(DIR).txt | sed 's/[^\.]$$/./g' | apertium-destxt | apertium -f none -d $(DATA) $(DIR)-tagger | apertium-pretransfer > $@<br />
<br />
data/$(CORPUS).$(DIR).ambig: data/$(CORPUS).$(DIR).tagger<br />
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/multitrans $(DATA)$(DIR).autobil.bin -b -t > $@<br />
<br />
data/$(CORPUS).$(DIR).multi-trimmed: data/$(CORPUS).$(DIR).tagger<br />
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/multitrans $(DATA)$(DIR).autobil.bin -m -t > $@<br />
<br />
data/$(CORPUS).$(DIR).ranked: data/$(CORPUS).$(DIR).tagger<br />
cat $< | $(LEX_TOOLS)/multitrans $(DATA)$(DIR).autobil.bin -m | apertium -f none -d $(DATA) $(DIR)-multi | irstlm-ranker-frac $(MODEL) > $@<br />
<br />
data/$(CORPUS).$(DIR).annotated: data/$(CORPUS).$(DIR).multi-trimmed data/$(CORPUS).$(DIR).ranked<br />
paste data/$(CORPUS).$(DIR).multi-trimmed data/$(CORPUS).$(DIR).ranked | cut -f1-4 > $@<br />
<br />
data/$(CORPUS).$(DIR).freq: data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated<br />
python3 $(SCRIPTS)/biltrans-extract-frac-freq.py data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated > $@<br />
<br />
data/$(CORPUS).$(DIR).freq.lrx: data/$(CORPUS).$(DIR).freq<br />
python3 $(SCRIPTS)/extract-alig-lrx.py $< > $@<br />
<br />
data/$(CORPUS).$(DIR).freq.lrx.bin: data/$(CORPUS).$(DIR).freq.lrx<br />
lrx-comp $< $@<br />
<br />
data/$(CORPUS).$(DIR).ngrams: data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated<br />
python3 $(SCRIPTS)/biltrans-count-patterns-ngrams.py data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated > $@<br />
<br />
data/$(CORPUS).$(DIR).patterns: data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ngrams<br />
python3 $(SCRIPTS)/ngram-pruning-frac.py data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ngrams > $@ <br />
<br />
data/$(CORPUS).$(DIR).patterns.lrx: data/$(CORPUS).$(DIR).patterns<br />
python3 $(SCRIPTS)/ngrams-to-rules.py $< $(THR) > $@<br />
</pre><br />
<br />
In the same folder also place your source side corpus file. The corpus file needs to be named as "basename"."language-pair".txt. <br/><br />
As an illustration, in the Makefile example, the corpus file is named setimes.sh-mk.txt.<br />
<br />
Set the Makefile variables as follows: <br/><br />
* CORPUS denotes the base name of your corpus file<br />
* DIR stands for the language pair<br />
* DATA is the path to the language resources for the language pair<br />
* AUTOBIL is the path to binary bilingual dictionary for the language pair<br />
* SCRIPTS denotes the path to the lex-tools scripts<br />
* MODEL is the path to the target side (binary) language model used for scoring the possible translations of ambiguous words<br />
<br />
Finally, executing the Makefile will generate lexical selection rules for the specified language pair.</div>
Fpetkovski
https://wiki.apertium.org/w/index.php?title=Learning_rules_from_parallel_and_non-parallel_corpora&diff=43309
Learning rules from parallel and non-parallel corpora
2013-08-19T07:55:02Z
<p>Fpetkovski: /* Maximum entropy rule extraction */</p>
<hr />
<div>== Estimating rules using parallel corpora ==<br />
It is always recommended to use a parallel corpus for any type of machine translation training<br />
when such a resource is available. This section describes several (3) methods for estimating<br />
lexical selection rules using a parallel corpus. We start by describing the part of the training process that is shared by all three methods, and then continue to describe each of the <br />
individual methods separately.<br />
=== Prequisites ===<br />
The training methods use several software packages that need to be installed.<br />
First you will need to download and install:<br />
* [[lttoolbox]]<br />
* Apertium<br />
* [[apertium-lex-tools]]<br />
* Moses and its training scripts which you can install using the following script<br />
<br />
Furthermore you will also need:<br />
<br />
* an Apertium language pair<br />
* a parallel corpus (see [[Corpora]])<br />
<br />
===Installing prerequisites===<br />
See [[Minimal installation from SVN]] for apertium/lttoolbox. <br />
<br />
See [[Constraint-based lexical selection module]] for apertium-lex-tools.<br />
<br />
For moses-decoder you can do<br />
<pre><br />
git clone https://github.com/moses-smt/mosesdecoder<br />
cd mosesdecoder/<br />
./bjam <br />
</pre><br />
<br />
Now e.g. the clean-corpus and train-model scripts referred to below will be in ~/smt/mosesdecoder/scripts/training/clean-corpus-n.perl<br />
See http://www.statmt.org/moses/?n=Development.GetStarted if you want to install the binaries to some other directory.<br />
<br />
=== Preparing the training files ===<br />
<br />
The parallel corpus is processed in such a way that <br />
the training files (the source and target side corpus) are first analysed and tagged.<br />
<br />
Next, lines with no analysis are removed and blank within tokens are replaced with a new character<br />
since Giza tokenizes a sentence by splitting on white space.<br />
<br />
Finally, both files are cleaned using a moses training script so that Giza will not <br />
crash during the training process.<br />
<br />
All of this can be achieved using the following script:<br />
<br />
<pre><br />
CORPUS="Europarl3"<br />
PAIR="es-pt"<br />
SL="pt"<br />
TL="es"<br />
DATA="/home/philip/Apertium/apertium-es-pt"<br />
<br />
LEX_TOOLS="/home/philip/Apertium/apertium-lex-tools"<br />
SCRIPTS="$LEX_TOOLS/scripts"<br />
MOSESDECODER="/home/philip/Apertium/mosesdecoder/scripts/training"<br />
TRAINING_LINES=100000<br />
<br />
<br />
if [ ! -d data-$SL-$TL ]; then <br />
mkdir data-$SL-$TL;<br />
fi<br />
<br />
# TAG CORPUS<br />
cat "$CORPUS.$PAIR.$SL" | head -n $TRAINING_LINES | apertium -d "$DATA" $SL-$TL-tagger-dcase \<br />
| apertium-pretransfer > data-$SL-$TL/$CORPUS.tagged.$SL;<br />
<br />
cat "$CORPUS.$PAIR.$TL" | head -n $TRAINING_LINES | apertium -d "$DATA" $TL-$SL-tagger-dcase \<br />
| apertium-pretransfer > data-$SL-$TL/$CORPUS.tagged.$TL;<br />
<br />
N=`wc -l $CORPUS.$PAIR.$SL | cut -d ' ' -f 1`<br />
<br />
<br />
# REMOVE LINES WITH NO ANALYSES<br />
seq 1 $TRAINING_LINES > data-$SL-$TL/$CORPUS.lines<br />
paste data-$SL-$TL/$CORPUS.lines data-$SL-$TL/$CORPUS.tagged.$SL data-$SL-$TL/$CORPUS.tagged.$TL | grep '<' \<br />
| cut -f1 > data-$SL-$TL/$CORPUS.lines.new<br />
paste data-$SL-$TL/$CORPUS.lines data-$SL-$TL/$CORPUS.tagged.$SL data-$SL-$TL/$CORPUS.tagged.$TL | grep '<' \<br />
| cut -f2 > data-$SL-$TL/$CORPUS.tagged.$SL.new<br />
paste data-$SL-$TL/$CORPUS.lines data-$SL-$TL/$CORPUS.tagged.$SL data-$SL-$TL/$CORPUS.tagged.$TL | grep '<' \<br />
| cut -f3 > data-$SL-$TL/$CORPUS.tagged.$TL.new<br />
mv data-$SL-$TL/$CORPUS.lines.new data-$SL-$TL/$CORPUS.lines<br />
cat data-$SL-$TL/$CORPUS.tagged.$SL.new \<br />
| sed 's/ /~/g' | sed 's/\$[^\^]*/$ /g' > data-$SL-$TL/$CORPUS.tagged.$SL<br />
cat data-$SL-$TL/$CORPUS.tagged.$TL.new \<br />
| sed 's/ /~/g' | sed 's/\$[^\^]*/$ /g' > data-$SL-$TL/$CORPUS.tagged.$TL<br />
rm data-$SL-$TL/*.new<br />
<br />
<br />
# CLEAN CORPUS<br />
perl "$MOSESDECODER/clean-corpus-n.perl" data-$SL-$TL/$CORPUS.tagged $SL $TL "data-$SL-$TL/$CORPUS.tag-clean" 1 40;<br />
<br />
</pre><br />
<br />
Make sure you set the variables as follows: <br/><br />
* CORPUS="Europarl3": name of the corpus that you're using<br />
* PAIR: the direction of the corpus<br />
* SL: the source language<br />
* TL: the target language<br />
* DATA: path to the language resources for the language pair<br />
* LEX_TOOLS: path to apertium-lex-tools<br />
* MOSESDECODER: path to moses-decoder<br />
* TRAINING_LINES: amount of training lines<br />
<br />
=== Learning rules with Giza ===<br />
<br />
==== Installing Giza ====<br />
<br />
You can download and install Giza in the following way:<br />
<br />
<pre><br />
<br />
$ mkdir ~/smt<br />
$ cd ~/smt<br />
$ mkdir local # our "install prefix"<br />
$ wget https://giza-pp.googlecode.com/files/giza-pp-v1.0.7.tar.gz<br />
$ tar xzvf giza-pp-v1.0.7.tar.gz<br />
$ cd giza-pp<br />
$ make<br />
$ mkdir ../local/bin<br />
$ cp GIZA++-v2/snt2cooc.out ../local/bin/<br />
$ cp GIZA++-v2/snt2plain.out ../local/bin/<br />
$ cp GIZA++-v2/GIZA++ ../local/bin/<br />
$ cp mkcls-v2/mkcls ../local/bin/<br />
</pre><br />
<br />
Notice that all of your binary files will be placed in a single folder which is needed by the moses training script.<br />
<br />
==== Alignment ====<br />
Giza++ is used for obtaining an alignment between the tokens of the parallel sentences in the two corpora. The alignment process can sometimes be slow depending on the size of the corpora.<br />
<br />
After alignment has been done, the tokens' tags can be trimmed down so that they match the the set of tags found in the bilingual dictionary of the language pair.<br />
<br />
Next, a bilingual transfer output is obtained from the source language side so that<br />
ambiguous sentences and missing bidix candidates can be extracted.<br />
<br />
Finally, a frequency lexicon is created, marking the most common translation for each ambiguous token.<br />
<br />
<pre><br />
BIN_DIR="/home/philip/Apertium/smt/bin"<br />
# ALIGN<br />
perl $MOSESDECODER/train-model.perl -external-bin-dir "$BIN_DIR" -corpus data-$SL-$TL/$CORPUS.tag-clean \<br />
-f $TL -e $SL -alignment grow-diag-final-and -reordering msd-bidirectional-fe \<br />
-lm 0:5:/home/philip/Apertium/gsoc2013/giza/europarl.lm:0 2>&1<br />
<br />
# EXTRACT<br />
zcat giza.$SL-$TL/$SL-$TL.A3.final.gz | $SCRIPTS/giza-to-moses.awk > data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL<br />
<br />
# TRIM TAGS<br />
cat data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL | sed 's/ ||| /\t/g' | cut -f 1 \<br />
| sed 's/~/ /g' | $LEX_TOOLS/process-tagger-output $DATA/$TL-$SL.autobil.bin -p -t > tmp1<br />
<br />
cat data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL | sed 's/ ||| /\t/g' | cut -f 2 \<br />
| sed 's/~/ /g' | $LEX_TOOLS/process-tagger-output $DATA/$SL-$TL.autobil.bin -p -t > tmp2<br />
<br />
cat data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL | sed 's/ ||| /\t/g' | cut -f 3 > tmp3<br />
<br />
cat data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL | sed 's/ ||| /\t/g' | cut -f 2 \<br />
| sed 's/~/ /g' | $LEX_TOOLS/process-tagger-output $DATA/$SL-$TL.autobil.bin -b -t > data-$SL-$TL/$CORPUS.clean-biltrans.$PAIR<br />
<br />
paste tmp1 tmp2 tmp3 | sed 's/\t/ ||| /g' > data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL<br />
rm tmp1 tmp2 tmp3<br />
<br />
# SENTENCES<br />
python3 $SCRIPTS/extract-sentences.py data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL data-$SL-$TL/$CORPUS.clean-biltrans.$PAIR \<br />
> data-$SL-$TL/$CORPUS.candidates.$SL-$TL 2>/dev/null<br />
<br />
# FREQUENCY LEXICON<br />
python $SCRIPTS/extract-freq-lexicon.py data-$SL-$TL/$CORPUS.candidates.$SL-$TL > data-$SL-$TL/$CORPUS.lex.$SL-$TL 2>/dev/null<br />
<br />
</pre><br />
<br />
Make sure you set the BIN_DIR variable so that it contains the path to the binary folder<br />
generated by the Giza installation process.<br />
<br />
<br />
<br />
==== Maximum likelihood rule extraction ====<br />
The ML method counts how many each translation occurs in a given context, and compares that number<br />
with the default translation from the frequency lexicon. <br />
It then decides whether to create a rule with the given translation, or to leave the default<br />
translation.<br />
<br />
The rule generation process is done with the following script:<br />
<br />
<pre><br />
crisphold=1.5<br />
# NGRAM PATTERNS<br />
python $SCRIPTS/ngram-count-patterns.py data-$SL-$TL/$CORPUS.lex.$SL-$TL data-$SL-$TL/$CORPUS.candidates.$SL-$TL $crisphold 2>/dev/null > data-$SL-$TL/$CORPUS.ngrams.$SL-$TL<br />
<br />
# NGRAMS TO RULES<br />
python3 $SCRIPTS/ngrams-to-rules.py data-$SL-$TL/$CORPUS.ngrams.$SL-$TL $crisphold > data-$SL-$TL/$CORPUS.ngrams.$SL-$TL.lrx 2>/dev/null<br />
</pre><br />
<br />
Where ''''crisphold'''' is a variable which determines how many more times should an individual translation occur over the default translation for a rule to be created.<br />
<br />
==== Maximum entropy rule extraction ====<br />
The ME method learns a discriminative model which assigns each individual context (ngram)<br />
a weight with which it contributes to a certain translation.<br />
<br />
The rule extraction process is done in the following way:<br />
<br />
<pre><br />
YASMET=$LEX_TOOLS/yasmet<br />
python $SCRIPTS/ngram-count-patterns-maxent2.py data-$SL-$TL/$CORPUS.lex.$SL-$TL data-$SL-$TL/$CORPUS.candidates.$SL-$TL 2>ngrams > events<br />
<br />
echo -n "" > all-lambdas<br />
cat events | grep -v -e '\$ 0\.0 #' -e '\$ 0 #' > events.trimmed<br />
for i in `cat events.trimmed | cut -f1 | sort -u | sed 's/\([\*\^\$]\)/\\\\\1/g'`; do<br />
<br />
num=`cat events.trimmed | grep "^$i" | cut -f2 | head -1`<br />
echo $num > tmp.yasmet.$i;<br />
cat events.trimmed | grep "^$i" | cut -f3 >> tmp.yasmet.$i;<br />
echo "$i"<br />
cat tmp.yasmet.$i | $YASMET -red $MIN > tmp.yasmet.$i.$MIN; <br />
cat tmp.yasmet.$i.$MIN | $YASMET > tmp.lambdas.$i<br />
cat tmp.lambdas.$i | sed "s/^/$i /g" >> all-lambdas;<br />
done<br />
<br />
rm tmp.*<br />
<br />
python3 $SCRIPTS/merge-ngrams-lambdas.py ngrams all-lambdas > rules-all.txt<br />
<br />
python3 $SCRIPTS/lambdas-to-rules.py $TRAIN/$CORPUS.lex.$SL-$TL rules-all.txt > ngrams-all.txt<br />
<br />
python3 $SCRIPTS/ngrams-to-rules-me.py ngrams-all.txt > $PAIR.ngrams-lm-$MIN.xml 2>/tmp/$PAIR.$MIN.lög<br />
<br />
</pre><br />
<br />
== Estimating rules using non-parallel corpora ==<br />
Prerequisites:<br />
* Install [[apertium-lex-tools]]<br />
* Install IRSTLM (http://sourceforge.net/projects/irstlm/)<br />
* Estimate a '''binary''' target side language model (http://sourceforge.net/apps/mediawiki/irstlm/index.php?title=User_Manual). <br />
* The language pair must support the pretransfer and multi modes. See apertium-sh-mk/modes.xml <br />
as a reference on how to add these modes if they do not exist.<br />
Place the following Makefile in the folder where you want to run your training process:<br />
<br />
<pre><br />
CORPUS=setimes<br />
DIR=sh-mk<br />
DATA=/home/philip/Apertium/apertium-sh-mk/<br />
AUTOBIL=sh-mk.autobil.bin<br />
SCRIPTS=/home/philip/Apertium/apertium-lex-tools/scripts<br />
MODEL=/home/philip/Apertium/corpora/language-models/mk/setimes.mk.5.blm<br />
LEX_TOOLS=/home/philip/Apertium/apertium-lex-tools<br />
THR=0<br />
<br />
#all: data/$(CORPUS).$(DIR).lrx data/$(CORPUS).$(DIR).freq.lrx<br />
all: data/$(CORPUS).$(DIR).freq.lrx.bin data/$(CORPUS).$(DIR).patterns.lrx<br />
<br />
data/$(CORPUS).$(DIR).tagger: $(CORPUS).$(DIR).txt<br />
if [ ! -d data ]; then mkdir data; fi<br />
cat $(CORPUS).$(DIR).txt | sed 's/[^\.]$$/./g' | apertium-destxt | apertium -f none -d $(DATA) $(DIR)-tagger | apertium-pretransfer > $@<br />
<br />
data/$(CORPUS).$(DIR).ambig: data/$(CORPUS).$(DIR).tagger<br />
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/multitrans $(DATA)$(DIR).autobil.bin -b -t > $@<br />
<br />
data/$(CORPUS).$(DIR).multi-trimmed: data/$(CORPUS).$(DIR).tagger<br />
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/multitrans $(DATA)$(DIR).autobil.bin -m -t > $@<br />
<br />
data/$(CORPUS).$(DIR).ranked: data/$(CORPUS).$(DIR).tagger<br />
cat $< | $(LEX_TOOLS)/multitrans $(DATA)$(DIR).autobil.bin -m | apertium -f none -d $(DATA) $(DIR)-multi | irstlm-ranker-frac $(MODEL) > $@<br />
<br />
data/$(CORPUS).$(DIR).annotated: data/$(CORPUS).$(DIR).multi-trimmed data/$(CORPUS).$(DIR).ranked<br />
paste data/$(CORPUS).$(DIR).multi-trimmed data/$(CORPUS).$(DIR).ranked | cut -f1-4 > $@<br />
<br />
data/$(CORPUS).$(DIR).freq: data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated<br />
python3 $(SCRIPTS)/biltrans-extract-frac-freq.py data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated > $@<br />
<br />
data/$(CORPUS).$(DIR).freq.lrx: data/$(CORPUS).$(DIR).freq<br />
python3 $(SCRIPTS)/extract-alig-lrx.py $< > $@<br />
<br />
data/$(CORPUS).$(DIR).freq.lrx.bin: data/$(CORPUS).$(DIR).freq.lrx<br />
lrx-comp $< $@<br />
<br />
data/$(CORPUS).$(DIR).ngrams: data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated<br />
python3 $(SCRIPTS)/biltrans-count-patterns-ngrams.py data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated > $@<br />
<br />
data/$(CORPUS).$(DIR).patterns: data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ngrams<br />
python3 $(SCRIPTS)/ngram-pruning-frac.py data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ngrams > $@ <br />
<br />
data/$(CORPUS).$(DIR).patterns.lrx: data/$(CORPUS).$(DIR).patterns<br />
python3 $(SCRIPTS)/ngrams-to-rules.py $< $(THR) > $@<br />
</pre><br />
<br />
In the same folder also place your source side corpus file. The corpus file needs to be named as "basename"."language-pair".txt. <br/><br />
As an illustration, in the Makefile example, the corpus file is named setimes.sh-mk.txt.<br />
<br />
Set the Makefile variables as follows: <br/><br />
* CORPUS denotes the base name of your corpus file<br />
* DIR stands for the language pair<br />
* DATA is the path to the language resources for the language pair<br />
* AUTOBIL is the path to binary bilingual dictionary for the language pair<br />
* SCRIPTS denotes the path to the lex-tools scripts<br />
* MODEL is the path to the target side (binary) language model used for scoring the possible translations of ambiguous words<br />
<br />
Finally, executing the Makefile will generate lexical selection rules for the specified language pair.</div>
Fpetkovski
https://wiki.apertium.org/w/index.php?title=Learning_rules_from_parallel_and_non-parallel_corpora&diff=43308
Learning rules from parallel and non-parallel corpora
2013-08-19T07:54:12Z
<p>Fpetkovski: /* Estimating rules using parallel corpora */</p>
<hr />
<div>== Estimating rules using parallel corpora ==<br />
It is always recommended to use a parallel corpus for any type of machine translation training<br />
when such a resource is available. This section describes several (3) methods for estimating<br />
lexical selection rules using a parallel corpus. We start by describing the part of the training process that is shared by all three methods, and then continue to describe each of the <br />
individual methods separately.<br />
=== Prequisites ===<br />
The training methods use several software packages that need to be installed.<br />
First you will need to download and install:<br />
* [[lttoolbox]]<br />
* Apertium<br />
* [[apertium-lex-tools]]<br />
* Moses and its training scripts which you can install using the following script<br />
<br />
Furthermore you will also need:<br />
<br />
* an Apertium language pair<br />
* a parallel corpus (see [[Corpora]])<br />
<br />
===Installing prerequisites===<br />
See [[Minimal installation from SVN]] for apertium/lttoolbox. <br />
<br />
See [[Constraint-based lexical selection module]] for apertium-lex-tools.<br />
<br />
For moses-decoder you can do<br />
<pre><br />
git clone https://github.com/moses-smt/mosesdecoder<br />
cd mosesdecoder/<br />
./bjam <br />
</pre><br />
<br />
Now e.g. the clean-corpus and train-model scripts referred to below will be in ~/smt/mosesdecoder/scripts/training/clean-corpus-n.perl<br />
See http://www.statmt.org/moses/?n=Development.GetStarted if you want to install the binaries to some other directory.<br />
<br />
=== Preparing the training files ===<br />
<br />
The parallel corpus is processed in such a way that <br />
the training files (the source and target side corpus) are first analysed and tagged.<br />
<br />
Next, lines with no analysis are removed and blank within tokens are replaced with a new character<br />
since Giza tokenizes a sentence by splitting on white space.<br />
<br />
Finally, both files are cleaned using a moses training script so that Giza will not <br />
crash during the training process.<br />
<br />
All of this can be achieved using the following script:<br />
<br />
<pre><br />
CORPUS="Europarl3"<br />
PAIR="es-pt"<br />
SL="pt"<br />
TL="es"<br />
DATA="/home/philip/Apertium/apertium-es-pt"<br />
<br />
LEX_TOOLS="/home/philip/Apertium/apertium-lex-tools"<br />
SCRIPTS="$LEX_TOOLS/scripts"<br />
MOSESDECODER="/home/philip/Apertium/mosesdecoder/scripts/training"<br />
TRAINING_LINES=100000<br />
<br />
<br />
if [ ! -d data-$SL-$TL ]; then <br />
mkdir data-$SL-$TL;<br />
fi<br />
<br />
# TAG CORPUS<br />
cat "$CORPUS.$PAIR.$SL" | head -n $TRAINING_LINES | apertium -d "$DATA" $SL-$TL-tagger-dcase \<br />
| apertium-pretransfer > data-$SL-$TL/$CORPUS.tagged.$SL;<br />
<br />
cat "$CORPUS.$PAIR.$TL" | head -n $TRAINING_LINES | apertium -d "$DATA" $TL-$SL-tagger-dcase \<br />
| apertium-pretransfer > data-$SL-$TL/$CORPUS.tagged.$TL;<br />
<br />
N=`wc -l $CORPUS.$PAIR.$SL | cut -d ' ' -f 1`<br />
<br />
<br />
# REMOVE LINES WITH NO ANALYSES<br />
seq 1 $TRAINING_LINES > data-$SL-$TL/$CORPUS.lines<br />
paste data-$SL-$TL/$CORPUS.lines data-$SL-$TL/$CORPUS.tagged.$SL data-$SL-$TL/$CORPUS.tagged.$TL | grep '<' \<br />
| cut -f1 > data-$SL-$TL/$CORPUS.lines.new<br />
paste data-$SL-$TL/$CORPUS.lines data-$SL-$TL/$CORPUS.tagged.$SL data-$SL-$TL/$CORPUS.tagged.$TL | grep '<' \<br />
| cut -f2 > data-$SL-$TL/$CORPUS.tagged.$SL.new<br />
paste data-$SL-$TL/$CORPUS.lines data-$SL-$TL/$CORPUS.tagged.$SL data-$SL-$TL/$CORPUS.tagged.$TL | grep '<' \<br />
| cut -f3 > data-$SL-$TL/$CORPUS.tagged.$TL.new<br />
mv data-$SL-$TL/$CORPUS.lines.new data-$SL-$TL/$CORPUS.lines<br />
cat data-$SL-$TL/$CORPUS.tagged.$SL.new \<br />
| sed 's/ /~/g' | sed 's/\$[^\^]*/$ /g' > data-$SL-$TL/$CORPUS.tagged.$SL<br />
cat data-$SL-$TL/$CORPUS.tagged.$TL.new \<br />
| sed 's/ /~/g' | sed 's/\$[^\^]*/$ /g' > data-$SL-$TL/$CORPUS.tagged.$TL<br />
rm data-$SL-$TL/*.new<br />
<br />
<br />
# CLEAN CORPUS<br />
perl "$MOSESDECODER/clean-corpus-n.perl" data-$SL-$TL/$CORPUS.tagged $SL $TL "data-$SL-$TL/$CORPUS.tag-clean" 1 40;<br />
<br />
</pre><br />
<br />
Make sure you set the variables as follows: <br/><br />
* CORPUS="Europarl3": name of the corpus that you're using<br />
* PAIR: the direction of the corpus<br />
* SL: the source language<br />
* TL: the target language<br />
* DATA: path to the language resources for the language pair<br />
* LEX_TOOLS: path to apertium-lex-tools<br />
* MOSESDECODER: path to moses-decoder<br />
* TRAINING_LINES: amount of training lines<br />
<br />
=== Learning rules with Giza ===<br />
<br />
==== Installing Giza ====<br />
<br />
You can download and install Giza in the following way:<br />
<br />
<pre><br />
<br />
$ mkdir ~/smt<br />
$ cd ~/smt<br />
$ mkdir local # our "install prefix"<br />
$ wget https://giza-pp.googlecode.com/files/giza-pp-v1.0.7.tar.gz<br />
$ tar xzvf giza-pp-v1.0.7.tar.gz<br />
$ cd giza-pp<br />
$ make<br />
$ mkdir ../local/bin<br />
$ cp GIZA++-v2/snt2cooc.out ../local/bin/<br />
$ cp GIZA++-v2/snt2plain.out ../local/bin/<br />
$ cp GIZA++-v2/GIZA++ ../local/bin/<br />
$ cp mkcls-v2/mkcls ../local/bin/<br />
</pre><br />
<br />
Notice that all of your binary files will be placed in a single folder which is needed by the moses training script.<br />
<br />
==== Alignment ====<br />
Giza++ is used for obtaining an alignment between the tokens of the parallel sentences in the two corpora. The alignment process can sometimes be slow depending on the size of the corpora.<br />
<br />
After alignment has been done, the tokens' tags can be trimmed down so that they match the the set of tags found in the bilingual dictionary of the language pair.<br />
<br />
Next, a bilingual transfer output is obtained from the source language side so that<br />
ambiguous sentences and missing bidix candidates can be extracted.<br />
<br />
Finally, a frequency lexicon is created, marking the most common translation for each ambiguous token.<br />
<br />
<pre><br />
BIN_DIR="/home/philip/Apertium/smt/bin"<br />
# ALIGN<br />
perl $MOSESDECODER/train-model.perl -external-bin-dir "$BIN_DIR" -corpus data-$SL-$TL/$CORPUS.tag-clean \<br />
-f $TL -e $SL -alignment grow-diag-final-and -reordering msd-bidirectional-fe \<br />
-lm 0:5:/home/philip/Apertium/gsoc2013/giza/europarl.lm:0 2>&1<br />
<br />
# EXTRACT<br />
zcat giza.$SL-$TL/$SL-$TL.A3.final.gz | $SCRIPTS/giza-to-moses.awk > data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL<br />
<br />
# TRIM TAGS<br />
cat data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL | sed 's/ ||| /\t/g' | cut -f 1 \<br />
| sed 's/~/ /g' | $LEX_TOOLS/process-tagger-output $DATA/$TL-$SL.autobil.bin -p -t > tmp1<br />
<br />
cat data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL | sed 's/ ||| /\t/g' | cut -f 2 \<br />
| sed 's/~/ /g' | $LEX_TOOLS/process-tagger-output $DATA/$SL-$TL.autobil.bin -p -t > tmp2<br />
<br />
cat data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL | sed 's/ ||| /\t/g' | cut -f 3 > tmp3<br />
<br />
cat data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL | sed 's/ ||| /\t/g' | cut -f 2 \<br />
| sed 's/~/ /g' | $LEX_TOOLS/process-tagger-output $DATA/$SL-$TL.autobil.bin -b -t > data-$SL-$TL/$CORPUS.clean-biltrans.$PAIR<br />
<br />
paste tmp1 tmp2 tmp3 | sed 's/\t/ ||| /g' > data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL<br />
rm tmp1 tmp2 tmp3<br />
<br />
# SENTENCES<br />
python3 $SCRIPTS/extract-sentences.py data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL data-$SL-$TL/$CORPUS.clean-biltrans.$PAIR \<br />
> data-$SL-$TL/$CORPUS.candidates.$SL-$TL 2>/dev/null<br />
<br />
# FREQUENCY LEXICON<br />
python $SCRIPTS/extract-freq-lexicon.py data-$SL-$TL/$CORPUS.candidates.$SL-$TL > data-$SL-$TL/$CORPUS.lex.$SL-$TL 2>/dev/null<br />
<br />
</pre><br />
<br />
Make sure you set the BIN_DIR variable so that it contains the path to the binary folder<br />
generated by the Giza installation process.<br />
<br />
<br />
<br />
==== Maximum likelihood rule extraction ====<br />
The ML method counts how many each translation occurs in a given context, and compares that number<br />
with the default translation from the frequency lexicon. <br />
It then decides whether to create a rule with the given translation, or to leave the default<br />
translation.<br />
<br />
The rule generation process is done with the following script:<br />
<br />
<pre><br />
crisphold=1.5<br />
# NGRAM PATTERNS<br />
python $SCRIPTS/ngram-count-patterns.py data-$SL-$TL/$CORPUS.lex.$SL-$TL data-$SL-$TL/$CORPUS.candidates.$SL-$TL $crisphold 2>/dev/null > data-$SL-$TL/$CORPUS.ngrams.$SL-$TL<br />
<br />
# NGRAMS TO RULES<br />
python3 $SCRIPTS/ngrams-to-rules.py data-$SL-$TL/$CORPUS.ngrams.$SL-$TL $crisphold > data-$SL-$TL/$CORPUS.ngrams.$SL-$TL.lrx 2>/dev/null<br />
</pre><br />
<br />
Where ''''crisphold'''' is a variable which determines how many more times should an individual translation occur over the default translation for a rule to be created.<br />
<br />
==== Maximum entropy rule extraction ====<br />
The ME method learns a discriminative model which assigns each individual context (ngram)<br />
a weight with which it contributes to a certain translation.<br />
<br />
The rule extraction process is done in the following way:<br />
<br />
<pre><br />
YASMET=$LEX_TOOLS/yasmet<br />
python $SCRIPTS/ngram-count-patterns-maxent2.py data/$CORPUS.lex.$SL-$TL data/$CORPUS.candidates.$SL-$TL 2>ngrams > events<br />
<br />
echo -n "" > all-lambdas<br />
cat events | grep -v -e '\$ 0\.0 #' -e '\$ 0 #' > events.trimmed<br />
for i in `cat events.trimmed | cut -f1 | sort -u | sed 's/\([\*\^\$]\)/\\\\\1/g'`; do<br />
<br />
num=`cat events.trimmed | grep "^$i" | cut -f2 | head -1`<br />
echo $num > tmp.yasmet.$i;<br />
cat events.trimmed | grep "^$i" | cut -f3 >> tmp.yasmet.$i;<br />
echo "$i"<br />
cat tmp.yasmet.$i | $YASMET -red $MIN > tmp.yasmet.$i.$MIN; <br />
cat tmp.yasmet.$i.$MIN | $YASMET > tmp.lambdas.$i<br />
cat tmp.lambdas.$i | sed "s/^/$i /g" >> all-lambdas;<br />
done<br />
<br />
rm tmp.*<br />
<br />
python3 $SCRIPTS/merge-ngrams-lambdas.py ngrams all-lambdas > rules-all.txt<br />
<br />
python3 $SCRIPTS/lambdas-to-rules.py $TRAIN/$CORPUS.lex.$SL-$TL rules-all.txt > ngrams-all.txt<br />
<br />
python3 $SCRIPTS/ngrams-to-rules-me.py ngrams-all.txt > $PAIR.ngrams-lm-$MIN.xml 2>/tmp/$PAIR.$MIN.lög<br />
<br />
</pre><br />
<br />
== Estimating rules using non-parallel corpora ==<br />
Prerequisites:<br />
* Install [[apertium-lex-tools]]<br />
* Install IRSTLM (http://sourceforge.net/projects/irstlm/)<br />
* Estimate a '''binary''' target side language model (http://sourceforge.net/apps/mediawiki/irstlm/index.php?title=User_Manual). <br />
* The language pair must support the pretransfer and multi modes. See apertium-sh-mk/modes.xml <br />
as a reference on how to add these modes if they do not exist.<br />
Place the following Makefile in the folder where you want to run your training process:<br />
<br />
<pre><br />
CORPUS=setimes<br />
DIR=sh-mk<br />
DATA=/home/philip/Apertium/apertium-sh-mk/<br />
AUTOBIL=sh-mk.autobil.bin<br />
SCRIPTS=/home/philip/Apertium/apertium-lex-tools/scripts<br />
MODEL=/home/philip/Apertium/corpora/language-models/mk/setimes.mk.5.blm<br />
LEX_TOOLS=/home/philip/Apertium/apertium-lex-tools<br />
THR=0<br />
<br />
#all: data/$(CORPUS).$(DIR).lrx data/$(CORPUS).$(DIR).freq.lrx<br />
all: data/$(CORPUS).$(DIR).freq.lrx.bin data/$(CORPUS).$(DIR).patterns.lrx<br />
<br />
data/$(CORPUS).$(DIR).tagger: $(CORPUS).$(DIR).txt<br />
if [ ! -d data ]; then mkdir data; fi<br />
cat $(CORPUS).$(DIR).txt | sed 's/[^\.]$$/./g' | apertium-destxt | apertium -f none -d $(DATA) $(DIR)-tagger | apertium-pretransfer > $@<br />
<br />
data/$(CORPUS).$(DIR).ambig: data/$(CORPUS).$(DIR).tagger<br />
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/multitrans $(DATA)$(DIR).autobil.bin -b -t > $@<br />
<br />
data/$(CORPUS).$(DIR).multi-trimmed: data/$(CORPUS).$(DIR).tagger<br />
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/multitrans $(DATA)$(DIR).autobil.bin -m -t > $@<br />
<br />
data/$(CORPUS).$(DIR).ranked: data/$(CORPUS).$(DIR).tagger<br />
cat $< | $(LEX_TOOLS)/multitrans $(DATA)$(DIR).autobil.bin -m | apertium -f none -d $(DATA) $(DIR)-multi | irstlm-ranker-frac $(MODEL) > $@<br />
<br />
data/$(CORPUS).$(DIR).annotated: data/$(CORPUS).$(DIR).multi-trimmed data/$(CORPUS).$(DIR).ranked<br />
paste data/$(CORPUS).$(DIR).multi-trimmed data/$(CORPUS).$(DIR).ranked | cut -f1-4 > $@<br />
<br />
data/$(CORPUS).$(DIR).freq: data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated<br />
python3 $(SCRIPTS)/biltrans-extract-frac-freq.py data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated > $@<br />
<br />
data/$(CORPUS).$(DIR).freq.lrx: data/$(CORPUS).$(DIR).freq<br />
python3 $(SCRIPTS)/extract-alig-lrx.py $< > $@<br />
<br />
data/$(CORPUS).$(DIR).freq.lrx.bin: data/$(CORPUS).$(DIR).freq.lrx<br />
lrx-comp $< $@<br />
<br />
data/$(CORPUS).$(DIR).ngrams: data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated<br />
python3 $(SCRIPTS)/biltrans-count-patterns-ngrams.py data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated > $@<br />
<br />
data/$(CORPUS).$(DIR).patterns: data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ngrams<br />
python3 $(SCRIPTS)/ngram-pruning-frac.py data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ngrams > $@ <br />
<br />
data/$(CORPUS).$(DIR).patterns.lrx: data/$(CORPUS).$(DIR).patterns<br />
python3 $(SCRIPTS)/ngrams-to-rules.py $< $(THR) > $@<br />
</pre><br />
<br />
In the same folder also place your source side corpus file. The corpus file needs to be named as "basename"."language-pair".txt. <br/><br />
As an illustration, in the Makefile example, the corpus file is named setimes.sh-mk.txt.<br />
<br />
Set the Makefile variables as follows: <br/><br />
* CORPUS denotes the base name of your corpus file<br />
* DIR stands for the language pair<br />
* DATA is the path to the language resources for the language pair<br />
* AUTOBIL is the path to binary bilingual dictionary for the language pair<br />
* SCRIPTS denotes the path to the lex-tools scripts<br />
* MODEL is the path to the target side (binary) language model used for scoring the possible translations of ambiguous words<br />
<br />
Finally, executing the Makefile will generate lexical selection rules for the specified language pair.</div>
Fpetkovski
https://wiki.apertium.org/w/index.php?title=Learning_rules_from_parallel_and_non-parallel_corpora&diff=43307
Learning rules from parallel and non-parallel corpora
2013-08-19T07:41:11Z
<p>Fpetkovski: /* Training */</p>
<hr />
<div>== Estimating rules using parallel corpora ==<br />
It is always recommended to use a parallel corpus for any type of machine translation training<br />
when such a resource is available. This section describes several (3) methods for estimating<br />
lexical selection rules using a parallel corpus. We start by describing the part of the training process that is shared by all three methods, and then continue to describe each of the <br />
individual methods separately.<br />
=== Prequisites ===<br />
The training methods use several software packages that need to be installed.<br />
First you will need to download and install:<br />
* [[lttoolbox]]<br />
* Apertium<br />
* [[apertium-lex-tools]]<br />
* Moses and its training scripts which you can install using the following script<br />
<br />
Furthermore you will also need:<br />
<br />
* an Apertium language pair<br />
* a parallel corpus (see [[Corpora]])<br />
<br />
===Installing prerequisites===<br />
See [[Minimal installation from SVN]] for apertium/lttoolbox. <br />
<br />
See [[Constraint-based lexical selection module]] for apertium-lex-tools.<br />
<br />
For moses-decoder you can do<br />
<pre><br />
git clone https://github.com/moses-smt/mosesdecoder<br />
cd mosesdecoder/<br />
./bjam <br />
</pre><br />
<br />
Now e.g. the clean-corpus and train-model scripts referred to below will be in ~/smt/mosesdecoder/scripts/training/clean-corpus-n.perl<br />
See http://www.statmt.org/moses/?n=Development.GetStarted if you want to install the binaries to some other directory.<br />
<br />
=== Preparing the training files ===<br />
<br />
The parallel corpus is processed in such a way that <br />
the training files (the source and target side corpus) are first analysed and tagged.<br />
<br />
Next, lines with no analysis are removed and blank within tokens are replaced with a new character<br />
since Giza tokenizes a sentence by splitting on white space.<br />
<br />
Finally, both files are cleaned using a moses training script so that Giza will not <br />
crash during the training process.<br />
<br />
All of this can be achieved using the following script:<br />
<br />
<pre><br />
CORPUS="Europarl3"<br />
PAIR="es-pt"<br />
SL="pt"<br />
TL="es"<br />
DATA="/home/philip/Apertium/apertium-es-pt"<br />
<br />
LEX_TOOLS="/home/philip/Apertium/apertium-lex-tools"<br />
SCRIPTS="$LEX_TOOLS/scripts"<br />
MOSESDECODER="/home/philip/Apertium/mosesdecoder/scripts/training"<br />
TRAINING_LINES=100000<br />
<br />
<br />
if [ ! -d data-$SL-$TL ]; then <br />
mkdir data-$SL-$TL;<br />
fi<br />
<br />
# TAG CORPUS<br />
cat "$CORPUS.$PAIR.$SL" | head -n $TRAINING_LINES | apertium -d "$DATA" $SL-$TL-tagger-dcase \<br />
| apertium-pretransfer > data-$SL-$TL/$CORPUS.tagged.$SL;<br />
<br />
cat "$CORPUS.$PAIR.$TL" | head -n $TRAINING_LINES | apertium -d "$DATA" $TL-$SL-tagger-dcase \<br />
| apertium-pretransfer > data-$SL-$TL/$CORPUS.tagged.$TL;<br />
<br />
N=`wc -l $CORPUS.$PAIR.$SL | cut -d ' ' -f 1`<br />
<br />
<br />
# REMOVE LINES WITH NO ANALYSES<br />
seq 1 $TRAINING_LINES > data-$SL-$TL/$CORPUS.lines<br />
paste data-$SL-$TL/$CORPUS.lines data-$SL-$TL/$CORPUS.tagged.$SL data-$SL-$TL/$CORPUS.tagged.$TL | grep '<' \<br />
| cut -f1 > data-$SL-$TL/$CORPUS.lines.new<br />
paste data-$SL-$TL/$CORPUS.lines data-$SL-$TL/$CORPUS.tagged.$SL data-$SL-$TL/$CORPUS.tagged.$TL | grep '<' \<br />
| cut -f2 > data-$SL-$TL/$CORPUS.tagged.$SL.new<br />
paste data-$SL-$TL/$CORPUS.lines data-$SL-$TL/$CORPUS.tagged.$SL data-$SL-$TL/$CORPUS.tagged.$TL | grep '<' \<br />
| cut -f3 > data-$SL-$TL/$CORPUS.tagged.$TL.new<br />
mv data-$SL-$TL/$CORPUS.lines.new data-$SL-$TL/$CORPUS.lines<br />
cat data-$SL-$TL/$CORPUS.tagged.$SL.new \<br />
| sed 's/ /~/g' | sed 's/\$[^\^]*/$ /g' > data-$SL-$TL/$CORPUS.tagged.$SL<br />
cat data-$SL-$TL/$CORPUS.tagged.$TL.new \<br />
| sed 's/ /~/g' | sed 's/\$[^\^]*/$ /g' > data-$SL-$TL/$CORPUS.tagged.$TL<br />
rm data-$SL-$TL/*.new<br />
<br />
<br />
# CLEAN CORPUS<br />
perl "$MOSESDECODER/clean-corpus-n.perl" data-$SL-$TL/$CORPUS.tagged $SL $TL "data-$SL-$TL/$CORPUS.tag-clean" 1 40;<br />
<br />
</pre><br />
<br />
Make sure you set the variables as follows: <br/><br />
* CORPUS="Europarl3": name of the corpus that you're using<br />
* PAIR: the direction of the corpus<br />
* SL: the source language<br />
* TL: the target language<br />
* DATA: path to the language resources for the language pair<br />
* LEX_TOOLS: path to apertium-lex-tools<br />
* MOSESDECODER: path to moses-decoder<br />
* TRAINING_LINES: amount of training lines<br />
<br />
=== Learning rules with Giza ===<br />
<br />
==== Installing Giza ====<br />
<br />
You can download and install Giza in the following way:<br />
<br />
<pre><br />
<br />
$ mkdir ~/smt<br />
$ cd ~/smt<br />
$ mkdir local # our "install prefix"<br />
$ wget https://giza-pp.googlecode.com/files/giza-pp-v1.0.7.tar.gz<br />
$ tar xzvf giza-pp-v1.0.7.tar.gz<br />
$ cd giza-pp<br />
$ make<br />
$ mkdir ../local/bin<br />
$ cp GIZA++-v2/snt2cooc.out ../local/bin/<br />
$ cp GIZA++-v2/snt2plain.out ../local/bin/<br />
$ cp GIZA++-v2/GIZA++ ../local/bin/<br />
$ cp mkcls-v2/mkcls ../local/bin/<br />
</pre><br />
<br />
Notice that all of your binary files will be placed in a single folder which is needed by the moses training script.<br />
<br />
==== Alignment ====<br />
Giza++ is used for obtaining an alignment between the tokens of the parallel sentences in the two corpora. The alignment process can sometimes be slow depending on the size of the corpora.<br />
<br />
After alignment has been done, the tokens' tags can be trimmed down so that they match the the set of tags found in the bilingual dictionary of the language pair.<br />
<br />
Finally, a bilingual transfer output is obtained from the source language side so that<br />
ambiguous sentences and missing bidix candidates can be extracted.<br />
<br />
<pre><br />
BIN_DIR="/home/philip/Apertium/smt/bin"<br />
# ALIGN<br />
perl $MOSESDECODER/train-model.perl -external-bin-dir "$BIN_DIR" -corpus data-$SL-$TL/$CORPUS.tag-clean \<br />
-f $TL -e $SL -alignment grow-diag-final-and -reordering msd-bidirectional-fe \<br />
-lm 0:5:/home/philip/Apertium/gsoc2013/giza/europarl.lm:0 2>&1<br />
<br />
# EXTRACT<br />
zcat giza.$SL-$TL/$SL-$TL.A3.final.gz | $SCRIPTS/giza-to-moses.awk > data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL<br />
<br />
# TRIM TAGS<br />
cat data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL | sed 's/ ||| /\t/g' | cut -f 1 \<br />
| sed 's/~/ /g' | $LEX_TOOLS/process-tagger-output $DATA/$TL-$SL.autobil.bin -p -t > tmp1<br />
<br />
cat data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL | sed 's/ ||| /\t/g' | cut -f 2 \<br />
| sed 's/~/ /g' | $LEX_TOOLS/process-tagger-output $DATA/$SL-$TL.autobil.bin -p -t > tmp2<br />
<br />
cat data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL | sed 's/ ||| /\t/g' | cut -f 3 > tmp3<br />
<br />
cat data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL | sed 's/ ||| /\t/g' | cut -f 2 \<br />
| sed 's/~/ /g' | $LEX_TOOLS/process-tagger-output $DATA/$SL-$TL.autobil.bin -b -t > data-$SL-$TL/$CORPUS.clean-biltrans.$PAIR<br />
<br />
paste tmp1 tmp2 tmp3 | sed 's/\t/ ||| /g' > data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL<br />
rm tmp1 tmp2 tmp3<br />
</pre><br />
<br />
== Estimating rules using non-parallel corpora ==<br />
Prerequisites:<br />
* Install [[apertium-lex-tools]]<br />
* Install IRSTLM (http://sourceforge.net/projects/irstlm/)<br />
* Estimate a '''binary''' target side language model (http://sourceforge.net/apps/mediawiki/irstlm/index.php?title=User_Manual). <br />
* The language pair must support the pretransfer and multi modes. See apertium-sh-mk/modes.xml <br />
as a reference on how to add these modes if they do not exist.<br />
Place the following Makefile in the folder where you want to run your training process:<br />
<br />
<pre><br />
CORPUS=setimes<br />
DIR=sh-mk<br />
DATA=/home/philip/Apertium/apertium-sh-mk/<br />
AUTOBIL=sh-mk.autobil.bin<br />
SCRIPTS=/home/philip/Apertium/apertium-lex-tools/scripts<br />
MODEL=/home/philip/Apertium/corpora/language-models/mk/setimes.mk.5.blm<br />
LEX_TOOLS=/home/philip/Apertium/apertium-lex-tools<br />
THR=0<br />
<br />
#all: data/$(CORPUS).$(DIR).lrx data/$(CORPUS).$(DIR).freq.lrx<br />
all: data/$(CORPUS).$(DIR).freq.lrx.bin data/$(CORPUS).$(DIR).patterns.lrx<br />
<br />
data/$(CORPUS).$(DIR).tagger: $(CORPUS).$(DIR).txt<br />
if [ ! -d data ]; then mkdir data; fi<br />
cat $(CORPUS).$(DIR).txt | sed 's/[^\.]$$/./g' | apertium-destxt | apertium -f none -d $(DATA) $(DIR)-tagger | apertium-pretransfer > $@<br />
<br />
data/$(CORPUS).$(DIR).ambig: data/$(CORPUS).$(DIR).tagger<br />
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/multitrans $(DATA)$(DIR).autobil.bin -b -t > $@<br />
<br />
data/$(CORPUS).$(DIR).multi-trimmed: data/$(CORPUS).$(DIR).tagger<br />
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/multitrans $(DATA)$(DIR).autobil.bin -m -t > $@<br />
<br />
data/$(CORPUS).$(DIR).ranked: data/$(CORPUS).$(DIR).tagger<br />
cat $< | $(LEX_TOOLS)/multitrans $(DATA)$(DIR).autobil.bin -m | apertium -f none -d $(DATA) $(DIR)-multi | irstlm-ranker-frac $(MODEL) > $@<br />
<br />
data/$(CORPUS).$(DIR).annotated: data/$(CORPUS).$(DIR).multi-trimmed data/$(CORPUS).$(DIR).ranked<br />
paste data/$(CORPUS).$(DIR).multi-trimmed data/$(CORPUS).$(DIR).ranked | cut -f1-4 > $@<br />
<br />
data/$(CORPUS).$(DIR).freq: data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated<br />
python3 $(SCRIPTS)/biltrans-extract-frac-freq.py data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated > $@<br />
<br />
data/$(CORPUS).$(DIR).freq.lrx: data/$(CORPUS).$(DIR).freq<br />
python3 $(SCRIPTS)/extract-alig-lrx.py $< > $@<br />
<br />
data/$(CORPUS).$(DIR).freq.lrx.bin: data/$(CORPUS).$(DIR).freq.lrx<br />
lrx-comp $< $@<br />
<br />
data/$(CORPUS).$(DIR).ngrams: data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated<br />
python3 $(SCRIPTS)/biltrans-count-patterns-ngrams.py data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated > $@<br />
<br />
data/$(CORPUS).$(DIR).patterns: data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ngrams<br />
python3 $(SCRIPTS)/ngram-pruning-frac.py data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ngrams > $@ <br />
<br />
data/$(CORPUS).$(DIR).patterns.lrx: data/$(CORPUS).$(DIR).patterns<br />
python3 $(SCRIPTS)/ngrams-to-rules.py $< $(THR) > $@<br />
</pre><br />
<br />
In the same folder also place your source side corpus file. The corpus file needs to be named as "basename"."language-pair".txt. <br/><br />
As an illustration, in the Makefile example, the corpus file is named setimes.sh-mk.txt.<br />
<br />
Set the Makefile variables as follows: <br/><br />
* CORPUS denotes the base name of your corpus file<br />
* DIR stands for the language pair<br />
* DATA is the path to the language resources for the language pair<br />
* AUTOBIL is the path to binary bilingual dictionary for the language pair<br />
* SCRIPTS denotes the path to the lex-tools scripts<br />
* MODEL is the path to the target side (binary) language model used for scoring the possible translations of ambiguous words<br />
<br />
Finally, executing the Makefile will generate lexical selection rules for the specified language pair.</div>
Fpetkovski
https://wiki.apertium.org/w/index.php?title=Learning_rules_from_parallel_and_non-parallel_corpora&diff=43306
Learning rules from parallel and non-parallel corpora
2013-08-19T07:39:22Z
<p>Fpetkovski: /* Training */</p>
<hr />
<div>== Estimating rules using parallel corpora ==<br />
It is always recommended to use a parallel corpus for any type of machine translation training<br />
when such a resource is available. This section describes several (3) methods for estimating<br />
lexical selection rules using a parallel corpus. We start by describing the part of the training process that is shared by all three methods, and then continue to describe each of the <br />
individual methods separately.<br />
=== Prequisites ===<br />
The training methods use several software packages that need to be installed.<br />
First you will need to download and install:<br />
* [[lttoolbox]]<br />
* Apertium<br />
* [[apertium-lex-tools]]<br />
* Moses and its training scripts which you can install using the following script<br />
<br />
Furthermore you will also need:<br />
<br />
* an Apertium language pair<br />
* a parallel corpus (see [[Corpora]])<br />
<br />
===Installing prerequisites===<br />
See [[Minimal installation from SVN]] for apertium/lttoolbox. <br />
<br />
See [[Constraint-based lexical selection module]] for apertium-lex-tools.<br />
<br />
For moses-decoder you can do<br />
<pre><br />
git clone https://github.com/moses-smt/mosesdecoder<br />
cd mosesdecoder/<br />
./bjam <br />
</pre><br />
<br />
Now e.g. the clean-corpus and train-model scripts referred to below will be in ~/smt/mosesdecoder/scripts/training/clean-corpus-n.perl<br />
See http://www.statmt.org/moses/?n=Development.GetStarted if you want to install the binaries to some other directory.<br />
<br />
=== Preparing the training files ===<br />
<br />
The parallel corpus is processed in such a way that <br />
the training files (the source and target side corpus) are first analysed and tagged.<br />
<br />
Next, lines with no analysis are removed and blank within tokens are replaced with a new character<br />
since Giza tokenizes a sentence by splitting on white space.<br />
<br />
Finally, both files are cleaned using a moses training script so that Giza will not <br />
crash during the training process.<br />
<br />
All of this can be achieved using the following script:<br />
<br />
<pre><br />
CORPUS="Europarl3"<br />
PAIR="es-pt"<br />
SL="pt"<br />
TL="es"<br />
DATA="/home/philip/Apertium/apertium-es-pt"<br />
<br />
LEX_TOOLS="/home/philip/Apertium/apertium-lex-tools"<br />
SCRIPTS="$LEX_TOOLS/scripts"<br />
MOSESDECODER="/home/philip/Apertium/mosesdecoder/scripts/training"<br />
TRAINING_LINES=100000<br />
<br />
<br />
if [ ! -d data-$SL-$TL ]; then <br />
mkdir data-$SL-$TL;<br />
fi<br />
<br />
# TAG CORPUS<br />
cat "$CORPUS.$PAIR.$SL" | head -n $TRAINING_LINES | apertium -d "$DATA" $SL-$TL-tagger-dcase \<br />
| apertium-pretransfer > data-$SL-$TL/$CORPUS.tagged.$SL;<br />
<br />
cat "$CORPUS.$PAIR.$TL" | head -n $TRAINING_LINES | apertium -d "$DATA" $TL-$SL-tagger-dcase \<br />
| apertium-pretransfer > data-$SL-$TL/$CORPUS.tagged.$TL;<br />
<br />
N=`wc -l $CORPUS.$PAIR.$SL | cut -d ' ' -f 1`<br />
<br />
<br />
# REMOVE LINES WITH NO ANALYSES<br />
seq 1 $TRAINING_LINES > data-$SL-$TL/$CORPUS.lines<br />
paste data-$SL-$TL/$CORPUS.lines data-$SL-$TL/$CORPUS.tagged.$SL data-$SL-$TL/$CORPUS.tagged.$TL | grep '<' \<br />
| cut -f1 > data-$SL-$TL/$CORPUS.lines.new<br />
paste data-$SL-$TL/$CORPUS.lines data-$SL-$TL/$CORPUS.tagged.$SL data-$SL-$TL/$CORPUS.tagged.$TL | grep '<' \<br />
| cut -f2 > data-$SL-$TL/$CORPUS.tagged.$SL.new<br />
paste data-$SL-$TL/$CORPUS.lines data-$SL-$TL/$CORPUS.tagged.$SL data-$SL-$TL/$CORPUS.tagged.$TL | grep '<' \<br />
| cut -f3 > data-$SL-$TL/$CORPUS.tagged.$TL.new<br />
mv data-$SL-$TL/$CORPUS.lines.new data-$SL-$TL/$CORPUS.lines<br />
cat data-$SL-$TL/$CORPUS.tagged.$SL.new \<br />
| sed 's/ /~/g' | sed 's/\$[^\^]*/$ /g' > data-$SL-$TL/$CORPUS.tagged.$SL<br />
cat data-$SL-$TL/$CORPUS.tagged.$TL.new \<br />
| sed 's/ /~/g' | sed 's/\$[^\^]*/$ /g' > data-$SL-$TL/$CORPUS.tagged.$TL<br />
rm data-$SL-$TL/*.new<br />
<br />
<br />
# CLEAN CORPUS<br />
perl "$MOSESDECODER/clean-corpus-n.perl" data-$SL-$TL/$CORPUS.tagged $SL $TL "data-$SL-$TL/$CORPUS.tag-clean" 1 40;<br />
<br />
</pre><br />
<br />
Make sure you set the variables as follows: <br/><br />
* CORPUS="Europarl3": name of the corpus that you're using<br />
* PAIR: the direction of the corpus<br />
* SL: the source language<br />
* TL: the target language<br />
* DATA: path to the language resources for the language pair<br />
* LEX_TOOLS: path to apertium-lex-tools<br />
* MOSESDECODER: path to moses-decoder<br />
* TRAINING_LINES: amount of training lines<br />
<br />
=== Learning rules with Giza ===<br />
<br />
==== Installing Giza ====<br />
<br />
You can download and install Giza in the following way:<br />
<br />
<pre><br />
<br />
$ mkdir ~/smt<br />
$ cd ~/smt<br />
$ mkdir local # our "install prefix"<br />
$ wget https://giza-pp.googlecode.com/files/giza-pp-v1.0.7.tar.gz<br />
$ tar xzvf giza-pp-v1.0.7.tar.gz<br />
$ cd giza-pp<br />
$ make<br />
$ mkdir ../local/bin<br />
$ cp GIZA++-v2/snt2cooc.out ../local/bin/<br />
$ cp GIZA++-v2/snt2plain.out ../local/bin/<br />
$ cp GIZA++-v2/GIZA++ ../local/bin/<br />
$ cp mkcls-v2/mkcls ../local/bin/<br />
</pre><br />
<br />
Notice that all of your binary files will be placed in a single folder which is needed by the moses training script.<br />
<br />
==== Training ====<br />
Giza++ is used for obtaining an alignment between the tokens of the parallel sentences in the two corpora. The alignment process can sometimes be slow depending on the size of the corpora.<br />
<br />
After alignment has been done, the tokens' tags can be trimmed down so that they match the the set of tags found in the bilingual dictionary of the language pair.<br />
<br />
Finally, a bilingual transfer output is obtained from the source language side so that<br />
ambiguous sentences and missing bidix candidates can be extracted.<br />
<br />
<pre><br />
BIN_DIR="/home/philip/Apertium/smt/bin"<br />
# ALIGN<br />
perl $MOSESDECODER/train-model.perl -external-bin-dir "$BIN_DIR" -corpus data-$SL-$TL/$CORPUS.tag-clean \<br />
-f $TL -e $SL -alignment grow-diag-final-and -reordering msd-bidirectional-fe \<br />
-lm 0:5:/home/philip/Apertium/gsoc2013/giza/europarl.lm:0 2>&1<br />
<br />
# EXTRACT<br />
zcat giza.$SL-$TL/$SL-$TL.A3.final.gz | $SCRIPTS/giza-to-moses.awk > data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL<br />
<br />
# TRIM TAGS<br />
cat data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL | sed 's/ ||| /\t/g' | cut -f 1 \<br />
| sed 's/~/ /g' | $LEX_TOOLS/process-tagger-output $DATA/$TL-$SL.autobil.bin -p -t > tmp1<br />
<br />
cat data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL | sed 's/ ||| /\t/g' | cut -f 2 \<br />
| sed 's/~/ /g' | $LEX_TOOLS/process-tagger-output $DATA/$SL-$TL.autobil.bin -p -t > tmp2<br />
<br />
cat data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL | sed 's/ ||| /\t/g' | cut -f 3 > tmp3<br />
<br />
cat data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL | sed 's/ ||| /\t/g' | cut -f 2 \<br />
| sed 's/~/ /g' | $LEX_TOOLS/process-tagger-output $DATA/$SL-$TL.autobil.bin -b -t > data-$SL-$TL/$CORPUS.clean-biltrans.$PAIR<br />
<br />
paste tmp1 tmp2 tmp3 | sed 's/\t/ ||| /g' > data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL<br />
rm tmp1 tmp2 tmp3<br />
</pre><br />
<br />
== Estimating rules using non-parallel corpora ==<br />
Prerequisites:<br />
* Install [[apertium-lex-tools]]<br />
* Install IRSTLM (http://sourceforge.net/projects/irstlm/)<br />
* Estimate a '''binary''' target side language model (http://sourceforge.net/apps/mediawiki/irstlm/index.php?title=User_Manual). <br />
* The language pair must support the pretransfer and multi modes. See apertium-sh-mk/modes.xml <br />
as a reference on how to add these modes if they do not exist.<br />
Place the following Makefile in the folder where you want to run your training process:<br />
<br />
<pre><br />
CORPUS=setimes<br />
DIR=sh-mk<br />
DATA=/home/philip/Apertium/apertium-sh-mk/<br />
AUTOBIL=sh-mk.autobil.bin<br />
SCRIPTS=/home/philip/Apertium/apertium-lex-tools/scripts<br />
MODEL=/home/philip/Apertium/corpora/language-models/mk/setimes.mk.5.blm<br />
LEX_TOOLS=/home/philip/Apertium/apertium-lex-tools<br />
THR=0<br />
<br />
#all: data/$(CORPUS).$(DIR).lrx data/$(CORPUS).$(DIR).freq.lrx<br />
all: data/$(CORPUS).$(DIR).freq.lrx.bin data/$(CORPUS).$(DIR).patterns.lrx<br />
<br />
data/$(CORPUS).$(DIR).tagger: $(CORPUS).$(DIR).txt<br />
if [ ! -d data ]; then mkdir data; fi<br />
cat $(CORPUS).$(DIR).txt | sed 's/[^\.]$$/./g' | apertium-destxt | apertium -f none -d $(DATA) $(DIR)-tagger | apertium-pretransfer > $@<br />
<br />
data/$(CORPUS).$(DIR).ambig: data/$(CORPUS).$(DIR).tagger<br />
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/multitrans $(DATA)$(DIR).autobil.bin -b -t > $@<br />
<br />
data/$(CORPUS).$(DIR).multi-trimmed: data/$(CORPUS).$(DIR).tagger<br />
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/multitrans $(DATA)$(DIR).autobil.bin -m -t > $@<br />
<br />
data/$(CORPUS).$(DIR).ranked: data/$(CORPUS).$(DIR).tagger<br />
cat $< | $(LEX_TOOLS)/multitrans $(DATA)$(DIR).autobil.bin -m | apertium -f none -d $(DATA) $(DIR)-multi | irstlm-ranker-frac $(MODEL) > $@<br />
<br />
data/$(CORPUS).$(DIR).annotated: data/$(CORPUS).$(DIR).multi-trimmed data/$(CORPUS).$(DIR).ranked<br />
paste data/$(CORPUS).$(DIR).multi-trimmed data/$(CORPUS).$(DIR).ranked | cut -f1-4 > $@<br />
<br />
data/$(CORPUS).$(DIR).freq: data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated<br />
python3 $(SCRIPTS)/biltrans-extract-frac-freq.py data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated > $@<br />
<br />
data/$(CORPUS).$(DIR).freq.lrx: data/$(CORPUS).$(DIR).freq<br />
python3 $(SCRIPTS)/extract-alig-lrx.py $< > $@<br />
<br />
data/$(CORPUS).$(DIR).freq.lrx.bin: data/$(CORPUS).$(DIR).freq.lrx<br />
lrx-comp $< $@<br />
<br />
data/$(CORPUS).$(DIR).ngrams: data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated<br />
python3 $(SCRIPTS)/biltrans-count-patterns-ngrams.py data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated > $@<br />
<br />
data/$(CORPUS).$(DIR).patterns: data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ngrams<br />
python3 $(SCRIPTS)/ngram-pruning-frac.py data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ngrams > $@ <br />
<br />
data/$(CORPUS).$(DIR).patterns.lrx: data/$(CORPUS).$(DIR).patterns<br />
python3 $(SCRIPTS)/ngrams-to-rules.py $< $(THR) > $@<br />
</pre><br />
<br />
In the same folder also place your source side corpus file. The corpus file needs to be named as "basename"."language-pair".txt. <br/><br />
As an illustration, in the Makefile example, the corpus file is named setimes.sh-mk.txt.<br />
<br />
Set the Makefile variables as follows: <br/><br />
* CORPUS denotes the base name of your corpus file<br />
* DIR stands for the language pair<br />
* DATA is the path to the language resources for the language pair<br />
* AUTOBIL is the path to binary bilingual dictionary for the language pair<br />
* SCRIPTS denotes the path to the lex-tools scripts<br />
* MODEL is the path to the target side (binary) language model used for scoring the possible translations of ambiguous words<br />
<br />
Finally, executing the Makefile will generate lexical selection rules for the specified language pair.</div>
Fpetkovski
https://wiki.apertium.org/w/index.php?title=Learning_rules_from_parallel_and_non-parallel_corpora&diff=43305
Learning rules from parallel and non-parallel corpora
2013-08-19T07:38:00Z
<p>Fpetkovski: /* Estimating rules using parallel corpora */</p>
<hr />
<div>== Estimating rules using parallel corpora ==<br />
It is always recommended to use a parallel corpus for any type of machine translation training<br />
when such a resource is available. This section describes several (3) methods for estimating<br />
lexical selection rules using a parallel corpus. We start by describing the part of the training process that is shared by all three methods, and then continue to describe each of the <br />
individual methods separately.<br />
=== Prequisites ===<br />
The training methods use several software packages that need to be installed.<br />
First you will need to download and install:<br />
* [[lttoolbox]]<br />
* Apertium<br />
* [[apertium-lex-tools]]<br />
* Moses and its training scripts which you can install using the following script<br />
<br />
Furthermore you will also need:<br />
<br />
* an Apertium language pair<br />
* a parallel corpus (see [[Corpora]])<br />
<br />
===Installing prerequisites===<br />
See [[Minimal installation from SVN]] for apertium/lttoolbox. <br />
<br />
See [[Constraint-based lexical selection module]] for apertium-lex-tools.<br />
<br />
For moses-decoder you can do<br />
<pre><br />
git clone https://github.com/moses-smt/mosesdecoder<br />
cd mosesdecoder/<br />
./bjam <br />
</pre><br />
<br />
Now e.g. the clean-corpus and train-model scripts referred to below will be in ~/smt/mosesdecoder/scripts/training/clean-corpus-n.perl<br />
See http://www.statmt.org/moses/?n=Development.GetStarted if you want to install the binaries to some other directory.<br />
<br />
=== Preparing the training files ===<br />
<br />
The parallel corpus is processed in such a way that <br />
the training files (the source and target side corpus) are first analysed and tagged.<br />
<br />
Next, lines with no analysis are removed and blank within tokens are replaced with a new character<br />
since Giza tokenizes a sentence by splitting on white space.<br />
<br />
Finally, both files are cleaned using a moses training script so that Giza will not <br />
crash during the training process.<br />
<br />
All of this can be achieved using the following script:<br />
<br />
<pre><br />
CORPUS="Europarl3"<br />
PAIR="es-pt"<br />
SL="pt"<br />
TL="es"<br />
DATA="/home/philip/Apertium/apertium-es-pt"<br />
<br />
LEX_TOOLS="/home/philip/Apertium/apertium-lex-tools"<br />
SCRIPTS="$LEX_TOOLS/scripts"<br />
MOSESDECODER="/home/philip/Apertium/mosesdecoder/scripts/training"<br />
TRAINING_LINES=100000<br />
<br />
<br />
if [ ! -d data-$SL-$TL ]; then <br />
mkdir data-$SL-$TL;<br />
fi<br />
<br />
# TAG CORPUS<br />
cat "$CORPUS.$PAIR.$SL" | head -n $TRAINING_LINES | apertium -d "$DATA" $SL-$TL-tagger-dcase \<br />
| apertium-pretransfer > data-$SL-$TL/$CORPUS.tagged.$SL;<br />
<br />
cat "$CORPUS.$PAIR.$TL" | head -n $TRAINING_LINES | apertium -d "$DATA" $TL-$SL-tagger-dcase \<br />
| apertium-pretransfer > data-$SL-$TL/$CORPUS.tagged.$TL;<br />
<br />
N=`wc -l $CORPUS.$PAIR.$SL | cut -d ' ' -f 1`<br />
<br />
<br />
# REMOVE LINES WITH NO ANALYSES<br />
seq 1 $TRAINING_LINES > data-$SL-$TL/$CORPUS.lines<br />
paste data-$SL-$TL/$CORPUS.lines data-$SL-$TL/$CORPUS.tagged.$SL data-$SL-$TL/$CORPUS.tagged.$TL | grep '<' \<br />
| cut -f1 > data-$SL-$TL/$CORPUS.lines.new<br />
paste data-$SL-$TL/$CORPUS.lines data-$SL-$TL/$CORPUS.tagged.$SL data-$SL-$TL/$CORPUS.tagged.$TL | grep '<' \<br />
| cut -f2 > data-$SL-$TL/$CORPUS.tagged.$SL.new<br />
paste data-$SL-$TL/$CORPUS.lines data-$SL-$TL/$CORPUS.tagged.$SL data-$SL-$TL/$CORPUS.tagged.$TL | grep '<' \<br />
| cut -f3 > data-$SL-$TL/$CORPUS.tagged.$TL.new<br />
mv data-$SL-$TL/$CORPUS.lines.new data-$SL-$TL/$CORPUS.lines<br />
cat data-$SL-$TL/$CORPUS.tagged.$SL.new \<br />
| sed 's/ /~/g' | sed 's/\$[^\^]*/$ /g' > data-$SL-$TL/$CORPUS.tagged.$SL<br />
cat data-$SL-$TL/$CORPUS.tagged.$TL.new \<br />
| sed 's/ /~/g' | sed 's/\$[^\^]*/$ /g' > data-$SL-$TL/$CORPUS.tagged.$TL<br />
rm data-$SL-$TL/*.new<br />
<br />
<br />
# CLEAN CORPUS<br />
perl "$MOSESDECODER/clean-corpus-n.perl" data-$SL-$TL/$CORPUS.tagged $SL $TL "data-$SL-$TL/$CORPUS.tag-clean" 1 40;<br />
<br />
</pre><br />
<br />
Make sure you set the variables as follows: <br/><br />
* CORPUS="Europarl3": name of the corpus that you're using<br />
* PAIR: the direction of the corpus<br />
* SL: the source language<br />
* TL: the target language<br />
* DATA: path to the language resources for the language pair<br />
* LEX_TOOLS: path to apertium-lex-tools<br />
* MOSESDECODER: path to moses-decoder<br />
* TRAINING_LINES: amount of training lines<br />
<br />
=== Learning rules with Giza ===<br />
<br />
==== Installing Giza ====<br />
<br />
You can download and install Giza in the following way:<br />
<br />
<pre><br />
<br />
$ mkdir ~/smt<br />
$ cd ~/smt<br />
$ mkdir local # our "install prefix"<br />
$ wget https://giza-pp.googlecode.com/files/giza-pp-v1.0.7.tar.gz<br />
$ tar xzvf giza-pp-v1.0.7.tar.gz<br />
$ cd giza-pp<br />
$ make<br />
$ mkdir ../local/bin<br />
$ cp GIZA++-v2/snt2cooc.out ../local/bin/<br />
$ cp GIZA++-v2/snt2plain.out ../local/bin/<br />
$ cp GIZA++-v2/GIZA++ ../local/bin/<br />
$ cp mkcls-v2/mkcls ../local/bin/<br />
</pre><br />
<br />
Notice that all of your binary files will be placed in a single folder which is needed by the moses training script.<br />
<br />
==== Training ====<br />
Giza is used for obtaining an alignment between the tokens of the parallel sentences in the two corpora. The alignment process can sometimes be slow depending on the size of the corpora.<br />
<br />
After alignment has been done, the tokens' tags can be trimmed down so that they match the the set of tags found in the bilingual dictionary of the language pair.<br />
<br />
Finally, a bilingual transfer output is obtained from the source language side so that<br />
ambiguous sentences and missing bidix candidates can be extracted.<br />
<br />
<pre><br />
BIN_DIR="/home/philip/Apertium/smt/bin"<br />
# ALIGN<br />
perl $MOSESDECODER/train-model.perl -external-bin-dir "$BIN_DIR" -corpus data-$SL-$TL/$CORPUS.tag-clean \<br />
-f $TL -e $SL -alignment grow-diag-final-and -reordering msd-bidirectional-fe \<br />
-lm 0:5:/home/philip/Apertium/gsoc2013/giza/europarl.lm:0 2>&1<br />
<br />
# EXTRACT<br />
zcat giza.$SL-$TL/$SL-$TL.A3.final.gz | $SCRIPTS/giza-to-moses.awk > data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL<br />
<br />
# TRIM TAGS<br />
cat data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL | sed 's/ ||| /\t/g' | cut -f 1 \<br />
| sed 's/~/ /g' | $LEX_TOOLS/process-tagger-output $DATA/$TL-$SL.autobil.bin -p -t > tmp1<br />
<br />
cat data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL | sed 's/ ||| /\t/g' | cut -f 2 \<br />
| sed 's/~/ /g' | $LEX_TOOLS/process-tagger-output $DATA/$SL-$TL.autobil.bin -p -t > tmp2<br />
<br />
cat data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL | sed 's/ ||| /\t/g' | cut -f 3 > tmp3<br />
<br />
cat data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL | sed 's/ ||| /\t/g' | cut -f 2 \<br />
| sed 's/~/ /g' | $LEX_TOOLS/process-tagger-output $DATA/$SL-$TL.autobil.bin -b -t > data-$SL-$TL/$CORPUS.clean-biltrans.$PAIR<br />
<br />
paste tmp1 tmp2 tmp3 | sed 's/\t/ ||| /g' > data-$SL-$TL/$CORPUS.phrasetable.$SL-$TL<br />
rm tmp1 tmp2 tmp3<br />
</pre><br />
<br />
== Estimating rules using non-parallel corpora ==<br />
Prerequisites:<br />
* Install [[apertium-lex-tools]]<br />
* Install IRSTLM (http://sourceforge.net/projects/irstlm/)<br />
* Estimate a '''binary''' target side language model (http://sourceforge.net/apps/mediawiki/irstlm/index.php?title=User_Manual). <br />
* The language pair must support the pretransfer and multi modes. See apertium-sh-mk/modes.xml <br />
as a reference on how to add these modes if they do not exist.<br />
Place the following Makefile in the folder where you want to run your training process:<br />
<br />
<pre><br />
CORPUS=setimes<br />
DIR=sh-mk<br />
DATA=/home/philip/Apertium/apertium-sh-mk/<br />
AUTOBIL=sh-mk.autobil.bin<br />
SCRIPTS=/home/philip/Apertium/apertium-lex-tools/scripts<br />
MODEL=/home/philip/Apertium/corpora/language-models/mk/setimes.mk.5.blm<br />
LEX_TOOLS=/home/philip/Apertium/apertium-lex-tools<br />
THR=0<br />
<br />
#all: data/$(CORPUS).$(DIR).lrx data/$(CORPUS).$(DIR).freq.lrx<br />
all: data/$(CORPUS).$(DIR).freq.lrx.bin data/$(CORPUS).$(DIR).patterns.lrx<br />
<br />
data/$(CORPUS).$(DIR).tagger: $(CORPUS).$(DIR).txt<br />
if [ ! -d data ]; then mkdir data; fi<br />
cat $(CORPUS).$(DIR).txt | sed 's/[^\.]$$/./g' | apertium-destxt | apertium -f none -d $(DATA) $(DIR)-tagger | apertium-pretransfer > $@<br />
<br />
data/$(CORPUS).$(DIR).ambig: data/$(CORPUS).$(DIR).tagger<br />
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/multitrans $(DATA)$(DIR).autobil.bin -b -t > $@<br />
<br />
data/$(CORPUS).$(DIR).multi-trimmed: data/$(CORPUS).$(DIR).tagger<br />
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/multitrans $(DATA)$(DIR).autobil.bin -m -t > $@<br />
<br />
data/$(CORPUS).$(DIR).ranked: data/$(CORPUS).$(DIR).tagger<br />
cat $< | $(LEX_TOOLS)/multitrans $(DATA)$(DIR).autobil.bin -m | apertium -f none -d $(DATA) $(DIR)-multi | irstlm-ranker-frac $(MODEL) > $@<br />
<br />
data/$(CORPUS).$(DIR).annotated: data/$(CORPUS).$(DIR).multi-trimmed data/$(CORPUS).$(DIR).ranked<br />
paste data/$(CORPUS).$(DIR).multi-trimmed data/$(CORPUS).$(DIR).ranked | cut -f1-4 > $@<br />
<br />
data/$(CORPUS).$(DIR).freq: data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated<br />
python3 $(SCRIPTS)/biltrans-extract-frac-freq.py data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated > $@<br />
<br />
data/$(CORPUS).$(DIR).freq.lrx: data/$(CORPUS).$(DIR).freq<br />
python3 $(SCRIPTS)/extract-alig-lrx.py $< > $@<br />
<br />
data/$(CORPUS).$(DIR).freq.lrx.bin: data/$(CORPUS).$(DIR).freq.lrx<br />
lrx-comp $< $@<br />
<br />
data/$(CORPUS).$(DIR).ngrams: data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated<br />
python3 $(SCRIPTS)/biltrans-count-patterns-ngrams.py data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated > $@<br />
<br />
data/$(CORPUS).$(DIR).patterns: data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ngrams<br />
python3 $(SCRIPTS)/ngram-pruning-frac.py data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ngrams > $@ <br />
<br />
data/$(CORPUS).$(DIR).patterns.lrx: data/$(CORPUS).$(DIR).patterns<br />
python3 $(SCRIPTS)/ngrams-to-rules.py $< $(THR) > $@<br />
</pre><br />
<br />
In the same folder also place your source side corpus file. The corpus file needs to be named as "basename"."language-pair".txt. <br/><br />
As an illustration, in the Makefile example, the corpus file is named setimes.sh-mk.txt.<br />
<br />
Set the Makefile variables as follows: <br/><br />
* CORPUS denotes the base name of your corpus file<br />
* DIR stands for the language pair<br />
* DATA is the path to the language resources for the language pair<br />
* AUTOBIL is the path to binary bilingual dictionary for the language pair<br />
* SCRIPTS denotes the path to the lex-tools scripts<br />
* MODEL is the path to the target side (binary) language model used for scoring the possible translations of ambiguous words<br />
<br />
Finally, executing the Makefile will generate lexical selection rules for the specified language pair.</div>
Fpetkovski
https://wiki.apertium.org/w/index.php?title=Learning_rules_from_parallel_and_non-parallel_corpora&diff=43304
Learning rules from parallel and non-parallel corpora
2013-08-19T07:27:38Z
<p>Fpetkovski: /* Learning rules with Giza */</p>
<hr />
<div>== Estimating rules using parallel corpora ==<br />
It is always recommended to use a parallel corpus for any type of machine translation training<br />
when such a resource is available. This section describes several (3) methods for estimating<br />
lexical selection rules using a parallel corpus. We start by describing the part of the training process that is shared by all three methods, and then continue to describe each of the <br />
individual methods separately.<br />
=== Prequisites ===<br />
The training methods use several software packages that need to be installed.<br />
First you will need to download and install:<br />
* [[lttoolbox]]<br />
* Apertium<br />
* [[apertium-lex-tools]]<br />
* Moses and its training scripts which you can install using the following script<br />
<br />
Furthermore you will also need:<br />
<br />
* an Apertium language pair<br />
* a parallel corpus (see [[Corpora]])<br />
<br />
===Installing prerequisites===<br />
See [[Minimal installation from SVN]] for apertium/lttoolbox. <br />
<br />
See [[Constraint-based lexical selection module]] for apertium-lex-tools.<br />
<br />
For moses-decoder you can do<br />
<pre><br />
git clone https://github.com/moses-smt/mosesdecoder<br />
cd mosesdecoder/<br />
./bjam <br />
</pre><br />
<br />
Now e.g. the clean-corpus and train-model scripts referred to below will be in ~/smt/mosesdecoder/scripts/training/clean-corpus-n.perl<br />
See http://www.statmt.org/moses/?n=Development.GetStarted if you want to install the binaries to some other directory.<br />
<br />
=== Preparing the training files ===<br />
<br />
The parallel corpus is processed in such a way that <br />
the training files (the source and target side corpus) are first analysed and tagged.<br />
<br />
Next, lines with no analysis are removed and blank within tokens are replaced with a new character<br />
since Giza tokenizes a sentence by splitting on white space.<br />
<br />
Finally, both files are cleaned using a moses training script so that Giza will not <br />
crash during the training process.<br />
<br />
All of this can be achieved using the following script:<br />
<br />
<pre><br />
CORPUS="Europarl3"<br />
PAIR="es-pt"<br />
SL="pt"<br />
TL="es"<br />
DATA="/home/philip/Apertium/apertium-es-pt"<br />
<br />
LEX_TOOLS="/home/philip/Apertium/apertium-lex-tools"<br />
SCRIPTS="$LEX_TOOLS/scripts"<br />
MOSESDECODER="/home/philip/Apertium/mosesdecoder/scripts/training"<br />
TRAINING_LINES=100000<br />
<br />
<br />
if [ ! -d data-$SL-$TL ]; then <br />
mkdir data-$SL-$TL;<br />
fi<br />
<br />
# TAG CORPUS<br />
cat "$CORPUS.$PAIR.$SL" | head -n $TRAINING_LINES | apertium -d "$DATA" $SL-$TL-tagger-dcase \<br />
| apertium-pretransfer > data-$SL-$TL/$CORPUS.tagged.$SL;<br />
<br />
cat "$CORPUS.$PAIR.$TL" | head -n $TRAINING_LINES | apertium -d "$DATA" $TL-$SL-tagger-dcase \<br />
| apertium-pretransfer > data-$SL-$TL/$CORPUS.tagged.$TL;<br />
<br />
N=`wc -l $CORPUS.$PAIR.$SL | cut -d ' ' -f 1`<br />
<br />
<br />
# REMOVE LINES WITH NO ANALYSES<br />
seq 1 $TRAINING_LINES > data-$SL-$TL/$CORPUS.lines<br />
paste data-$SL-$TL/$CORPUS.lines data-$SL-$TL/$CORPUS.tagged.$SL data-$SL-$TL/$CORPUS.tagged.$TL | grep '<' \<br />
| cut -f1 > data-$SL-$TL/$CORPUS.lines.new<br />
paste data-$SL-$TL/$CORPUS.lines data-$SL-$TL/$CORPUS.tagged.$SL data-$SL-$TL/$CORPUS.tagged.$TL | grep '<' \<br />
| cut -f2 > data-$SL-$TL/$CORPUS.tagged.$SL.new<br />
paste data-$SL-$TL/$CORPUS.lines data-$SL-$TL/$CORPUS.tagged.$SL data-$SL-$TL/$CORPUS.tagged.$TL | grep '<' \<br />
| cut -f3 > data-$SL-$TL/$CORPUS.tagged.$TL.new<br />
mv data-$SL-$TL/$CORPUS.lines.new data-$SL-$TL/$CORPUS.lines<br />
cat data-$SL-$TL/$CORPUS.tagged.$SL.new \<br />
| sed 's/ /~/g' | sed 's/\$[^\^]*/$ /g' > data-$SL-$TL/$CORPUS.tagged.$SL<br />
cat data-$SL-$TL/$CORPUS.tagged.$TL.new \<br />
| sed 's/ /~/g' | sed 's/\$[^\^]*/$ /g' > data-$SL-$TL/$CORPUS.tagged.$TL<br />
rm data-$SL-$TL/*.new<br />
<br />
<br />
# CLEAN CORPUS<br />
perl "$MOSESDECODER/clean-corpus-n.perl" data-$SL-$TL/$CORPUS.tagged $SL $TL "data-$SL-$TL/$CORPUS.tag-clean" 1 40;<br />
<br />
</pre><br />
<br />
Make sure you set the variables as follows: <br/><br />
* CORPUS="Europarl3": name of the corpus that you're using<br />
* PAIR: the direction of the corpus<br />
* SL: the source language<br />
* TL: the target language<br />
* DATA: path to the language resources for the language pair<br />
* LEX_TOOLS: path to apertium-lex-tools<br />
* MOSESDECODER: path to moses-decoder<br />
* TRAINING_LINES: amount of training lines<br />
<br />
==== Learning rules with Giza ====<br />
<br />
=== Installing Giza ===<br />
<pre><br />
<br />
$ mkdir ~/smt<br />
$ cd ~/smt<br />
$ mkdir local # our "install prefix"<br />
$ wget https://giza-pp.googlecode.com/files/giza-pp-v1.0.7.tar.gz<br />
$ tar xzvf giza-pp-v1.0.7.tar.gz<br />
$ cd giza-pp<br />
$ make<br />
$ mkdir ../local/bin<br />
$ cp GIZA++-v2/snt2cooc.out ../local/bin/<br />
$ cp GIZA++-v2/snt2plain.out ../local/bin/<br />
$ cp GIZA++-v2/GIZA++ ../local/bin/<br />
$ cp mkcls-v2/mkcls ../local/bin/<br />
</pre><br />
<br />
== Estimating rules using non-parallel corpora ==<br />
Prerequisites:<br />
* Install [[apertium-lex-tools]]<br />
* Install IRSTLM (http://sourceforge.net/projects/irstlm/)<br />
* Estimate a '''binary''' target side language model (http://sourceforge.net/apps/mediawiki/irstlm/index.php?title=User_Manual). <br />
* The language pair must support the pretransfer and multi modes. See apertium-sh-mk/modes.xml <br />
as a reference on how to add these modes if they do not exist.<br />
Place the following Makefile in the folder where you want to run your training process:<br />
<br />
<pre><br />
CORPUS=setimes<br />
DIR=sh-mk<br />
DATA=/home/philip/Apertium/apertium-sh-mk/<br />
AUTOBIL=sh-mk.autobil.bin<br />
SCRIPTS=/home/philip/Apertium/apertium-lex-tools/scripts<br />
MODEL=/home/philip/Apertium/corpora/language-models/mk/setimes.mk.5.blm<br />
LEX_TOOLS=/home/philip/Apertium/apertium-lex-tools<br />
THR=0<br />
<br />
#all: data/$(CORPUS).$(DIR).lrx data/$(CORPUS).$(DIR).freq.lrx<br />
all: data/$(CORPUS).$(DIR).freq.lrx.bin data/$(CORPUS).$(DIR).patterns.lrx<br />
<br />
data/$(CORPUS).$(DIR).tagger: $(CORPUS).$(DIR).txt<br />
if [ ! -d data ]; then mkdir data; fi<br />
cat $(CORPUS).$(DIR).txt | sed 's/[^\.]$$/./g' | apertium-destxt | apertium -f none -d $(DATA) $(DIR)-tagger | apertium-pretransfer > $@<br />
<br />
data/$(CORPUS).$(DIR).ambig: data/$(CORPUS).$(DIR).tagger<br />
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/multitrans $(DATA)$(DIR).autobil.bin -b -t > $@<br />
<br />
data/$(CORPUS).$(DIR).multi-trimmed: data/$(CORPUS).$(DIR).tagger<br />
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/multitrans $(DATA)$(DIR).autobil.bin -m -t > $@<br />
<br />
data/$(CORPUS).$(DIR).ranked: data/$(CORPUS).$(DIR).tagger<br />
cat $< | $(LEX_TOOLS)/multitrans $(DATA)$(DIR).autobil.bin -m | apertium -f none -d $(DATA) $(DIR)-multi | irstlm-ranker-frac $(MODEL) > $@<br />
<br />
data/$(CORPUS).$(DIR).annotated: data/$(CORPUS).$(DIR).multi-trimmed data/$(CORPUS).$(DIR).ranked<br />
paste data/$(CORPUS).$(DIR).multi-trimmed data/$(CORPUS).$(DIR).ranked | cut -f1-4 > $@<br />
<br />
data/$(CORPUS).$(DIR).freq: data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated<br />
python3 $(SCRIPTS)/biltrans-extract-frac-freq.py data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated > $@<br />
<br />
data/$(CORPUS).$(DIR).freq.lrx: data/$(CORPUS).$(DIR).freq<br />
python3 $(SCRIPTS)/extract-alig-lrx.py $< > $@<br />
<br />
data/$(CORPUS).$(DIR).freq.lrx.bin: data/$(CORPUS).$(DIR).freq.lrx<br />
lrx-comp $< $@<br />
<br />
data/$(CORPUS).$(DIR).ngrams: data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated<br />
python3 $(SCRIPTS)/biltrans-count-patterns-ngrams.py data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated > $@<br />
<br />
data/$(CORPUS).$(DIR).patterns: data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ngrams<br />
python3 $(SCRIPTS)/ngram-pruning-frac.py data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ngrams > $@ <br />
<br />
data/$(CORPUS).$(DIR).patterns.lrx: data/$(CORPUS).$(DIR).patterns<br />
python3 $(SCRIPTS)/ngrams-to-rules.py $< $(THR) > $@<br />
</pre><br />
<br />
In the same folder also place your source side corpus file. The corpus file needs to be named as "basename"."language-pair".txt. <br/><br />
As an illustration, in the Makefile example, the corpus file is named setimes.sh-mk.txt.<br />
<br />
Set the Makefile variables as follows: <br/><br />
* CORPUS denotes the base name of your corpus file<br />
* DIR stands for the language pair<br />
* DATA is the path to the language resources for the language pair<br />
* AUTOBIL is the path to binary bilingual dictionary for the language pair<br />
* SCRIPTS denotes the path to the lex-tools scripts<br />
* MODEL is the path to the target side (binary) language model used for scoring the possible translations of ambiguous words<br />
<br />
Finally, executing the Makefile will generate lexical selection rules for the specified language pair.</div>
Fpetkovski
https://wiki.apertium.org/w/index.php?title=Learning_rules_from_parallel_and_non-parallel_corpora&diff=43303
Learning rules from parallel and non-parallel corpora
2013-08-19T07:27:03Z
<p>Fpetkovski: /* Preparing the training files */</p>
<hr />
<div>== Estimating rules using parallel corpora ==<br />
It is always recommended to use a parallel corpus for any type of machine translation training<br />
when such a resource is available. This section describes several (3) methods for estimating<br />
lexical selection rules using a parallel corpus. We start by describing the part of the training process that is shared by all three methods, and then continue to describe each of the <br />
individual methods separately.<br />
=== Prequisites ===<br />
The training methods use several software packages that need to be installed.<br />
First you will need to download and install:<br />
* [[lttoolbox]]<br />
* Apertium<br />
* [[apertium-lex-tools]]<br />
* Moses and its training scripts which you can install using the following script<br />
<br />
Furthermore you will also need:<br />
<br />
* an Apertium language pair<br />
* a parallel corpus (see [[Corpora]])<br />
<br />
===Installing prerequisites===<br />
See [[Minimal installation from SVN]] for apertium/lttoolbox. <br />
<br />
See [[Constraint-based lexical selection module]] for apertium-lex-tools.<br />
<br />
For moses-decoder you can do<br />
<pre><br />
git clone https://github.com/moses-smt/mosesdecoder<br />
cd mosesdecoder/<br />
./bjam <br />
</pre><br />
<br />
Now e.g. the clean-corpus and train-model scripts referred to below will be in ~/smt/mosesdecoder/scripts/training/clean-corpus-n.perl<br />
See http://www.statmt.org/moses/?n=Development.GetStarted if you want to install the binaries to some other directory.<br />
<br />
=== Preparing the training files ===<br />
<br />
The parallel corpus is processed in such a way that <br />
the training files (the source and target side corpus) are first analysed and tagged.<br />
<br />
Next, lines with no analysis are removed and blank within tokens are replaced with a new character<br />
since Giza tokenizes a sentence by splitting on white space.<br />
<br />
Finally, both files are cleaned using a moses training script so that Giza will not <br />
crash during the training process.<br />
<br />
All of this can be achieved using the following script:<br />
<br />
<pre><br />
CORPUS="Europarl3"<br />
PAIR="es-pt"<br />
SL="pt"<br />
TL="es"<br />
DATA="/home/philip/Apertium/apertium-es-pt"<br />
<br />
LEX_TOOLS="/home/philip/Apertium/apertium-lex-tools"<br />
SCRIPTS="$LEX_TOOLS/scripts"<br />
MOSESDECODER="/home/philip/Apertium/mosesdecoder/scripts/training"<br />
TRAINING_LINES=100000<br />
<br />
<br />
if [ ! -d data-$SL-$TL ]; then <br />
mkdir data-$SL-$TL;<br />
fi<br />
<br />
# TAG CORPUS<br />
cat "$CORPUS.$PAIR.$SL" | head -n $TRAINING_LINES | apertium -d "$DATA" $SL-$TL-tagger-dcase \<br />
| apertium-pretransfer > data-$SL-$TL/$CORPUS.tagged.$SL;<br />
<br />
cat "$CORPUS.$PAIR.$TL" | head -n $TRAINING_LINES | apertium -d "$DATA" $TL-$SL-tagger-dcase \<br />
| apertium-pretransfer > data-$SL-$TL/$CORPUS.tagged.$TL;<br />
<br />
N=`wc -l $CORPUS.$PAIR.$SL | cut -d ' ' -f 1`<br />
<br />
<br />
# REMOVE LINES WITH NO ANALYSES<br />
seq 1 $TRAINING_LINES > data-$SL-$TL/$CORPUS.lines<br />
paste data-$SL-$TL/$CORPUS.lines data-$SL-$TL/$CORPUS.tagged.$SL data-$SL-$TL/$CORPUS.tagged.$TL | grep '<' \<br />
| cut -f1 > data-$SL-$TL/$CORPUS.lines.new<br />
paste data-$SL-$TL/$CORPUS.lines data-$SL-$TL/$CORPUS.tagged.$SL data-$SL-$TL/$CORPUS.tagged.$TL | grep '<' \<br />
| cut -f2 > data-$SL-$TL/$CORPUS.tagged.$SL.new<br />
paste data-$SL-$TL/$CORPUS.lines data-$SL-$TL/$CORPUS.tagged.$SL data-$SL-$TL/$CORPUS.tagged.$TL | grep '<' \<br />
| cut -f3 > data-$SL-$TL/$CORPUS.tagged.$TL.new<br />
mv data-$SL-$TL/$CORPUS.lines.new data-$SL-$TL/$CORPUS.lines<br />
cat data-$SL-$TL/$CORPUS.tagged.$SL.new \<br />
| sed 's/ /~/g' | sed 's/\$[^\^]*/$ /g' > data-$SL-$TL/$CORPUS.tagged.$SL<br />
cat data-$SL-$TL/$CORPUS.tagged.$TL.new \<br />
| sed 's/ /~/g' | sed 's/\$[^\^]*/$ /g' > data-$SL-$TL/$CORPUS.tagged.$TL<br />
rm data-$SL-$TL/*.new<br />
<br />
<br />
# CLEAN CORPUS<br />
perl "$MOSESDECODER/clean-corpus-n.perl" data-$SL-$TL/$CORPUS.tagged $SL $TL "data-$SL-$TL/$CORPUS.tag-clean" 1 40;<br />
<br />
</pre><br />
<br />
Make sure you set the variables as follows: <br/><br />
* CORPUS="Europarl3": name of the corpus that you're using<br />
* PAIR: the direction of the corpus<br />
* SL: the source language<br />
* TL: the target language<br />
* DATA: path to the language resources for the language pair<br />
* LEX_TOOLS: path to apertium-lex-tools<br />
* MOSESDECODER: path to moses-decoder<br />
* TRAINING_LINES: amount of training lines<br />
<br />
=== Learning rules with Giza ===<br />
<br />
=== Installing Giza ===<br />
<pre><br />
<br />
$ mkdir ~/smt<br />
$ cd ~/smt<br />
$ mkdir local # our "install prefix"<br />
$ wget https://giza-pp.googlecode.com/files/giza-pp-v1.0.7.tar.gz<br />
$ tar xzvf giza-pp-v1.0.7.tar.gz<br />
$ cd giza-pp<br />
$ make<br />
$ mkdir ../local/bin<br />
$ cp GIZA++-v2/snt2cooc.out ../local/bin/<br />
$ cp GIZA++-v2/snt2plain.out ../local/bin/<br />
$ cp GIZA++-v2/GIZA++ ../local/bin/<br />
$ cp mkcls-v2/mkcls ../local/bin/<br />
</pre><br />
<br />
== Estimating rules using non-parallel corpora ==<br />
Prerequisites:<br />
* Install [[apertium-lex-tools]]<br />
* Install IRSTLM (http://sourceforge.net/projects/irstlm/)<br />
* Estimate a '''binary''' target side language model (http://sourceforge.net/apps/mediawiki/irstlm/index.php?title=User_Manual). <br />
* The language pair must support the pretransfer and multi modes. See apertium-sh-mk/modes.xml <br />
as a reference on how to add these modes if they do not exist.<br />
Place the following Makefile in the folder where you want to run your training process:<br />
<br />
<pre><br />
CORPUS=setimes<br />
DIR=sh-mk<br />
DATA=/home/philip/Apertium/apertium-sh-mk/<br />
AUTOBIL=sh-mk.autobil.bin<br />
SCRIPTS=/home/philip/Apertium/apertium-lex-tools/scripts<br />
MODEL=/home/philip/Apertium/corpora/language-models/mk/setimes.mk.5.blm<br />
LEX_TOOLS=/home/philip/Apertium/apertium-lex-tools<br />
THR=0<br />
<br />
#all: data/$(CORPUS).$(DIR).lrx data/$(CORPUS).$(DIR).freq.lrx<br />
all: data/$(CORPUS).$(DIR).freq.lrx.bin data/$(CORPUS).$(DIR).patterns.lrx<br />
<br />
data/$(CORPUS).$(DIR).tagger: $(CORPUS).$(DIR).txt<br />
if [ ! -d data ]; then mkdir data; fi<br />
cat $(CORPUS).$(DIR).txt | sed 's/[^\.]$$/./g' | apertium-destxt | apertium -f none -d $(DATA) $(DIR)-tagger | apertium-pretransfer > $@<br />
<br />
data/$(CORPUS).$(DIR).ambig: data/$(CORPUS).$(DIR).tagger<br />
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/multitrans $(DATA)$(DIR).autobil.bin -b -t > $@<br />
<br />
data/$(CORPUS).$(DIR).multi-trimmed: data/$(CORPUS).$(DIR).tagger<br />
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/multitrans $(DATA)$(DIR).autobil.bin -m -t > $@<br />
<br />
data/$(CORPUS).$(DIR).ranked: data/$(CORPUS).$(DIR).tagger<br />
cat $< | $(LEX_TOOLS)/multitrans $(DATA)$(DIR).autobil.bin -m | apertium -f none -d $(DATA) $(DIR)-multi | irstlm-ranker-frac $(MODEL) > $@<br />
<br />
data/$(CORPUS).$(DIR).annotated: data/$(CORPUS).$(DIR).multi-trimmed data/$(CORPUS).$(DIR).ranked<br />
paste data/$(CORPUS).$(DIR).multi-trimmed data/$(CORPUS).$(DIR).ranked | cut -f1-4 > $@<br />
<br />
data/$(CORPUS).$(DIR).freq: data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated<br />
python3 $(SCRIPTS)/biltrans-extract-frac-freq.py data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated > $@<br />
<br />
data/$(CORPUS).$(DIR).freq.lrx: data/$(CORPUS).$(DIR).freq<br />
python3 $(SCRIPTS)/extract-alig-lrx.py $< > $@<br />
<br />
data/$(CORPUS).$(DIR).freq.lrx.bin: data/$(CORPUS).$(DIR).freq.lrx<br />
lrx-comp $< $@<br />
<br />
data/$(CORPUS).$(DIR).ngrams: data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated<br />
python3 $(SCRIPTS)/biltrans-count-patterns-ngrams.py data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated > $@<br />
<br />
data/$(CORPUS).$(DIR).patterns: data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ngrams<br />
python3 $(SCRIPTS)/ngram-pruning-frac.py data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ngrams > $@ <br />
<br />
data/$(CORPUS).$(DIR).patterns.lrx: data/$(CORPUS).$(DIR).patterns<br />
python3 $(SCRIPTS)/ngrams-to-rules.py $< $(THR) > $@<br />
</pre><br />
<br />
In the same folder also place your source side corpus file. The corpus file needs to be named as "basename"."language-pair".txt. <br/><br />
As an illustration, in the Makefile example, the corpus file is named setimes.sh-mk.txt.<br />
<br />
Set the Makefile variables as follows: <br/><br />
* CORPUS denotes the base name of your corpus file<br />
* DIR stands for the language pair<br />
* DATA is the path to the language resources for the language pair<br />
* AUTOBIL is the path to binary bilingual dictionary for the language pair<br />
* SCRIPTS denotes the path to the lex-tools scripts<br />
* MODEL is the path to the target side (binary) language model used for scoring the possible translations of ambiguous words<br />
<br />
Finally, executing the Makefile will generate lexical selection rules for the specified language pair.</div>
Fpetkovski
https://wiki.apertium.org/w/index.php?title=Learning_rules_from_parallel_and_non-parallel_corpora&diff=43302
Learning rules from parallel and non-parallel corpora
2013-08-19T07:22:41Z
<p>Fpetkovski: /* Estimating rules using parallel corpora */</p>
<hr />
<div>== Estimating rules using parallel corpora ==<br />
It is always recommended to use a parallel corpus for any type of machine translation training<br />
when such a resource is available. This section describes several (3) methods for estimating<br />
lexical selection rules using a parallel corpus. We start by describing the part of the training process that is shared by all three methods, and then continue to describe each of the <br />
individual methods separately.<br />
=== Prequisites ===<br />
The training methods use several software packages that need to be installed.<br />
First you will need to download and install:<br />
* [[lttoolbox]]<br />
* Apertium<br />
* [[apertium-lex-tools]]<br />
* Moses and its training scripts which you can install using the following script<br />
<br />
Furthermore you will also need:<br />
<br />
* an Apertium language pair<br />
* a parallel corpus (see [[Corpora]])<br />
<br />
===Installing prerequisites===<br />
See [[Minimal installation from SVN]] for apertium/lttoolbox. <br />
<br />
See [[Constraint-based lexical selection module]] for apertium-lex-tools.<br />
<br />
For moses-decoder you can do<br />
<pre><br />
git clone https://github.com/moses-smt/mosesdecoder<br />
cd mosesdecoder/<br />
./bjam <br />
</pre><br />
<br />
Now e.g. the clean-corpus and train-model scripts referred to below will be in ~/smt/mosesdecoder/scripts/training/clean-corpus-n.perl<br />
See http://www.statmt.org/moses/?n=Development.GetStarted if you want to install the binaries to some other directory.<br />
<br />
=== Preparing the training files ===<br />
<br />
The parallel corpus is processed in such a way that <br />
the training files (the source and target side corpus) are first analysed and tagged.<br />
<br />
Next, lines with no analysis are removed and blank within tokens are replaced with a new character<br />
since Giza tokenizes a sentence by splitting on white space.<br />
<br />
Finally, both files are cleaned using a moses training script so that Giza will not <br />
crash during the training process.<br />
<br />
All of this can be achieved using the following script:<br />
<br />
<pre><br />
CORPUS="Europarl3"<br />
PAIR="es-pt"<br />
SL="pt"<br />
TL="es"<br />
DATA="/home/philip/Apertium/apertium-es-pt"<br />
<br />
LEX_TOOLS="/home/philip/Apertium/apertium-lex-tools"<br />
SCRIPTS="$LEX_TOOLS/scripts"<br />
MOSESDECODER="/home/philip/Apertium/mosesdecoder/scripts/training"<br />
TRAINING_LINES=100000<br />
<br />
<br />
if [ ! -d data-$SL-$TL ]; then <br />
mkdir data-$SL-$TL;<br />
fi<br />
<br />
# TAG CORPUS<br />
cat "$CORPUS.$PAIR.$SL" | head -n $TRAINING_LINES | apertium -d "$DATA" $SL-$TL-tagger-dcase \<br />
| apertium-pretransfer > data-$SL-$TL/$CORPUS.tagged.$SL;<br />
<br />
cat "$CORPUS.$PAIR.$TL" | head -n $TRAINING_LINES | apertium -d "$DATA" $TL-$SL-tagger-dcase \<br />
| apertium-pretransfer > data-$SL-$TL/$CORPUS.tagged.$TL;<br />
<br />
N=`wc -l $CORPUS.$PAIR.$SL | cut -d ' ' -f 1`<br />
<br />
<br />
# REMOVE LINES WITH NO ANALYSES<br />
seq 1 $TRAINING_LINES > data-$SL-$TL/$CORPUS.lines<br />
paste data-$SL-$TL/$CORPUS.lines data-$SL-$TL/$CORPUS.tagged.$SL data-$SL-$TL/$CORPUS.tagged.$TL | grep '<' \<br />
| cut -f1 > data-$SL-$TL/$CORPUS.lines.new<br />
paste data-$SL-$TL/$CORPUS.lines data-$SL-$TL/$CORPUS.tagged.$SL data-$SL-$TL/$CORPUS.tagged.$TL | grep '<' \<br />
| cut -f2 > data-$SL-$TL/$CORPUS.tagged.$SL.new<br />
paste data-$SL-$TL/$CORPUS.lines data-$SL-$TL/$CORPUS.tagged.$SL data-$SL-$TL/$CORPUS.tagged.$TL | grep '<' \<br />
| cut -f3 > data-$SL-$TL/$CORPUS.tagged.$TL.new<br />
mv data-$SL-$TL/$CORPUS.lines.new data-$SL-$TL/$CORPUS.lines<br />
cat data-$SL-$TL/$CORPUS.tagged.$SL.new \<br />
| sed 's/ /~/g' | sed 's/\$[^\^]*/$ /g' > data-$SL-$TL/$CORPUS.tagged.$SL<br />
cat data-$SL-$TL/$CORPUS.tagged.$TL.new \<br />
| sed 's/ /~/g' | sed 's/\$[^\^]*/$ /g' > data-$SL-$TL/$CORPUS.tagged.$TL<br />
rm data-$SL-$TL/*.new<br />
<br />
<br />
# CLEAN CORPUS<br />
perl "$MOSESDECODER/clean-corpus-n.perl" data-$SL-$TL/$CORPUS.tagged $SL $TL "data-$SL-$TL/$CORPUS.tag-clean" 1 40;<br />
<br />
</pre><br />
<br />
=== Installing Giza ===<br />
<pre><br />
<br />
$ mkdir ~/smt<br />
$ cd ~/smt<br />
$ mkdir local # our "install prefix"<br />
$ wget https://giza-pp.googlecode.com/files/giza-pp-v1.0.7.tar.gz<br />
$ tar xzvf giza-pp-v1.0.7.tar.gz<br />
$ cd giza-pp<br />
$ make<br />
$ mkdir ../local/bin<br />
$ cp GIZA++-v2/snt2cooc.out ../local/bin/<br />
$ cp GIZA++-v2/snt2plain.out ../local/bin/<br />
$ cp GIZA++-v2/GIZA++ ../local/bin/<br />
$ cp mkcls-v2/mkcls ../local/bin/<br />
</pre><br />
<br />
== Estimating rules using non-parallel corpora ==<br />
Prerequisites:<br />
* Install [[apertium-lex-tools]]<br />
* Install IRSTLM (http://sourceforge.net/projects/irstlm/)<br />
* Estimate a '''binary''' target side language model (http://sourceforge.net/apps/mediawiki/irstlm/index.php?title=User_Manual). <br />
* The language pair must support the pretransfer and multi modes. See apertium-sh-mk/modes.xml <br />
as a reference on how to add these modes if they do not exist.<br />
Place the following Makefile in the folder where you want to run your training process:<br />
<br />
<pre><br />
CORPUS=setimes<br />
DIR=sh-mk<br />
DATA=/home/philip/Apertium/apertium-sh-mk/<br />
AUTOBIL=sh-mk.autobil.bin<br />
SCRIPTS=/home/philip/Apertium/apertium-lex-tools/scripts<br />
MODEL=/home/philip/Apertium/corpora/language-models/mk/setimes.mk.5.blm<br />
LEX_TOOLS=/home/philip/Apertium/apertium-lex-tools<br />
THR=0<br />
<br />
#all: data/$(CORPUS).$(DIR).lrx data/$(CORPUS).$(DIR).freq.lrx<br />
all: data/$(CORPUS).$(DIR).freq.lrx.bin data/$(CORPUS).$(DIR).patterns.lrx<br />
<br />
data/$(CORPUS).$(DIR).tagger: $(CORPUS).$(DIR).txt<br />
if [ ! -d data ]; then mkdir data; fi<br />
cat $(CORPUS).$(DIR).txt | sed 's/[^\.]$$/./g' | apertium-destxt | apertium -f none -d $(DATA) $(DIR)-tagger | apertium-pretransfer > $@<br />
<br />
data/$(CORPUS).$(DIR).ambig: data/$(CORPUS).$(DIR).tagger<br />
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/multitrans $(DATA)$(DIR).autobil.bin -b -t > $@<br />
<br />
data/$(CORPUS).$(DIR).multi-trimmed: data/$(CORPUS).$(DIR).tagger<br />
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/multitrans $(DATA)$(DIR).autobil.bin -m -t > $@<br />
<br />
data/$(CORPUS).$(DIR).ranked: data/$(CORPUS).$(DIR).tagger<br />
cat $< | $(LEX_TOOLS)/multitrans $(DATA)$(DIR).autobil.bin -m | apertium -f none -d $(DATA) $(DIR)-multi | irstlm-ranker-frac $(MODEL) > $@<br />
<br />
data/$(CORPUS).$(DIR).annotated: data/$(CORPUS).$(DIR).multi-trimmed data/$(CORPUS).$(DIR).ranked<br />
paste data/$(CORPUS).$(DIR).multi-trimmed data/$(CORPUS).$(DIR).ranked | cut -f1-4 > $@<br />
<br />
data/$(CORPUS).$(DIR).freq: data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated<br />
python3 $(SCRIPTS)/biltrans-extract-frac-freq.py data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated > $@<br />
<br />
data/$(CORPUS).$(DIR).freq.lrx: data/$(CORPUS).$(DIR).freq<br />
python3 $(SCRIPTS)/extract-alig-lrx.py $< > $@<br />
<br />
data/$(CORPUS).$(DIR).freq.lrx.bin: data/$(CORPUS).$(DIR).freq.lrx<br />
lrx-comp $< $@<br />
<br />
data/$(CORPUS).$(DIR).ngrams: data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated<br />
python3 $(SCRIPTS)/biltrans-count-patterns-ngrams.py data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated > $@<br />
<br />
data/$(CORPUS).$(DIR).patterns: data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ngrams<br />
python3 $(SCRIPTS)/ngram-pruning-frac.py data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ngrams > $@ <br />
<br />
data/$(CORPUS).$(DIR).patterns.lrx: data/$(CORPUS).$(DIR).patterns<br />
python3 $(SCRIPTS)/ngrams-to-rules.py $< $(THR) > $@<br />
</pre><br />
<br />
In the same folder also place your source side corpus file. The corpus file needs to be named as "basename"."language-pair".txt. <br/><br />
As an illustration, in the Makefile example, the corpus file is named setimes.sh-mk.txt.<br />
<br />
Set the Makefile variables as follows: <br/><br />
* CORPUS denotes the base name of your corpus file<br />
* DIR stands for the language pair<br />
* DATA is the path to the language resources for the language pair<br />
* AUTOBIL is the path to binary bilingual dictionary for the language pair<br />
* SCRIPTS denotes the path to the lex-tools scripts<br />
* MODEL is the path to the target side (binary) language model used for scoring the possible translations of ambiguous words<br />
<br />
Finally, executing the Makefile will generate lexical selection rules for the specified language pair.</div>
Fpetkovski
https://wiki.apertium.org/w/index.php?title=Learning_rules_from_parallel_and_non-parallel_corpora&diff=43301
Learning rules from parallel and non-parallel corpora
2013-08-19T07:09:31Z
<p>Fpetkovski: /* Preparing the training files */</p>
<hr />
<div>== Estimating rules using parallel corpora ==<br />
It is always recommended to use a parallel corpus for any type of machine translation training<br />
when such a resource is available. This section describes several (3) methods for estimating<br />
lexical selection rules using a parallel corpus. We start by describing the part of the training process that is shared by all three methods, and then continue to describe each of the <br />
individual methods separately.<br />
=== Prequisites ===<br />
The training methods use several software packages that need to be installed.<br />
First you will need to download and install:<br />
* [[lttoolbox]]<br />
* Apertium<br />
* [[apertium-lex-tools]]<br />
* Moses and its training scripts<br />
Furthermore you will also need:<br />
<br />
* an Apertium language pair<br />
* a parallel corpus (see [[Corpora]])<br />
=== Preparing the training files ===<br />
<br />
<pre><br />
CORPUS="Europarl3"<br />
PAIR="es-pt"<br />
SL="pt"<br />
TL="es"<br />
DATA="/home/philip/Apertium/apertium-es-pt"<br />
<br />
LEX_TOOLS="/home/philip/Apertium/apertium-lex-tools"<br />
SCRIPTS="$LEX_TOOLS/scripts"<br />
MOSESDECODER="/home/philip/Apertium/mosesdecoder/scripts/training"<br />
TRAINING_LINES=100000<br />
BIN_DIR="/home/philip/Apertium/smt/bin"<br />
crisphold=1<br />
<br />
if [ ! -d data-$SL-$TL ]; then <br />
mkdir data-$SL-$TL;<br />
fi<br />
<br />
#TAG CORPUS<br />
cat "$CORPUS.$PAIR.$SL" | head -n $TRAINING_LINES | apertium -d "$DATA" $SL-$TL-tagger-dcase \<br />
| apertium-pretransfer > data-$SL-$TL/$CORPUS.tagged.$SL;<br />
<br />
cat "$CORPUS.$PAIR.$TL" | head -n $TRAINING_LINES | apertium -d "$DATA" $TL-$SL-tagger-dcase \<br />
| apertium-pretransfer > data-$SL-$TL/$CORPUS.tagged.$TL;<br />
<br />
N=`wc -l $CORPUS.$PAIR.$SL | cut -d ' ' -f 1`<br />
<br />
<br />
# REMOVE LINES WITH NO ANALYSES<br />
seq 1 $TRAINING_LINES > data-$SL-$TL/$CORPUS.lines<br />
paste data-$SL-$TL/$CORPUS.lines data-$SL-$TL/$CORPUS.tagged.$SL data-$SL-$TL/$CORPUS.tagged.$TL | grep '<' \<br />
| cut -f1 > data-$SL-$TL/$CORPUS.lines.new<br />
paste data-$SL-$TL/$CORPUS.lines data-$SL-$TL/$CORPUS.tagged.$SL data-$SL-$TL/$CORPUS.tagged.$TL | grep '<' \<br />
| cut -f2 > data-$SL-$TL/$CORPUS.tagged.$SL.new<br />
paste data-$SL-$TL/$CORPUS.lines data-$SL-$TL/$CORPUS.tagged.$SL data-$SL-$TL/$CORPUS.tagged.$TL | grep '<' \<br />
| cut -f3 > data-$SL-$TL/$CORPUS.tagged.$TL.new<br />
mv data-$SL-$TL/$CORPUS.lines.new data-$SL-$TL/$CORPUS.lines<br />
cat data-$SL-$TL/$CORPUS.tagged.$SL.new \<br />
| sed 's/ /~/g' | sed 's/\$[^\^]*/$ /g' > data-$SL-$TL/$CORPUS.tagged.$SL<br />
cat data-$SL-$TL/$CORPUS.tagged.$TL.new \<br />
| sed 's/ /~/g' | sed 's/\$[^\^]*/$ /g' > data-$SL-$TL/$CORPUS.tagged.$TL<br />
rm data-$SL-$TL/*.new<br />
<br />
<br />
# CLEAN CORPUS<br />
perl "$MOSESDECODER/clean-corpus-n.perl" data-$SL-$TL/$CORPUS.tagged $SL $TL "data-$SL-$TL/$CORPUS.tag-clean" 1 40;<br />
<br />
# ALIGN<br />
perl $MOSESDECODER/train-model.perl -external-bin-dir "$BIN_DIR" -corpus data-$SL-$TL/$CORPUS.tag-clean \<br />
-f $TL -e $SL -alignment grow-diag-final-and -reordering msd-bidirectional-fe \<br />
-lm 0:5:/home/philip/Apertium/gsoc2013/giza/europarl.lm:0 2>&1<br />
</pre><br />
<br />
=== Installing Giza ===<br />
===Installing prerequisites===<br />
See [[Minimal installation from SVN]] for apertium/lttoolbox. <br />
<br />
See [[Constraint-based lexical selection module]] for apertium-lex-tools.<br />
<br />
For Giza++ and moses-decoder, etc. you can do<br />
<pre><br />
$ mkdir ~/smt<br />
$ cd ~/smt<br />
$ mkdir local # our "install prefix"<br />
$ wget https://giza-pp.googlecode.com/files/giza-pp-v1.0.7.tar.gz<br />
$ tar xzvf giza-pp-v1.0.7.tar.gz<br />
$ cd giza-pp<br />
$ make<br />
$ mkdir ../local/bin<br />
$ cp GIZA++-v2/snt2cooc.out ../local/bin/<br />
$ cp GIZA++-v2/snt2plain.out ../local/bin/<br />
$ cp GIZA++-v2/GIZA++ ../local/bin/<br />
$ cp mkcls-v2/mkcls ../local/bin/<br />
</pre><br />
<br />
Now e.g. the clean-corpus and train-model scripts referred to below will be in ~/smt/mosesdecoder/scripts/training/clean-corpus-n.perl<br />
See http://www.statmt.org/moses/?n=Development.GetStarted if you want to install the binaries to some other directory.<br />
<br />
== Estimating rules using non-parallel corpora ==<br />
Prerequisites:<br />
* Install [[apertium-lex-tools]]<br />
* Install IRSTLM (http://sourceforge.net/projects/irstlm/)<br />
* Estimate a '''binary''' target side language model (http://sourceforge.net/apps/mediawiki/irstlm/index.php?title=User_Manual). <br />
* The language pair must support the pretransfer and multi modes. See apertium-sh-mk/modes.xml <br />
as a reference on how to add these modes if they do not exist.<br />
Place the following Makefile in the folder where you want to run your training process:<br />
<br />
<pre><br />
CORPUS=setimes<br />
DIR=sh-mk<br />
DATA=/home/philip/Apertium/apertium-sh-mk/<br />
AUTOBIL=sh-mk.autobil.bin<br />
SCRIPTS=/home/philip/Apertium/apertium-lex-tools/scripts<br />
MODEL=/home/philip/Apertium/corpora/language-models/mk/setimes.mk.5.blm<br />
LEX_TOOLS=/home/philip/Apertium/apertium-lex-tools<br />
THR=0<br />
<br />
#all: data/$(CORPUS).$(DIR).lrx data/$(CORPUS).$(DIR).freq.lrx<br />
all: data/$(CORPUS).$(DIR).freq.lrx.bin data/$(CORPUS).$(DIR).patterns.lrx<br />
<br />
data/$(CORPUS).$(DIR).tagger: $(CORPUS).$(DIR).txt<br />
if [ ! -d data ]; then mkdir data; fi<br />
cat $(CORPUS).$(DIR).txt | sed 's/[^\.]$$/./g' | apertium-destxt | apertium -f none -d $(DATA) $(DIR)-tagger | apertium-pretransfer > $@<br />
<br />
data/$(CORPUS).$(DIR).ambig: data/$(CORPUS).$(DIR).tagger<br />
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/multitrans $(DATA)$(DIR).autobil.bin -b -t > $@<br />
<br />
data/$(CORPUS).$(DIR).multi-trimmed: data/$(CORPUS).$(DIR).tagger<br />
cat data/$(CORPUS).$(DIR).tagger | $(LEX_TOOLS)/multitrans $(DATA)$(DIR).autobil.bin -m -t > $@<br />
<br />
data/$(CORPUS).$(DIR).ranked: data/$(CORPUS).$(DIR).tagger<br />
cat $< | $(LEX_TOOLS)/multitrans $(DATA)$(DIR).autobil.bin -m | apertium -f none -d $(DATA) $(DIR)-multi | irstlm-ranker-frac $(MODEL) > $@<br />
<br />
data/$(CORPUS).$(DIR).annotated: data/$(CORPUS).$(DIR).multi-trimmed data/$(CORPUS).$(DIR).ranked<br />
paste data/$(CORPUS).$(DIR).multi-trimmed data/$(CORPUS).$(DIR).ranked | cut -f1-4 > $@<br />
<br />
data/$(CORPUS).$(DIR).freq: data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated<br />
python3 $(SCRIPTS)/biltrans-extract-frac-freq.py data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated > $@<br />
<br />
data/$(CORPUS).$(DIR).freq.lrx: data/$(CORPUS).$(DIR).freq<br />
python3 $(SCRIPTS)/extract-alig-lrx.py $< > $@<br />
<br />
data/$(CORPUS).$(DIR).freq.lrx.bin: data/$(CORPUS).$(DIR).freq.lrx<br />
lrx-comp $< $@<br />
<br />
data/$(CORPUS).$(DIR).ngrams: data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated<br />
python3 $(SCRIPTS)/biltrans-count-patterns-ngrams.py data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated > $@<br />
<br />
data/$(CORPUS).$(DIR).patterns: data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ngrams<br />
python3 $(SCRIPTS)/ngram-pruning-frac.py data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ngrams > $@ <br />
<br />
data/$(CORPUS).$(DIR).patterns.lrx: data/$(CORPUS).$(DIR).patterns<br />
python3 $(SCRIPTS)/ngrams-to-rules.py $< $(THR) > $@<br />
</pre><br />
<br />
In the same folder also place your source side corpus file. The corpus file needs to be named as "basename"."language-pair".txt. <br/><br />
As an illustration, in the Makefile example, the corpus file is named setimes.sh-mk.txt.<br />
<br />
Set the Makefile variables as follows: <br/><br />
* CORPUS denotes the base name of your corpus file<br />
* DIR stands for the language pair<br />
* DATA is the path to the language resources for the language pair<br />
* AUTOBIL is the path to binary bilingual dictionary for the language pair<br />
* SCRIPTS denotes the path to the lex-tools scripts<br />
* MODEL is the path to the target side (binary) language model used for scoring the possible translations of ambiguous words<br />
<br />
Finally, executing the Makefile will generate lexical selection rules for the specified language pair.</div>
Fpetkovski