Generating lexical-selection rules from monolingual corpora
This page describes how to generate lexical selection rules without relying on a parallel corpus.
Prerequisites
- apertium-lex-tools
- IRSTLM
- A language pair (e.g. apertium-br-fr)
- The language pair should have the following two modes:
-multi
which is all the modules after lexical transfer (see apertium-mk-en/modes.xml)-pretransfer
which is all the modules up to lexical transfer (see apertium-mk-en/modes.xml)
- The language pair should have the following two modes:
Annotation
Important: If you don't want through the whole process step by step, you can use the Makefile script provided in the last section of this page.
We're going to do the example with EuroParl and the English to Spanish pair in Apertium.
Given that you've got all the stuff installed, the work will be as follows:
Take your corpus and make a tagged version of it:
cat europarl.es-en.es | apertium-destxt | apertium -f none -d ~/source/apertium/apertium-en-es en-es-pretransfer > europarl.en-es.es.tagged
Make an ambiguous version of your corpus and trim redundant tags:
cat europarl.en-es.es.tagged | python ~/source/apertium/apertium-lex-tools/multitrans ~/source/apertium/apertium-en-es/en-es.autobil -b -f -t -n > europarl.en-es.es.ambig
Next, generate all the possible disambiguation paths while trimming redundant tags:
cat europarl.en-es.es.tagged | ~/source/apertium/apertium-lex-tools/multitrans ~/source/apertium/apertium-en-es/en-es.autobil -m -f -t -n > europarl.en-es.es.multi-trimmed
Translate and score all possible disambiguation paths:
cat europarl.en-es.es.tagged | python ~/source/apertium/apertium-lex-tools/multitrans ~/source/apertium/apertium-en-es/en-es.autobil -m -f -n | apertium -f none -d ~/source/apertium/apertium-en-es en-es-multi | ~/source/apertium/apertium-lex-tools/irstlm-ranker ~/source/corpora/lm/en.blm europarl.en-es.es.multi-trimmed -f > europarl.en-es.es.annotated
Now we have a pseudo-parallel corpus where each possible translation is scored. We start by extracting a frequency lexicon:
python3 ~/source/apertium/apertium-lex-tools-scripts/biltrans-extract-frac-freq.py europarl.en-es.es.ambig europarl.en-es.es.annotated > europarl.en-es.freq
python3 ~/source/apertium/apertium-lex-tools-scripts/extract-alig-lrx.py europarl.en-es.freq > europarl.en-es.freq.lrx
lrx-comp europarl.en-es.freq.lrx europarl.en-es.freq.lrx.bin
From here on, we have two paths we can choose. We can extract rules using a maximum entropy classifier, or we can extract rules based only on the scores provided by irstlm-ranker.
Direct rule extraction
When using this method, we directly continue with extracting ngrams from the pseudo parallel corpus:
python3 ~/source/apertium/apertium-lex-tools/scripts/biltrans-count-patterns-ngrams.py europarl.en-es.freq europarl.en-es.es.ambig europarl.en-es.es.annotated > ngrams <pre> Next, we prune the generated ngrams: <pre> python3 ~/source/apertium/apertium-lex-tools/scripts/ngram-pruning-frac.py europarl.en-es.freq ngrams > patterns
Finally, we generate and compile lexical selection rules while thresholding their irstlm-score
crisphold=1; python3 ~/source/apertium/apertium-lex-tools/scripts//ngrams-to-rules.py patterns $crisphold > patterns.lrx
lrx-comp patterns.lrx patterns.lrx.bin
Maximum entropy rule extraction
When extracting rules using a maximum entropy criterion, we first extract features which we are going to feed to a classifier:
python3 ~/source/apertium/apertium-lex-tools/scripts/biltrans-count-patterns-frac-maxent.py europarl-en-es.freq europarl.en-es.ambig europarl.en-es.annotated > events 2>ngrams
We then train classifiers which as a side effect score how much each ngram contributes to a certain translation:
cat events | grep -v -e '\$$ 0\.0 #' -e '\$$ 0 #' > events.trimmed cat events.trimmed | python ~/source/apertium/apertium-lex-tools/scripts/merge-all-lambdas.py $(YASMET) > all-lambdas
python3 ~/source/apertium-lex-tools/scripts/merge-ngrams-lambdas.py ngrams all-lambdas > rules-all
Finally, we extract ngrams:
python3 ~/source/apertium-lex-tools/scripts/lambdas-to-rules.py europarl-en-es.freq rules-all > ngrams-all
we trim them:
python3 ~/source/apertium-lex-tools/scripts/ngram-pareto-trim.py ngrams-all > ngrams-trimmed
and generate lexical selection rules:
python3 ~/source/apertium-lex-tools/scripts/ngrams-to-rules-me.py ngrams-trimmed > europarl.en-es.lrx.bin