Difference between revisions of "Generating lexical-selection rules from monolingual corpora"

Revision as of 05:47, 23 September 2013

Prerequisites

apertium-lex-tools
IRSTLM
A language pair (e.g. apertium-br-fr)
- The language pair should have the following two modes:
  - -multi which is all the modules after lexical transfer (see apertium-mk-en/modes.xml)
  - -pretransfer which is all the modules up to lexical transfer (see apertium-mk-en/modes.xml)

Annotation

Important: If you don't want through the whole process step by step, you can use the Makefile script provided in the last section of this page.

We're going to do the example with EuroParl and the English to Spanish pair in Apertium.

Given that you've got all the stuff installed, the work will be as follows:

Take your corpus and make a tagged version of it:

cat europarl.es-en.es | apertium-destxt | apertium -f none -d ~/source/apertium/apertium-en-es en-es-pretransfer > europarl.en-es.es.tagged

Make an ambiguous version of your corpus and trim redundant tags:

cat europarl.en-es.es.tagged | python ~/source/apertium/apertium-lex-tools/multitrans ~/source/apertium/apertium-en-es/en-es.autobil -b -f -t -n > europarl.en-es.es.ambig

Next, generate all the possible disambiguation paths while trimming redundant tags:

cat europarl.en-es.es.tagged | ~/source/apertium/apertium-lex-tools/multitrans ~/source/apertium/apertium-en-es/en-es.autobil -m -f -t -n > europarl.en-es.es.multi-trimmed

Translate and score all possible disambiguation paths:

cat europarl.en-es.es.tagged | python ~/source/apertium/apertium-lex-tools/multitrans ~/source/apertium/apertium-en-es/en-es.autobil -m -f -n |
apertium -f none -d ~/source/apertium/apertium-en-es en-es-multi | ~/source/apertium/apertium-lex-tools/irstlm-ranker 
~/source/corpora/lm/en.blm europarl.en-es.es.multi-trimmed -f > europarl.en-es.es.annotated

Now we have a pseudo-parallel corpus where each possible translation is scored. We start by extracting a frequency lexicon:

	python3 ~/source/apertium/apertium-lex-tools-scripts/biltrans-extract-frac-freq.py  europarl.en-es.es.ambig europarl.en-es.es.annotated > europarl.en-es.freq

	python3 ~/source/apertium/apertium-lex-tools-scripts/extract-alig-lrx.py  europarl.en-es.freq > europarl.en-es.freq.lrx

	lrx-comp europarl.en-es.freq.lrx europarl.en-es.freq.lrx.bin

From here on, we have two paths we can choose. We can extract rules using a maximum entropy classifier, or we can extract rules based only on the scores provided by irstlm-ranker.

Rule-extraction

First extract the default translations:

Then the ngram partial counts:

Finding the best threshold

Difference between revisions of "Generating lexical-selection rules from monolingual corpora"

Revision as of 05:47, 23 September 2013

Contents

Prerequisites

Annotation

Rule-extraction

Finding the best threshold

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools

@@ Line 56: / Line 56: @@
 	python3 ~/source/apertium/apertium-lex-tools-scripts/extract-alig-lrx.py  europarl.en-es.freq > europarl.en-es.freq.lrx
 </pre>
 <pre>
 	lrx-comp europarl.en-es.freq.lrx europarl.en-es.freq.lrx.bin
 </pre>
+From here on, we have two paths we can choose. We can extract rules using a maximum entropy classifier, or we can extract
+rules based only on the scores provided by irstlm-ranker.
 ==Rule-extraction==