Generating lexical-selection rules from monolingual corpora

From Apertium
Jump to navigation Jump to search

This page describes how to generate lexical selection rules without relying on a parallel corpus.


  • apertium-lex-tools
  • A language pair (e.g. apertium-br-fr)
    • The language pair should have the following two modes:
      • -multi which is all the modules after lexical transfer (see apertium-mk-en/modes.xml)
      • -pretransfer which is all the modules up to lexical transfer (see apertium-mk-en/modes.xml)


Important: If you don't want through the whole process step by step, you can use the Makefile script provided in the last section of this page.

We're going to do the example with EuroParl and the English to Spanish pair in Apertium.

Given that you've got all the stuff installed, the work will be as follows:

Take your corpus and make a tagged version of it:

cat | apertium-destxt | apertium -f none -d ~/source/apertium/apertium-en-es en-es-pretransfer >

Make an ambiguous version of your corpus and trim redundant tags:

cat | python ~/source/apertium/apertium-lex-tools/multitrans ~/source/apertium/apertium-en-es/en-es.autobil -b -f -t -n >

Next, generate all the possible disambiguation paths while trimming redundant tags:

cat | ~/source/apertium/apertium-lex-tools/multitrans ~/source/apertium/apertium-en-es/en-es.autobil -m -f -t -n >

Translate and score all possible disambiguation paths:

cat | python ~/source/apertium/apertium-lex-tools/multitrans ~/source/apertium/apertium-en-es/en-es.autobil -m -f -n |
apertium -f none -d ~/source/apertium/apertium-en-es en-es-multi | ~/source/apertium/apertium-lex-tools/irstlm-ranker ~/source/corpora/lm/en.blm -f >

Now we have a pseudo-parallel corpus where each possible translation is scored. We start by extracting a frequency lexicon:

	python3 ~/source/apertium/apertium-lex-tools-scripts/ > europarl.en-es.freq


First extract the default translations:

Then the ngram partial counts:

Finding the best threshold