Generating lexical-selection rules from monolingual corpora

From Apertium
Revision as of 12:57, 27 September 2012 by Francis Tyers (talk | contribs) (Created page with '{{TOCD}} This page describes how to generate lexical selection rules without relying on a parallel corpus. ==Prerequisites== * apertium-lex-tools * IRSTLM * A language …')
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

This page describes how to generate lexical selection rules without relying on a parallel corpus.

Prerequisites

  • apertium-lex-tools
  • IRSTLM
  • A language pair (e.g. apertium-br-fr)
    • The language pair should have the following two modes:
      • -multi which is all the modules after lexical transfer
      • -pretransfer which is all the modules up to lexical transfer

Annotation

Take your corpus and run it through the lexical transfer:

cat $(CORPUS).$(DIR).txt | apertium-destxt | apertium -f none -d $(DATA) $(DIR)-pretransfer | lt-proc -b $(DATA)/$(AUTOBIL) > $@

Then select only the lines which have more than one and less than 10,000 translations, which have an ambiguous noun/verb/adjective and which have >= 90% coverage of the morphology.

cat $< | python3 $(SCRIPTS)/trim-fertile-lines.py | python3 $(SCRIPTS)/biltrans-line-only-pos-ambig.py | python3 $(SCRIPTS)/biltrans-trim-uncovered.py > $@

Generate all the possible disambiguation paths:

cat $< | python $(SCRIPTS)/biltrans-to-multitrans-line-recursive.py > $@

Translate all possible disambiguation paths:

cat $< | apertium -f none -d $(DATA) $(DIR)-multi > $@

Score all the possible disambiguation paths with IRSTLM.


Rule-extraction

First extract the default translations:


Then the ngram partial counts:


Finding the best threshold