Generating lexical-selection rules

From Apertium
Jump to navigation Jump to search

Preparation

Wikipedia

Wikipedia can be downloaded from downloads.wikimedia.org.

$ wget http://download.wikimedia.org/iswiki/20100306/iswiki-20100306-pages-articles.xml.bz2

You will need to make a nice corpus from Wikipedia, with around one sentence per line. There are scripts in apertium-lex-learner/wikipedia that do this for different Wikipedias, they will probably need some supervision. You will also need the analyser from your language pair and pass it as the second argument to the script.

$ sh strip-wiki-markup.sh iswiki-20100306-pages-articles.xml.bz2 is-en.automorf.bin > is.crp.txt

Then tag the corpus:

$ cat is.crp.txt | apertium -d ~/source/apertium/trunk/apertium-is-en is-en-tagger > is.tagged.txt

Language model

It is assumed you have a language model left over, probably from previous NLP experiments, that this model is in IRSTLM binary format and is called en.blm. Instructions on how to make one of these may be added here in future.

Steps

Ambiguate source language corpus

$ cat is.tagged.txt | python generate_sl_ambig_corpus.py apertium-is-en.is-en.dix lr > is.ambig.txt
$ cat is.ambig.txt | sh translator_pipeline.sh > is.translated.txt
$ cat is.translated.txt | irstlm-ranker en.blm > is.ranked.txt

Extract candidate phrases

$ cat is.ranked.txt | python extract_candidate_phrases.py 0.1 > en.candidates.txt

Generate candidate rules

$ python generate_candidate_rules.py is.ambig.txt en.candidates.txt > is.rules.txt

Score candidate rules

$ cg-comp is.rules.txt is.rules.bin
$ cg-comp empty.rlx empty.rlx.bin
$ cat is.ambig.txt | grep -e '^\[[0-9]\+:0:' | sed 's/:0</</g' | cg-proc empty.rlx.bin > is.baseline.txt
$ mkdir ranking
$ python generate_rule_diffs.py is.baseline.txt is.rules.txt is.rules.bin translator_pipeline.sh ranking
$ python rank_candidate_rules.py is.baseline.txt is.rules.txt translator_pipeline.sh ranking
$ python aggregate_rule_ranks.py is.rules.txt ranking