Generating lexical-selection rules
Preparation
Wikipedia
Wikipedia can be downloaded from downloads.wikimedia.org.
$ wget http://download.wikimedia.org/iswiki/20100306/iswiki-20100306-pages-articles.xml.bz2
You will need to make a nice corpus from Wikipedia, with around one sentence per line. There are scripts in apertium-lex-learner/wikipedia
that do this for different Wikipedias, they will probably need some supervision. You will also need the analyser from your language pair and pass it as the second argument to the script.
$ sh strip-wiki-markup.sh iswiki-20100306-pages-articles.xml.bz2 is-en.automorf.bin > is.crp.txt
Then tag the corpus:
$ cat is.crp.txt | apertium -d ~/source/apertium/trunk/apertium-is-en is-en-tagger > is.tagged.txt
Language model
It is assumed you have a language model left over, probably from previous NLP experiments, that this model is in IRSTLM binary format and is called en.blm
. Instructions on how to make one of these may be added here in future.
Steps
Ambiguate source language corpus
In the first stage, we take the source language corpus, expand using the bilingual dictionary to generate all the possible translation paths, translate it and score it on the target language model.
$ cat is.tagged.txt | python generate_sl_ambig_corpus.py apertium-is-en.is-en.dix lr > is.ambig.txt $ cat is.ambig.txt | sh translator_pipeline.sh > is.translated.txt $ cat is.translated.txt | irstlm-ranker en.blm > is.ranked.txt
This gives us a set of sentences in the target language with attached scores.
-3.15967 || [4:0:4:10 || ].[] Finland and Sweden own near 700 years of joint history. -2.89170 || [4:1:4:10 || ].[] Finland and Sweden have near 700 years of joint history. -3.80183 || [15:0:3:13 || ].[] Now when counted that about 270 Saimaa - hringanórar are on life. -3.81782 || [15:1:3:13 || ].[] Now when reckoned that about 270 Saimaa - hringanórar are on life. -3.01545 || [39:0:1:7 || ].[] The universal suffrage was come on 1918-1921. -3.30002 || [39:1:1:7 || ].[] The common suffrage was come on 1918-1921. -3.26693 || [39:2:1:7 || ].[] The general suffrage was come on 1918-1921. -2.74306 || [20:0:5:10 || ].[] Swedish is according to laws the only official language on Álandseyjum. -3.29975 || [20:1:5:10 || ].[] Swedish is according to laws the alone official language on Álandseyjum. -3.30206 || [60:0:3:9 || ].[] Nominative case is only of four falls in Icelandic. -3.56592 || [60:1:3:9 || ].[] Nominative case is alone of four falls in Icelandic. -2.50803 || [147:0:6:9 || ].[] The state, the municipality and the town are called all Ósló. -2.96652 || [147:1:6:9 || ].[] The state, the municipality and the town promise all Ósló. -3.56343 || [154:0:1,9:11 || ].[] Been called is seldom used about authors the ones that novels compose. -3.56343 || [154:1:1,9:11 || ].[] Been called is seldom used about authors the ones that novels negotiate. -3.56343 || [154:2:1,9:11 || ].[] Been called is seldom used about authors the ones that novels write. -3.57915 || [154:3:1,9:11 || ].[] Promised is seldom used about authors the ones that novels compose. -3.57915 || [154:4:1,9:11 || ].[] Promised is seldom used about authors the ones that novels negotiate. -3.57915 || [154:5:1,9:11 || ].[] Promised is seldom used about authors the ones that novels write.
Extract candidate phrases
$ cat is.ranked.txt | python extract_candidate_phrases.py 0.1 > en.candidates.txt
Generate candidate rules
$ python generate_candidate_rules.py is.ambig.txt en.candidates.txt > is.rules.txt
Score candidate rules
$ cg-comp is.rules.txt is.rules.bin $ cg-comp empty.rlx empty.rlx.bin $ cat is.ambig.txt | grep -e '^\[[0-9]\+:0:' | sed 's/:0</</g' | cg-proc empty.rlx.bin > is.baseline.txt $ mkdir ranking $ python generate_rule_diffs.py is.baseline.txt is.rules.txt is.rules.bin translator_pipeline.sh ranking $ python rank_candidate_rules.py is.baseline.txt is.rules.txt translator_pipeline.sh ranking $ python aggregate_rule_ranks.py is.rules.txt ranking