Running the MaxEnt rule learning
There are two methods based on maximum entropy models for lexical selection rule learning, a monolingual and a bilingual one.
Monolingual rule learning[edit]
First run the monolingual-rule-learning to obtain the .ambig and .annotated files. You should also have yasmet compiled in your Apertium-lex-tools folder.
Next, run the following script to extract lexical selection rules:
YASMET=/home/philip/Apertium/apertium-lex-tools/yasmet SCRIPTS=/home/philip/Apertium/apertium-lex-tools/scripts DATA=/home/philip/Apertium/gsoc2013/monolingual/data CORPUS=setimes PAIR=sh-mk MIN=$1 python3 $SCRIPTS/biltrans-count-patterns-frac-maxent.py $DATA/setimes.sh-mk.freq $DATA/setimes.sh-mk.ambig $DATA/setimes.sh-mk.annotated > events 2>ngrams echo -n "" > all-lambdas cat events | grep -v -e '\$ 0\.0 #' -e '\$ 0 #' > events.trimmed for i in `cat events.trimmed|cut -f1 |sort -u`; do num=`cat events.trimmed | grep "^$i" | cut -f2 | head -1` echo $num > tmp.yasmet; cat events.trimmed | grep "^$i" | cut -f3 | sed 's/^ //g' | sed 's/[ ]*$//g' >> tmp.yasmet; cat tmp.yasmet | $YASMET -red $MIN > tmp.yasmet.$MIN; cat tmp.yasmet.$MIN | $YASMET > tmp.lambdas cat tmp.lambdas | sed "s/^/$i /g" >> all-lambdas; done rm tmp.* python3 $SCRIPTS/merge-ngrams-lambdas.py ngrams all-lambdas > rules-all.txt python3 $SCRIPTS/lambdas-to-rules.py $DATA/setimes.sh-mk.freq rules-all.txt > ngrams-all.txt python3 $SCRIPTS/ngrams-to-rules-me.py ngrams-all.txt > $PAIR.ngrams-lm-$MIN.xml 2>/tmp/$PAIR.$MIN.lög
The variables at the top of the script should be set in the following way:
* YASMET is the file path to a yasmet binary * SCRIPTS is the file path to the apertium-lex-tools scripts * DATA is the file path to the data generated by the monolingual rule extraction method * CORPUS is the base name of the corpus file * PAIR is the language pair
Bilingual rule learning[edit]
You will need to run Generating lexical-selection rules from a parallel corpus first in order to obtain the (corpus).candidates.(lang-pair) file (You don't have to run the whole process. Stop when you have obtained the candidates file).
You also need to have yasmet compiled in your apertium-lex-tools directory.
Next, run the following script to obtain the lexical selection rules
YASMET=/home/philip/Apertium/apertium-lex-tools/yasmet SCRIPTS=/home/philip/Apertium/apertium-lex-tools/scripts TRAIN=/home/philip/Apertium/corpora/raw/setimes-hr-mk-nikola DATA=/home/philip/Apertium/gsoc2013/monolingual/data CORPUS=setimes PAIR=sh-mk MIN=$1 python $SCRIPTS/ngram-count-patterns-maxent.py $DATA/setimes.sh-mk.freq $TRAIN/$CORPUS.candidates.hr-mk 2>ngrams > events echo -n "" > all-lambdas cat events | grep -v -e '\$ 0\.0 #' -e '\$ 0 #' > events.trimmed for i in `cat events.trimmed|cut -f1 |sort -u`; do num=`cat events.trimmed | grep "^$i" | cut -f2 | head -1` echo $num > tmp.yasmet; cat events.trimmed | grep "^$i" | cut -f3 | sed 's/^ //g' | sed 's/[ ]*$//g' >> tmp.yasmet; cat tmp.yasmet | $YASMET -red $MIN > tmp.yasmet.$MIN; cat tmp.yasmet.$MIN | $YASMET > tmp.lambdas cat tmp.lambdas | sed "s/^/$i /g" >> all-lambdas; done rm tmp.* python3 $SCRIPTS/merge-ngrams-lambdas.py ngrams all-lambdas > rules-all.txt python3 $SCRIPTS/lambdas-to-rules.py $DATA/setimes.sh-mk.freq rules-all.txt > ngrams-all.txt python3 $SCRIPTS/ngrams-to-rules-me.py ngrams-all.txt > $PAIR.ngrams-lm-$MIN.xml 2>/tmp/$PAIR.$MIN.lög
The variables should be set the same way as with the monolingual method.