Running the MaxEnt rule learning

From Apertium
Jump to navigation Jump to search

There are two methods based on maximum entropy models for lexical selection rule learning, a monolingual and a bilingual one.

Monolingual rule learning

First run the monolingual-rule-learning to obtain the .ambig and .annotated files. You should also have yasmet compiled in your Apertium-lex-tools folder.

Next, run the following script to extract lexical selection rules:

YASMET=/home/philip/Apertium/apertium-lex-tools/yasmet
SCRIPTS=/home/philip/Apertium/apertium-lex-tools/scripts
DATA=/home/philip/Apertium/gsoc2013/monolingual/data
CORPUS=setimes
PAIR=sh-mk
MIN=$1

python3 $SCRIPTS/biltrans-count-patterns-frac-maxent.py $DATA/setimes.sh-mk.freq $DATA/setimes.sh-mk.ambig $DATA/setimes.sh-mk.annotated > events 2>ngrams
echo -n "" > all-lambdas
cat events | grep -v -e '\$ 0\.0 #' -e '\$ 0 #' > events.trimmed
for i in `cat events.trimmed|cut -f1 |sort -u`; do
	num=`cat events.trimmed | grep "^$i" | cut -f2 | head -1`
	echo $num > tmp.yasmet;
	cat events.trimmed | grep "^$i" | cut -f3 | sed 's/^ //g' | sed 's/[ ]*$//g'  >> tmp.yasmet;
	cat tmp.yasmet | $YASMET -red $MIN > tmp.yasmet.$MIN; 
	cat tmp.yasmet.$MIN | $YASMET > tmp.lambdas
	cat tmp.lambdas | sed "s/^/$i /g" >> all-lambdas;
done

rm tmp.*

python3 $SCRIPTS/merge-ngrams-lambdas.py ngrams all-lambdas > rules-all.txt

python3 $SCRIPTS/lambdas-to-rules.py $DATA/setimes.sh-mk.freq rules-all.txt > ngrams-all.txt

python3 $SCRIPTS/ngrams-to-rules-me.py ngrams-all.txt > $PAIR.ngrams-lm-$MIN.xml 2>/tmp/$PAIR.$MIN.lög

The variables at the top of the script should be set in the following way:

* YASMET is the file path to a yasmet binary
* SCRIPTS is the file path to the apertium-lex-tools scripts
* DATA is the file path to the data generated by the monolingual rule extraction method
* CORPUS is the base name of the corpus file 
* PAIR is the language pair

Bilingual rule learning

You will need to run Generating lexical-selection rules from a parallel corpus first in order to obtain the (corpus).candidates.(lang-pair) file (You don't have to run the whole process. Stop when you have obtained the candidates file).

You also need to have yasmet compiled in your apertium-lex-tools directory.

Next, run the following script to obtain the lexical selection rules

YASMET=/home/philip/Apertium/apertium-lex-tools/yasmet
SCRIPTS=/home/philip/Apertium/apertium-lex-tools/scripts
TRAIN=/home/philip/Apertium/corpora/raw/setimes-hr-mk-nikola
DATA=/home/philip/Apertium/gsoc2013/monolingual/data
CORPUS=setimes
PAIR=sh-mk
MIN=$1


python $SCRIPTS/ngram-count-patterns-maxent.py $DATA/setimes.sh-mk.freq $TRAIN/$CORPUS.candidates.hr-mk 2>ngrams > events
echo -n "" > all-lambdas
cat events | grep -v -e '\$ 0\.0 #' -e '\$ 0 #' > events.trimmed
for i in `cat events.trimmed|cut -f1 |sort -u`; do
	num=`cat events.trimmed | grep "^$i" | cut -f2 | head -1`
	echo $num > tmp.yasmet;
	cat events.trimmed | grep "^$i" | cut -f3 | sed 's/^ //g' | sed 's/[ ]*$//g'  >> tmp.yasmet;
	cat tmp.yasmet | $YASMET -red $MIN > tmp.yasmet.$MIN; 
	cat tmp.yasmet.$MIN | $YASMET > tmp.lambdas
	cat tmp.lambdas | sed "s/^/$i /g" >> all-lambdas;
done

rm tmp.*

python3 $SCRIPTS/merge-ngrams-lambdas.py ngrams all-lambdas > rules-all.txt

python3 $SCRIPTS/lambdas-to-rules.py $DATA/setimes.sh-mk.freq rules-all.txt > ngrams-all.txt

python3 $SCRIPTS/ngrams-to-rules-me.py ngrams-all.txt > $PAIR.ngrams-lm-$MIN.xml 2>/tmp/$PAIR.$MIN.lög

The variables should be set the same way as with the monolingual method.