Running the MaxEnt rule learning

From Apertium
Revision as of 13:31, 17 June 2013 by Fpetkovski (talk | contribs)
Jump to navigation Jump to search

There are two methods based on maximum entropy models for lexical selection rule learning, a monolingual and a bilingual one.

Monolingual rule learning

First run the monolingual-rule-learning to obtain the .ambig and .annotated files. You should also have yasmet compiled in your Apertium-lex-tools folder.

Next, run the following script to extract lexical selection rules:


# python3 $SCRIPTS/ $DATA/ $DATA/ $DATA/ > events 2>ngrams
echo -n "" > all-lambdas
cat events | grep -v -e '\$ 0\.0 #' -e '\$ 0 #' > events.trimmed
for i in `cat events.trimmed|cut -f1 |sort -u`; do
	num=`cat events.trimmed | grep "^$i" | cut -f2 | head -1`
	echo $num > tmp.yasmet;
	cat events.trimmed | grep "^$i" | cut -f3 | sed 's/^ //g' | sed 's/[ ]*$//g'  >> tmp.yasmet;
	cat tmp.yasmet | $YASMET -red $MIN > tmp.yasmet.$MIN; 
	cat tmp.yasmet.$MIN | $YASMET > tmp.lambdas
	cat tmp.lambdas | sed "s/^/$i /g" >> all-lambdas;

rm tmp.*

python3 $SCRIPTS/ ngrams all-lambdas > rules-all.txt

python3 $SCRIPTS/ $DATA/ rules-all.txt > ngrams-all.txt

python3 $SCRIPTS/ ngrams-all.txt > $PAIR.ngrams-lm-$MIN.xml 2>/tmp/$PAIR.$MIN.lög

The variables at the top of the script should be set in the following way:

* YASMET is the file path to a yasmet binary
* SCRIPTS is the file path to the apertium-lex-tools scripts
* DATA is the file path to the data generated by the monolingual rule extraction method
* CORPUS is the base name of the corpus file 
* PAIR is the language pair

Bilingual rule learning