Difference between revisions of "Running the MaxEnt rule learning"

From Apertium
Jump to navigation Jump to search
Line 43: Line 43:
* CORPUS is the base name of the corpus file
* CORPUS is the base name of the corpus file
* PAIR is the language pair
* PAIR is the language pair

==Bilingual rule learning==

Revision as of 13:31, 17 June 2013

There are two methods based on maximum entropy models for lexical selection rule learning, a monolingual and a bilingual one.

Monolingual rule learning

First run the monolingual-rule-learning to obtain the .ambig and .annotated files. You should also have yasmet compiled in your Apertium-lex-tools folder.

Next, run the following script to extract lexical selection rules:

YASMET=/home/philip/Apertium/apertium-lex-tools/yasmet
SCRIPTS=/home/philip/Apertium/apertium-lex-tools/scripts
DATA=/home/philip/Apertium/gsoc2013/monolingual/data
CORPUS=setimes
PAIR=sh-mk
MIN=$1

# python3 $SCRIPTS/biltrans-count-patterns-frac-maxent.py $DATA/setimes.sh-mk.freq $DATA/setimes.sh-mk.ambig $DATA/setimes.sh-mk.annotated > events 2>ngrams
echo -n "" > all-lambdas
cat events | grep -v -e '\$ 0\.0 #' -e '\$ 0 #' > events.trimmed
for i in `cat events.trimmed|cut -f1 |sort -u`; do
	num=`cat events.trimmed | grep "^$i" | cut -f2 | head -1`
	echo $num > tmp.yasmet;
	cat events.trimmed | grep "^$i" | cut -f3 | sed 's/^ //g' | sed 's/[ ]*$//g'  >> tmp.yasmet;
	cat tmp.yasmet | $YASMET -red $MIN > tmp.yasmet.$MIN; 
	cat tmp.yasmet.$MIN | $YASMET > tmp.lambdas
	cat tmp.lambdas | sed "s/^/$i /g" >> all-lambdas;
done

rm tmp.*

python3 $SCRIPTS/merge-ngrams-lambdas.py ngrams all-lambdas > rules-all.txt

python3 $SCRIPTS/lambdas-to-rules.py $DATA/setimes.sh-mk.freq rules-all.txt > ngrams-all.txt

python3 $SCRIPTS/ngrams-to-rules-me.py ngrams-all.txt > $PAIR.ngrams-lm-$MIN.xml 2>/tmp/$PAIR.$MIN.lög

The variables at the top of the script should be set in the following way:

* YASMET is the file path to a yasmet binary
* SCRIPTS is the file path to the apertium-lex-tools scripts
* DATA is the file path to the data generated by the monolingual rule extraction method
* CORPUS is the base name of the corpus file 
* PAIR is the language pair

Bilingual rule learning