Difference between revisions of "Running the MaxEnt rule learning"

From Apertium
Jump to navigation Jump to search
(Created page with 'There are two methods based on maximum entropy models for lexical selection rule learning, a monolingual and a bilingual one. ==Monolingual rule learning==')
 
 
(9 intermediate revisions by one other user not shown)
Line 2: Line 2:
   
 
==Monolingual rule learning==
 
==Monolingual rule learning==
  +
  +
First run the [[monolingual-rule-learning]] to obtain the .ambig and .annotated files.
  +
You should also have yasmet compiled in your Apertium-lex-tools folder.
  +
  +
Next, run the following script to extract lexical selection rules:
  +
  +
<pre>
  +
YASMET=/home/philip/Apertium/apertium-lex-tools/yasmet
  +
SCRIPTS=/home/philip/Apertium/apertium-lex-tools/scripts
  +
DATA=/home/philip/Apertium/gsoc2013/monolingual/data
  +
CORPUS=setimes
  +
PAIR=sh-mk
  +
MIN=$1
  +
  +
python3 $SCRIPTS/biltrans-count-patterns-frac-maxent.py $DATA/setimes.sh-mk.freq $DATA/setimes.sh-mk.ambig $DATA/setimes.sh-mk.annotated > events 2>ngrams
  +
echo -n "" > all-lambdas
  +
cat events | grep -v -e '\$ 0\.0 #' -e '\$ 0 #' > events.trimmed
  +
for i in `cat events.trimmed|cut -f1 |sort -u`; do
  +
num=`cat events.trimmed | grep "^$i" | cut -f2 | head -1`
  +
echo $num > tmp.yasmet;
  +
cat events.trimmed | grep "^$i" | cut -f3 | sed 's/^ //g' | sed 's/[ ]*$//g' >> tmp.yasmet;
  +
cat tmp.yasmet | $YASMET -red $MIN > tmp.yasmet.$MIN;
  +
cat tmp.yasmet.$MIN | $YASMET > tmp.lambdas
  +
cat tmp.lambdas | sed "s/^/$i /g" >> all-lambdas;
  +
done
  +
  +
rm tmp.*
  +
  +
python3 $SCRIPTS/merge-ngrams-lambdas.py ngrams all-lambdas > rules-all.txt
  +
  +
python3 $SCRIPTS/lambdas-to-rules.py $DATA/setimes.sh-mk.freq rules-all.txt > ngrams-all.txt
  +
  +
python3 $SCRIPTS/ngrams-to-rules-me.py ngrams-all.txt > $PAIR.ngrams-lm-$MIN.xml 2>/tmp/$PAIR.$MIN.lög
  +
</pre>
  +
  +
The variables at the top of the script should be set in the following way:
  +
* YASMET is the file path to a yasmet binary
  +
* SCRIPTS is the file path to the apertium-lex-tools scripts
  +
* DATA is the file path to the data generated by the monolingual rule extraction method
  +
* CORPUS is the base name of the corpus file
  +
* PAIR is the language pair
  +
  +
==Bilingual rule learning==
  +
  +
You will need to run [[Generating lexical-selection rules from a parallel corpus]] first in order
  +
to obtain the (corpus).candidates.(lang-pair) file (You don't have to run the whole process. Stop when
  +
you have obtained the candidates file).
  +
  +
You also need to have yasmet compiled in your apertium-lex-tools directory.
  +
  +
Next, run the following script to obtain the lexical selection rules
  +
  +
<pre>
  +
YASMET=/home/philip/Apertium/apertium-lex-tools/yasmet
  +
SCRIPTS=/home/philip/Apertium/apertium-lex-tools/scripts
  +
TRAIN=/home/philip/Apertium/corpora/raw/setimes-hr-mk-nikola
  +
DATA=/home/philip/Apertium/gsoc2013/monolingual/data
  +
CORPUS=setimes
  +
PAIR=sh-mk
  +
MIN=$1
  +
  +
  +
python $SCRIPTS/ngram-count-patterns-maxent.py $DATA/setimes.sh-mk.freq $TRAIN/$CORPUS.candidates.hr-mk 2>ngrams > events
  +
echo -n "" > all-lambdas
  +
cat events | grep -v -e '\$ 0\.0 #' -e '\$ 0 #' > events.trimmed
  +
for i in `cat events.trimmed|cut -f1 |sort -u`; do
  +
num=`cat events.trimmed | grep "^$i" | cut -f2 | head -1`
  +
echo $num > tmp.yasmet;
  +
cat events.trimmed | grep "^$i" | cut -f3 | sed 's/^ //g' | sed 's/[ ]*$//g' >> tmp.yasmet;
  +
cat tmp.yasmet | $YASMET -red $MIN > tmp.yasmet.$MIN;
  +
cat tmp.yasmet.$MIN | $YASMET > tmp.lambdas
  +
cat tmp.lambdas | sed "s/^/$i /g" >> all-lambdas;
  +
done
  +
  +
rm tmp.*
  +
  +
python3 $SCRIPTS/merge-ngrams-lambdas.py ngrams all-lambdas > rules-all.txt
  +
  +
python3 $SCRIPTS/lambdas-to-rules.py $DATA/setimes.sh-mk.freq rules-all.txt > ngrams-all.txt
  +
  +
python3 $SCRIPTS/ngrams-to-rules-me.py ngrams-all.txt > $PAIR.ngrams-lm-$MIN.xml 2>/tmp/$PAIR.$MIN.lög
  +
</pre>
  +
  +
The variables should be set the same way as with the monolingual method.
  +
  +
[[Category:Lexical selection]]

Latest revision as of 21:25, 14 February 2014

There are two methods based on maximum entropy models for lexical selection rule learning, a monolingual and a bilingual one.

Monolingual rule learning[edit]

First run the monolingual-rule-learning to obtain the .ambig and .annotated files. You should also have yasmet compiled in your Apertium-lex-tools folder.

Next, run the following script to extract lexical selection rules:

YASMET=/home/philip/Apertium/apertium-lex-tools/yasmet
SCRIPTS=/home/philip/Apertium/apertium-lex-tools/scripts
DATA=/home/philip/Apertium/gsoc2013/monolingual/data
CORPUS=setimes
PAIR=sh-mk
MIN=$1

python3 $SCRIPTS/biltrans-count-patterns-frac-maxent.py $DATA/setimes.sh-mk.freq $DATA/setimes.sh-mk.ambig $DATA/setimes.sh-mk.annotated > events 2>ngrams
echo -n "" > all-lambdas
cat events | grep -v -e '\$ 0\.0 #' -e '\$ 0 #' > events.trimmed
for i in `cat events.trimmed|cut -f1 |sort -u`; do
	num=`cat events.trimmed | grep "^$i" | cut -f2 | head -1`
	echo $num > tmp.yasmet;
	cat events.trimmed | grep "^$i" | cut -f3 | sed 's/^ //g' | sed 's/[ ]*$//g'  >> tmp.yasmet;
	cat tmp.yasmet | $YASMET -red $MIN > tmp.yasmet.$MIN; 
	cat tmp.yasmet.$MIN | $YASMET > tmp.lambdas
	cat tmp.lambdas | sed "s/^/$i /g" >> all-lambdas;
done

rm tmp.*

python3 $SCRIPTS/merge-ngrams-lambdas.py ngrams all-lambdas > rules-all.txt

python3 $SCRIPTS/lambdas-to-rules.py $DATA/setimes.sh-mk.freq rules-all.txt > ngrams-all.txt

python3 $SCRIPTS/ngrams-to-rules-me.py ngrams-all.txt > $PAIR.ngrams-lm-$MIN.xml 2>/tmp/$PAIR.$MIN.lög

The variables at the top of the script should be set in the following way:

* YASMET is the file path to a yasmet binary
* SCRIPTS is the file path to the apertium-lex-tools scripts
* DATA is the file path to the data generated by the monolingual rule extraction method
* CORPUS is the base name of the corpus file 
* PAIR is the language pair

Bilingual rule learning[edit]

You will need to run Generating lexical-selection rules from a parallel corpus first in order to obtain the (corpus).candidates.(lang-pair) file (You don't have to run the whole process. Stop when you have obtained the candidates file).

You also need to have yasmet compiled in your apertium-lex-tools directory.

Next, run the following script to obtain the lexical selection rules

YASMET=/home/philip/Apertium/apertium-lex-tools/yasmet
SCRIPTS=/home/philip/Apertium/apertium-lex-tools/scripts
TRAIN=/home/philip/Apertium/corpora/raw/setimes-hr-mk-nikola
DATA=/home/philip/Apertium/gsoc2013/monolingual/data
CORPUS=setimes
PAIR=sh-mk
MIN=$1


python $SCRIPTS/ngram-count-patterns-maxent.py $DATA/setimes.sh-mk.freq $TRAIN/$CORPUS.candidates.hr-mk 2>ngrams > events
echo -n "" > all-lambdas
cat events | grep -v -e '\$ 0\.0 #' -e '\$ 0 #' > events.trimmed
for i in `cat events.trimmed|cut -f1 |sort -u`; do
	num=`cat events.trimmed | grep "^$i" | cut -f2 | head -1`
	echo $num > tmp.yasmet;
	cat events.trimmed | grep "^$i" | cut -f3 | sed 's/^ //g' | sed 's/[ ]*$//g'  >> tmp.yasmet;
	cat tmp.yasmet | $YASMET -red $MIN > tmp.yasmet.$MIN; 
	cat tmp.yasmet.$MIN | $YASMET > tmp.lambdas
	cat tmp.lambdas | sed "s/^/$i /g" >> all-lambdas;
done

rm tmp.*

python3 $SCRIPTS/merge-ngrams-lambdas.py ngrams all-lambdas > rules-all.txt

python3 $SCRIPTS/lambdas-to-rules.py $DATA/setimes.sh-mk.freq rules-all.txt > ngrams-all.txt

python3 $SCRIPTS/ngrams-to-rules-me.py ngrams-all.txt > $PAIR.ngrams-lm-$MIN.xml 2>/tmp/$PAIR.$MIN.lög

The variables should be set the same way as with the monolingual method.