Difference between revisions of "Running the MaxEnt rule learning"
Fpetkovski (talk | contribs) |
|||
(5 intermediate revisions by one other user not shown) | |||
Line 16: | Line 16: | ||
MIN=$1 |
MIN=$1 |
||
python3 $SCRIPTS/biltrans-count-patterns-frac-maxent.py $DATA/setimes.sh-mk.freq $DATA/setimes.sh-mk.ambig $DATA/setimes.sh-mk.annotated > events 2>ngrams |
|||
echo -n "" > all-lambdas |
echo -n "" > all-lambdas |
||
cat events | grep -v -e '\$ 0\.0 #' -e '\$ 0 #' > events.trimmed |
cat events | grep -v -e '\$ 0\.0 #' -e '\$ 0 #' > events.trimmed |
||
Line 43: | Line 43: | ||
* CORPUS is the base name of the corpus file |
* CORPUS is the base name of the corpus file |
||
* PAIR is the language pair |
* PAIR is the language pair |
||
==Bilingual rule learning== |
|||
You will need to run [[Generating lexical-selection rules from a parallel corpus]] first in order |
|||
to obtain the (corpus).candidates.(lang-pair) file (You don't have to run the whole process. Stop when |
|||
you have obtained the candidates file). |
|||
You also need to have yasmet compiled in your apertium-lex-tools directory. |
|||
Next, run the following script to obtain the lexical selection rules |
|||
<pre> |
|||
YASMET=/home/philip/Apertium/apertium-lex-tools/yasmet |
|||
SCRIPTS=/home/philip/Apertium/apertium-lex-tools/scripts |
|||
TRAIN=/home/philip/Apertium/corpora/raw/setimes-hr-mk-nikola |
|||
DATA=/home/philip/Apertium/gsoc2013/monolingual/data |
|||
CORPUS=setimes |
|||
PAIR=sh-mk |
|||
MIN=$1 |
|||
python $SCRIPTS/ngram-count-patterns-maxent.py $DATA/setimes.sh-mk.freq $TRAIN/$CORPUS.candidates.hr-mk 2>ngrams > events |
|||
echo -n "" > all-lambdas |
|||
cat events | grep -v -e '\$ 0\.0 #' -e '\$ 0 #' > events.trimmed |
|||
for i in `cat events.trimmed|cut -f1 |sort -u`; do |
|||
num=`cat events.trimmed | grep "^$i" | cut -f2 | head -1` |
|||
echo $num > tmp.yasmet; |
|||
cat events.trimmed | grep "^$i" | cut -f3 | sed 's/^ //g' | sed 's/[ ]*$//g' >> tmp.yasmet; |
|||
cat tmp.yasmet | $YASMET -red $MIN > tmp.yasmet.$MIN; |
|||
cat tmp.yasmet.$MIN | $YASMET > tmp.lambdas |
|||
cat tmp.lambdas | sed "s/^/$i /g" >> all-lambdas; |
|||
done |
|||
rm tmp.* |
|||
python3 $SCRIPTS/merge-ngrams-lambdas.py ngrams all-lambdas > rules-all.txt |
|||
python3 $SCRIPTS/lambdas-to-rules.py $DATA/setimes.sh-mk.freq rules-all.txt > ngrams-all.txt |
|||
python3 $SCRIPTS/ngrams-to-rules-me.py ngrams-all.txt > $PAIR.ngrams-lm-$MIN.xml 2>/tmp/$PAIR.$MIN.lög |
|||
</pre> |
|||
The variables should be set the same way as with the monolingual method. |
|||
[[Category:Lexical selection]] |
Latest revision as of 21:25, 14 February 2014
There are two methods based on maximum entropy models for lexical selection rule learning, a monolingual and a bilingual one.
Monolingual rule learning[edit]
First run the monolingual-rule-learning to obtain the .ambig and .annotated files. You should also have yasmet compiled in your Apertium-lex-tools folder.
Next, run the following script to extract lexical selection rules:
YASMET=/home/philip/Apertium/apertium-lex-tools/yasmet SCRIPTS=/home/philip/Apertium/apertium-lex-tools/scripts DATA=/home/philip/Apertium/gsoc2013/monolingual/data CORPUS=setimes PAIR=sh-mk MIN=$1 python3 $SCRIPTS/biltrans-count-patterns-frac-maxent.py $DATA/setimes.sh-mk.freq $DATA/setimes.sh-mk.ambig $DATA/setimes.sh-mk.annotated > events 2>ngrams echo -n "" > all-lambdas cat events | grep -v -e '\$ 0\.0 #' -e '\$ 0 #' > events.trimmed for i in `cat events.trimmed|cut -f1 |sort -u`; do num=`cat events.trimmed | grep "^$i" | cut -f2 | head -1` echo $num > tmp.yasmet; cat events.trimmed | grep "^$i" | cut -f3 | sed 's/^ //g' | sed 's/[ ]*$//g' >> tmp.yasmet; cat tmp.yasmet | $YASMET -red $MIN > tmp.yasmet.$MIN; cat tmp.yasmet.$MIN | $YASMET > tmp.lambdas cat tmp.lambdas | sed "s/^/$i /g" >> all-lambdas; done rm tmp.* python3 $SCRIPTS/merge-ngrams-lambdas.py ngrams all-lambdas > rules-all.txt python3 $SCRIPTS/lambdas-to-rules.py $DATA/setimes.sh-mk.freq rules-all.txt > ngrams-all.txt python3 $SCRIPTS/ngrams-to-rules-me.py ngrams-all.txt > $PAIR.ngrams-lm-$MIN.xml 2>/tmp/$PAIR.$MIN.lög
The variables at the top of the script should be set in the following way:
* YASMET is the file path to a yasmet binary * SCRIPTS is the file path to the apertium-lex-tools scripts * DATA is the file path to the data generated by the monolingual rule extraction method * CORPUS is the base name of the corpus file * PAIR is the language pair
Bilingual rule learning[edit]
You will need to run Generating lexical-selection rules from a parallel corpus first in order to obtain the (corpus).candidates.(lang-pair) file (You don't have to run the whole process. Stop when you have obtained the candidates file).
You also need to have yasmet compiled in your apertium-lex-tools directory.
Next, run the following script to obtain the lexical selection rules
YASMET=/home/philip/Apertium/apertium-lex-tools/yasmet SCRIPTS=/home/philip/Apertium/apertium-lex-tools/scripts TRAIN=/home/philip/Apertium/corpora/raw/setimes-hr-mk-nikola DATA=/home/philip/Apertium/gsoc2013/monolingual/data CORPUS=setimes PAIR=sh-mk MIN=$1 python $SCRIPTS/ngram-count-patterns-maxent.py $DATA/setimes.sh-mk.freq $TRAIN/$CORPUS.candidates.hr-mk 2>ngrams > events echo -n "" > all-lambdas cat events | grep -v -e '\$ 0\.0 #' -e '\$ 0 #' > events.trimmed for i in `cat events.trimmed|cut -f1 |sort -u`; do num=`cat events.trimmed | grep "^$i" | cut -f2 | head -1` echo $num > tmp.yasmet; cat events.trimmed | grep "^$i" | cut -f3 | sed 's/^ //g' | sed 's/[ ]*$//g' >> tmp.yasmet; cat tmp.yasmet | $YASMET -red $MIN > tmp.yasmet.$MIN; cat tmp.yasmet.$MIN | $YASMET > tmp.lambdas cat tmp.lambdas | sed "s/^/$i /g" >> all-lambdas; done rm tmp.* python3 $SCRIPTS/merge-ngrams-lambdas.py ngrams all-lambdas > rules-all.txt python3 $SCRIPTS/lambdas-to-rules.py $DATA/setimes.sh-mk.freq rules-all.txt > ngrams-all.txt python3 $SCRIPTS/ngrams-to-rules-me.py ngrams-all.txt > $PAIR.ngrams-lm-$MIN.xml 2>/tmp/$PAIR.$MIN.lög
The variables should be set the same way as with the monolingual method.