Running the monolingual rule learning
Revision as of 22:04, 14 June 2013 by Fpetkovski (talk | contribs)
Prerequisites:
- Install apertium-lex-tools
- Install IRSTLM (http://sourceforge.net/projects/irstlm/)
- Train a target side language model (http://sourceforge.net/apps/mediawiki/irstlm/index.php?title=User_Manual)
Place the following Makefile in the folder where you want to run your training process:
CORPUS=setimes Place the following Makefile in the folder where you want to run your training process: DIR=sh-mk DATA=/home/philip/Apertium/apertium-sh-mk/ AUTOBIL=sh-mk.autobil.bin SCRIPTS=/home/philip/Apertium/apertium-lex-tools/scripts MODEL=/home/philip/Apertium/corpora/language-models/mk/setimes.mk.5.blm #all: data/$(CORPUS).$(DIR).lrx data/$(CORPUS).$(DIR).freq.lrx all: data/$(CORPUS).$(DIR).freq.lrx.bin data/$(CORPUS).$(DIR).lines: $(CORPUS).$(DIR).txt if [ ! -d data ]; then mkdir data; fi seq `cat $< | wc -l` > $@ data/$(CORPUS).$(DIR).biltrans: $(CORPUS).$(DIR).txt if [ ! -d data ]; then mkdir data; fi cat $(CORPUS).$(DIR).txt | apertium-destxt | apertium -f none -d $(DATA) $(DIR)-pretransfer | lt-proc -b $(DATA)/$(AUTOBIL) > $@ data/$(CORPUS).$(DIR).ambig: data/$(CORPUS).$(DIR).biltrans data/$(CORPUS).$(DIR).lines cat -n data/$(CORPUS).$(DIR).biltrans | python3 $(SCRIPTS)/trim-fertile-lines.py | python3 $(SCRIPTS)/biltrans-line-only-pos-ambig.py | python3 $(SCRIPTS)/biltrans-trim-uncovered.py > $@ data/$(CORPUS).$(DIR).multi: data/$(CORPUS).$(DIR).ambig cat $< | python $(SCRIPTS)/biltrans-to-multitrans-line-recursive.py > $@ data/$(CORPUS).$(DIR).unranked: data/$(CORPUS).$(DIR).multi cat $< | apertium -f none -d $(DATA) $(DIR)-multi > $@ data/$(CORPUS).$(DIR).ranked: data/$(CORPUS).$(DIR).unranked cat $< | irstlm-ranker-frac $(MODEL) > $@ data/$(CORPUS).$(DIR).annotated: data/$(CORPUS).$(DIR).multi data/$(CORPUS).$(DIR).ranked paste data/$(CORPUS).$(DIR).multi data/$(CORPUS).$(DIR).ranked | cut -f1-4 > $@ data/$(CORPUS).$(DIR).freq: data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated python3 $(SCRIPTS)/biltrans-extract-frac-freq.py data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated > $@ data/$(CORPUS).$(DIR).ngrams: data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated python3 $(SCRIPTS)/biltrans-count-patterns-ngrams.py data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ambig data/$(CORPUS).$(DIR).annotated > $@ data/$(CORPUS).$(DIR).patterns: data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ngrams python3 $(SCRIPTS)/ngram-pruning-frac.py data/$(CORPUS).$(DIR).freq data/$(CORPUS).$(DIR).ngrams > $@ data/$(CORPUS).$(DIR).freq.lrx: data/$(CORPUS).$(DIR).freq python3 $(SCRIPTS)/extract-alig-lrx.py $< > $@ data/$(CORPUS).$(DIR).freq.lrx.bin: data/$(CORPUS).$(DIR).freq.lrx apertium-lrx-comp $< $@
In the same folder also place your source side corpus file. The corpus file needs to be named as "basename"."language-pair".txt. 
As an illustration, in the Makefile example, the corpus file is named setimes.sh-mk.txt.
Set the Makefile variables as follows: 
- CORPUS denotes the base name of your corpus file
- DIR stands for the language pair
- DATA is the path to the language resources for the language pair
- AUTOBIL is the path to binary bilingual dictionary for the language pair
- SCRIPTS denotes the path to the lex-tools scripts
- MODEL is the path to the target side (binary) language model used for scoring the possible translations of ambiguous words
Finally, executing the Makefile will generate lexical selection rules for the specified language pair.

