Using weights for ambiguous rules
Contents
The Idea
The idea is to allow Old-Apertium transfer rules to be ambiguous i.e. allow a set of rules to match the same general input pattern. This is more effective than the existing situation wherein the first rule in the XML transfer files takes exclusive precedence and blocks out all its ambiguous peers during the transfer precompilation stage. To decide which rule applies, the transfer module would use a set of predefined or pre-trained (more specific) weighted patterns provided for each group of ambiguous rules. This way, if a specific pattern matches with multiple transfer rules, the rule with the largest weight for that pattern is applied.
If no weighted patterns are matched, then the first rule in XML transfer file that matches the general pattern is still considered the default one and is applied.
The module (apertium-ambiguous) can be found at https://github.com/sevilaybayatli/apertium-ambiguous.
How to use apertium-ambiguous for your language pair
For this tutorial, we will be using the language pair apertium-kaz-tur
Download a wikimedia dump
Download a Wikipedia dump from http://dumps.wikimedia.org:
$ wget https://dumps.wikimedia.org/kkwiki/latest/kkwiki-latest-pages-articles.xml.bz2
Next, extract the text using WikiExtractor script:
$ git clone https://github.com/apertium/WikiExtractor.git $ cd WikiExtractor $ python3 WikiExtractor.py --infn ../kkwiki-latest-pages-articles.xml.bz2
Insert the wiki.txt file which has just been extracted into the project directory.
$ mv wiki.txt ../apertium-kaz-tur
Install segmenter
Install Kazakh segmenter from https://github.com/diasks2/pragmatic_segmenter/tree/kazakh
For using pragmatic_segmenter you need to do the following steps:
- Download Ruby 2.3 by running
sudo apt-get install ruby-full
- Run
gem install pragmatic_segmenter
This piece of code uses the segmenter to segment a corpus file and output the segmented sentences into a file. In kazSentenceTokenizer.rb, change the 2-letters code of the source language to the language desired. Here "kk" is code for Kazakh.
require 'pragmatic_segmenter' File.open(ARGV[0]).each do |line1| ps = PragmaticSegmenter::Segmenter.new(text: line1, language: 'kk', doc_type: 'txt') sentences = ps.segment File.open(ARGV[1], "a") do |line2| sentences.each { |sentence| line2.puts sentence } end end
Install and build kenlm
Download and install kenlm by following the steps under 'USAGE' at https://kheafield.com/code/kenlm/
Download a big Turkish corpus from wikidumps:
$ wget https://dumps.wikimedia.org/trwikinews/20181020/trwikinews-20181020-pages-articles.xml.bz2.
For training, you should follow these steps:
- Estimating: run
bin/lmplz -o 5 <text >text.arpa
- Querying: run
bin/build_binary text.arpa text.binary
- Add text.binary inside subdirectory script
Python scripts (exampleken1, kenlm.pyx, genalltra.py) used to score sentences can be found living here: https://github.com/sevilaybayatli/apertium-ambiguous/tree/master/scripts. These scripts automatically do their functions.
Install and build yasmet
The next step is downloading and compiling yasmet by doing the following:
Download yasmet from https://www-i6.informatik.rwth-aachen.de/web/Software/YASMET.html or from https://github.com/apertium/apertium-lex-tools/blob/master/yasmet.cc
To build and compile, follow steps below:
g++ -o yasmet yasmet.cc
./yasmet
Apertium language pairs modules
You need apertium and the language pair installed for using language modules inside the code. The steps below just show how the apertium modules are used inside the code.
Change the language pair file name to the pair desired in the paths of apertium tools (biltrans, lextor, interchunk, postchunk, transfer) in the file CLExec.cpp. Their paths can also be changed. Here, the pair is kaz-tur and the path is the Home path.
To apply the apertium tool "biltrans" on the segmented sentences:
apertium -d $HOME/apertium-kaz-tur kaz-tur-biltrans input_file output_file
To apply the apertium tool "lextor" on the output of the biltrans:
lrx-proc -m $HOME/apertium-kaz-tur/kaz-tur.autolex.bin inFilePath > outFilePath
To apply the apertium tool "interchunk" to that file:
apertium-interchunk $HOME/apertium-kaz-tur/apertium-kaz-tur.kaz-tur.t2x $HOME/apertium-kaz-tur/kaz-tur.t2x.bin input_file output_file
To apply the apertium tool "postchunk" to the "interchunk" output file:
apertium-postchunk $HOME/apertium-kaz-tur/apertium-kaz-tur.kaz-tur.t3x $HOME/apertium-kaz-tur/kaz-tur.t3x.bin input_file output_file
To apply the apertium tool "transfer" to the "postchunk" output file:
apertium-transfer -n $HOME/apertium-kaz-tur/apertium-kaz-tur.kaz-tur.t4x $HOME/apertium-kaz-tur/kaz- tur.t4x.bin input_file | lt-proc -g $HOME/apertium-kaz-tur/kaz-tur.autogen.bin | lt-proc -p $HOME/apertium- kaz-tur/kaz-tur.autopgen.bin > output_file.
Configure, build and install
cd
to apertium-ambiguous before you run the the commands shown below:
./autogen.sh ./configure make
Training and Testing apertium-ambiguous
The compiled program has four modes. These can be used by passing the right parameters.
- Yasmet dataset mode (with output file). Process the input wiki file, get the yasmet data of it and get the output (analysis) of that input file.
./machine-translation input_file_name output_file_name
- Yasmet dataset mode (without output file). Process the input wiki file, get the yasmet data of it but without the output (analysis) of that input file.
./machine-translation input_file_name
- Yasmet training models mode. Generate the yasmet models from the yasmet datasets, actually running the command "./yasmet yasmet_data yasmet_data.model" on every yasmet file in datasets folder.
./machine-translation
- Beam search mode. Running beam search with beam = beam_number in the input fill, writing its results in file "beamResults" and writing the output analysis in "output_file_name" file.
./machine-translation input_file_name output_file_name beam_number
Training should be done by running
./machine-translation input-file output-file
Testing can be done by running
./machine-translation input-file output-file number-of-beam
input-file= source language(Kazakh), output-file= target language(Turkish), and number of beam= 8 or any number.
Note: You can find the final result inside results/beamResults.txt.
Enjoy using our project :)