Using weights for ambiguous rules

From Apertium
Revision as of 16:47, 20 November 2018 by Purplemoon (talk | contribs)
Jump to navigation Jump to search


The Idea

The idea is to allow old apertium transfer rules to be ambiguous i.e. allow a set of rules to match the same general input pattern. This is more effective than the existing situation wherein the first rule in the XML transfer files takes exclusive precedence and blocks out all its ambiguous peers during the transfer precompilation stage, often leading to inaccurate translation. To achieve this, the transfer module would use a set of predefined or pre-trained (more specific) weighted patterns provided for each group of ambiguous rules. This way, if a specific pattern matches with multiple transfer rules, the rule with the largest weight for that pattern is applied.

If no weighted patterns are matched, then the first rule in XML transfer file that matches the general pattern is still considered the default one and is applied.

The module (apertium-ambiguous) can be found at https://github.com/sevilaybayatli/apertium-ambiguous.

How to use apertium-ambiguous for your language pair

For this tutorial, we will be using the language pair apertium-kaz-tur

Download a wikimedia dump

Download a Wikipedia dump from http://dumps.wikimedia.org:

$ wget https://dumps.wikimedia.org/kkwiki/latest/kkwiki-latest-pages-articles.xml.bz2

To use any other language, simply replace the occurrences of 'kk' with the 2-letter code of your language.

Next, extract the text using WikiExtractor script:

$ git clone https://github.com/apertium/WikiExtractor.git

$ cd WikiExtractor

$ python3 WikiExtractor.py --infn ../kkwiki-latest-pages-articles.xml.bz2  

The extracted file will be named as wiki.txt in the current directory which you are already working on and you are going to use it with other steps of the project.

Install segmenter

Install Kazakh segmenter from https://github.com/diasks2/pragmatic_segmenter/tree/kazakh

For using pragmatic_segmenter you need to do the following steps:

  1. Download Ruby 2.3 by running sudo apt-get install ruby-full
  2. Run gem install pragmatic_segmenter

This piece of code uses the segmenter to segment a corpus file and output the segmented sentences into a file. The sentenceTokenizer.rb, which is located at https://github.com/sevilaybayatli/apertium-ambiguous/blob/master/scripts

require 'pragmatic_segmenter'

File.open(ARGV[1]).each do |line1|
	line1.delete! ('\\\(\)\[\]\{\}\<\>\|\$\/\'\"')
    ps = PragmaticSegmenter::Segmenter.new(text: line1, language: ARGV[0], doc_type: 'txt')
    sentences = ps.segment
    
    File.open(ARGV[2], "a") do |line2|
        sentences.each { |sentence| line2.puts sentence }
    end
end

Breaking corpus into sentences using the ruby program sentenceTokenizer.rb built on the pragmatic segmenter.

ruby2.3 sentenceTokenizer.rb $langCode $inputFile sentences.txt

For example:
ruby2.3 sentenceTokenizer.rb kk wiki.txt sentences.txt

langcode for Kazakh kk, inputFile is Kazakh corpus, and sentences.txt is a segmented sentences.

Apertium language pairs modules

You need apertium and the language pair installed for using language modules inside the code. The steps below just show how the apertium modules are used inside the code.

Change the language pair file name to the pair desired in the paths of apertium tools (biltrans, lextor, interchunk, postchunk, transfer) in the file CLExec.cpp. Their paths can also be changed. Here, the pair is kaz-tur and the path is the Home path.

To apply the apertium tool "biltrans" on the segmented sentences:

apertium -d $HOME/apertium-kaz-tur kaz-tur-biltrans  input_file output_file

To apply the apertium tool "lextor" on the output of the biltrans:

lrx-proc -m $HOME/apertium-kaz-tur/kaz-tur.autolex.bin inFilePath > outFilePath

To apply the apertium tool "interchunk" to that file:

apertium-interchunk $HOME/apertium-kaz-tur/apertium-kaz-tur.kaz-tur.t2x $HOME/apertium-kaz-tur/kaz-tur.t2x.bin input_file output_file

To apply the apertium tool "postchunk" to the "interchunk" output file:

apertium-postchunk $HOME/apertium-kaz-tur/apertium-kaz-tur.kaz-tur.t3x $HOME/apertium-kaz-tur/kaz-tur.t3x.bin input_file output_file

To apply the apertium tool "transfer" to the "postchunk" output file:

apertium-transfer -n $HOME/apertium-kaz-tur/apertium-kaz-tur.kaz-tur.t4x $HOME/apertium-kaz-tur/kaz-
tur.t4x.bin input_file | lt-proc -g $HOME/apertium-kaz-tur/kaz-tur.autogen.bin | lt-proc -p $HOME/apertium-
kaz-tur/kaz-tur.autopgen.bin > output_file.

Install and build kenlm

Download and install kenlm by the following steps under 'USAGE' at https://kheafield.com/code/kenlm/

Download a big Turkish corpus from wikidumps:

$ wget https://dumps.wikimedia.org/trwikinews/20181020/trwikinews-20181020-pages-articles.xml.bz2. 

For training, you should follow these steps:

  1. Estimating: run bin/lmplz -o 5 <text >text.arpa
  2. Querying: run bin/build_binary text.arpa text.binary
$ mv text.binary ../scripts

Python script (score-sentences.py) used to score target language's sentences with language model, it can be found at https://github.com/sevilaybayatli/apertium-ambiguous/tree/master/scripts.

For running model weight program on the transfer file as in bash script:

python3 $modelWeight $LM < transfer.txt > weights.txt;

Install and build yasmet

The next step is downloading and compiling yasmet by doing the following:

Download yasmet either from https://www-i6.informatik.rwth-aachen.de/web/Software/YASMET.html or from https://github.com/apertium/apertium-lex-tools/blob/master/yasmet.cc

To build and compile, follow steps below:

  1. g++ -o yasmet yasmet.cc
  2. ./yasmet

Running yasmet-formatter to prepare yasmet datasets. Also this will generate the analysis output file , beside the best model weighting translations(scoring with language model) in file modelWeight.txt, and random translations(choosing applying rule randomly form transfer file) in file randomWeight.txt.

./yasmet-formatter $localeId sentences.txt lextor.txt transfer.txt weights.txt $outputFile $datasets;


Configure, build and install

cd to apertium-ambiguous before you run the the commands shown below:

./autogen.sh
./configure
make

Training and Testing apertium-ambiguous

The compiled program has four modes. These can be used by passing the right parameters.

  • training-yasmet


  • Yasmet training models mode. Generate the yasmet models from the yasmet datasets, actually running the command "./yasmet yasmet_data yasmet_data.model" on every yasmet file in datasets folder.

./machine-translation

  • Beam search mode. Running beam search with beam = beam_number in the input file, writing its results in file "beamResults" and writing the output analysis in "output_file_name" file.

./beam-search localeId sentencesFile lextorFile transferOutFile modelsFolder k1 k2 k3 ...


Training should be done by running

  • ./machine-translation input-file output-file

Testing can be done by running

  • ./machine-translation input-file output-file number-of-beam
input-file= source language(Kazakh), output-file= target language(Turkish), and number of beam= 8 or any number.

Note: You can find the final result inside results/beamResults.txt.

Enjoy using our project :)