Difference between revisions of "Using weights for ambiguous rules"

From Apertium
Jump to navigation Jump to search
 
(138 intermediate revisions by 5 users not shown)
Line 1: Line 1:
[[Documentation ]]
 
   
  +
{{Main page header}}
 
  +
 
==The Idea==
 
==The Idea==
The idea is to allow Old-Apertium transfer rules to be ambiguous, i.e., allow a set of rules to match the same general input pattern, as opposed to the existed situation when the first rule in xml transfer file takes exclusive precedence and blocks out all its ambiguous peers during transfer precompilation stage.
+
The idea is to allow old apertium transfer rules to be ambiguous i.e. allow a set of rules to match the same general input pattern. This is more effective than the existing situation wherein the first rule in the XML transfer files takes exclusive precedence and blocks out all its ambiguous peers during the transfer precompilation stage, often leading to inaccurate translation.
To decide which rule applies, transfer module would use a set of predefined or pretrained — more specific weighted patterns provided for each group of ambiguous rules. This way, if a specific pattern matches, the rule with the highest weight for that pattern is applied.
+
To achieve this, the transfer module would use a set of predefined or pre-trained (more specific) weighted patterns provided for each group of ambiguous rules. This way, if a specific pattern matches with multiple transfer rules, the rule with the largest weight for that pattern is applied.
   
The first rule in xml transfer file that matches the general pattern is still considered the default one and is applied if no weighted patterns matched.
+
If no weighted patterns are matched, then the first rule in XML transfer file that matches the general pattern is still considered the default one and is applied.
   
  +
The module (apertium-ambiguous) can be found at https://github.com/sevilaybayatli/apertium-ambiguous.
==How to use apertium-kaz-tur-mt for your language pair==
 
   
  +
==Configure, build and install==
1) Download a Wikipedia dump from http://dumps.wikimedia.org
 
  +
<code>cd</code> to <b>apertium-ambiguous</b> before you run the commands is shown below
<pre>
 
$ wget https://dumps.wikimedia.org/kkwiki/latest/kkwiki-latest-pages-articles.xml.bz2
 
</pre>
 
 
2) Extract the text using WikiExtractor:
 
   
 
<pre>
 
<pre>
  +
./autogen.sh
$ wget https://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/WikiExtractor.py
 
  +
./configure
 
  +
make
$ python3 WikiExtractor.py --infn kkwiki-latest-pages-articles.xml.bz2
 
 
</pre>
 
</pre>
   
  +
==How to use apertium-ambiguous for your language pair==
4) Insert wiki.txt which has been extacted into the project directory.
 
   
  +
For this tutorial, we will be using the language pair apertium-kaz-tur
5) Install Kazakh segmenter from https://github.com/diasks2/pragmatic_segmenter/tree/kazakh
 
   
  +
===Download a wikimedia dump===
* '''[[For using pragmatic_segmenter you need to do the following steps]]'''
 
   
  +
Download a Wikipedia dump from http://dumps.wikimedia.org:
** [[downloading ruby2.3]]
 
 
** [[gem install pragmatic_segmenter]]
 
 
** [[inside code you should use it like "ruby2.3 kazSentenceTokenizer.rb"]]
 
 
* [This piece of code uses the segmenter to segment a corpus file and output the segmented sentences in a file]
 
   
 
<pre>
 
<pre>
  +
$ wget https://dumps.wikimedia.org/kkwiki/latest/kkwiki-latest-pages-articles.xml.bz2
require 'pragmatic_segmenter'
 
 
File.open(ARGV[0]).each do |line1|
 
ps = PragmaticSegmenter::Segmenter.new(text: line1, language: 'kk', doc_type: 'txt')
 
sentences = ps.segment
 
File.open(ARGV[1], "a") do |line2|
 
sentences.each { |sentence| line2.puts sentence }
 
end end
 
 
</pre>
 
</pre>
   
  +
To use any other language, simply replace the occurrences of 'kk' with the 2-letter code of your language.
6- Install and build kenlm for training Turkish corpus and score the sentences from this link https://kheafield.com/code/kenlm/, then you need doing the following steps:
 
   
  +
Next, extract the text using WikiExtractor script:
a- Downloading big Turkish corpus from wikidumps https://dumps.wikimedia.org/trwikinews/20181020/trwikinews-20181020-pages-articles.xml.bz2
 
   
  +
<pre>
b- Train kenlm using big Turkish corpus by 5-gram language model, and with the following commands:
 
  +
$ git clone https://github.com/apertium/WikiExtractor.git
   
  +
$ cd WikiExtractor
c- Estimating running bin/lmplz -o 5 <text >text.arpa
 
   
  +
$ python3 WikiExtractor.py --infn ../kkwiki-latest-pages-articles.xml.bz2
d- Querying will generate binary file by bin/build_binary text.arpa text.binary
 
  +
</pre>
   
  +
The extracted file will be named as <b>wiki.txt</b> in the current directory which you are already working on and you are going to use it with other steps of the project.
e- Python scripts(exampleken1, kenlm.pyx, genalltra.py) used to score sentences living here https://github.com/sevilaybayatli/apertium-kaz-tur-mt/tree/master/scripts, these scripts automatically doing its function.
 
   
  +
===Install segmenter===
f. You have to add the path of text.binary inside exampleken1.
 
   
  +
Install Kazakh segmenter from https://github.com/diasks2/pragmatic_segmenter/tree/kazakh
7. Download and compile Yasmet by following the instruction here:
 
   
  +
For using pragmatic_segmenter you need to do the following steps:
a) Download yasmet else from https://www-i6.informatik.rwth-aachen.de/web/Software/YASMET.html or https://github.com/apertium/apertium-lex-tools/blob/master/yasmet.cc
 
   
  +
# Download Ruby 2.3 by running <code>sudo apt-get install ruby-full</code>
b) Build/compile g++ -o yasmet yasmet.cc
 
  +
# Run <code>gem install pragmatic_segmenter</code>
   
  +
This piece of code uses the segmenter to segment a corpus file and output the segmented sentences into a file. The <b>sentenceTokenizer.rb</b>, which is located at https://github.com/sevilaybayatli/apertium-ambiguous/blob/master/scripts
c) Execution by ./yasmet
 
   
8. You need apertium and language pair installed to use language modules inside code, the steps below just showing how the rest apertium modules used inside the code
 
 
a) Apertium tool "biltrans" on the segmented sentences.
 
 
<pre>
 
<pre>
  +
require 'pragmatic_segmenter'
apertium -d $HOME/apertium-kaz-tur kaz-tur-biltrans input_file output_file
 
</pre>
 
   
  +
File.open(ARGV[1]).each do |line1|
b) Apertium tool "lextor" on the output of the biltrans.
 
  +
line1.delete! ('\\\(\)\[\]\{\}\<\>\|\$\/\'\"')
<pre>
 
  +
ps = PragmaticSegmenter::Segmenter.new(text: line1, language: ARGV[0], doc_type: 'txt')
lrx-proc -m $HOME/apertium-kaz-tur/kaz-tur.autolex.bin inFilePath > outFilePath
 
  +
sentences = ps.segment
  +
  +
File.open(ARGV[2], "a") do |line2|
  +
sentences.each { |sentence| line2.puts sentence }
  +
end
  +
end
 
</pre>
 
</pre>
   
  +
Breaking corpus into sentences using the ruby program <b>sentenceTokenizer.rb</b> built on the pragmatic segmenter.
c) Apertium tool "interchunk" to that file.
 
 
<pre>
 
<pre>
  +
ruby2.3 sentenceTokenizer.rb $langCode $inputFile sentences.txt
apertium-interchunk $HOME/apertium-kaz-tur/apertium-kaz-tur.kaz-tur.t2x $HOME/apertium-kaz-tur/kaz-tur.t2x.bin input_file output_file
 
</pre>
 
   
  +
For example:
e) Apertium tool "postchunk" to the "interchunk" output file.
 
  +
ruby2.3 sentenceTokenizer.rb kk wiki.txt sentences.txt
   
  +
</pre>
apertium-postchunk $HOME/apertium-kaz-tur/apertium-kaz-tur.kaz-tur.t3x $HOME/apertium-kaz-tur/kaz-tur.t3x.bin input_file output_file
 
  +
langcode for Kazakh <b>kk</b>, inputFile is <b>Kazakh corpus</b>, and sentences.txt is a <b>segmented sentences</b>.
   
  +
===Apertium language pairs modules===
f) Apertium tool "transfer" to the "postchunk" output file.
 
  +
You need apertium and the language pair installed for using language modules. The steps below just show how the apertium modules for getting desired output which will used by apertium-ambiguous. Apertium pair parent directory path(<b>apertium-kaz-tur</b>). If it's in your home directory then we expect <b>$HOME</b>.
   
  +
To apply the apertium tool <b>biltrans</b> on the segmented sentences:
apertium-transfer -n $HOME/apertium-kaz-tur/apertium-kaz-tur.kaz-tur.t4x $HOME/apertium-kaz-tur/kaz-
 
  +
<pre>
  +
apertium -d $pairPar/apertium-$pairCode $pairCode-biltrans sentences.txt biltrans.txt
   
  +
For example
tur.t4x.bin input_file | lt-proc -g $HOME/apertium-kaz-tur/kaz-tur.autogen.bin | lt-proc -p $HOME/apertium-
 
  +
apertium -d $HOME/apertium-kaz-tur kaz-tur-biltrans sentences.txt biltrans.txt
   
  +
</pre>
kaz-tur/kaz-tur.autopgen.bin > output_file.
 
   
  +
To apply the apertium tool <b>lextor</b> on the output of the biltrans:
9)Build/Compile system have done with this command:
 
  +
<pre>
 
  +
lrx-proc -m $HOME/apertium-kaz-tur/kaz-tur.autolex.bin inFilePath > outFilePath
a) ./autogen.sh
 
   
  +
For example
b) ./configure
 
  +
lrx-proc -m $HOME/apertium-kaz-tur/kaz-tur.autolex.bin biltrans.txt > lextor.txt
   
c) make
 
 
d) training ./machine-translation input-file output-file
 
 
e) testing done by ./machine-translation input-file output-file number-of-beam
 
 
input-file= source language(Kazakh), output-file= target language(Turkish), and number of beam= 8 or any number.
 
 
Note: You can find the final result inside results/beamResults.txt.
 
 
 
 
==Project working stages==
 
 
1) Download a Wikipedia dump from http://dumps.wikimedia.org
 
<pre> $ wget https://dumps.wikimedia.org/kkwiki/latest/kkwiki-latest-pages-articles.xml.bz2
 
 
</pre>
 
</pre>
   
  +
To run <b>rules-applier</b> program
2) Extract the text using WikiExtractor:
 
 
 
<pre>
 
<pre>
  +
./rules-applier localeId transferFile.t1x sentences.txt lextor.txt rulesOut.txt
$ wget https://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/WikiExtractor.py
 
$ python3 WikiExtractor.py --infn kkwiki-latest-pages-articles.xml.bz2 https://dumps.wikimedia.org/kkwiki/latest/kkwiki-latest-pages-articles.xml.bz2
 
</pre>
 
   
  +
For example
3) Remove some unnecessary characters <code>( " , < , > , | , $ , / , \ , ( , ) , etc. )</code> that cause Apertium tools to stop or malfunction.
 
  +
./rules-applier kk_KZ $HOME/transferFile.t1x sentences.txt lextor.txt rulesOut.txt
 
4) Break corpus into sentences using paragmatic segmenter. This piece of code uses the segmenter to segment a corpus file and output the segmented sentences in a file.
 
<pre>
 
require 'pragmatic_segmenter'
 
 
File.open(ARGV[0]).each do |line1|
 
ps = PragmaticSegmenter::Segmenter.new(text: line1, language: 'kk', doc_type: 'txt')
 
sentences = ps.segment
 
File.open(ARGV[1], "a") do |line2|
 
sentences.each { |sentence| line2.puts sentence }
 
end end
 
 
</pre>
 
</pre>
We call this piece of code in our program by :
 
 
<pre>
 
<pre>
  +
localeId= ICU localeId for the source language, sentences.txt= source language sentences, rulesOut.txt= output file of your results
ruby2.3 kazSentenceTokenizer.rb input_file output_file
 
 
</pre>
 
</pre>
   
5) Apply apertium tool "biltrans" on the segmented sentences.
+
To apply the apertium tool <b>interchunk</b> into rulesOut.txt file:
 
<pre>
 
<pre>
  +
apertium-interchunk $pairPar/apertium-$pairCode/apertium-$pairCode.$pairCode.t2x $pairPar/apertium-$pairCode/$pairCode.t2x.bin rulesOut.txt interchunk.txt
apertium -d $HOME/apertium-kaz-tur kaz-tur-biltrans input_file output_file
 
</pre>
 
   
  +
For example
6) Apply apertium tool "lextor" on the output of the biltrans.
 
  +
apertium-interchunk $HOME/apertium-kaz-tur/apertium-kaz-tur.kaz-tur.t2x $HOME/apertium-kaz-tur/kaz-tur.t2x.bin rulesOut.txt interchunk.txt
<pre>
 
lrx-proc -m $HOME/apertium-kaz-tur/kaz-tur.autolex.bin inFilePath > outFilePath
 
</pre>
 
   
  +
</pre>
7) Load the output of the "lextor" - each line as string - in a vector data structure in our program.
 
   
  +
To apply the apertium tool <b>postchunk</b> to the <b>interchunk</b> output file:
8) Split each "biltrans" sentence into source and target tokens and tags.
 
  +
<pre>
  +
apertium-postchunk $pairPar/apertium-$pairCode/apertium-$pairCode.$pairCode.t3x $pairPar/apertium-$pairCode/$pairCode.t3x.bin interchunk.txt postchunk.txt
   
  +
For example
9) Match the source tags with their categories in the transfer file.
 
  +
apertium-postchunk $HOME/apertium-kaz-tur/apertium-kaz-tur.kaz-tur.t3x $HOME/apertium-kaz-tur/kaz-tur.t3x.bin interchunk.txt postchunk.txt
   
  +
</pre>
10) From the matched tags, match the applied rules.
 
   
  +
To apply the apertium tool <b>transfer</b> to the <b>postchunk</b> output file
11) Apply the applied rules, with taking care of the multiple -ambiguous- rules applied to the same word/s
 
   
  +
# INPUT: Outputof the postchunk module
12) Get the all the combination outputs from applied rules.
 
  +
# OUTPUT: Morphologically generated sentences in the target language
   
13) Write these outputs on a file, then apply apertium tool "interchunk" to that file.
 
 
<pre>
 
<pre>
apertium-interchunk $HOME/apertium-kaz-tur/apertium-kaz-tur.kaz-tur.t2x $HOME/apertium-kaz-tur/kaz-tur.t2x.bin input_file output_file
 
</pre>
 
   
  +
apertium-transfer -n $pairPar/apertium-$pairCode/apertium-$pairCode.$pairCode.t4x $pairPar/apertium-$pairCode/$pairCode.t4x.bin postchunk.txt | lt-proc -g $pairPar/apertium-$pairCode/$pairCode.autogen.bin | lt-proc -p $pairPar/apertium-$pairCode/$pairCode.autopgen.bin > transfer.txt
14) Apply apertium tool "postchunk" to the "interchunk" output file.
 
   
  +
For example
apertium-postchunk $HOME/apertium-kaz-tur/apertium-kaz-tur.kaz-tur.t3x $HOME/apertium-kaz-tur/kaz-tur.t3x.bin input_file output_file
 
  +
apertium-transfer -n $HOME/apertium-kaz-tur/apertium-kaz-tur.kaz-tur.t4x $HOME/apertium-kaz-tur/kaz-tur.t4x.bin postchunk.txt | lt-proc -g $HOME/apertium-kaz-tur/kaz-tur.autogen.bin | lt-proc -p $HOME/apertium-kaz-tur/kaz-tur.autopgen.bin > transfer.txt
   
  +
</pre>
15) Apply apertium tool "transfer" to the "postchunk" output file.
 
   
  +
===Install and build kenlm===
apertium-transfer -n $HOME/apertium-kaz-tur/apertium-kaz-tur.kaz-tur.t4x $HOME/apertium-kaz-tur/kaz-
 
   
  +
Download and install kenlm by the following steps under 'USAGE' at https://kheafield.com/code/kenlm/
tur.t4x.bin input_file | lt-proc -g $HOME/apertium-kaz-tur/kaz-tur.autogen.bin | lt-proc -p $HOME/apertium-
 
   
  +
Download a big Turkish corpus from wikidumps:
kaz-tur/kaz-tur.autopgen.bin > output_file
 
  +
<pre>
  +
$ wget https://dumps.wikimedia.org/trwikinews/20181020/trwikinews-20181020-pages-articles.xml.bz2.
  +
</pre>
   
  +
For training, you should follow these steps:
16) "transfer" output is the target -translation- sentence. We then get the scores of these sentences from the language model.
 
  +
# <b>Estimating:</b> run <code>bin/lmplz -o 5 <text >text.arpa</code>
  +
# <b>Querying:</b> run <code>bin/build_binary text.arpa text.binary</code>
   
17) For given source sentence , there are one or more target sentence , each with a score now. We normalize their scores to make their sum = 1.
 
   
  +
Python script (score-sentences.py) used to score target language's sentences with language model, it can be found at https://github.com/sevilaybayatli/apertium-ambiguous/tree/master/scripts.
18) We then prepare the yasmet files , and then train them and get yasmet models.
 
   
  +
For running model weight program on the transfer file as it is explained in bash script by this command:
19) These models are used to get weights for the ambiguous rules in beam search.
 
  +
<pre>
  +
python score-sentences.py arpa_or_binary_LM_file target_lang_file weights_file
   
  +
For example
The output of the beam search is not optimal , but it gives one good translation as output.
 
  +
python2 score-sentences.py text.binary target-sentences.txt weights.txt
   
  +
</pre>
20) The project living here https://github.com/sevilaybayatli/apertium-kaz-tur-mt
 
   
  +
===Install and build yasmet===
21) Build/Compile system have done with this command:
 
  +
Downloading and compiling yasmet by doing the following:
 
a) ./autogen.sh
 
   
  +
Download yasmet either from https://www-i6.informatik.rwth-aachen.de/web/Software/YASMET.html
b) ./configure
 
  +
or from https://github.com/apertium/apertium-lex-tools/blob/master/yasmet.cc
   
  +
To build and compile, follow steps below:
c) make
 
  +
# <code>g++ -o yasmet yasmet.cc</code>
  +
# <code>./yasmet</code>
   
  +
(If the compilation doesn't work, try:
d) trining ./machine-translation inputfile outputfile
 
  +
#g++ -o yasmet yasmet.cc -std=gnu++98
  +
)
   
  +
<pre>
e) testing done by ./machine-translation inputfile outputfile number of beam
 
  +
$mv yasmet ./apertium-ambiguous
  +
</pre>
   
  +
Running yasmet-formatter to prepare yasmet datasets. Also this will generate the analysis output file , beside the best model weighting translations(scoring with language model) in file <b>modelWeight.txt</b>, and random translations(choosing applying rule randomly form transfer file) in file <b>randomWeight.txt</b>.
inputfile=kazak-text, outputfile= it's Turkish translation, and number of beam=8 or any number.
 
   
  +
To apply yasmet-formatter program
==Implementation==
 
  +
<pre>
We have created transfer-module by using the old transfer-module and rest of apertium tools such as lexical transfer, lexical selection, and morphological generator. We made a module by using c++ that translate texts from Kazakh to Turkish. This module try to give the best Turkish translation for Kazakh by applying advanced algorithms and methods.
 
  +
./yasmet-formatter $localeId transferFile.t1x sentences.txt lextor.txt transfer.txt weights.txt $outputFile $datasets
 
=====Step 1=====
 
We have a very big corpuses (wiki dumps) with size 640 MB and 320 MB of wiki texts. Since our application takes a sentence as input, we must split our corpus into sentences. First, we process the corpus if it precedes a sentence with capital letter and remove the latin alphabets from the corpus. We then applied a rule-based sentence boundary detection tool called “pragmatic segmenter”https://github.com/diasks2/pragmatic_segmenter/tree/kazakh.
 
   
  +
For example
For using pragmatic_segmenter you need do the following steps:
 
  +
./yasmet-formatter kk_KZ transferFile.t1x sentences.txt lextor.txt transfer.txt weights.txt sentences.out datasets
   
  +
</pre>
1- downloading Ruby
 
   
  +
===Training and Testing apertium-ambiguous===
2- gem install pragmatic_segmenter
 
   
3- inside code you should use it like "ruby2.3 kazSentenceTokenizer.rb"
 
   
  +
<b>Training</b>
=====Step 2=====
 
First of all, we take that sentence and give it to the rest of apertium tools biltrans and lextor to get a string of tokens (words) each with its translations and part of speech tags.
 
Now this is will be the real input to our program, we first split these strings into source and target tokens along with there tags, then we try to match these tags with categories from the transfer file as these matches will help us match the tokens to the rules. Second, it was to apply these rules on the matched tokens. If different rules are applied to one token, then we have ambiguity with that word, so we must decide which one to use. And if many tokens have ambiguities that makes the whole sentence has much more ambiguity, as all the possible combinations are equal the multiplication of each number of ambiguous rules of each token.
 
Our output for this step was to output all the possible combinations of translations of the sentence along with their analysis (output of the rules).
 
   
  +
<pre>
=====Step 3=====
 
After get all possible translations of every combination we scored them(their sum = 1) by using language model. In this project we have used KenLM Language Model Toolkit https://kheafield.com/code/kenlm/. Language Model applied on target language Turkish.
 
   
  +
./yasmet-formatter icu-locale-id transfer-file-path sentences-file-path lextor-file-path transfer-out-file-path(postchunk2-out) model-weights output-file-path datasets-folder-name
You should follow steps here to work with the language model:
 
   
  +
For example
1- First download and install kenlm language model form https://kheafield.com/code/kenlm/
 
   
  +
./yasmet-formatter kk_KZ transferFile.t1x test.txt lextor.txt transfer.txt weights.txt test.out datasets
2- Downloading big Turkish corpus from wikidumps https://dumps.wikimedia.org/trwikinews/20181020/trwikinews-20181020-pages-articles.xml.bz2
 
  +
</pre>
   
3- Train kenlm using big Turkish corpus by 5-gram language model, and with the following commands:
 
   
  +
<b>Generate-yasmet-models.sh</b>
a- Estimating running bin/lmplz -o 5 <text >text.arpa
 
   
  +
Generate the yasmet models form yasmet datasets either using bash file or doing it manually, actually by running one of the commands.
b- Querying will generate binary file by bin/build_binary text.arpa text.binary
 
  +
<pre>
  +
Either
   
  +
bash generate-yasmet-models.sh datasets models
4- You should have either python2 or python3 and add this line to code"python2 $HOME/Normalisek/exampleken1.py <"
 
   
  +
or
 
 
=====Step 4=====
 
There was a required format to obtain to use it as dataset for an unsupervised machine learner (YASMET). Every dataset will be for a certain pattern that ambiguous rules applied to, where the features will be the different words matched with these patterns, along with the rules number and their precalculated weight. The challenges for making that format was the need to modify and introduce new data-structures into the old code, which was not an easy task. But after finishing it successfully, we now are in the step of translating that data into a table and then feed it to the learner.
 
 
=====Step 5=====
 
Download and compile Yasmet by following the instruction here:
 
 
a) Download yasmet else from https://www-i6.informatik.rwth-aachen.de/web/Software/YASMET.html or https://github.com/apertium/apertium-lex-tools/blob/master/yasmet.cc
 
 
b) Build/compile g++ -o yasmet yasmet.cc
 
 
c) Execution by ./yasmet
 
 
Learner:
 
 
Let we have sentence with words w1 w2 w3 w4 w5.
 
 
Where rules r1 , r2 , r3 , r4 , r5 applied on w1 w2 w3 as follows :
 
 
 
  +
./yasmet < dataset-path > model-path
r1 applied on => w1 w2 w3
 
   
  +
</pre>
r2 applied on => w1 w2
 
   
  +
<b>Testing</b>
r3 applied on => w1
 
   
  +
Running beam search with beam = beam_number in the sentencesFile , writing its results into file "BeamSearch-k.txt".
r4 applied on => w2
 
   
  +
<pre>
r5 applied on => w3
 
   
  +
./beam-search localeId transferFile.tx1 sentencesFile lextorFile transferOutFile modelsFolder k1 k2 k3 ...
And rules r6 , r7 applied on w4 w5 as follows :
 
   
  +
For example
r6 applied on => w4 w5
 
  +
./beam-search kk_KZ transferFile.t1x test.txt lextor.txt transfer.txt models 2 4 8 10
   
  +
</pre>
r7 applied on => w4 w5
 
   
  +
<pre>
So we now have 3*2 possible translations for that sentence with the ambiguous rules applied and with their normalized scores as follows :
 
   
  +
test.txt= test.text(source language text(Kazakh)), output-file= BeamSearch-2.txt, BeamSearch-4.txt.., and k= 2 4 8 10 or any number.
r1 - r6 => 0.2
 
  +
</pre>
 
r1 - r7 => 0.1
 
 
r2 - r5 - r6 => 0.1
 
 
r2 - r5 - r7 => 0.3
 
 
r3 - r4 - r5 - r6 => 0.1
 
 
r3 - r4 - r5 - r7 => 0.2
 
 
Next we prepare the format for the yasmet files. By first calculating the scores of each rule/s applied to the same words by accumulating them from the normalized scores , as follows :
 
 
r1 => 0.1+0.2 = 0.3
 
 
r2-r5 => 0.1+0.3 = 0.4
 
 
r3-r4-r5 => 0.1+.02 = 0.3
 
 
r6- => 0.2+0.1+0.1 = 0.4
 
 
r7 => 0.1+0.3+0.2 = 0.6
 
 
 
So the yasmet format for file (r1+r2-r5+r3-r4-r5) is :
 
 
3
 
 
0 $ 0.3 # w1_0:0 w2_1:0 w3_2:0 # w1_0:1 w2_1:1 w3_2:1 # w1_0:2 w2_1:2 w3_2:2 #
 
 
1 $ 0.4 # w1_0:0 w2_1:0 w3_2:0 # w1_0:1 w2_1:1 w3_2:1 # w1_0:2 w2_1:2 w3_2:2 #
 
 
2 $ 0.3 # w1_0:0 w2_1:0 w3_2:0 # w1_0:1 w2_1:1 w3_2:1 # w1_0:2 w2_1:2 w3_2:2 #
 
 
And the yasmet format for file (r6+r7) is :
 
 
2
 
 
0 $ 0.4 # w4_0:0 w5_1:0 # w4_0:1 w5_1:1 #
 
 
1 $ 0.6 # w4_0:0 w5_1:0 # w4_0:1 w5_1:1 #
 
 
 
And we do so for all the sentences , accumulating the yasmet data for each file. At the end we train a model for each file to use it after that to take the scores of such rules and use them in the beam-search algorithm.
 
 
We train the model "r6+r7.model" of the given yasmet file "r6+r7" by using the cmmand :
 
 
./yasmet < r6+r7 > r6+r7.model
 
 
The model "r6+r7.model" would be :
 
 
@@@CORRECTIVE-FEATURE@@@ 1
 
 
w4_0:0 score1
 
 
w5_1:0 score2
 
 
w4_0:1 score3
 
 
w5_1:1 score4
 
 
=====The following steps just apply on test data=====
 
=====Step 6=====
 
We got 100 new sentences form Wikipedia(wiki dumps)to test our system. These new data across all steps except learning step(YASMET). Learner just used during training.
 
 
=====Step 7=====
 
Applying beam Search algorithm:
 
 
Input :
 
 
- beam : beam size
 
 
- slTokens : source words indices
 
 
- ambigInfo : A data structure has all the ambiguous rules with their corresponding words indices.
 
 
- classesWeights : yasmet weights loaded from the model files onto a map.
 
 
Output :
 
 
- beamTree : the highest weights (beam size) translations of the given source words. Actually the tree has the highest rules indices along with their weights sum.
 
 
Algorithm :
 
 
- At first we get a set of ambiguous rules applied to some words , then we get the weight of these words for every rule from the yasmet weights.
 
 
- We build a new tree for these new words. The tree is just a vector of vectors of rules indices along with their weights sum.
 
 
For example let at any iteration we have a set of rules (r for rules and w for word) :
 
r1 applied on => w1 w2 w3
 
r2 applied on => w1 w2
 
r3 applied on => w1
 
r4 applied on => w2
 
r5 applied on => w3
 
 
We then have 3 different translations for these 3 words.
 
 
We then build the tree as follows :
 
 
--------> r1 : weight1
 
--------> r2 - r5 : weight2
 
--------> r3 - r4 - r5 : weight3
 
 
- Then we expand our beamTree by the number of the rules we have and then merge the two tree.
 
So if we have a beam tree say with 6 translations, then with the above tree we just built, we will expand our beamTree to have 6*3 = 18 translations and then merge the tree we just built with beam tree.
 
 
- Then we sort those 18 translations descendingly by their sum of weights.
 
 
- Then if we reduce our beamTree to have no more the beam size translations. So if the beam size = 8 , we will remove the least 10 translations from our tree.
 
   
  +
Enjoy using apertium-ambiguous :)
- We then continue until we finish all the ambiguous rules and the output will be at last a tree with no more than the beam size translations.
 
   
  +
[[Category:Documentation in English]]
- After that we get only the best translation and output it.
 

Latest revision as of 13:19, 17 May 2019


The Idea[edit]

The idea is to allow old apertium transfer rules to be ambiguous i.e. allow a set of rules to match the same general input pattern. This is more effective than the existing situation wherein the first rule in the XML transfer files takes exclusive precedence and blocks out all its ambiguous peers during the transfer precompilation stage, often leading to inaccurate translation. To achieve this, the transfer module would use a set of predefined or pre-trained (more specific) weighted patterns provided for each group of ambiguous rules. This way, if a specific pattern matches with multiple transfer rules, the rule with the largest weight for that pattern is applied.

If no weighted patterns are matched, then the first rule in XML transfer file that matches the general pattern is still considered the default one and is applied.

The module (apertium-ambiguous) can be found at https://github.com/sevilaybayatli/apertium-ambiguous.

Configure, build and install[edit]

cd to apertium-ambiguous before you run the commands is shown below

./autogen.sh
./configure
make

How to use apertium-ambiguous for your language pair[edit]

For this tutorial, we will be using the language pair apertium-kaz-tur

Download a wikimedia dump[edit]

Download a Wikipedia dump from http://dumps.wikimedia.org:

$ wget https://dumps.wikimedia.org/kkwiki/latest/kkwiki-latest-pages-articles.xml.bz2

To use any other language, simply replace the occurrences of 'kk' with the 2-letter code of your language.

Next, extract the text using WikiExtractor script:

$ git clone https://github.com/apertium/WikiExtractor.git

$ cd WikiExtractor

$ python3 WikiExtractor.py --infn ../kkwiki-latest-pages-articles.xml.bz2  

The extracted file will be named as wiki.txt in the current directory which you are already working on and you are going to use it with other steps of the project.

Install segmenter[edit]

Install Kazakh segmenter from https://github.com/diasks2/pragmatic_segmenter/tree/kazakh

For using pragmatic_segmenter you need to do the following steps:

  1. Download Ruby 2.3 by running sudo apt-get install ruby-full
  2. Run gem install pragmatic_segmenter

This piece of code uses the segmenter to segment a corpus file and output the segmented sentences into a file. The sentenceTokenizer.rb, which is located at https://github.com/sevilaybayatli/apertium-ambiguous/blob/master/scripts

require 'pragmatic_segmenter'

File.open(ARGV[1]).each do |line1|
	line1.delete! ('\\\(\)\[\]\{\}\<\>\|\$\/\'\"')
    ps = PragmaticSegmenter::Segmenter.new(text: line1, language: ARGV[0], doc_type: 'txt')
    sentences = ps.segment
    
    File.open(ARGV[2], "a") do |line2|
        sentences.each { |sentence| line2.puts sentence }
    end
end

Breaking corpus into sentences using the ruby program sentenceTokenizer.rb built on the pragmatic segmenter.

ruby2.3 sentenceTokenizer.rb $langCode $inputFile sentences.txt

For example:
ruby2.3 sentenceTokenizer.rb kk wiki.txt sentences.txt

langcode for Kazakh kk, inputFile is Kazakh corpus, and sentences.txt is a segmented sentences.

Apertium language pairs modules[edit]

You need apertium and the language pair installed for using language modules. The steps below just show how the apertium modules for getting desired output which will used by apertium-ambiguous. Apertium pair parent directory path(apertium-kaz-tur). If it's in your home directory then we expect $HOME.

To apply the apertium tool biltrans on the segmented sentences:

apertium -d $pairPar/apertium-$pairCode $pairCode-biltrans sentences.txt biltrans.txt

For example
apertium -d $HOME/apertium-kaz-tur kaz-tur-biltrans sentences.txt biltrans.txt

To apply the apertium tool lextor on the output of the biltrans:

lrx-proc -m $HOME/apertium-kaz-tur/kaz-tur.autolex.bin inFilePath > outFilePath

For example
lrx-proc -m $HOME/apertium-kaz-tur/kaz-tur.autolex.bin biltrans.txt > lextor.txt

To run rules-applier program

./rules-applier localeId transferFile.t1x sentences.txt lextor.txt rulesOut.txt

For example
./rules-applier kk_KZ $HOME/transferFile.t1x sentences.txt lextor.txt rulesOut.txt
localeId= ICU localeId for the source language, sentences.txt= source language sentences, rulesOut.txt= output file of your results 

To apply the apertium tool interchunk into rulesOut.txt file:

apertium-interchunk $pairPar/apertium-$pairCode/apertium-$pairCode.$pairCode.t2x $pairPar/apertium-$pairCode/$pairCode.t2x.bin rulesOut.txt interchunk.txt

For example
apertium-interchunk $HOME/apertium-kaz-tur/apertium-kaz-tur.kaz-tur.t2x $HOME/apertium-kaz-tur/kaz-tur.t2x.bin rulesOut.txt interchunk.txt

To apply the apertium tool postchunk to the interchunk output file:

apertium-postchunk $pairPar/apertium-$pairCode/apertium-$pairCode.$pairCode.t3x $pairPar/apertium-$pairCode/$pairCode.t3x.bin interchunk.txt postchunk.txt

For example
apertium-postchunk $HOME/apertium-kaz-tur/apertium-kaz-tur.kaz-tur.t3x $HOME/apertium-kaz-tur/kaz-tur.t3x.bin interchunk.txt postchunk.txt

To apply the apertium tool transfer to the postchunk output file

  1. INPUT: Outputof the postchunk module
  2. OUTPUT: Morphologically generated sentences in the target language

apertium-transfer -n $pairPar/apertium-$pairCode/apertium-$pairCode.$pairCode.t4x $pairPar/apertium-$pairCode/$pairCode.t4x.bin postchunk.txt | lt-proc -g $pairPar/apertium-$pairCode/$pairCode.autogen.bin | lt-proc -p $pairPar/apertium-$pairCode/$pairCode.autopgen.bin > transfer.txt

For example
apertium-transfer -n $HOME/apertium-kaz-tur/apertium-kaz-tur.kaz-tur.t4x $HOME/apertium-kaz-tur/kaz-tur.t4x.bin postchunk.txt | lt-proc -g $HOME/apertium-kaz-tur/kaz-tur.autogen.bin | lt-proc -p $HOME/apertium-kaz-tur/kaz-tur.autopgen.bin > transfer.txt

Install and build kenlm[edit]

Download and install kenlm by the following steps under 'USAGE' at https://kheafield.com/code/kenlm/

Download a big Turkish corpus from wikidumps:

$ wget https://dumps.wikimedia.org/trwikinews/20181020/trwikinews-20181020-pages-articles.xml.bz2. 

For training, you should follow these steps:

  1. Estimating: run bin/lmplz -o 5 <text >text.arpa
  2. Querying: run bin/build_binary text.arpa text.binary


Python script (score-sentences.py) used to score target language's sentences with language model, it can be found at https://github.com/sevilaybayatli/apertium-ambiguous/tree/master/scripts.

For running model weight program on the transfer file as it is explained in bash script by this command:

python score-sentences.py arpa_or_binary_LM_file target_lang_file weights_file

For example
python2 score-sentences.py text.binary target-sentences.txt weights.txt

Install and build yasmet[edit]

Downloading and compiling yasmet by doing the following:

Download yasmet either from https://www-i6.informatik.rwth-aachen.de/web/Software/YASMET.html or from https://github.com/apertium/apertium-lex-tools/blob/master/yasmet.cc

To build and compile, follow steps below:

  1. g++ -o yasmet yasmet.cc
  2. ./yasmet

(If the compilation doesn't work, try:

  1. g++ -o yasmet yasmet.cc -std=gnu++98

)

$mv yasmet ./apertium-ambiguous

Running yasmet-formatter to prepare yasmet datasets. Also this will generate the analysis output file , beside the best model weighting translations(scoring with language model) in file modelWeight.txt, and random translations(choosing applying rule randomly form transfer file) in file randomWeight.txt.

To apply yasmet-formatter program

./yasmet-formatter $localeId transferFile.t1x sentences.txt lextor.txt transfer.txt weights.txt $outputFile $datasets

For example
./yasmet-formatter kk_KZ transferFile.t1x sentences.txt lextor.txt transfer.txt weights.txt sentences.out datasets

Training and Testing apertium-ambiguous[edit]

Training


./yasmet-formatter  icu-locale-id  transfer-file-path  sentences-file-path  lextor-file-path transfer-out-file-path(postchunk2-out)  model-weights  output-file-path  datasets-folder-name

For example

./yasmet-formatter kk_KZ transferFile.t1x test.txt lextor.txt transfer.txt weights.txt test.out datasets


Generate-yasmet-models.sh

Generate the yasmet models form yasmet datasets either using bash file or doing it manually, actually by running one of the commands.

Either

bash generate-yasmet-models.sh datasets models

or
 
./yasmet < dataset-path > model-path

Testing

Running beam search with beam = beam_number in the sentencesFile , writing its results into file "BeamSearch-k.txt".


./beam-search localeId transferFile.tx1 sentencesFile lextorFile transferOutFile modelsFolder k1 k2 k3 ...

For example 
./beam-search kk_KZ transferFile.t1x test.txt lextor.txt transfer.txt models 2 4 8 10 


test.txt= test.text(source language text(Kazakh)), output-file= BeamSearch-2.txt, BeamSearch-4.txt.., and k= 2 4 8 10 or any number.

Enjoy using apertium-ambiguous :)