Difference between revisions of "Using weights for ambiguous rules"

From Apertium
Jump to navigation Jump to search
 
(108 intermediate revisions by 4 users not shown)
Line 3: Line 3:
   
 
==The Idea==
 
==The Idea==
The idea is to allow Old-Apertium transfer rules to be ambiguous, i.e., allow a set of rules to match the same general input pattern, as opposed to the existed situation when the first rule in xml transfer file takes exclusive precedence and blocks out all its ambiguous peers during transfer precompilation stage.
+
The idea is to allow old apertium transfer rules to be ambiguous i.e. allow a set of rules to match the same general input pattern. This is more effective than the existing situation wherein the first rule in the XML transfer files takes exclusive precedence and blocks out all its ambiguous peers during the transfer precompilation stage, often leading to inaccurate translation.
To decide which rule applies, transfer module would use a set of predefined or pretrained — more specific weighted patterns provided for each group of ambiguous rules. This way, if a specific pattern matches, the rule with the highest weight for that pattern is applied.
+
To achieve this, the transfer module would use a set of predefined or pre-trained (more specific) weighted patterns provided for each group of ambiguous rules. This way, if a specific pattern matches with multiple transfer rules, the rule with the largest weight for that pattern is applied.
   
The first rule in xml transfer file that matches the general pattern is still considered the default one and is applied if no weighted patterns matched.
+
If no weighted patterns are matched, then the first rule in XML transfer file that matches the general pattern is still considered the default one and is applied.
   
apertium-kaz-tur-mt living here https://github.com/sevilaybayatli/apertium-kaz-tur-mt.
+
The module (apertium-ambiguous) can be found at https://github.com/sevilaybayatli/apertium-ambiguous.
   
  +
==Configure, build and install==
==How to use apertium-kaz-tur-mt for your language pair==
 
  +
<code>cd</code> to <b>apertium-ambiguous</b> before you run the commands is shown below
   
  +
<pre>
===Downloading wikimedia dump===
 
  +
./autogen.sh
  +
./configure
  +
make
  +
</pre>
   
  +
==How to use apertium-ambiguous for your language pair==
Download a Wikipedia dump from http://dumps.wikimedia.org
 
  +
<pre>
 
  +
For this tutorial, we will be using the language pair apertium-kaz-tur
  +
  +
===Download a wikimedia dump===
  +
  +
Download a Wikipedia dump from http://dumps.wikimedia.org:
  +
  +
<pre>
 
$ wget https://dumps.wikimedia.org/kkwiki/latest/kkwiki-latest-pages-articles.xml.bz2
 
$ wget https://dumps.wikimedia.org/kkwiki/latest/kkwiki-latest-pages-articles.xml.bz2
 
</pre>
 
</pre>
   
  +
To use any other language, simply replace the occurrences of 'kk' with the 2-letter code of your language.
Extract the text using WikiExtractor:
 
  +
  +
Next, extract the text using WikiExtractor script:
   
 
<pre>
 
<pre>
$ wget https://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/WikiExtractor.py
+
$ git clone https://github.com/apertium/WikiExtractor.git
  +
  +
$ cd WikiExtractor
   
$ python3 WikiExtractor.py --infn kkwiki-latest-pages-articles.xml.bz2
+
$ python3 WikiExtractor.py --infn ../kkwiki-latest-pages-articles.xml.bz2
 
</pre>
 
</pre>
   
  +
The extracted file will be named as <b>wiki.txt</b> in the current directory which you are already working on and you are going to use it with other steps of the project.
* Insert wiki.txt which has been extracted into the project directory.
 
   
 
===Install segmenter===
 
===Install segmenter===
  +
 
Install Kazakh segmenter from https://github.com/diasks2/pragmatic_segmenter/tree/kazakh
 
Install Kazakh segmenter from https://github.com/diasks2/pragmatic_segmenter/tree/kazakh
   
 
For using pragmatic_segmenter you need to do the following steps:
 
For using pragmatic_segmenter you need to do the following steps:
   
  +
# Download Ruby 2.3 by running <code>sudo apt-get install ruby-full</code>
# <code>downloading ruby2.3</code>
 
  +
# Run <code>gem install pragmatic_segmenter</code>
 
# <code>gem install pragmatic_segmenter</code>
 
   
This piece of code uses the segmenter to segment a corpus file and output the segmented sentences in a file. In kazSentenceTokenizer.rb , Change the 2-letters code of the source language to the language desired. Here "kk" is code for Kazakh.
+
This piece of code uses the segmenter to segment a corpus file and output the segmented sentences into a file. The <b>sentenceTokenizer.rb</b>, which is located at https://github.com/sevilaybayatli/apertium-ambiguous/blob/master/scripts
   
 
<pre>
 
<pre>
 
require 'pragmatic_segmenter'
 
require 'pragmatic_segmenter'
   
File.open(ARGV[0]).each do |line1|
+
File.open(ARGV[1]).each do |line1|
  +
line1.delete! ('\\\(\)\[\]\{\}\<\>\|\$\/\'\"')
ps = PragmaticSegmenter::Segmenter.new(text: line1, language: 'kk', doc_type: 'txt')
 
  +
ps = PragmaticSegmenter::Segmenter.new(text: line1, language: ARGV[0], doc_type: 'txt')
sentences = ps.segment
 
  +
sentences = ps.segment
File.open(ARGV[1], "a") do |line2|
 
  +
sentences.each { |sentence| line2.puts sentence }
 
  +
File.open(ARGV[2], "a") do |line2|
end end
 
  +
sentences.each { |sentence| line2.puts sentence }
  +
end
  +
end
 
</pre>
 
</pre>
   
  +
Breaking corpus into sentences using the ruby program <b>sentenceTokenizer.rb</b> built on the pragmatic segmenter.
===Install and build kenlm===
 
  +
<pre>
Download and install kenlm https://kheafield.com/code/kenlm/
 
  +
ruby2.3 sentenceTokenizer.rb $langCode $inputFile sentences.txt
  +
  +
For example:
  +
ruby2.3 sentenceTokenizer.rb kk wiki.txt sentences.txt
  +
  +
</pre>
  +
langcode for Kazakh <b>kk</b>, inputFile is <b>Kazakh corpus</b>, and sentences.txt is a <b>segmented sentences</b>.
  +
  +
===Apertium language pairs modules===
  +
You need apertium and the language pair installed for using language modules. The steps below just show how the apertium modules for getting desired output which will used by apertium-ambiguous. Apertium pair parent directory path(<b>apertium-kaz-tur</b>). If it's in your home directory then we expect <b>$HOME</b>.
   
  +
To apply the apertium tool <b>biltrans</b> on the segmented sentences:
Downloading big Turkish corpus from wikidumps:
 
 
<pre>
 
<pre>
  +
apertium -d $pairPar/apertium-$pairCode $pairCode-biltrans sentences.txt biltrans.txt
$ wget https://dumps.wikimedia.org/trwikinews/20181020/trwikinews-20181020-pages-articles.xml.bz2.
 
  +
  +
For example
  +
apertium -d $HOME/apertium-kaz-tur kaz-tur-biltrans sentences.txt biltrans.txt
  +
 
</pre>
 
</pre>
   
  +
To apply the apertium tool <b>lextor</b> on the output of the biltrans:
For training you should follow these steps:
 
  +
<pre>
# <code>estimating running bin/lmplz -o 5 <text >text.arpa</code>
 
  +
lrx-proc -m $HOME/apertium-kaz-tur/kaz-tur.autolex.bin inFilePath > outFilePath
# <code>querying will generate binary file by bin/build_binary text.arpa text.binary</code>
 
# <code>add text.binary inside subdirectory script</code>
 
   
  +
For example
Python scripts(exampleken1, kenlm.pyx, genalltra.py) used to score sentences living here https://github.com/sevilaybayatli/apertium-kaz-tur-mt/tree/master/scripts, these scripts automatically doing its function.
 
  +
lrx-proc -m $HOME/apertium-kaz-tur/kaz-tur.autolex.bin biltrans.txt > lextor.txt
   
  +
</pre>
===Install and build yasmet===
 
The next step is downloading and compile yasmet by following the instruction here:
 
   
  +
To run <b>rules-applier</b> program
Download yasmet else from https://www-i6.informatik.rwth-aachen.de/web/Software/YASMET.html
 
  +
<pre>
or form https://github.com/apertium/apertium-lex-tools/blob/master/yasmet.cc
 
  +
./rules-applier localeId transferFile.t1x sentences.txt lextor.txt rulesOut.txt
   
  +
For example
Build and compile you should follow steps below:
 
  +
./rules-applier kk_KZ $HOME/transferFile.t1x sentences.txt lextor.txt rulesOut.txt
# <code>g++ -o yasmet yasmet.cc</code>
 
  +
</pre>
# <code>./yasmet</code>
 
  +
<pre>
  +
localeId= ICU localeId for the source language, sentences.txt= source language sentences, rulesOut.txt= output file of your results
  +
</pre>
   
  +
To apply the apertium tool <b>interchunk</b> into rulesOut.txt file:
===Apertium language pairs modules===
 
  +
<pre>
You need apertium and language pair installed for using language modules inside code, the steps below just showing how the rest apertium modules used inside the code.
 
  +
apertium-interchunk $pairPar/apertium-$pairCode/apertium-$pairCode.$pairCode.t2x $pairPar/apertium-$pairCode/$pairCode.t2x.bin rulesOut.txt interchunk.txt
   
  +
For example
Change the language pair file name to the pair desired in the paths of apertium tools (biltrans , lextor , interchunk , postchunk , transfer) in class CLExec.cpp. Also the their paths could be changed. Here the pair is kaz-tur and the path is the Home path.
 
  +
apertium-interchunk $HOME/apertium-kaz-tur/apertium-kaz-tur.kaz-tur.t2x $HOME/apertium-kaz-tur/kaz-tur.t2x.bin rulesOut.txt interchunk.txt
   
  +
</pre>
Applying apertium tool "biltrans" on the segmented sentences.
 
  +
  +
To apply the apertium tool <b>postchunk</b> to the <b>interchunk</b> output file:
 
<pre>
 
<pre>
  +
apertium-postchunk $pairPar/apertium-$pairCode/apertium-$pairCode.$pairCode.t3x $pairPar/apertium-$pairCode/$pairCode.t3x.bin interchunk.txt postchunk.txt
apertium -d $HOME/apertium-kaz-tur kaz-tur-biltrans input_file output_file
 
  +
  +
For example
  +
apertium-postchunk $HOME/apertium-kaz-tur/apertium-kaz-tur.kaz-tur.t3x $HOME/apertium-kaz-tur/kaz-tur.t3x.bin interchunk.txt postchunk.txt
  +
 
</pre>
 
</pre>
   
Applying apertium tool "lextor" on the output of the biltrans.
+
To apply the apertium tool <b>transfer</b> to the <b>postchunk</b> output file
  +
  +
# INPUT: Outputof the postchunk module
  +
# OUTPUT: Morphologically generated sentences in the target language
  +
 
<pre>
 
<pre>
  +
lrx-proc -m $HOME/apertium-kaz-tur/kaz-tur.autolex.bin inFilePath > outFilePath
 
  +
apertium-transfer -n $pairPar/apertium-$pairCode/apertium-$pairCode.$pairCode.t4x $pairPar/apertium-$pairCode/$pairCode.t4x.bin postchunk.txt | lt-proc -g $pairPar/apertium-$pairCode/$pairCode.autogen.bin | lt-proc -p $pairPar/apertium-$pairCode/$pairCode.autopgen.bin > transfer.txt
  +
  +
For example
  +
apertium-transfer -n $HOME/apertium-kaz-tur/apertium-kaz-tur.kaz-tur.t4x $HOME/apertium-kaz-tur/kaz-tur.t4x.bin postchunk.txt | lt-proc -g $HOME/apertium-kaz-tur/kaz-tur.autogen.bin | lt-proc -p $HOME/apertium-kaz-tur/kaz-tur.autopgen.bin > transfer.txt
  +
 
</pre>
 
</pre>
   
  +
===Install and build kenlm===
Applying apertium tool "interchunk" to that file.
 
  +
  +
Download and install kenlm by the following steps under 'USAGE' at https://kheafield.com/code/kenlm/
  +
  +
Download a big Turkish corpus from wikidumps:
 
<pre>
 
<pre>
  +
$ wget https://dumps.wikimedia.org/trwikinews/20181020/trwikinews-20181020-pages-articles.xml.bz2.
apertium-interchunk $HOME/apertium-kaz-tur/apertium-kaz-tur.kaz-tur.t2x $HOME/apertium-kaz-tur/kaz-tur.t2x.bin input_file output_file
 
</pre>
+
</pre>
   
  +
For training, you should follow these steps:
Applying apertium tool "postchunk" to the "interchunk" output file.
 
  +
# <b>Estimating:</b> run <code>bin/lmplz -o 5 <text >text.arpa</code>
  +
# <b>Querying:</b> run <code>bin/build_binary text.arpa text.binary</code>
  +
  +
  +
Python script (score-sentences.py) used to score target language's sentences with language model, it can be found at https://github.com/sevilaybayatli/apertium-ambiguous/tree/master/scripts.
  +
  +
For running model weight program on the transfer file as it is explained in bash script by this command:
 
<pre>
 
<pre>
  +
python score-sentences.py arpa_or_binary_LM_file target_lang_file weights_file
apertium-postchunk $HOME/apertium-kaz-tur/apertium-kaz-tur.kaz-tur.t3x $HOME/apertium-kaz-tur/kaz-tur.t3x.bin input_file output_file
 
  +
  +
For example
  +
python2 score-sentences.py text.binary target-sentences.txt weights.txt
  +
 
</pre>
 
</pre>
   
  +
===Install and build yasmet===
Applying apertium tool "transfer" to the "postchunk" output file.
 
  +
Downloading and compiling yasmet by doing the following:
  +
  +
Download yasmet either from https://www-i6.informatik.rwth-aachen.de/web/Software/YASMET.html
  +
or from https://github.com/apertium/apertium-lex-tools/blob/master/yasmet.cc
  +
  +
To build and compile, follow steps below:
  +
# <code>g++ -o yasmet yasmet.cc</code>
  +
# <code>./yasmet</code>
  +
  +
(If the compilation doesn't work, try:
  +
#g++ -o yasmet yasmet.cc -std=gnu++98
  +
)
  +
 
<pre>
 
<pre>
  +
$mv yasmet ./apertium-ambiguous
apertium-transfer -n $HOME/apertium-kaz-tur/apertium-kaz-tur.kaz-tur.t4x $HOME/apertium-kaz-tur/kaz-
 
tur.t4x.bin input_file | lt-proc -g $HOME/apertium-kaz-tur/kaz-tur.autogen.bin | lt-proc -p $HOME/apertium-
 
kaz-tur/kaz-tur.autopgen.bin > output_file.
 
 
</pre>
 
</pre>
   
  +
Running yasmet-formatter to prepare yasmet datasets. Also this will generate the analysis output file , beside the best model weighting translations(scoring with language model) in file <b>modelWeight.txt</b>, and random translations(choosing applying rule randomly form transfer file) in file <b>randomWeight.txt</b>.
===Configure, build and install===
 
  +
<code>cd</code> to apertium-kaz-tur-mt before you run the the commands shown below:
 
  +
To apply yasmet-formatter program
  +
<pre>
  +
./yasmet-formatter $localeId transferFile.t1x sentences.txt lextor.txt transfer.txt weights.txt $outputFile $datasets
  +
  +
For example
  +
./yasmet-formatter kk_KZ transferFile.t1x sentences.txt lextor.txt transfer.txt weights.txt sentences.out datasets
   
<pre>
 
./autogen.sh
 
./configure
 
make
 
 
</pre>
 
</pre>
   
===Training and Testing apertium-kaz-tur-mt===
+
===Training and Testing apertium-ambiguous===
The compiled program has four modes can be used by passing the right parameters.
 
   
* Yasmet dataset mode (with output file). Process the input wiki file, get the yasmet data of it and get the output (analysis) of that input file.
 
./machine-translation input_file_name output_file_name
 
   
  +
<b>Training</b>
* Yasmet dataset mode (without output file). Process the input wiki file, get the yasmet data of it but without the output (analysis) of that input file.
 
./machine-translation input_file_name
 
   
  +
<pre>
* Yasmet training models mode. Generate the yasmet models from the yasmet datasets , actually running command "./yasmet yasmet_data yasmet_data.model" on every yasmet file in datasets folder.
 
./machine-translation
 
   
  +
./yasmet-formatter icu-locale-id transfer-file-path sentences-file-path lextor-file-path transfer-out-file-path(postchunk2-out) model-weights output-file-path datasets-folder-name
* Beam search mode. Running beam search with beam = beam_number on the input file , giving its results in file "beamResults" and giving the output analysis in "output_file_name" file.
 
./machine-translation input_file_name output_file_name beam_number
 
   
  +
For example
Training should be done by
 
* ./machine-translation input-file output-file
 
   
  +
./yasmet-formatter kk_KZ transferFile.t1x test.txt lextor.txt transfer.txt weights.txt test.out datasets
Testing can be done by
 
  +
</pre>
* ./machine-translation input-file output-file number-of-beam
 
   
  +
  +
<b>Generate-yasmet-models.sh</b>
  +
  +
Generate the yasmet models form yasmet datasets either using bash file or doing it manually, actually by running one of the commands.
 
<pre>
 
<pre>
  +
Either
input-file= source language(Kazakh), output-file= target language(Turkish), and number of beam= 8 or any number.
 
  +
  +
bash generate-yasmet-models.sh datasets models
  +
  +
or
  +
  +
./yasmet < dataset-path > model-path
  +
  +
</pre>
  +
  +
<b>Testing</b>
  +
  +
Running beam search with beam = beam_number in the sentencesFile , writing its results into file "BeamSearch-k.txt".
  +
  +
<pre>
  +
  +
./beam-search localeId transferFile.tx1 sentencesFile lextorFile transferOutFile modelsFolder k1 k2 k3 ...
  +
  +
For example
  +
./beam-search kk_KZ transferFile.t1x test.txt lextor.txt transfer.txt models 2 4 8 10
  +
  +
</pre>
  +
  +
<pre>
  +
  +
test.txt= test.text(source language text(Kazakh)), output-file= BeamSearch-2.txt, BeamSearch-4.txt.., and k= 2 4 8 10 or any number.
 
</pre>
 
</pre>
   
  +
Enjoy using apertium-ambiguous :)
Note: You can find the final result inside results/beamResults.txt.
 
   
  +
[[Category:Documentation in English]]
Enjoy by using our project :)
 

Latest revision as of 13:19, 17 May 2019


The Idea[edit]

The idea is to allow old apertium transfer rules to be ambiguous i.e. allow a set of rules to match the same general input pattern. This is more effective than the existing situation wherein the first rule in the XML transfer files takes exclusive precedence and blocks out all its ambiguous peers during the transfer precompilation stage, often leading to inaccurate translation. To achieve this, the transfer module would use a set of predefined or pre-trained (more specific) weighted patterns provided for each group of ambiguous rules. This way, if a specific pattern matches with multiple transfer rules, the rule with the largest weight for that pattern is applied.

If no weighted patterns are matched, then the first rule in XML transfer file that matches the general pattern is still considered the default one and is applied.

The module (apertium-ambiguous) can be found at https://github.com/sevilaybayatli/apertium-ambiguous.

Configure, build and install[edit]

cd to apertium-ambiguous before you run the commands is shown below

./autogen.sh
./configure
make

How to use apertium-ambiguous for your language pair[edit]

For this tutorial, we will be using the language pair apertium-kaz-tur

Download a wikimedia dump[edit]

Download a Wikipedia dump from http://dumps.wikimedia.org:

$ wget https://dumps.wikimedia.org/kkwiki/latest/kkwiki-latest-pages-articles.xml.bz2

To use any other language, simply replace the occurrences of 'kk' with the 2-letter code of your language.

Next, extract the text using WikiExtractor script:

$ git clone https://github.com/apertium/WikiExtractor.git

$ cd WikiExtractor

$ python3 WikiExtractor.py --infn ../kkwiki-latest-pages-articles.xml.bz2  

The extracted file will be named as wiki.txt in the current directory which you are already working on and you are going to use it with other steps of the project.

Install segmenter[edit]

Install Kazakh segmenter from https://github.com/diasks2/pragmatic_segmenter/tree/kazakh

For using pragmatic_segmenter you need to do the following steps:

  1. Download Ruby 2.3 by running sudo apt-get install ruby-full
  2. Run gem install pragmatic_segmenter

This piece of code uses the segmenter to segment a corpus file and output the segmented sentences into a file. The sentenceTokenizer.rb, which is located at https://github.com/sevilaybayatli/apertium-ambiguous/blob/master/scripts

require 'pragmatic_segmenter'

File.open(ARGV[1]).each do |line1|
	line1.delete! ('\\\(\)\[\]\{\}\<\>\|\$\/\'\"')
    ps = PragmaticSegmenter::Segmenter.new(text: line1, language: ARGV[0], doc_type: 'txt')
    sentences = ps.segment
    
    File.open(ARGV[2], "a") do |line2|
        sentences.each { |sentence| line2.puts sentence }
    end
end

Breaking corpus into sentences using the ruby program sentenceTokenizer.rb built on the pragmatic segmenter.

ruby2.3 sentenceTokenizer.rb $langCode $inputFile sentences.txt

For example:
ruby2.3 sentenceTokenizer.rb kk wiki.txt sentences.txt

langcode for Kazakh kk, inputFile is Kazakh corpus, and sentences.txt is a segmented sentences.

Apertium language pairs modules[edit]

You need apertium and the language pair installed for using language modules. The steps below just show how the apertium modules for getting desired output which will used by apertium-ambiguous. Apertium pair parent directory path(apertium-kaz-tur). If it's in your home directory then we expect $HOME.

To apply the apertium tool biltrans on the segmented sentences:

apertium -d $pairPar/apertium-$pairCode $pairCode-biltrans sentences.txt biltrans.txt

For example
apertium -d $HOME/apertium-kaz-tur kaz-tur-biltrans sentences.txt biltrans.txt

To apply the apertium tool lextor on the output of the biltrans:

lrx-proc -m $HOME/apertium-kaz-tur/kaz-tur.autolex.bin inFilePath > outFilePath

For example
lrx-proc -m $HOME/apertium-kaz-tur/kaz-tur.autolex.bin biltrans.txt > lextor.txt

To run rules-applier program

./rules-applier localeId transferFile.t1x sentences.txt lextor.txt rulesOut.txt

For example
./rules-applier kk_KZ $HOME/transferFile.t1x sentences.txt lextor.txt rulesOut.txt
localeId= ICU localeId for the source language, sentences.txt= source language sentences, rulesOut.txt= output file of your results 

To apply the apertium tool interchunk into rulesOut.txt file:

apertium-interchunk $pairPar/apertium-$pairCode/apertium-$pairCode.$pairCode.t2x $pairPar/apertium-$pairCode/$pairCode.t2x.bin rulesOut.txt interchunk.txt

For example
apertium-interchunk $HOME/apertium-kaz-tur/apertium-kaz-tur.kaz-tur.t2x $HOME/apertium-kaz-tur/kaz-tur.t2x.bin rulesOut.txt interchunk.txt

To apply the apertium tool postchunk to the interchunk output file:

apertium-postchunk $pairPar/apertium-$pairCode/apertium-$pairCode.$pairCode.t3x $pairPar/apertium-$pairCode/$pairCode.t3x.bin interchunk.txt postchunk.txt

For example
apertium-postchunk $HOME/apertium-kaz-tur/apertium-kaz-tur.kaz-tur.t3x $HOME/apertium-kaz-tur/kaz-tur.t3x.bin interchunk.txt postchunk.txt

To apply the apertium tool transfer to the postchunk output file

  1. INPUT: Outputof the postchunk module
  2. OUTPUT: Morphologically generated sentences in the target language

apertium-transfer -n $pairPar/apertium-$pairCode/apertium-$pairCode.$pairCode.t4x $pairPar/apertium-$pairCode/$pairCode.t4x.bin postchunk.txt | lt-proc -g $pairPar/apertium-$pairCode/$pairCode.autogen.bin | lt-proc -p $pairPar/apertium-$pairCode/$pairCode.autopgen.bin > transfer.txt

For example
apertium-transfer -n $HOME/apertium-kaz-tur/apertium-kaz-tur.kaz-tur.t4x $HOME/apertium-kaz-tur/kaz-tur.t4x.bin postchunk.txt | lt-proc -g $HOME/apertium-kaz-tur/kaz-tur.autogen.bin | lt-proc -p $HOME/apertium-kaz-tur/kaz-tur.autopgen.bin > transfer.txt

Install and build kenlm[edit]

Download and install kenlm by the following steps under 'USAGE' at https://kheafield.com/code/kenlm/

Download a big Turkish corpus from wikidumps:

$ wget https://dumps.wikimedia.org/trwikinews/20181020/trwikinews-20181020-pages-articles.xml.bz2. 

For training, you should follow these steps:

  1. Estimating: run bin/lmplz -o 5 <text >text.arpa
  2. Querying: run bin/build_binary text.arpa text.binary


Python script (score-sentences.py) used to score target language's sentences with language model, it can be found at https://github.com/sevilaybayatli/apertium-ambiguous/tree/master/scripts.

For running model weight program on the transfer file as it is explained in bash script by this command:

python score-sentences.py arpa_or_binary_LM_file target_lang_file weights_file

For example
python2 score-sentences.py text.binary target-sentences.txt weights.txt

Install and build yasmet[edit]

Downloading and compiling yasmet by doing the following:

Download yasmet either from https://www-i6.informatik.rwth-aachen.de/web/Software/YASMET.html or from https://github.com/apertium/apertium-lex-tools/blob/master/yasmet.cc

To build and compile, follow steps below:

  1. g++ -o yasmet yasmet.cc
  2. ./yasmet

(If the compilation doesn't work, try:

  1. g++ -o yasmet yasmet.cc -std=gnu++98

)

$mv yasmet ./apertium-ambiguous

Running yasmet-formatter to prepare yasmet datasets. Also this will generate the analysis output file , beside the best model weighting translations(scoring with language model) in file modelWeight.txt, and random translations(choosing applying rule randomly form transfer file) in file randomWeight.txt.

To apply yasmet-formatter program

./yasmet-formatter $localeId transferFile.t1x sentences.txt lextor.txt transfer.txt weights.txt $outputFile $datasets

For example
./yasmet-formatter kk_KZ transferFile.t1x sentences.txt lextor.txt transfer.txt weights.txt sentences.out datasets

Training and Testing apertium-ambiguous[edit]

Training


./yasmet-formatter  icu-locale-id  transfer-file-path  sentences-file-path  lextor-file-path transfer-out-file-path(postchunk2-out)  model-weights  output-file-path  datasets-folder-name

For example

./yasmet-formatter kk_KZ transferFile.t1x test.txt lextor.txt transfer.txt weights.txt test.out datasets


Generate-yasmet-models.sh

Generate the yasmet models form yasmet datasets either using bash file or doing it manually, actually by running one of the commands.

Either

bash generate-yasmet-models.sh datasets models

or
 
./yasmet < dataset-path > model-path

Testing

Running beam search with beam = beam_number in the sentencesFile , writing its results into file "BeamSearch-k.txt".


./beam-search localeId transferFile.tx1 sentencesFile lextorFile transferOutFile modelsFolder k1 k2 k3 ...

For example 
./beam-search kk_KZ transferFile.t1x test.txt lextor.txt transfer.txt models 2 4 8 10 


test.txt= test.text(source language text(Kazakh)), output-file= BeamSearch-2.txt, BeamSearch-4.txt.., and k= 2 4 8 10 or any number.

Enjoy using apertium-ambiguous :)