Apertium has moved from SourceForge to GitHub.
If you have any questions, please come and talk to us on #apertium on irc.freenode.net or contact the GitHub migration team.

Using weights for ambiguous rules

From Apertium
(Difference between revisions)
Jump to: navigation, search
(Removed SVN reference)
(Fixed the glaring grammar errors)
Line 8: Line 8:
 
The first rule in XML transfer file that matches the general pattern is still considered the default one and is applied if no weighted patterns are matched.
 
The first rule in XML transfer file that matches the general pattern is still considered the default one and is applied if no weighted patterns are matched.
   
The module (apertium-ambiguous) can be found [https://github.com/sevilaybayatli/apertium-ambiguous here].
+
The module (apertium-ambiguous) can be found at [https://github.com/sevilaybayatli/apertium-ambiguous].
   
==How to use apertium-kaz-tur-mt for your language pair==
+
==How to use apertium-ambiguous for your language pair==
   
 
===Download a wikimedia dump===
 
===Download a wikimedia dump===
Line 36: Line 36:
 
For using pragmatic_segmenter you need to do the following steps:
 
For using pragmatic_segmenter you need to do the following steps:
   
# Download Ruby 2.3 [https://www.brightbox.com/blog/2016/01/06/ruby-2-3-ubuntu-packages/ like so]
+
# Download Ruby 2.3 by running <code>sudo apt-get install ruby-full</code>
 
# Run <code>gem install pragmatic_segmenter</code>
 
# Run <code>gem install pragmatic_segmenter</code>
   
Line 66: Line 66:
 
# <code>add text.binary inside subdirectory script</code>
 
# <code>add text.binary inside subdirectory script</code>
   
Python scripts(exampleken1, kenlm.pyx, genalltra.py) used to score sentences living here https://github.com/sevilaybayatli/apertium-kaz-tur-mt/tree/master/scripts, these scripts automatically doing its function.
+
Python scripts (exampleken1, kenlm.pyx, genalltra.py) used to score sentences can be found living here https://github.com/sevilaybayatli/apertium-ambiguous/tree/master/scripts. These scripts automatically do their functions.
   
 
===Install and build yasmet===
 
===Install and build yasmet===
The next step is downloading and compile yasmet by following the instruction here:
+
The next step is downloading and compiling yasmet by doing the following:
   
 
Download yasmet else from https://www-i6.informatik.rwth-aachen.de/web/Software/YASMET.html
 
Download yasmet else from https://www-i6.informatik.rwth-aachen.de/web/Software/YASMET.html
or form https://github.com/apertium/apertium-lex-tools/blob/master/yasmet.cc
+
or from https://github.com/apertium/apertium-lex-tools/blob/master/yasmet.cc
   
 
Build and compile you should follow steps below:
 
Build and compile you should follow steps below:
Line 79: Line 79:
   
 
===Apertium language pairs modules===
 
===Apertium language pairs modules===
You need apertium and language pair installed for using language modules inside code, the steps below just showing how the rest apertium modules used inside the code.
+
You need apertium and language pair installed for using language modules inside the code. The steps below just show how the rest apertium modules are used inside the code.
   
Change the language pair file name to the pair desired in the paths of apertium tools (biltrans , lextor , interchunk , postchunk , transfer) in class CLExec.cpp. Also the their paths could be changed. Here the pair is kaz-tur and the path is the Home path.
+
Change the language pair file name to the pair desired in the paths of apertium tools (biltrans, lextor, interchunk, postchunk, transfer) in class CLExec.cpp. Their paths can also be changed. Here the pair is kaz-tur and the path is the Home path.
   
 
Applying apertium tool "biltrans" on the segmented sentences.
 
Applying apertium tool "biltrans" on the segmented sentences.
Line 113: Line 113:
 
<code>cd</code> to apertium-kaz-tur-mt before you run the the commands shown below:
 
<code>cd</code> to apertium-kaz-tur-mt before you run the the commands shown below:
   
<pre>
+
<pre>
 
./autogen.sh
 
./autogen.sh
 
./configure
 
./configure
Line 119: Line 119:
 
</pre>
 
</pre>
   
===Training and Testing apertium-kaz-tur-mt===
+
===Training and Testing apertium-ambiguous==
The compiled program has four modes can be used by passing the right parameters.
+
The compiled program has four modes. These can be used by passing the right parameters.
  +
  +
* <b>Yasmet dataset mode (with output file)</b>. Process the input wiki file, get the yasmet data of it and get the output (analysis) of that input file.
  +
<code>./machine-translation input_file_name output_file_name</code>
   
* Yasmet dataset mode (with output file). Process the input wiki file, get the yasmet data of it and get the output (analysis) of that input file.
+
* <b>Yasmet dataset mode (without output file)</b>. Process the input wiki file, get the yasmet data of it but without the output (analysis) of that input file.
./machine-translation input_file_name output_file_name
+
<code>./machine-translation input_file_name</code>
   
* Yasmet dataset mode (without output file). Process the input wiki file, get the yasmet data of it but without the output (analysis) of that input file.
+
* <b>Yasmet training models mode</b>. Generate the yasmet models from the yasmet datasets, actually running the command "./yasmet yasmet_data yasmet_data.model" on every yasmet file in datasets folder.
./machine-translation input_file_name
+
<code>./machine-translation</code>
   
* Yasmet training models mode. Generate the yasmet models from the yasmet datasets , actually running command "./yasmet yasmet_data yasmet_data.model" on every yasmet file in datasets folder.
+
* <b>Beam search mode</b>. Running beam search with beam = beam_number on the input file , giving its results in file "beamResults" and giving the output analysis in "output_file_name" file.
./machine-translation
+
<code>./machine-translation input_file_name output_file_name beam_number</code>
   
* Beam search mode. Running beam search with beam = beam_number on the input file , giving its results in file "beamResults" and giving the output analysis in "output_file_name" file.
+
<br/>
./machine-translation input_file_name output_file_name beam_number
 
   
Training should be done by
+
Training should be done by running
* ./machine-translation input-file output-file
+
* <code>./machine-translation input-file output-file</code>
   
Testing can be done by
+
Testing can be done by running
* ./machine-translation input-file output-file number-of-beam
+
* <code>./machine-translation input-file output-file number-of-beam</code>
   
 
<pre>
 
<pre>
Line 145: Line 145:
 
Note: You can find the final result inside results/beamResults.txt.
 
Note: You can find the final result inside results/beamResults.txt.
   
Enjoy by using our project :)
+
Enjoy using our project :)

Revision as of 18:57, 12 November 2018


Contents

The Idea

The idea is to allow Old-Apertium transfer rules to be ambiguous i.e. allow a set of rules to match the same general input pattern, as opposed to the existed situation wherein the first rule in XML transfer file takes exclusive precedence and blocks out all its ambiguous peers during transfer precompilation stage. To decide which rule applies, the transfer module would use a set of predefined or pre-trained (more specific) weighted patterns provided for each group of ambiguous rules. This way, if a specific pattern matches with multiple transfer rules, the rule with the highest weight for that pattern is applied.

The first rule in XML transfer file that matches the general pattern is still considered the default one and is applied if no weighted patterns are matched.

The module (apertium-ambiguous) can be found at [1].

How to use apertium-ambiguous for your language pair

Download a wikimedia dump

Download a Wikipedia dump from http://dumps.wikimedia.org:

$ wget https://dumps.wikimedia.org/kkwiki/latest/kkwiki-latest-pages-articles.xml.bz2

Next, extract the text using WikiExtractor script:

$ git clone https://github.com/apertium/WikiExtractor.git
$ cd WikiExtractor
$ python3 WikiExtractor.py --infn ../kkwiki-latest-pages-articles.xml.bz2  

Insert the wiki.txt file which has just been extracted into the project directory.

Install segmenter

Install Kazakh segmenter from https://github.com/diasks2/pragmatic_segmenter/tree/kazakh

For using pragmatic_segmenter you need to do the following steps:

  1. Download Ruby 2.3 by running sudo apt-get install ruby-full
  2. Run gem install pragmatic_segmenter

This piece of code uses the segmenter to segment a corpus file and output the segmented sentences into a file. In kazSentenceTokenizer.rb, change the 2-letters code of the source language to the language desired. Here "kk" is code for Kazakh.

require 'pragmatic_segmenter'

File.open(ARGV[0]).each do |line1|
ps = PragmaticSegmenter::Segmenter.new(text: line1, language: 'kk', doc_type: 'txt')
sentences = ps.segment
File.open(ARGV[1], "a") do |line2|
    sentences.each { |sentence| line2.puts sentence }
end end

Install and build kenlm

Download and install kenlm https://kheafield.com/code/kenlm/

Download a big Turkish corpus from wikidumps:

$ wget https://dumps.wikimedia.org/trwikinews/20181020/trwikinews-20181020-pages-articles.xml.bz2. 

For training you should follow these steps:

  1. estimating running bin/lmplz -o 5 <text >text.arpa
  2. querying will generate binary file by bin/build_binary text.arpa text.binary
  3. add text.binary inside subdirectory script

Python scripts (exampleken1, kenlm.pyx, genalltra.py) used to score sentences can be found living here https://github.com/sevilaybayatli/apertium-ambiguous/tree/master/scripts. These scripts automatically do their functions.

Install and build yasmet

The next step is downloading and compiling yasmet by doing the following:

Download yasmet else from https://www-i6.informatik.rwth-aachen.de/web/Software/YASMET.html or from https://github.com/apertium/apertium-lex-tools/blob/master/yasmet.cc

Build and compile you should follow steps below:

  1. g++ -o yasmet yasmet.cc
  2. ./yasmet

Apertium language pairs modules

You need apertium and language pair installed for using language modules inside the code. The steps below just show how the rest apertium modules are used inside the code.

Change the language pair file name to the pair desired in the paths of apertium tools (biltrans, lextor, interchunk, postchunk, transfer) in class CLExec.cpp. Their paths can also be changed. Here the pair is kaz-tur and the path is the Home path.

Applying apertium tool "biltrans" on the segmented sentences.

apertium -d $HOME/apertium-kaz-tur kaz-tur-biltrans  input_file output_file

Applying apertium tool "lextor" on the output of the biltrans.

lrx-proc -m $HOME/apertium-kaz-tur/kaz-tur.autolex.bin inFilePath > outFilePath

Applying apertium tool "interchunk" to that file.

apertium-interchunk $HOME/apertium-kaz-tur/apertium-kaz-tur.kaz-tur.t2x $HOME/apertium-kaz-tur/kaz-tur.t2x.bin input_file output_file

Applying apertium tool "postchunk" to the "interchunk" output file.

apertium-postchunk $HOME/apertium-kaz-tur/apertium-kaz-tur.kaz-tur.t3x $HOME/apertium-kaz-tur/kaz-tur.t3x.bin input_file output_file

Applying apertium tool "transfer" to the "postchunk" output file.

apertium-transfer -n $HOME/apertium-kaz-tur/apertium-kaz-tur.kaz-tur.t4x $HOME/apertium-kaz-tur/kaz-
tur.t4x.bin input_file | lt-proc -g $HOME/apertium-kaz-tur/kaz-tur.autogen.bin | lt-proc -p $HOME/apertium-
kaz-tur/kaz-tur.autopgen.bin > output_file.

Configure, build and install

cd to apertium-kaz-tur-mt before you run the the commands shown below:

./autogen.sh
./configure
make

=Training and Testing apertium-ambiguous

The compiled program has four modes. These can be used by passing the right parameters.

  • Yasmet dataset mode (with output file). Process the input wiki file, get the yasmet data of it and get the output (analysis) of that input file.

./machine-translation input_file_name output_file_name

  • Yasmet dataset mode (without output file). Process the input wiki file, get the yasmet data of it but without the output (analysis) of that input file.

./machine-translation input_file_name

  • Yasmet training models mode. Generate the yasmet models from the yasmet datasets, actually running the command "./yasmet yasmet_data yasmet_data.model" on every yasmet file in datasets folder.

./machine-translation

  • Beam search mode. Running beam search with beam = beam_number on the input file , giving its results in file "beamResults" and giving the output analysis in "output_file_name" file.

./machine-translation input_file_name output_file_name beam_number


Training should be done by running

  • ./machine-translation input-file output-file

Testing can be done by running

  • ./machine-translation input-file output-file number-of-beam
input-file= source language(Kazakh), output-file= target language(Turkish), and number of beam= 8 or any number.

Note: You can find the final result inside results/beamResults.txt.

Enjoy using our project :)

Personal tools