Difference between revisions of "Using weights for ambiguous rules"

From Apertium
Jump to navigation Jump to search
Line 7: Line 7:


The first rule in xml transfer file that matches the general pattern is still considered the default one and is applied if no weighted patterns matched.
The first rule in xml transfer file that matches the general pattern is still considered the default one and is applied if no weighted patterns matched.

apertium-kaz-tur-mt living here https://github.com/sevilaybayatli/apertium-kaz-tur-mt.


==How to use apertium-kaz-tur-mt for your language pair==
==How to use apertium-kaz-tur-mt for your language pair==

Revision as of 09:39, 31 October 2018

Documentation

InstallationResourcesContactDocumentationDevelopmentTools

Gnome-home.png Home PageBugs.png BugsInternet.png WikiGaim.png Chat


The Idea

The idea is to allow Old-Apertium transfer rules to be ambiguous, i.e., allow a set of rules to match the same general input pattern, as opposed to the existed situation when the first rule in xml transfer file takes exclusive precedence and blocks out all its ambiguous peers during transfer precompilation stage. To decide which rule applies, transfer module would use a set of predefined or pretrained — more specific — weighted patterns provided for each group of ambiguous rules. This way, if a specific pattern matches, the rule with the highest weight for that pattern is applied.

The first rule in xml transfer file that matches the general pattern is still considered the default one and is applied if no weighted patterns matched.

apertium-kaz-tur-mt living here https://github.com/sevilaybayatli/apertium-kaz-tur-mt.

How to use apertium-kaz-tur-mt for your language pair

Downloading wikimedia dump

Download a Wikipedia dump from http://dumps.wikimedia.org

 
$ wget https://dumps.wikimedia.org/kkwiki/latest/kkwiki-latest-pages-articles.xml.bz2

Extract the text using WikiExtractor:

$ wget https://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/WikiExtractor.py

$ python3 WikiExtractor.py --infn kkwiki-latest-pages-articles.xml.bz2  
  • Insert wiki.txt which has been extracted into the project directory.

Install segmenter

Install Kazakh segmenter from https://github.com/diasks2/pragmatic_segmenter/tree/kazakh

For using pragmatic_segmenter you need to do the following steps:

  1. downloading ruby2.3
  1. gem install pragmatic_segmenter

This piece of code uses the segmenter to segment a corpus file and output the segmented sentences in a file

require 'pragmatic_segmenter'

File.open(ARGV[0]).each do |line1|
ps = PragmaticSegmenter::Segmenter.new(text: line1, language: 'kk', doc_type: 'txt')
sentences = ps.segment
File.open(ARGV[1], "a") do |line2|
    sentences.each { |sentence| line2.puts sentence }
end end

Install and build kenlm

Download and install kenlm https://kheafield.com/code/kenlm/

Downloading big Turkish corpus from wikidumps:

$ wget https://dumps.wikimedia.org/trwikinews/20181020/trwikinews-20181020-pages-articles.xml.bz2. 

For training you should follow these steps:

  1. estimating running bin/lmplz -o 5 <text >text.arpa
  2. querying will generate binary file by bin/build_binary text.arpa text.binary
  3. add the path of text.binary inside exampleken1

Python scripts(exampleken1, kenlm.pyx, genalltra.py) used to score sentences living here https://github.com/sevilaybayatli/apertium-kaz-tur-mt/tree/master/scripts, these scripts automatically doing its function.

Install and build yasmet

The next step is downloading and compile yasmet by following the instruction here:

Download yasmet else from https://www-i6.informatik.rwth-aachen.de/web/Software/YASMET.html or form https://github.com/apertium/apertium-lex-tools/blob/master/yasmet.cc

Build and compile you should follow steps below:

  1. Build/compile g++ -o yasmet yasmet.cc
  2. Execution by ./yasmet

Apertium language pairs modules

You need apertium and language pair installed for using language modules inside code, the steps below just showing how the rest apertium modules used inside the code.

Applying apertium tool "biltrans" on the segmented sentences.

apertium -d $HOME/apertium-kaz-tur kaz-tur-biltrans  input_file output_file

Applying apertium tool "lextor" on the output of the biltrans.

lrx-proc -m $HOME/apertium-kaz-tur/kaz-tur.autolex.bin inFilePath > outFilePath

Applying apertium tool "interchunk" to that file.

apertium-interchunk $HOME/apertium-kaz-tur/apertium-kaz-tur.kaz-tur.t2x $HOME/apertium-kaz-tur/kaz-tur.t2x.bin input_file output_file

Applying apertium tool "postchunk" to the "interchunk" output file.

apertium-postchunk $HOME/apertium-kaz-tur/apertium-kaz-tur.kaz-tur.t3x $HOME/apertium-kaz-tur/kaz-tur.t3x.bin input_file output_file

Applying apertium tool "transfer" to the "postchunk" output file.

apertium-transfer -n $HOME/apertium-kaz-tur/apertium-kaz-tur.kaz-tur.t4x $HOME/apertium-kaz-tur/kaz-
tur.t4x.bin input_file | lt-proc -g $HOME/apertium-kaz-tur/kaz-tur.autogen.bin | lt-proc -p $HOME/apertium-
kaz-tur/kaz-tur.autopgen.bin > output_file.

Configure, build and install

  1. ./autogen.sh
  1. ./configure
  1. make

Training and Testing apertium-kaz-tur-mt

Training should be done by

  • ./machine-translation input-file output-file

Testing can be done by

  • ./machine-translation input-file output-file number-of-beam
input-file= source language(Kazakh), output-file= target language(Turkish), and number of beam= 8 or any number.

Note: You can find the final result inside results/beamResults.txt.

Enjoy by using our project :)