Difference between revisions of "Using weights for ambiguous rules"
Purplemoon (talk | contribs) |
Purplemoon (talk | contribs) |
||
(65 intermediate revisions by 2 users not shown) | |||
Line 9: | Line 9: | ||
The module (apertium-ambiguous) can be found at https://github.com/sevilaybayatli/apertium-ambiguous. |
The module (apertium-ambiguous) can be found at https://github.com/sevilaybayatli/apertium-ambiguous. |
||
==Configure, build and install== |
|||
<code>cd</code> to <b>apertium-ambiguous</b> before you run the commands is shown below |
|||
<pre> |
|||
./autogen.sh |
|||
./configure |
|||
make |
|||
</pre> |
|||
==How to use apertium-ambiguous for your language pair== |
==How to use apertium-ambiguous for your language pair== |
||
Line 34: | Line 43: | ||
</pre> |
</pre> |
||
The extracted file will be named as <b>wiki.txt</b> in the current directory which you are already working on and you are going to use it with other steps of the project. |
|||
Insert the wiki.txt file which has just been extracted into the project directory. |
|||
<pre> |
|||
$ mv wiki.txt ./apertium-ambiguous |
|||
</pre> |
|||
===Install segmenter=== |
===Install segmenter=== |
||
Line 49: | Line 54: | ||
# Run <code>gem install pragmatic_segmenter</code> |
# Run <code>gem install pragmatic_segmenter</code> |
||
This piece of code uses the segmenter to segment a corpus file and output the segmented sentences into a file. |
This piece of code uses the segmenter to segment a corpus file and output the segmented sentences into a file. The <b>sentenceTokenizer.rb</b>, which is located at https://github.com/sevilaybayatli/apertium-ambiguous/blob/master/scripts |
||
<pre> |
<pre> |
||
Line 65: | Line 70: | ||
</pre> |
</pre> |
||
Breaking corpus into sentences using the ruby program sentenceTokenizer.rb built on the pragmatic segmenter |
Breaking corpus into sentences using the ruby program <b>sentenceTokenizer.rb</b> built on the pragmatic segmenter. |
||
<pre> |
<pre> |
||
ruby2.3 sentenceTokenizer.rb $langCode $inputFile sentences.txt |
ruby2.3 sentenceTokenizer.rb $langCode $inputFile sentences.txt |
||
For example: |
For example: |
||
ruby2.3 sentenceTokenizer.rb kk |
ruby2.3 sentenceTokenizer.rb kk wiki.txt sentences.txt |
||
</pre> |
|||
langcode for Kazakh <b>kk</b>, inputFile is <b>Kazakh corpus</b>, and sentences.txt is a <b>segmented sentences</b>. |
|||
===Apertium language pairs modules=== |
|||
You need apertium and the language pair installed for using language modules. The steps below just show how the apertium modules for getting desired output which will used by apertium-ambiguous. Apertium pair parent directory path(<b>apertium-kaz-tur</b>). If it's in your home directory then we expect <b>$HOME</b>. |
|||
To apply the apertium tool <b>biltrans</b> on the segmented sentences: |
|||
<pre> |
|||
apertium -d $pairPar/apertium-$pairCode $pairCode-biltrans sentences.txt biltrans.txt |
|||
For example |
|||
apertium -d $HOME/apertium-kaz-tur kaz-tur-biltrans sentences.txt biltrans.txt |
|||
</pre> |
|||
To apply the apertium tool <b>lextor</b> on the output of the biltrans: |
|||
<pre> |
|||
lrx-proc -m $HOME/apertium-kaz-tur/kaz-tur.autolex.bin inFilePath > outFilePath |
|||
For example |
|||
lrx-proc -m $HOME/apertium-kaz-tur/kaz-tur.autolex.bin biltrans.txt > lextor.txt |
|||
</pre> |
|||
To run <b>rules-applier</b> program |
|||
<pre> |
|||
./rules-applier localeId transferFile.t1x sentences.txt lextor.txt rulesOut.txt |
|||
For example |
|||
./rules-applier kk_KZ $HOME/transferFile.t1x sentences.txt lextor.txt rulesOut.txt |
|||
</pre> |
|||
<pre> |
|||
localeId= ICU localeId for the source language, sentences.txt= source language sentences, rulesOut.txt= output file of your results |
|||
</pre> |
|||
To apply the apertium tool <b>interchunk</b> into rulesOut.txt file: |
|||
<pre> |
|||
apertium-interchunk $pairPar/apertium-$pairCode/apertium-$pairCode.$pairCode.t2x $pairPar/apertium-$pairCode/$pairCode.t2x.bin rulesOut.txt interchunk.txt |
|||
For example |
|||
apertium-interchunk $HOME/apertium-kaz-tur/apertium-kaz-tur.kaz-tur.t2x $HOME/apertium-kaz-tur/kaz-tur.t2x.bin rulesOut.txt interchunk.txt |
|||
</pre> |
|||
To apply the apertium tool <b>postchunk</b> to the <b>interchunk</b> output file: |
|||
<pre> |
|||
apertium-postchunk $pairPar/apertium-$pairCode/apertium-$pairCode.$pairCode.t3x $pairPar/apertium-$pairCode/$pairCode.t3x.bin interchunk.txt postchunk.txt |
|||
For example |
|||
apertium-postchunk $HOME/apertium-kaz-tur/apertium-kaz-tur.kaz-tur.t3x $HOME/apertium-kaz-tur/kaz-tur.t3x.bin interchunk.txt postchunk.txt |
|||
</pre> |
|||
To apply the apertium tool <b>transfer</b> to the <b>postchunk</b> output file |
|||
# INPUT: Outputof the postchunk module |
|||
# OUTPUT: Morphologically generated sentences in the target language |
|||
<pre> |
|||
apertium-transfer -n $pairPar/apertium-$pairCode/apertium-$pairCode.$pairCode.t4x $pairPar/apertium-$pairCode/$pairCode.t4x.bin postchunk.txt | lt-proc -g $pairPar/apertium-$pairCode/$pairCode.autogen.bin | lt-proc -p $pairPar/apertium-$pairCode/$pairCode.autopgen.bin > transfer.txt |
|||
For example |
|||
apertium-transfer -n $HOME/apertium-kaz-tur/apertium-kaz-tur.kaz-tur.t4x $HOME/apertium-kaz-tur/kaz-tur.t4x.bin postchunk.txt | lt-proc -g $HOME/apertium-kaz-tur/kaz-tur.autogen.bin | lt-proc -p $HOME/apertium-kaz-tur/kaz-tur.autopgen.bin > transfer.txt |
|||
</pre> |
</pre> |
||
Line 87: | Line 157: | ||
# <b>Querying:</b> run <code>bin/build_binary text.arpa text.binary</code> |
# <b>Querying:</b> run <code>bin/build_binary text.arpa text.binary</code> |
||
<pre> |
|||
$ mv text.binary ../scripts |
|||
</pre> |
|||
Python script (score-sentences.py) used to score target language's sentences with language model, it can be found at https://github.com/sevilaybayatli/apertium-ambiguous/tree/master/scripts. |
Python script (score-sentences.py) used to score target language's sentences with language model, it can be found at https://github.com/sevilaybayatli/apertium-ambiguous/tree/master/scripts. |
||
For running model weight program on the transfer file as in bash script: |
For running model weight program on the transfer file as it is explained in bash script by this command: |
||
<pre> |
<pre> |
||
python score-sentences.py arpa_or_binary_LM_file target_lang_file weights_file |
|||
python3 $modelWeight $LM < transfer.txt > weights.txt; |
|||
For example |
|||
python2 score-sentences.py text.binary target-sentences.txt weights.txt |
|||
</pre> |
</pre> |
||
===Install and build yasmet=== |
===Install and build yasmet=== |
||
Downloading and compiling yasmet by doing the following: |
|||
Download yasmet either from https://www-i6.informatik.rwth-aachen.de/web/Software/YASMET.html |
Download yasmet either from https://www-i6.informatik.rwth-aachen.de/web/Software/YASMET.html |
||
Line 108: | Line 179: | ||
# <code>./yasmet</code> |
# <code>./yasmet</code> |
||
(If the compilation doesn't work, try: |
|||
Running yasmet-formatter to prepare yasmet datasets. Also this will generate the analysis output file , beside the best model weighting translations(scoring with language model) in file <b>modelWeight.txt</b>, and random translations(choosing applying rule randomly form transfer file) in file <b>randomWeight.txt</b>. |
|||
#g++ -o yasmet yasmet.cc -std=gnu++98 |
|||
) |
|||
<pre> |
<pre> |
||
$mv yasmet ./apertium-ambiguous |
|||
./yasmet-formatter $localeId sentences.txt lextor.txt transfer.txt weights.txt $outputFile $datasets; |
|||
</pre> |
</pre> |
||
Running yasmet-formatter to prepare yasmet datasets. Also this will generate the analysis output file , beside the best model weighting translations(scoring with language model) in file <b>modelWeight.txt</b>, and random translations(choosing applying rule randomly form transfer file) in file <b>randomWeight.txt</b>. |
|||
To apply yasmet-formatter program |
|||
===Apertium language pairs modules=== |
|||
<pre> |
|||
You need apertium and the language pair installed for using language modules inside the code. The steps below just show how the apertium modules are used inside the code. |
|||
./yasmet-formatter $localeId transferFile.t1x sentences.txt lextor.txt transfer.txt weights.txt $outputFile $datasets |
|||
For example |
|||
Change the language pair file name to the pair desired in the paths of apertium tools (biltrans, lextor, interchunk, postchunk, transfer) in the file CLExec.cpp. Their paths can also be changed. Here, the pair is kaz-tur and the path is the Home path. |
|||
./yasmet-formatter kk_KZ transferFile.t1x sentences.txt lextor.txt transfer.txt weights.txt sentences.out datasets |
|||
To apply the apertium tool "biltrans" on the segmented sentences: |
|||
<pre> |
|||
apertium -d $HOME/apertium-kaz-tur kaz-tur-biltrans input_file output_file |
|||
</pre> |
</pre> |
||
===Training and Testing apertium-ambiguous=== |
|||
To apply the apertium tool "lextor" on the output of the biltrans: |
|||
<pre> |
|||
lrx-proc -m $HOME/apertium-kaz-tur/kaz-tur.autolex.bin inFilePath > outFilePath |
|||
</pre> |
|||
To apply the apertium tool "interchunk" to that file: |
|||
<pre> |
|||
apertium-interchunk $HOME/apertium-kaz-tur/apertium-kaz-tur.kaz-tur.t2x $HOME/apertium-kaz-tur/kaz-tur.t2x.bin input_file output_file |
|||
</pre> |
|||
<b>Training</b> |
|||
To apply the apertium tool "postchunk" to the "interchunk" output file: |
|||
<pre> |
<pre> |
||
apertium-postchunk $HOME/apertium-kaz-tur/apertium-kaz-tur.kaz-tur.t3x $HOME/apertium-kaz-tur/kaz-tur.t3x.bin input_file output_file |
|||
</pre> |
|||
./yasmet-formatter icu-locale-id transfer-file-path sentences-file-path lextor-file-path transfer-out-file-path(postchunk2-out) model-weights output-file-path datasets-folder-name |
|||
To apply the apertium tool "transfer" to the "postchunk" output file: |
|||
<pre> |
|||
For example |
|||
apertium-transfer -n $HOME/apertium-kaz-tur/apertium-kaz-tur.kaz-tur.t4x $HOME/apertium-kaz-tur/kaz- |
|||
tur.t4x.bin input_file | lt-proc -g $HOME/apertium-kaz-tur/kaz-tur.autogen.bin | lt-proc -p $HOME/apertium- |
|||
./yasmet-formatter kk_KZ transferFile.t1x test.txt lextor.txt transfer.txt weights.txt test.out datasets |
|||
kaz-tur/kaz-tur.autopgen.bin > output_file. |
|||
</pre> |
</pre> |
||
===Configure, build and install=== |
|||
<code>cd</code> to apertium-ambiguous before you run the the commands shown below: |
|||
<b>Generate-yasmet-models.sh</b> |
|||
Generate the yasmet models form yasmet datasets either using bash file or doing it manually, actually by running one of the commands. |
|||
<pre> |
<pre> |
||
Either |
|||
./autogen.sh |
|||
./configure |
|||
make |
|||
</pre> |
|||
bash generate-yasmet-models.sh datasets models |
|||
==Training and Testing apertium-ambiguous== |
|||
The compiled program has four modes. These can be used by passing the right parameters. |
|||
or |
|||
*<b>training-yasmet</b> |
|||
./yasmet < dataset-path > model-path |
|||
</pre> |
|||
<b>Testing</b> |
|||
Running beam search with beam = beam_number in the sentencesFile , writing its results into file "BeamSearch-k.txt". |
|||
* <b>Yasmet training models mode</b>. Generate the yasmet models from the yasmet datasets, actually running the command "./yasmet yasmet_data yasmet_data.model" on every yasmet file in datasets folder. |
|||
<code>./machine-translation</code> |
|||
<pre> |
|||
* <b>Beam search mode</b>. Running beam search with beam = beam_number in the input file, writing its results in file "beamResults" and writing the output analysis in "output_file_name" file. |
|||
<code>./beam-search localeId sentencesFile lextorFile transferOutFile modelsFolder k1 k2 k3 ...</code> |
|||
./beam-search localeId transferFile.tx1 sentencesFile lextorFile transferOutFile modelsFolder k1 k2 k3 ... |
|||
<br/> |
|||
For example |
|||
Training should be done by running |
|||
./beam-search kk_KZ transferFile.t1x test.txt lextor.txt transfer.txt models 2 4 8 10 |
|||
* <code>./machine-translation input-file output-file</code> |
|||
</pre> |
|||
Testing can be done by running |
|||
* <code>./machine-translation input-file output-file number-of-beam</code> |
|||
<pre> |
<pre> |
||
input-file= source language(Kazakh), output-file= target language(Turkish), and number of beam= 8 or any number. |
|||
</pre> |
|||
test.txt= test.text(source language text(Kazakh)), output-file= BeamSearch-2.txt, BeamSearch-4.txt.., and k= 2 4 8 10 or any number. |
|||
Note: You can find the final result inside results/beamResults.txt. |
|||
</pre> |
|||
Enjoy using |
Enjoy using apertium-ambiguous :) |
||
[[Category:Documentation in English]] |
[[Category:Documentation in English]] |
Latest revision as of 13:19, 17 May 2019
Contents
The Idea[edit]
The idea is to allow old apertium transfer rules to be ambiguous i.e. allow a set of rules to match the same general input pattern. This is more effective than the existing situation wherein the first rule in the XML transfer files takes exclusive precedence and blocks out all its ambiguous peers during the transfer precompilation stage, often leading to inaccurate translation. To achieve this, the transfer module would use a set of predefined or pre-trained (more specific) weighted patterns provided for each group of ambiguous rules. This way, if a specific pattern matches with multiple transfer rules, the rule with the largest weight for that pattern is applied.
If no weighted patterns are matched, then the first rule in XML transfer file that matches the general pattern is still considered the default one and is applied.
The module (apertium-ambiguous) can be found at https://github.com/sevilaybayatli/apertium-ambiguous.
Configure, build and install[edit]
cd
to apertium-ambiguous before you run the commands is shown below
./autogen.sh ./configure make
How to use apertium-ambiguous for your language pair[edit]
For this tutorial, we will be using the language pair apertium-kaz-tur
Download a wikimedia dump[edit]
Download a Wikipedia dump from http://dumps.wikimedia.org:
$ wget https://dumps.wikimedia.org/kkwiki/latest/kkwiki-latest-pages-articles.xml.bz2
To use any other language, simply replace the occurrences of 'kk' with the 2-letter code of your language.
Next, extract the text using WikiExtractor script:
$ git clone https://github.com/apertium/WikiExtractor.git $ cd WikiExtractor $ python3 WikiExtractor.py --infn ../kkwiki-latest-pages-articles.xml.bz2
The extracted file will be named as wiki.txt in the current directory which you are already working on and you are going to use it with other steps of the project.
Install segmenter[edit]
Install Kazakh segmenter from https://github.com/diasks2/pragmatic_segmenter/tree/kazakh
For using pragmatic_segmenter you need to do the following steps:
- Download Ruby 2.3 by running
sudo apt-get install ruby-full
- Run
gem install pragmatic_segmenter
This piece of code uses the segmenter to segment a corpus file and output the segmented sentences into a file. The sentenceTokenizer.rb, which is located at https://github.com/sevilaybayatli/apertium-ambiguous/blob/master/scripts
require 'pragmatic_segmenter' File.open(ARGV[1]).each do |line1| line1.delete! ('\\\(\)\[\]\{\}\<\>\|\$\/\'\"') ps = PragmaticSegmenter::Segmenter.new(text: line1, language: ARGV[0], doc_type: 'txt') sentences = ps.segment File.open(ARGV[2], "a") do |line2| sentences.each { |sentence| line2.puts sentence } end end
Breaking corpus into sentences using the ruby program sentenceTokenizer.rb built on the pragmatic segmenter.
ruby2.3 sentenceTokenizer.rb $langCode $inputFile sentences.txt For example: ruby2.3 sentenceTokenizer.rb kk wiki.txt sentences.txt
langcode for Kazakh kk, inputFile is Kazakh corpus, and sentences.txt is a segmented sentences.
Apertium language pairs modules[edit]
You need apertium and the language pair installed for using language modules. The steps below just show how the apertium modules for getting desired output which will used by apertium-ambiguous. Apertium pair parent directory path(apertium-kaz-tur). If it's in your home directory then we expect $HOME.
To apply the apertium tool biltrans on the segmented sentences:
apertium -d $pairPar/apertium-$pairCode $pairCode-biltrans sentences.txt biltrans.txt For example apertium -d $HOME/apertium-kaz-tur kaz-tur-biltrans sentences.txt biltrans.txt
To apply the apertium tool lextor on the output of the biltrans:
lrx-proc -m $HOME/apertium-kaz-tur/kaz-tur.autolex.bin inFilePath > outFilePath For example lrx-proc -m $HOME/apertium-kaz-tur/kaz-tur.autolex.bin biltrans.txt > lextor.txt
To run rules-applier program
./rules-applier localeId transferFile.t1x sentences.txt lextor.txt rulesOut.txt For example ./rules-applier kk_KZ $HOME/transferFile.t1x sentences.txt lextor.txt rulesOut.txt
localeId= ICU localeId for the source language, sentences.txt= source language sentences, rulesOut.txt= output file of your results
To apply the apertium tool interchunk into rulesOut.txt file:
apertium-interchunk $pairPar/apertium-$pairCode/apertium-$pairCode.$pairCode.t2x $pairPar/apertium-$pairCode/$pairCode.t2x.bin rulesOut.txt interchunk.txt For example apertium-interchunk $HOME/apertium-kaz-tur/apertium-kaz-tur.kaz-tur.t2x $HOME/apertium-kaz-tur/kaz-tur.t2x.bin rulesOut.txt interchunk.txt
To apply the apertium tool postchunk to the interchunk output file:
apertium-postchunk $pairPar/apertium-$pairCode/apertium-$pairCode.$pairCode.t3x $pairPar/apertium-$pairCode/$pairCode.t3x.bin interchunk.txt postchunk.txt For example apertium-postchunk $HOME/apertium-kaz-tur/apertium-kaz-tur.kaz-tur.t3x $HOME/apertium-kaz-tur/kaz-tur.t3x.bin interchunk.txt postchunk.txt
To apply the apertium tool transfer to the postchunk output file
- INPUT: Outputof the postchunk module
- OUTPUT: Morphologically generated sentences in the target language
apertium-transfer -n $pairPar/apertium-$pairCode/apertium-$pairCode.$pairCode.t4x $pairPar/apertium-$pairCode/$pairCode.t4x.bin postchunk.txt | lt-proc -g $pairPar/apertium-$pairCode/$pairCode.autogen.bin | lt-proc -p $pairPar/apertium-$pairCode/$pairCode.autopgen.bin > transfer.txt For example apertium-transfer -n $HOME/apertium-kaz-tur/apertium-kaz-tur.kaz-tur.t4x $HOME/apertium-kaz-tur/kaz-tur.t4x.bin postchunk.txt | lt-proc -g $HOME/apertium-kaz-tur/kaz-tur.autogen.bin | lt-proc -p $HOME/apertium-kaz-tur/kaz-tur.autopgen.bin > transfer.txt
Install and build kenlm[edit]
Download and install kenlm by the following steps under 'USAGE' at https://kheafield.com/code/kenlm/
Download a big Turkish corpus from wikidumps:
$ wget https://dumps.wikimedia.org/trwikinews/20181020/trwikinews-20181020-pages-articles.xml.bz2.
For training, you should follow these steps:
- Estimating: run
bin/lmplz -o 5 <text >text.arpa
- Querying: run
bin/build_binary text.arpa text.binary
Python script (score-sentences.py) used to score target language's sentences with language model, it can be found at https://github.com/sevilaybayatli/apertium-ambiguous/tree/master/scripts.
For running model weight program on the transfer file as it is explained in bash script by this command:
python score-sentences.py arpa_or_binary_LM_file target_lang_file weights_file For example python2 score-sentences.py text.binary target-sentences.txt weights.txt
Install and build yasmet[edit]
Downloading and compiling yasmet by doing the following:
Download yasmet either from https://www-i6.informatik.rwth-aachen.de/web/Software/YASMET.html or from https://github.com/apertium/apertium-lex-tools/blob/master/yasmet.cc
To build and compile, follow steps below:
g++ -o yasmet yasmet.cc
./yasmet
(If the compilation doesn't work, try:
- g++ -o yasmet yasmet.cc -std=gnu++98
)
$mv yasmet ./apertium-ambiguous
Running yasmet-formatter to prepare yasmet datasets. Also this will generate the analysis output file , beside the best model weighting translations(scoring with language model) in file modelWeight.txt, and random translations(choosing applying rule randomly form transfer file) in file randomWeight.txt.
To apply yasmet-formatter program
./yasmet-formatter $localeId transferFile.t1x sentences.txt lextor.txt transfer.txt weights.txt $outputFile $datasets For example ./yasmet-formatter kk_KZ transferFile.t1x sentences.txt lextor.txt transfer.txt weights.txt sentences.out datasets
Training and Testing apertium-ambiguous[edit]
Training
./yasmet-formatter icu-locale-id transfer-file-path sentences-file-path lextor-file-path transfer-out-file-path(postchunk2-out) model-weights output-file-path datasets-folder-name For example ./yasmet-formatter kk_KZ transferFile.t1x test.txt lextor.txt transfer.txt weights.txt test.out datasets
Generate-yasmet-models.sh
Generate the yasmet models form yasmet datasets either using bash file or doing it manually, actually by running one of the commands.
Either bash generate-yasmet-models.sh datasets models or ./yasmet < dataset-path > model-path
Testing
Running beam search with beam = beam_number in the sentencesFile , writing its results into file "BeamSearch-k.txt".
./beam-search localeId transferFile.tx1 sentencesFile lextorFile transferOutFile modelsFolder k1 k2 k3 ... For example ./beam-search kk_KZ transferFile.t1x test.txt lextor.txt transfer.txt models 2 4 8 10
test.txt= test.text(source language text(Kazakh)), output-file= BeamSearch-2.txt, BeamSearch-4.txt.., and k= 2 4 8 10 or any number.
Enjoy using apertium-ambiguous :)