Difference between revisions of "Using weights for ambiguous rules"

From Apertium
Jump to navigation Jump to search
Line 14: Line 14:
First of all, we take that sentence and give it to the rest of apertium tools biltrans and lextor to get a string of tokens (words) each with its translations and part of speech tags.
First of all, we take that sentence and give it to the rest of apertium tools biltrans and lextor to get a string of tokens (words) each with its translations and part of speech tags.
Now this is will be the real input to our program, we first split these strings into source and target tokens along with there tags, then we try to match these tags with categories from the transfer file as these matches will help us match the tokens to the rules. Second, it was to apply these rules on the matched tokens. If different rules are applied to one token, then we have ambiguity with that word, so we must decide which one to use. And if many tokens have ambiguities that makes the whole sentence has much more ambiguity, as all the possible combinations are equal the multiplication of each number of ambiguous rules of each token.
Now this is will be the real input to our program, we first split these strings into source and target tokens along with there tags, then we try to match these tags with categories from the transfer file as these matches will help us match the tokens to the rules. Second, it was to apply these rules on the matched tokens. If different rules are applied to one token, then we have ambiguity with that word, so we must decide which one to use. And if many tokens have ambiguities that makes the whole sentence has much more ambiguity, as all the possible combinations are equal the multiplication of each number of ambiguous rules of each token.
Our output for this step was to output all the possible combinations of translations of the sentence along with their analysis (output of the rules). the final translation of every combination and finally the weight of each combination (their sum = 1) by using KenLM Language Model Toolkit.
Our output for this step was to output all the possible combinations of translations of the sentence along with their analysis (output of the rules).




Line 20: Line 20:
After get all possible translations of every combination we scored them(their sum = 1) by using language model. In this project we have used KenLM Language Model Toolkit https://kheafield.com/code/kenlm/. Language Model applied on target language Turkish.
After get all possible translations of every combination we scored them(their sum = 1) by using language model. In this project we have used KenLM Language Model Toolkit https://kheafield.com/code/kenlm/. Language Model applied on target language Turkish.
=====Step 4=====
=====Step 4=====
There was a required format to obtain to use it as dataset for an unsupervised machine learner (YASMET). Every dataset will be for a certain pattern that ambiguous rules applied to, where the features will be the different words matched with these patterns, along with the rules number and their precalculated weight. The challenges for making that format was the need to modify and introduce new data-structures into the old code, which was not an easy task. But after finishing it successfully, we now are in the step of translating that data into a table and then feed it to the learner.


=====Step 5=====
=====Step 5=====


=====Step 6=====
=====Step 6=====
Applying beam Search algorithm:

Input :
- beam : beam size
- slTokens : source words indices
- ambigInfo : A data structure has all the ambiguous rules with their corresponding words indices.
- classesWeights : yasmet weights loaded from the model files onto a map.

Output :
- beamTree : the highest weights (beam size) translations of the given source words. Actually the tree has the highest rules indices along with their weights sum.

Algorithm :
- At first we get a set of ambiguous rules applied to some words , then we get the weight of these words for every rule from the yasmet weights.

- We build a new tree for these new words. The tree is just a vector of vectors of rules indices along with their weights sum.

For example let at any iteration we have a set of rules (r for rules and w for word) :
r1 applied on => w1 w2 w3
r2 applied on => w1 w2
r3 applied on => w1
r4 applied on => w2
r5 applied on => w3

We then have 3 different translations for these 3 words.

We then build the tree as follows :
--------> r1 : weight1
------> --------> r2 - r5 : weight2
--------> r3 - r4 - r5 : weight3

- Then we expand our beamTree by the number of the rules we have and then merge the two tree.
So if we have a beam tree say with 6 translations, then with the above tree we just built, we will expand our beamTree to have 6*3 = 18 translations and then merge the tree we just built with beam tree.

- Then we sort those 18 translations descendingly by their sum of weights.

- Then if we reduce our beamTree to have no more the beam size translations. So if the beam size = 8 , we will remove the least 10 translations from our tree.

- We then continue until we finish all the ambiguous rules and the output will be at last a tree with no more than the beam size translations.

- After that we get only the best translation and output it.

====Evaluation====
====Evaluation====

Revision as of 22:19, 24 October 2018

The Idea

The idea is to allow Old-Apertium transfer rules to be ambiguous, i.e., allow a set of rules to match the same general input pattern, as opposed to the existed situation when the first rule in xml transfer file takes exclusive precedence and blocks out all its ambiguous peers during transfer precompilation stage. To decide which rule applies, transfer module would use a set of predefined or pretrained — more specific — weighted patterns provided for each group of ambiguous rules. This way, if a specific pattern matches, the rule with the highest weight for that pattern is applied.

The first rule in xml transfer file that matches the general pattern is still considered the default one and is applied if no weighted patterns matched.

Implementation

We have created transfer-module by using the old transfer-module and rest of apertium tools such as morphological analyser, morphological disambiguator, lexical transfer, lexical selection, morphological generator, and reformattor. We made a module by using c++ that translate texts from Kazakh to Turkish. This module try to give the best Turkish translation for Kazakh by applying advanced algorithms and methods.

step 1

We have a very big corpuses (wiki dumps) with size 640 MB and 320 MB of wiki texts. Since our application takes a sentence as input, we must split our corpus into sentences. First, we process the corpus if it precedes a sentence with capital letter and remove the latin alphabets from the corpus. We then applied a rule-based sentence boundary detection tool called “pragmatic segmenter”https://github.com/diasks2/pragmatic_segmenter/tree/kazakh.

Step 2

First of all, we take that sentence and give it to the rest of apertium tools biltrans and lextor to get a string of tokens (words) each with its translations and part of speech tags. Now this is will be the real input to our program, we first split these strings into source and target tokens along with there tags, then we try to match these tags with categories from the transfer file as these matches will help us match the tokens to the rules. Second, it was to apply these rules on the matched tokens. If different rules are applied to one token, then we have ambiguity with that word, so we must decide which one to use. And if many tokens have ambiguities that makes the whole sentence has much more ambiguity, as all the possible combinations are equal the multiplication of each number of ambiguous rules of each token. Our output for this step was to output all the possible combinations of translations of the sentence along with their analysis (output of the rules).


Step 3

After get all possible translations of every combination we scored them(their sum = 1) by using language model. In this project we have used KenLM Language Model Toolkit https://kheafield.com/code/kenlm/. Language Model applied on target language Turkish.

Step 4

There was a required format to obtain to use it as dataset for an unsupervised machine learner (YASMET). Every dataset will be for a certain pattern that ambiguous rules applied to, where the features will be the different words matched with these patterns, along with the rules number and their precalculated weight. The challenges for making that format was the need to modify and introduce new data-structures into the old code, which was not an easy task. But after finishing it successfully, we now are in the step of translating that data into a table and then feed it to the learner.

Step 5
Step 6

Applying beam Search algorithm:

Input : - beam : beam size - slTokens : source words indices - ambigInfo : A data structure has all the ambiguous rules with their corresponding words indices. - classesWeights : yasmet weights loaded from the model files onto a map.

Output : - beamTree : the highest weights (beam size) translations of the given source words. Actually the tree has the highest rules indices along with their weights sum.

Algorithm : - At first we get a set of ambiguous rules applied to some words , then we get the weight of these words for every rule from the yasmet weights.

- We build a new tree for these new words. The tree is just a vector of vectors of rules indices along with their weights sum.

For example let at any iteration we have a set of rules (r for rules and w for word) : r1 applied on => w1 w2 w3 r2 applied on => w1 w2 r3 applied on => w1 r4 applied on => w2 r5 applied on => w3

We then have 3 different translations for these 3 words.

We then build the tree as follows :

           --------> r1 : weight1

> --------> r2 - r5 : weight2

           --------> r3 - r4 - r5 : weight3

- Then we expand our beamTree by the number of the rules we have and then merge the two tree. So if we have a beam tree say with 6 translations, then with the above tree we just built, we will expand our beamTree to have 6*3 = 18 translations and then merge the tree we just built with beam tree.

- Then we sort those 18 translations descendingly by their sum of weights.

- Then if we reduce our beamTree to have no more the beam size translations. So if the beam size = 8 , we will remove the least 10 translations from our tree.

- We then continue until we finish all the ambiguous rules and the output will be at last a tree with no more than the beam size translations.

- After that we get only the best translation and output it.

Evaluation