Using weights for ambiguous rules

From Apertium
Revision as of 22:11, 24 October 2018 by Purplemoon (talk | contribs)
Jump to navigation Jump to search

The Idea

The idea is to allow Old-Apertium transfer rules to be ambiguous, i.e., allow a set of rules to match the same general input pattern, as opposed to the existed situation when the first rule in xml transfer file takes exclusive precedence and blocks out all its ambiguous peers during transfer precompilation stage. To decide which rule applies, transfer module would use a set of predefined or pretrained — more specific — weighted patterns provided for each group of ambiguous rules. This way, if a specific pattern matches, the rule with the highest weight for that pattern is applied.

The first rule in xml transfer file that matches the general pattern is still considered the default one and is applied if no weighted patterns matched.

Implementation

We have created transfer-module by using the old transfer-module and rest of apertium tools such as morphological analyser, morphological disambiguator, lexical transfer, lexical selection, morphological generator, and reformattor. We made a module by using c++ that translate texts from Kazakh to Turkish. This module try to give the best Turkish translation for Kazakh by applying advanced algorithms and methods.

step 1

We have a very big corpuses (wiki dumps) with size 640 MB and 320 MB of wiki texts. Since our application takes a sentence as input, we must split our corpus into sentences. First, we process the corpus if it precedes a sentence with capital letter and remove the latin alphabets from the corpus. We then applied a rule-based sentence boundary detection tool called “pragmatic segmenter”https://github.com/diasks2/pragmatic_segmenter/tree/kazakh.

Step 2

First of all, we take that sentence and give it to the rest of apertium tools biltrans and lextor to get a string of tokens (words) each with its translations and part of speech tags. Now this is will be the real input to our program, we first split these strings into source and target tokens along with there tags, then we try to match these tags with categories from the transfer file as these matches will help us match the tokens to the rules. Second, it was to apply these rules on the matched tokens. If different rules are applied to one token, then we have ambiguity with that word, so we must decide which one to use. And if many tokens have ambiguities that makes the whole sentence has much more ambiguity, as all the possible combinations are equal the multiplication of each number of ambiguous rules of each token. Our output for this step was to output all the possible combinations of translations of the sentence along with their analysis (output of the rules). the final translation of every combination and finally the weight of each combination (their sum = 1) by using KenLM Language Model Toolkit.


Step 3

After get all possible translations of every combination we scored them(their sum = 1) by using language model. In this project we have used KenLM Language Model Toolkit https://kheafield.com/code/kenlm/. Language Model applied on target language Turkish.

Step 4
Step 5
Step 6

Evaluation