Using weights for ambiguous rules

From Apertium
Revision as of 21:36, 24 October 2018 by Purplemoon (talk | contribs)
Jump to navigation Jump to search

The Idea

The idea is to allow Old-Apertium transfer rules to be ambiguous, i.e., allow a set of rules to match the same general input pattern, as opposed to the existed situation when the first rule in xml transfer file takes exclusive precedence and blocks out all its ambiguous peers during transfer precompilation stage.

To decide which rule applies, transfer module would use a set of predefined or pretrained — more specific — weighted patterns provided for each group of ambiguous rules. This way, if a specific pattern matches, the rule with the highest weight for that pattern is applied.

The first rule in xml transfer file that matches the general pattern is still considered the default one and is applied if no weighted patterns matched.


Implementation

We have created transfer-module by using the old transfer-module and rest of apertium tools such as morphological analyser, morphological disambiguator, lexical transfer, lexical selection, morphological generator, and reformattor. We made a module by using c++ that translate texts from Kazakh to Turkish. This module try to give the best Turkish translation for Kazakh by applying advanced algorithms and methods.

Step 1

First we take that sentence and give it to apertium tools biltrans and lextor to get a string of tokens (words) each with its translations and part of speech tags. Now this is the real input to our program, we first split these strings into source and target tokens along with there tags, then we try to match these tags with categories from the transfer file as these matches will help us match the tokens to the rules. The second step is to apply these rules on the matched tokens. If different rules are applied to one token, then we have ambiguity with that word, so we must decide which one to use. And if many tokens have ambiguities that makes the whole sentence has much more ambiguity, as all the possible combinations are equal the multiplication of each number of ambiguous rules of each token. Our output for that phase was to output all the possible combinations of translations of the sentence along with their analysis (output of the rules) , the final translation of every combination and finally the weight of each combination (their sum = 1) by using KenLM Language Model Toolkit.


Step 2
Step 3
Step 4
Step 5

Evaluation