Weighted transfer rules at GSoC 2016

From Apertium
Revision as of 13:37, 23 August 2016 by Nikita Medyankin (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

This page serves as a final submission page for Weighted transfer rules project conducted by Nikita Medyankin at Google Summer of Code 2016.

The Idea[edit]

The idea is to allow Apertium transfer rules to be ambiguous, i.e., allow a set of rules to match the same general input pattern, as opposed to the present situation when the first rule in xml transfer file takes exclusive precedence and blocks out all its ambiguous peers during transfer precompilation stage.

To decide which rule applies, transfer module would use a set of predefined or pretrained — more specific — weighted patterns provided for each group of ambiguous rules. This way, if a specific pattern matches, the rule with the highest weight for that pattern is applied.

The first rule in xml transfer file that matches the general pattern is still considered the default one and is applied if no weighted patterns matched. This way, transfer weights file can be seen as specifying lexicalized or partially lexicalized exceptions from the default rule.

Example language pair[edit]

An example language pair was put up for the purposes of testing and evaluation. The code can be found at https://svn.code.sf.net/p/apertium/svn/branches/weighted-transfer/apertium-en-es/

The difference between this and the trunk version is that t1x transfer file has three additional rules which are ambiguous counterparts to the three rules that define interaction of adjacent adjective and noun. In all three original rules, noun and adjective are swapped on output as is usual for Spanish. In additional rules, they are not, as sometimes happens too and is known to be dependent on lexical patterns involved.

Transfer module[edit]

The code can be located in the weighted-transfer branch, namely at https://svn.code.sf.net/p/apertium/svn/branches/weighted-transfer/apertium/

The changes were made to transfer.cc and transfer.h files to incorporate reading of the weights file and utilising the weights in order to choose which rule applies for ambiguous input. Wrapper apertium-transfer.cc was also modified to recognize input weights file name provided by -w option.

Since transfer.cc and transfer.h originally came with little to no comments, comments were also added to the crucial parts of transfer in addition to commenting the code directly dealing with transfer weights.

This version of transfer module understands fully lexicalized patterns (i.e., when only items with lemma and full set of tags are allowed in pattern) as well as partially delexicalized patterns (i.e., with some tokens missing lemmas while retaning full set of tags). However, it does not support any wildcards in tags, only full tag patterns.

Weights file format[edit]

DTD specification was developed for weights file format and was added to the branch. Below is a small example of weights file. All mutually ambiguous rules are listed as subelements of the same 'rule-group' element. Each rule copies its 'comment' and 'id' attributes from the xml transfer file. The use of 'id' attribute was added specifically for the purpose of matching the rule from weights file with the same rule from transfer file. It is optional, unique, and must be added only to the ambiguous rules. Each rule in weights file also has 'md5' attribute, which is added during weights learning and is an md5 sum of original rule text with whitespace removed. It is added in order to be able to check if the weights file actually corresponds to the transfer file during the language pair installation.

Each rule, in turn, has any number of 'pattern' subelements with 'weights' attribute, which specify certain patterns for the rule. In the example given below, there is one pattern for both rules, which specifies that the second rule in the group should be preferred to the first, since it is listed with ~0.95 weight for the second as opposed to ~0.05 for the first.

<?xml version='1.0' encoding='UTF-8'?>
<transfer-weights>
  <rule-group>
    <rule comment="REGLA: DET ADJ NOM" id="det-adj-nom" md5="897a67e4ffadec9b7fd515ce0a8d453b">
      <pattern weight="0.05124922803710481">
        <pattern-item lemma="this" tags="det.dem.sg"/>
        <pattern-item lemma="new" tags="adj.sint"/>
        <pattern-item lemma="software" tags="n.sg"/>
      </pattern>
    </rule>
    <rule comment="REGLA: DET ADJ NOM no-swap-version" id="det-adj-nom-ns" md5="13f1c5ed0615ae8f9d3142aed7a3855f">
      <pattern weight="0.9487507719628953">
        <pattern-item lemma="this" tags="det.dem.sg"/>
        <pattern-item lemma="new" tags="adj.sint"/>
        <pattern-item lemma="software" tags="n.sg"/>
      </pattern>
    </rule>
  </rule-group>
</transfer-weights>

DTD for transfer rules was modified in order to add 'id' property to the 'rule' element, used in the corresponding weights file to identify the rules.

Weights learning script[edit]

A python3 script was made to enable learning rule weights from a corpus. Its source code is located at https://svn.code.sf.net/p/apertium/svn/branches/weighted-transfer/apertium-weights-learner/ It works in two modes, monolingual and parallel.

In monolingual mode, it requires a t1x file with ambiguous rules, a corpus of source language and a pretrained language model of target language. The target language model may be trained on a target language corpus that does not have to be related to the source language corpus in any way. A number of simple helper scripts are located in tools folder to help to prepare language model as well as instructions in the README file. The workflow of the script in monolingual mode is as follows:

  1. Tag source language corpus.
  2. For each sentence, calculate its LRLM coverage by the transfer rules.
  3. If there are any ambiguous chunks in the coverage, segment the sentence into parts containing one ambiguous chunk each.
    1. For each sentence segment, translate it in the default way.
    2. For each sentence segment, translate it in all possible ways and concatenate each variant with the default translation of the other segments. Store the results.
  4. Score all variants of all sentences against the language model and normalize the scores for the variants of each sentence obtained for the same segment. Store them as scores for the corresponding ambiguous chunk patterns.
  5. Sum up the scores for each ambiguous chunk pattern and make weights xml.
  6. Prune the weights xml.

In parallel mode, weights learning script requires a parallel corpus stored in two text files which match line by line. The workflow of the script in parallel mode is as follows:

  1. Tag source language corpus.
  2. For each sentence, calculate its LRLM coverage by the transfer rules.
  3. If there are any ambiguous chunks in the coverage, translate them and look them up in the corresponding target language sentence. If the translation is found, score the chunk pattern with 1.
  4. Sum up the scores for each ambiguous chunk pattern and make weights xml.
  5. Prune the weights xml.

More information can be found in the README file.

For now, weights learning script only allows for learning fully lexicalized patterns, i.e. only items with lemma and full set of tags are allowed in patterns. However, partially delexicalized patterns (i.e., with some tokens missing lemmas while still retaning full set of tags) can be added to the obtained weights file manually.

Evaluation[edit]

A simple script for the evaluation of the resulting weights file was made and can be located at https://svn.code.sf.net/p/apertium/svn/branches/weighted-transfer/apertium-weights-learner/testing/

To be done[edit]

The following issues should be addressed in further work:

  • Add an option to learn the weights for partially delexicalized patterns in weights learning script.
  • Extensively test the impact of the weighted rules on overall quality and speed of translation using large corpora for training and evaluation.
  • Add md5 sum verification.