Weighted transfer rules at GSoC 2016

From Apertium
Revision as of 23:31, 22 August 2016 by Nikita Medyankin (talk | contribs) (Started a page for GSoC 2016 final submission.)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

This page serves as a final submission page for Weighted transfer rules project conducted by Nikita Medyankin at Google Summer of Code 2016.

The Idea

The idea is to allow Apertium transfer rules to be ambiguous, i.e., allow a set of rules to match the same general input pattern. To decide which rule applies, transfer module would use a set of predefined or pretrained — more specific — weighted patterns provided for each group of ambiguous rules. This way, if a specific pattern matches, the rule with the highest weight for that pattern is applied.

The first rule in transfer file that matches the general pattern is still considered the default one and is applied if no weighted patterns matched. This way, transfer weights file can be seen as specifying lexicalized or partially lexicalized exceptions from the default rule.

Example language pair

An example language pair was put up for the purposes of testing and evaluation. The code can be found at https://svn.code.sf.net/p/apertium/svn/branches/weighted-transfer/apertium-en-es/

The difference between this and the trunk version is that t1x transfer file has three additional rules which are ambiguous counterparts to the three rules that define interaction of adjacent adjective and noun. In all three original rules, noun and adjective are swapped on output as is usual for Spanish. In additional rules, they are not, as sometimes happens too and is known to be dependent on lexical patterns involved.

Weights file format

https://svn.code.sf.net/p/apertium/svn/branches/weighted-transfer/apertium/apertium/transfer-weights.dtd tbd: some explanation

Transfer module

The code can be located at https://svn.code.sf.net/p/apertium/svn/branches/weighted-transfer/apertium/ This version of transfer module understands fully lexicalized patterns (i.e., when only items with lemma and full set of tags are allowed in pattern) as well as partially delexicalized patterns (i.e., with some tokens missing lemmas while retaning full set of tags). However, it does not support any wildcards in tags, only full tag patterns.

Weights learning script

A python3 script was made to enable learning rule weigths from a corpus. Its source code is located at https://svn.code.sf.net/p/apertium/svn/branches/weighted-transfer/apertium-weights-learner/ It works in two modes, monolingual and parallel.

In monolingual mode, it requires a t1x file with ambiguous rules, a corpus of source language and a pretrained language model of target language. The target language model may be trained on a target language corpus that does not have to be related to the source language corpus in any way. A number of simple helper scripts are located in tools folder to help to prepare language model as well as instructions in the README file. The workflow of the script in monolingual mode is as follows:

  1. Tag source language corpus.
  2. For each sentence, calculate its LRLM coverage by the transfer rules.
  3. If there are any ambiguous chunks in the coverage, segment the sentence into parts containing one ambiguous chunk each.
    1. For each sentence segment, translate it in the default way.
    2. For each sentence segment, translate it in all possible ways and concatenate each variant with the default translation of the other segments. Store the results.
  4. Score all variants of all sentences against the language model and normalize the scores for the variants of each sentence obtained for the same segment. Store them as scores for the corresponding ambiguous chunk patterns.
  5. Sum up the scores for each ambiguous chunk pattern and make weights xml.
  6. Prune the weigths xml.

In parallel mode, weights learning script requires a parallel corpus stored in two text files which match line by line. The workflow of the script in parallel mode is as follows:

  1. Tag source language corpus.
  2. For each sentence, calculate its LRLM coverage by the transfer rules.
  3. If there are any ambiguous chunks in the coverage, translate them and look them up in the corresponding target language sentence. If the translation is found, score the chunk pattern with 1.
  4. Sum up the scores for each ambiguous chunk pattern and make weights xml.
  5. Prune the weigths xml.

More information can be found in the README file.

For now, weights learning script only allows for learning fully lexicalized patterns, i.e. only items with lemma and full set of tags are allowed in patterns. However, partially delexicalized patterns (i.e., with some tokens missing lemmas while still retaning full set of tags) can be added to the obtained weights file manually.

Evaluation

tbd