Ideas for Google Summer of Code/Weighted transfer rules

From Apertium
Jump to navigation Jump to search

At the moment Apertium transfer rules are combinations of fixed-length patterns and actions. Conflicts are "solved" by selecting the first rule that matches. The idea of this task is to allow rules to conflict and to select the most adequate rules for an input using weights.

Unlexicalised rule weights
Depend only on the category of the input words.
Lexicalised rules weights
Depend on one or more of the lemmas of the input words. e.g.
ID Rule Input Output Frequency
1 de memoría de traducción translation memory 90
2 de 's memoría de traducción translation's memory 0
3 de of memoría de traducción memory of translation 0

So here we would have something like:

  • Rule 1 (x=memoría, y=traducción, weight=1.0)
  • Rule 2 (x=memoría, y=traducción, weight=0.0)
  • Rule 3 (x=memoría, y=traducción, weight=0.0)

Example

Transfer rules:

ID Rule Input Output
1 de memoria de traducción translation memory
2 de 's la hermana de mi novia my girlfriend's sister
3 de of el estado de la cuestión the state of the art

Training

  • Take a big corpus
  • For each sentence:
    • Apply transfer rules
    • For each possible combination of transfer rules
      • Translate the sentence and score on language model
      • Each sentence gets a count 1. This count is shared between the transfer rules.
Example
La canciller se reúne hoy con el presidente de EE UU para limar asperezas y preparar la cumbre del miércoles con Putin.
1 1 The chancellor gathers today with [the U.S. president] for mend fences and prepare [the Wednesday summit] with Putin. -74.55 0.39
2 1 The chancellor gathers today with [the U.S.'s president] for mend fences and prepare [the Wednesday summit] with Putin. -69.51 60.71
3 1 The chancellor gathers today with [the president of the U.S.] for mend fences and prepare [the Wednesday summit] with Putin. -74.47 0.43
1 2 The chancellor gathers today with [the U.S. president] for mend fences and prepare [the Wednesday's summit] with Putin. -75.02 0.25
2 2 The chancellor gathers today with [the U.S.'s president] for mend fences and prepare [the Wednesday's summit] with Putin. -69.98 37.94
3 2 The chancellor gathers today with [the president of the U.S.] for mend fences and prepare [the Wednesday's summit] with Putin. -74.94 0.27
1 3 The chancellor gathers today with [the U.S. president] for mend fences and prepare [the summit of the Wednesday] with Putin. -82.88 0.0
2 3 The chancellor gathers today with [the U.S.'s president] for mend fences and prepare [the summit of the Wednesday] with Putin. -77.84 0.01
3 3 The chancellor gathers today with [the president of the U.S.] for mend fences and prepare [the summit of the Wednesday] with Putin. -82.80 0.0

You can then feed the fractional counts to some supervised machine learning program to get appropriate weights.

Questions

  • How to calculate the paths?
    • With optimal coverage, or with just taking the LRLM and only calculating paths for rules which conflict.
  • For lexicalised weights:
    • What is the function assigning cost to each lexical combination of N1 and N2?
  • Could we score a rule at a time, by keeping part fixed ?

Tasks

  • Implement in C++ and integrate into Apertium.

Coding challenge

  • Write a program (in python or C++) that reads the XML transfer format patterns and applies them to an input stream printing out all the possible coverages, using left-right longest match (so a "det" rule and a "noun" rule won't match "det noun" input if there are "det noun" rules).
  • Write a program (in python or C++) that reads the XML transfer format patterns and applies them to an input stream printing out all the possible coverages, including alternatives where a combination of shorter rules matches a longer rule (so a "det" rule and a "noun" rule will be included in the combinations even if there are "det noun" rules).

See also