Ideas for Google Summer of Code/Weighted transfer rules

From Apertium
Jump to navigation Jump to search

At the moment Apertium transfer rules are combinations of fixed-length patterns and actions. Conflicts are "solved" by selecting the first rule that matches. The idea of this task is to allow rules to conflict and to select the most adequate rules for an input using weights.

Unlexicalised rule weights
Depend only on the category of the input words.
Lexicalised rules weights
Depend on one or more of the lemmas of the input words. e.g.
ID Rule Input Output Frequency
1 de Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle y}Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle x} Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle y} memoría de traducción translation memory 90
2 de Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle y}Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle y} 's Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle x} memoría de traducción translation's memory 0
3 de Failed to parse (Conversion error. Server ("https://wikimedia.org/api/rest_") reported: "Cannot get mml. Server problem."): {\displaystyle x} of memoría de traducción memory of translation 0

So here we would have something like:

  • Rule 1 (x=memoría, y=traducción, weight=1.0)
  • Rule 2 (x=memoría, y=traducción, weight=0.0)
  • Rule 3 (x=memoría, y=traducción, weight=0.0)

Example

Transfer rules:

ID Rule Input Output
1 Failed to parse (Conversion error. Server ("https://wikimedia.org/api/rest_") reported: "Cannot get mml. Server problem."): {\displaystyle x} de Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle x} Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle y} memoria de traducción translation memory
2 Failed to parse (Conversion error. Server ("https://wikimedia.org/api/rest_") reported: "Cannot get mml. Server problem."): {\displaystyle x} de 's la hermana de mi novia my girlfriend's sister
3 Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle x} de Failed to parse (Conversion error. Server ("https://wikimedia.org/api/rest_") reported: "Cannot get mml. Server problem."): {\displaystyle y}Failed to parse (Conversion error. Server ("https://wikimedia.org/api/rest_") reported: "Cannot get mml. Server problem."): {\displaystyle x} of Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle y} el estado de la cuestión the state of the art

Training

  • Take a big corpus
  • For each sentence:
    • Apply transfer rules
    • For each possible combination of transfer rules
      • Translate the sentence and score on language model
      • Each sentence gets a count 1. This count is shared between the transfer rules.
Example
La canciller se reúne hoy con el presidente de EE UU para limar asperezas y preparar la cumbre del miércoles con Putin.
1 1 The chancellor gathers today with [the U.S. president] for mend fences and prepare [the Wednesday summit] with Putin. -74.55 0.39
2 1 The chancellor gathers today with [the U.S.'s president] for mend fences and prepare [the Wednesday summit] with Putin. -69.51 60.71
3 1 The chancellor gathers today with [the president of the U.S.] for mend fences and prepare [the Wednesday summit] with Putin. -74.47 0.43
1 2 The chancellor gathers today with [the U.S. president] for mend fences and prepare [the Wednesday's summit] with Putin. -75.02 0.25
2 2 The chancellor gathers today with [the U.S.'s president] for mend fences and prepare [the Wednesday's summit] with Putin. -69.98 37.94
3 2 The chancellor gathers today with [the president of the U.S.] for mend fences and prepare [the Wednesday's summit] with Putin. -74.94 0.27
1 3 The chancellor gathers today with [the U.S. president] for mend fences and prepare [the summit of the Wednesday] with Putin. -82.88 0.0
2 3 The chancellor gathers today with [the U.S.'s president] for mend fences and prepare [the summit of the Wednesday] with Putin. -77.84 0.01
3 3 The chancellor gathers today with [the president of the U.S.] for mend fences and prepare [the summit of the Wednesday] with Putin. -82.80 0.0

You can then feed the fractional counts to some supervised machine learning program to get appropriate weights.

Questions

  • How to calculate the paths?
    • With optimal coverage, or with just taking the LRLM and only calculating paths for rules which conflict.
  • For lexicalised weights:
    • What is the function assigning cost to each lexical combination of N1 and N2?
  • Could we score a rule at a time, by keeping part fixed ?

Tasks

  • Implement weighted arcs in lttoolbox (C++) and integrate into Apertium.
  • Implement weighted sections in lttoolbox
  • Implement recursive paradigms in lttoolbox

Coding challenge

  • Set up a pair and train the existing weighted transfer rule code.


See also