Ideas for Google Summer of Code/Weighted transfer rules

ID	Rule	Input	Output	Frequency
1	$x$ de Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle y} → Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle x} Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle y}	memoría de traducción	translation memory	90
2	Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle x} de Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle y} → Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle y} 's Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle x}	memoría de traducción	translation's memory	0
3	Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle x} de Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle y} → Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle x} of Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle y}	memoría de traducción	memory of translation	0

So here we would have something like:

Rule 1 (x=memoría, y=traducción, weight=1.0)
Rule 2 (x=memoría, y=traducción, weight=0.0)
Rule 3 (x=memoría, y=traducción, weight=0.0)

Example

Transfer rules:

ID	Rule	Input	Output
1	Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle x} de Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle y} → Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle x} Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle y}	memoria de traducción	translation memory
2	Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle x} de Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle y} → Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle y} 's Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle x}	la hermana de mi novia	my girlfriend's sister
3	Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle x} de Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle y} → Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle x} of Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle y}	el estado de la cuestión	the state of the art

Training

Take a big corpus
For each sentence:
- Apply transfer rules
- For each possible combination of transfer rules
  - Translate the sentence and score on language model
  - Each sentence gets a count 1. This count is shared between the transfer rules.

Example

	La canciller se reúne hoy con el presidente de EE UU para limar asperezas y preparar la cumbre del miércoles con Putin.
1 1	The chancellor gathers today with [the U.S. president] for mend fences and prepare [the Wednesday summit] with Putin.	-74.55	0.39
2 1	The chancellor gathers today with [the U.S.'s president] for mend fences and prepare [the Wednesday summit] with Putin.	-69.51	60.71
3 1	The chancellor gathers today with [the president of the U.S.] for mend fences and prepare [the Wednesday summit] with Putin.	-74.47	0.43
1 2	The chancellor gathers today with [the U.S. president] for mend fences and prepare [the Wednesday's summit] with Putin.	-75.02	0.25
2 2	The chancellor gathers today with [the U.S.'s president] for mend fences and prepare [the Wednesday's summit] with Putin.	-69.98	37.94
3 2	The chancellor gathers today with [the president of the U.S.] for mend fences and prepare [the Wednesday's summit] with Putin.	-74.94	0.27
1 3	The chancellor gathers today with [the U.S. president] for mend fences and prepare [the summit of the Wednesday] with Putin.	-82.88	0.0
2 3	The chancellor gathers today with [the U.S.'s president] for mend fences and prepare [the summit of the Wednesday] with Putin.	-77.84	0.01
3 3	The chancellor gathers today with [the president of the U.S.] for mend fences and prepare [the summit of the Wednesday] with Putin.	-82.80	0.0

You can then feed the fractional counts to some supervised machine learning program to get appropriate weights.

Questions

How to calculate the paths?
- With optimal coverage, or with just taking the LRLM and only calculating paths for rules which conflict.
For lexicalised weights:
- What is the function assigning cost to each lexical combination of N1 and N2?
Could we score a rule at a time, by keeping part fixed ?

Tasks

Implement in C++ and integrate into Apertium.

Coding challenge

Write a program (in python or C++) that reads the XML transfer format patterns and applies them to an input stream printing out all the possible coverages, using left-right longest match (so a "det" rule and a "noun" rule won't match "det noun" input if there are "det noun" rules).

Write a program (in python or C++) that reads the XML transfer format patterns and applies them to an input stream printing out all the possible coverages, including alternatives where a combination of shorter rules matches a longer rule (so a "det" rule and a "noun" rule will be included in the combinations even if there are "det noun" rules):

 (? I) (? think that) (? he) (210 might have finished it) (? yesterday)
 (? I) (? think that) (? he) (? might) (174 have finished) (? it) (? yesterday)
 (? I) (? think that) (? he) (? might) (175 have finished it) (? yesterday)
 (? I) (? think that) (? he) (? might) (? have) (161 finished) (? it) (? yesterday)

where the numbers are rule numbers

Ideas for Google Summer of Code/Weighted transfer rules

Contents

Example

Training

Questions

Tasks

Coding challenge

See also

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools