Ideas for Google Summer of Code/Command-line translation memory fuzzy-match repair

From Apertium
< Ideas for Google Summer of Code
Revision as of 12:50, 14 February 2014 by Jimregan (talk | contribs) (I think this is what Mikel intended)
Jump to navigation Jump to search

Fuzzy matching in translation memories

Imagine that the new sentence is:

s' = “Connect the printer to the computer”

And we find a fuzzy match (score 83%):

s = “Connect the scanner to the computer” t = “Connecteu l’escàner a l’ordinador”

What would be t', the translation of s'?

Obtain t’ by “patching” or “repairing” t

A way to repair would be:

  • determine what changed from s to s’
  • decide which parts of t correspond to changed parts in s
  • translate what changed from s to s’
  • change the corresponding parts in t to obtain one or more approximate t’

Determine what changed from s to s’

So first we align:

s’ = “Connect the printer to the computer” s = “Connect the scanner to the computer”

and find the changes shown.

Then we cut “clippings” around the changes: “the scanner”→ “the printer” “the scanner to” → “the printer to” “scanner to” → “printer to” “connect the scanner” → “connect the printer” “scanner to the computer” → “printer to the computer”

Decide which parts of t correspond to changed parts in s

s = “Connect the scanner to the computer” t = “Connecteu l’escàner a l’ordinador”

Translate the s clippings and match them in t.

(The translations given here are from Google)

  • “the scanner” [2,3]→ “l’escàner [2,3]”
  • “the scanner to [2,4]” → “l’escàner a [2,4]”
  • “scanner to [3,4]” → “escàner a [3,4]”
  • “connect the scanner” [1,3] → “connecteu l’escàner” [1,3]
  • “scanner to the computer” [3,6] → “escàner a l’ordinador” [3,6]

All match! (This may not always be the case).

Translate what changed from s to s’

Translate s’ clippings different from s.

  • “the printer”→ “la impressora”
  • “the printer to” → “la impressora a”
  • “printer to” → “impressora per”
  • “connect the printer” → “connecteu la impressora”
  • “printer to the computer” → “impressora a l’ordinador”

Match translations of s’ clippings to translations of s clippings to build “repair operators”.

  • “l’escàner” [2,3]→ “la impressora”
  • “l’escàner a” [2,4] → “la impressora a”
  • “escàner a [3,4]” → “impressora per”
  • connecteu l’escàner” [1,3] → “connecteu la impressora”
  • “escàner a l’ordinador” [3,6] → “impressora a l’ordinador”

Overlap emphasized: overlap is desirable.

Change the corresponding parts in t to obtain one or more approximate t’

t = “Connecteu l’escàner a l’ordinador”

  • t’(a) = “Connecteu la impressora a l’ordinador”
  • t’(b) = “Connecteu la impressora a l’ordinador”
  • t’(c) = “Connecteu l’impressora per l’ordinador”
  • t’(d) = “Connecteu la impressora a l’ordinador”
  • t’(e) = “Connecteu l’impressora a l’ordinador”

Which are the best repairs?

Probably the best repairs would come from:

  • the longest possible repair operators (even longer than in the example above)
  • those having most overlap (or context)
  • those having overlaps on both sides

When should repairs be used?

Only for high fuzzy match scores. This could be a parameter when calling this functionality.