Difference between revisions of "Ideas for Google Summer of Code/Command-line translation memory fuzzy-match repair"

From Apertium
Jump to navigation Jump to search
m (I think this is what Mikel intended)
m (oops, must have been an older revision)
Line 1: Line 1:
  +
{{TOCD}}
 
== Fuzzy matching in translation memories ==
 
== Fuzzy matching in translation memories ==
   
Line 94: Line 95:
   
 
Only for high fuzzy match scores. This could be a parameter when calling this functionality.
 
Only for high fuzzy match scores. This could be a parameter when calling this functionality.
  +
  +
= Coding challenge =
  +
  +
Write (in some scripting language of your choice) a command-line program that takes an Apertium language pair, a source-language sentence S, and a target-language sentence T, and outputs the set of pairs of subsegments (s,t) such that s is a subsegment of S, t a subsegment of T and t is the Apertium translation of s or vice-versa (a subsegment is a sequence of whole words).
  +
  +
= More info =
  +
  +
Ask [[User:mlforcada]] for more information

Revision as of 13:51, 14 February 2014

Fuzzy matching in translation memories

Imagine that the new sentence is:

s' = “Connect the printer to the computer”

And we find a fuzzy match (score 83%):

s = “Connect the scanner to the computer” t = “Connecteu l’escàner a l’ordinador”

What would be t', the translation of s'?

Obtain t’ by “patching” or “repairing” t

A way to repair would be:

  • determine what changed from s to s’
  • decide which parts of t correspond to changed parts in s
  • translate what changed from s to s’
  • change the corresponding parts in t to obtain one or more approximate t’

Determine what changed from s to s’

So first we align:

s’ = “Connect the printer to the computer” s = “Connect the scanner to the computer”

and find the changes shown.

Then we cut “clippings” around the changes: “the scanner”→ “the printer” “the scanner to” → “the printer to” “scanner to” → “printer to” “connect the scanner” → “connect the printer” “scanner to the computer” → “printer to the computer”

Decide which parts of t correspond to changed parts in s

s = “Connect the scanner to the computer” t = “Connecteu l’escàner a l’ordinador”

Translate the s clippings and match them in t.

(The translations given here are from Google)

  • “the scanner” [2,3]→ “l’escàner [2,3]”
  • “the scanner to [2,4]” → “l’escàner a [2,4]”
  • “scanner to [3,4]” → “escàner a [3,4]”
  • “connect the scanner” [1,3] → “connecteu l’escàner” [1,3]
  • “scanner to the computer” [3,6] → “escàner a l’ordinador” [3,6]

All match! (This may not always be the case).

Translate what changed from s to s’

Translate s’ clippings different from s.

  • “the printer”→ “la impressora”
  • “the printer to” → “la impressora a”
  • “printer to” → “impressora per”
  • “connect the printer” → “connecteu la impressora”
  • “printer to the computer” → “impressora a l’ordinador”

Match translations of s’ clippings to translations of s clippings to build “repair operators”.

  • “l’escàner” [2,3]→ “la impressora”
  • “l’escàner a” [2,4] → “la impressora a”
  • “escàner a [3,4]” → “impressora per”
  • connecteu l’escàner” [1,3] → “connecteu la impressora”
  • “escàner a l’ordinador” [3,6] → “impressora a l’ordinador”

Overlap emphasized: overlap is desirable.

Change the corresponding parts in t to obtain one or more approximate t’

t = “Connecteu l’escàner a l’ordinador”

  • t’(a) = “Connecteu la impressora a l’ordinador”
  • t’(b) = “Connecteu la impressora a l’ordinador”
  • t’(c) = “Connecteu l’impressora per l’ordinador”
  • t’(d) = “Connecteu la impressora a l’ordinador”
  • t’(e) = “Connecteu l’impressora a l’ordinador”

Which are the best repairs?

Probably the best repairs would come from:

  • the longest possible repair operators (even longer than in the example above)
  • those having most overlap (or context)
  • those having overlaps on both sides

When should repairs be used?

Only for high fuzzy match scores. This could be a parameter when calling this functionality.

Coding challenge

Write (in some scripting language of your choice) a command-line program that takes an Apertium language pair, a source-language sentence S, and a target-language sentence T, and outputs the set of pairs of subsegments (s,t) such that s is a subsegment of S, t a subsegment of T and t is the Apertium translation of s or vice-versa (a subsegment is a sequence of whole words).

More info

Ask User:mlforcada for more information