Revision as of 14:20, 14 February 2014

1 Fuzzy matching in translation memories
2 Obtain t’ by “patching” or “repairing” t
3 Determine what changed from s to s’
4 Decide which parts of t correspond to changed parts in s
5 Translate what changed from s to s’
6 Change the corresponding parts in t to obtain one or more approximate t’
7 Which are the best repairs?
8 When should repairs be used?
9 Coding challenge
10 More info

Fuzzy matching in translation memories

Imagine that the new sentence is:

s' = “Connect the printer to the computer”

And we find a fuzzy match (score 83%):

s = “Connect the scanner to the computer” t = “Connecteu l’escàner a l’ordinador”

What would be t', the translation of s'?

Obtain t’ by “patching” or “repairing” t

A way to repair would be:

determine what changed from s to s’
decide which parts of t correspond to changed parts in s
translate what changed from s to s’
change the corresponding parts in t to obtain one or more approximate t’

Determine what changed from s to s’

So first we align:

s’ = “Connect the printer to the computer” s = “Connect the scanner to the computer”

and find the changes shown.

Then we cut “clippings” around the changes: “the scanner”→ “the printer” “the scanner to” → “the printer to” “scanner to” → “printer to” “connect the scanner” → “connect the printer” “scanner to the computer” → “printer to the computer”

Decide which parts of t correspond to changed parts in s

s = “Connect the scanner to the computer” t = “Connecteu l’escàner a l’ordinador”

Translate the s clippings and match them in t.

(The translations given here are from Google)

“the scanner” [2,3]→ “l’escàner [2,3]”
“the scanner to [2,4]” → “l’escàner a [2,4]”
“scanner to [3,4]” → “escàner a [3,4]”
“connect the scanner” [1,3] → “connecteu l’escàner” [1,3]
“scanner to the computer” [3,6] → “escàner a l’ordinador” [3,6]

All match! (This may not always be the case).

Translate what changed from s to s’

Translate s’ clippings different from s.

“the printer”→ “la impressora”
“the printer to” → “la impressora a”
“printer to” → “impressora per”
“connect the printer” → “connecteu la impressora”
“printer to the computer” → “impressora a l’ordinador”

Match translations of s’ clippings to translations of s clippings to build “repair operators”.

“l’escàner” [2,3]→ “la impressora”
“l’escàner a” [2,4] → “la impressora a”
“escàner a [3,4]” → “impressora per”
“connecteu l’escàner” [1,3] → “connecteu la impressora”
“escàner a l’ordinador” [3,6] → “impressora a l’ordinador”

Overlap emphasized: overlap is desirable.

Change the corresponding parts in t to obtain one or more approximate t’

t = “Connecteu l’escàner a l’ordinador”

t’(a) = “Connecteu la impressora a l’ordinador”
t’(b) = “Connecteu la impressora a l’ordinador”
t’(c) = “Connecteu l’impressora per l’ordinador” (not correct)
t’(d) = “Connecteu la impressora a l’ordinador”
t’(e) = “Connecteu l’impressora a l’ordinador” (not correct)

Which are the best repairs?

Probably the best repairs would come from:

the longest possible repair operators (even longer than in the example above)
those having most overlap (or context)
those having overlaps on both sides

When should repairs be used?

Only for high fuzzy match scores. This could be a parameter when calling this functionality.

Coding challenge

Write (in some scripting language of your choice) a command-line program that takes an Apertium language pair, a source-language sentence S, and a target-language sentence T, and outputs the set of all possible pairs of subsegments (s,t) such that s is a subsegment of S, t a subsegment of T and t is the Apertium translation of s or vice-versa (a subsegment is a sequence of whole words).

More info

Ask User:mlforcada for more information

@@ Line 80: / Line 80: @@
 * t’(a) = “Connecteu la impressora a l’ordinador”
 * t’(b) = “Connecteu la impressora a l’ordinador”
-* t’(c) = “Connecteu l’impressora per l’ordinador”
+* t’(c) = “Connecteu l’impressora per l’ordinador” (not correct)
 * t’(d) = “Connecteu la impressora a l’ordinador”
-* t’(e) = “Connecteu l’impressora a l’ordinador”
+* t’(e) = “Connecteu l’impressora a l’ordinador” (not correct)
 == Which are the best repairs? ==

Difference between revisions of "Ideas for Google Summer of Code/Command-line translation memory fuzzy-match repair"

Revision as of 14:20, 14 February 2014

Contents

Fuzzy matching in translation memories

Obtain t’ by “patching” or “repairing” t

Determine what changed from s to s’

Decide which parts of t correspond to changed parts in s

Translate what changed from s to s’

Change the corresponding parts in t to obtain one or more approximate t’

Which are the best repairs?

When should repairs be used?

Coding challenge

More info

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools