Difference between revisions of "Ideas for Google Summer of Code/Command-line translation memory fuzzy-match repair"
Line 80: | Line 80: | ||
* t’(a) = “Connecteu la impressora a l’ordinador” |
* t’(a) = “Connecteu la impressora a l’ordinador” |
||
* t’(b) = “Connecteu la impressora a l’ordinador” |
* t’(b) = “Connecteu la impressora a l’ordinador” |
||
− | * t’(c) = “Connecteu l’impressora per l’ordinador” |
+ | * t’(c) = “Connecteu l’impressora per l’ordinador” (not correct) |
* t’(d) = “Connecteu la impressora a l’ordinador” |
* t’(d) = “Connecteu la impressora a l’ordinador” |
||
− | * t’(e) = “Connecteu l’impressora a l’ordinador” |
+ | * t’(e) = “Connecteu l’impressora a l’ordinador” (not correct) |
== Which are the best repairs? == |
== Which are the best repairs? == |
Revision as of 14:20, 14 February 2014
Fuzzy matching in translation memories
Imagine that the new sentence is:
s' = “Connect the printer to the computer”
And we find a fuzzy match (score 83%):
s = “Connect the scanner to the computer” t = “Connecteu l’escàner a l’ordinador”
What would be t', the translation of s'?
Obtain t’ by “patching” or “repairing” t
A way to repair would be:
- determine what changed from s to s’
- decide which parts of t correspond to changed parts in s
- translate what changed from s to s’
- change the corresponding parts in t to obtain one or more approximate t’
Determine what changed from s to s’
So first we align:
s’ = “Connect the printer to the computer” s = “Connect the scanner to the computer”
and find the changes shown.
Then we cut “clippings” around the changes: “the scanner”→ “the printer” “the scanner to” → “the printer to” “scanner to” → “printer to” “connect the scanner” → “connect the printer” “scanner to the computer” → “printer to the computer”
Decide which parts of t correspond to changed parts in s
s = “Connect the scanner to the computer” t = “Connecteu l’escàner a l’ordinador”
Translate the s clippings and match them in t.
(The translations given here are from Google)
- “the scanner” [2,3]→ “l’escàner [2,3]”
- “the scanner to [2,4]” → “l’escàner a [2,4]”
- “scanner to [3,4]” → “escàner a [3,4]”
- “connect the scanner” [1,3] → “connecteu l’escàner” [1,3]
- “scanner to the computer” [3,6] → “escàner a l’ordinador” [3,6]
All match! (This may not always be the case).
Translate what changed from s to s’
Translate s’ clippings different from s.
- “the printer”→ “la impressora”
- “the printer to” → “la impressora a”
- “printer to” → “impressora per”
- “connect the printer” → “connecteu la impressora”
- “printer to the computer” → “impressora a l’ordinador”
Match translations of s’ clippings to translations of s clippings to build “repair operators”.
- “l’escàner” [2,3]→ “la impressora”
- “l’escàner a” [2,4] → “la impressora a”
- “escàner a [3,4]” → “impressora per”
- “connecteu l’escàner” [1,3] → “connecteu la impressora”
- “escàner a l’ordinador” [3,6] → “impressora a l’ordinador”
Overlap emphasized: overlap is desirable.
Change the corresponding parts in t to obtain one or more approximate t’
t = “Connecteu l’escàner a l’ordinador”
- t’(a) = “Connecteu la impressora a l’ordinador”
- t’(b) = “Connecteu la impressora a l’ordinador”
- t’(c) = “Connecteu l’impressora per l’ordinador” (not correct)
- t’(d) = “Connecteu la impressora a l’ordinador”
- t’(e) = “Connecteu l’impressora a l’ordinador” (not correct)
Which are the best repairs?
Probably the best repairs would come from:
- the longest possible repair operators (even longer than in the example above)
- those having most overlap (or context)
- those having overlaps on both sides
When should repairs be used?
Only for high fuzzy match scores. This could be a parameter when calling this functionality.
Coding challenge
Write (in some scripting language of your choice) a command-line program that takes an Apertium language pair, a source-language sentence S, and a target-language sentence T, and outputs the set of all possible pairs of subsegments (s,t) such that s is a subsegment of S, t a subsegment of T and t is the Apertium translation of s or vice-versa (a subsegment is a sequence of whole words).
More info
Ask User:mlforcada for more information