Apertium has moved from SourceForge to GitHub.
If you have any questions, please come and talk to us on #apertium on irc.freenode.net or contact the GitHub migration team.

Ideas for Google Summer of Code/Command-line translation memory fuzzy-match repair

From Apertium
< Ideas for Google Summer of Code(Difference between revisions)
Jump to: navigation, search
m (I think this is what Mikel intended)
Line 1: Line 1:
{{TOCD}}
 
 
 
== Fuzzy matching in translation memories ==
 
== Fuzzy matching in translation memories ==
   
 
Imagine that the new sentence is:
 
Imagine that the new sentence is:
   
s’ = “Connect the printer to the computer”
+
s' = “Connect the printer to the computer”
   
 
And we find a fuzzy match (score 83%):
 
And we find a fuzzy match (score 83%):
   
s’ = “Connect the scanner to the computer”
+
s = “Connect the scanner to the computer”
 
t = “Connecteu l’escàner a l’ordinador”
 
t = “Connecteu l’escàner a l’ordinador”
   
Line 96: Line 94:
   
 
Only for high fuzzy match scores. This could be a parameter when calling this functionality.
 
Only for high fuzzy match scores. This could be a parameter when calling this functionality.
 
= Coding challenge =
 
 
Write (in some scripting language of your choice) a command-line program that takes an Apertium language pair, a source-language sentence S, and a target-language sentence T, and outputs the set of pairs of subsegments (s,t) such that s is a subsegment of S, t a subsegment of T and t is the Apertium translation of s or vice-versa (a subsegment is a sequence of whole words).
 
 
= More info =
 
 
Ask [[User:mlforcada]] for more information
 

Revision as of 13:50, 14 February 2014

Contents

Fuzzy matching in translation memories

Imagine that the new sentence is:

s' = “Connect the printer to the computer”

And we find a fuzzy match (score 83%):

s = “Connect the scanner to the computer” t = “Connecteu l’escàner a l’ordinador”

What would be t', the translation of s'?

Obtain t’ by “patching” or “repairing” t

A way to repair would be:

  • determine what changed from s to s’
  • decide which parts of t correspond to changed parts in s
  • translate what changed from s to s’
  • change the corresponding parts in t to obtain one or more approximate t’

Determine what changed from s to s’

So first we align:

s’ = “Connect the printer to the computer” s = “Connect the scanner to the computer”

and find the changes shown.

Then we cut “clippings” around the changes: “the scanner”→ “the printer” “the scanner to” → “the printer to” “scanner to” → “printer to” “connect the scanner” → “connect the printer” “scanner to the computer” → “printer to the computer”

Decide which parts of t correspond to changed parts in s

s = “Connect the scanner to the computer” t = “Connecteu l’escàner a l’ordinador”

Translate the s clippings and match them in t.

(The translations given here are from Google)

  • “the scanner” [2,3]→ “l’escàner [2,3]”
  • “the scanner to [2,4]” → “l’escàner a [2,4]”
  • “scanner to [3,4]” → “escàner a [3,4]”
  • “connect the scanner” [1,3] → “connecteu l’escàner” [1,3]
  • “scanner to the computer” [3,6] → “escàner a l’ordinador” [3,6]

All match! (This may not always be the case).

Translate what changed from s to s’

Translate s’ clippings different from s.

  • “the printer”→ “la impressora”
  • “the printer to” → “la impressora a”
  • “printer to” → “impressora per”
  • “connect the printer” → “connecteu la impressora”
  • “printer to the computer” → “impressora a l’ordinador”

Match translations of s’ clippings to translations of s clippings to build “repair operators”.

  • “l’escàner” [2,3]→ “la impressora”
  • “l’escàner a” [2,4] → “la impressora a”
  • “escàner a [3,4]” → “impressora per”
  • connecteu l’escàner” [1,3] → “connecteu la impressora”
  • “escàner a l’ordinador” [3,6] → “impressora a l’ordinador”

Overlap emphasized: overlap is desirable.

Change the corresponding parts in t to obtain one or more approximate t’

t = “Connecteu l’escàner a l’ordinador”

  • t’(a) = “Connecteu la impressora a l’ordinador”
  • t’(b) = “Connecteu la impressora a l’ordinador”
  • t’(c) = “Connecteu l’impressora per l’ordinador”
  • t’(d) = “Connecteu la impressora a l’ordinador”
  • t’(e) = “Connecteu l’impressora a l’ordinador”

Which are the best repairs?

Probably the best repairs would come from:

  • the longest possible repair operators (even longer than in the example above)
  • those having most overlap (or context)
  • those having overlaps on both sides

When should repairs be used?

Only for high fuzzy match scores. This could be a parameter when calling this functionality.

Personal tools