Apertium has moved from SourceForge to GitHub.
If you have any questions, please come and talk to us on `#apertium` on `irc.freenode.net` or contact the GitHub migration team.

# Ideas for Google Summer of Code/Command-line translation memory fuzzy-match repair

Jump to: navigation, search

This idea is connected to research being performed at the Universitat d'Alacant by John E. Ortega, Mikel L. Forcada and Felipe Sánchez Martínez.

GsoC2014 project:

##  Fuzzy matching in translation memories

Imagine that the new sentence is:

s' = “Connect the printer to the computer”

And we find a fuzzy match (score 83%):

s = “Connect the scanner to the computer” t = “Connecteu l’escàner a l’ordinador”

What would be t', the translation of s'?

##  Obtain t’ by “patching” or “repairing” t

A way to repair would be:

• determine what changed from s to s’
• decide which parts of t correspond to changed parts in s
• translate what changed from s to s’
• change the corresponding parts in t to obtain one or more approximate t’

##  Determine what changed from s to s’

So first we align:

s’ = “Connect the printer to the computer” s = “Connect the scanner to the computer”

and find the changes shown.

Then we cut “clippings” around the changes: “the scanner”→ “the printer” “the scanner to” → “the printer to” “scanner to” → “printer to” “connect the scanner” → “connect the printer” “scanner to the computer” → “printer to the computer”

##  Decide which parts of t correspond to changed parts in s

s = “Connect the scanner to the computer” t = “Connecteu l’escàner a l’ordinador”

Translate the s clippings and match them in t.

(The translations given here are from Google)

• “the scanner” [2,3]→ “l’escàner [2,3]”
• “the scanner to [2,4]” → “l’escàner a [2,4]”
• “scanner to [3,4]” → “escàner a [3,4]”
• “connect the scanner” [1,3] → “connecteu l’escàner” [1,3]
• “scanner to the computer” [3,6] → “escàner a l’ordinador” [3,6]

All match! (This may not always be the case).

##  Translate what changed from s to s’

Translate s’ clippings different from s.

• “the printer”→ “la impressora”
• “the printer to” → “la impressora a”
• “printer to” → “impressora per”
• “connect the printer” → “connecteu la impressora”
• “printer to the computer” → “impressora a l’ordinador”

Match translations of s’ clippings to translations of s clippings to build “repair operators”.

• “l’escàner” [2,3]→ “la impressora”
• “l’escàner a” [2,4] → “la impressora a”
• “escàner a [3,4]” → “impressora per”
• connecteu l’escàner” [1,3] → “connecteu la impressora”
• “escàner a l’ordinador” [3,6] → “impressora a l’ordinador”

Overlap emphasized: overlap is desirable.

##  Change the corresponding parts in t to obtain one or more approximate t’

t = “Connecteu l’escàner a l’ordinador”

• t’(a) = “Connecteu la impressora a l’ordinador”
• t’(b) = “Connecteu la impressora a l’ordinador”
• t’(c) = “Connecteu l’impressora per l’ordinador” (not correct)
• t’(d) = “Connecteu la impressora a l’ordinador”
• t’(e) = “Connecteu l’impressora a l’ordinador” (not correct)

##  Which are the best repairs?

Probably the best repairs would come from:

• the longest possible repair operators (even longer than in the example above)
• those having most overlap (or context)
• those having overlaps on both sides

##  When should repairs be used?

Only for high fuzzy match scores. This could be a parameter when calling this functionality.

#  Coding challenge

Write (in some scripting language of your choice) a command-line program that takes an Apertium language pair, a source-language sentence S, and a target-language sentence T, and outputs the set of all possible pairs of subsegments (s,t) such that s is a subsegment of S, t a subsegment of T and t is the Apertium translation of s or vice-versa (a subsegment is a sequence of whole words).

The program should read S, T, and the language pair from the command line. No interaction please.

The program should be capable of dealing with Unicode UTF-8.

Sentences S and T should be delimited with quotes (").

The output pairs should have the format ("....",".....").

No other output should be produced.

Once you've got this working, it could be a nice idea to add an additional option "-r" to use the reverse language pair to discover segments too.

#  More info

Ask User:mlforcada for more information