User:Francis Tyers/MT

From Apertium
Jump to navigation Jump to search

Schubert, E., Schaffert, S., and Bry, F. (2005). "Structure-preserving difference search for xml documents". Proceedings of the Extreme Markup Languages 2005 Conference, 1-5 August 2005, Montréal, Quebec, Canada

Structure-Preserving Difference Search for XML Documents

The paper presents a strategy for measuring the difference between a pair of documents in XML. The authors report that this is an improvement over the more traditional strategies. The two traditional strategies tested against were longest common subsequence (LCS), as used by GNU diff -- which operates on the level of lines in a file, and shortest edit distance, as used by XMLDiff and similar programs -- which operates on nodes in a document tree. The authors state that the deficiency in these methods lies in the way that they do not represent changes as made by authors. Rather they try to make the "smallest possible" edit script or diff.

The authors present their method of "structure preserving difference", which instead of trying to find the smallest possible edit script, attempts to maximise the size of retained sub-structures maintained in changing one document to another. In order to calculate this, they model the document as a graph, where relations other than simple parent-child can be taken into account, for example ancestor-descendent, and sibling relationships.

Retained sub-structures are calculated as nodeset correspondences (set of mappings between equal nodes) between the original, and modified document. Two nodes are considered equal when they match a node similarity relation, which is user-specified. The most simple node similarity relation would be to consider two nodes equal when the labels are equal. Between two documents, there are many nodeset correspondences, so an algorithm needs to be used to find the optimal set of correspondences which maximises the retained sub-structures.

The algorithm used is Dijkstra's algorithm, where the edge weights are calculated by different cost functions. The authors outline four possible cost functions.


Review

Revision 1 

Reviewer Recommendation Term: 
Revisions
Comments to Editor: 
Review Sheet: General Judgement
============================================================

1. Is the paper acceptable for publication

(a)   in its present form?   
(b)   with minor revisions?   

Should the paper be reconsidered after major revision?


Is it unacceptable for publication?


2. Please list any other general comments or specific suggestions in the separate blind comments to author's box.

Comments to Author: 

Comments on Paper:



A list follows: