Difference between revisions of "User:Francis Tyers/MT"

From Apertium
Jump to navigation Jump to search
Line 3: Line 3:
The paper presents a strategy for measuring the difference between a pair of documents in XML. The
The paper presents a strategy for measuring the difference between a pair of documents in XML. The
authors report that this is an improvement over the more traditional strategies. The two traditional
authors report that this is an improvement over the more traditional strategies. The two traditional
strategies tested against were longest common subsequence (LCS), as used by GNU diff, and shortest
strategies tested against were longest common subsequence (LCS), as used by GNU diff -- which operates
edit distance, as used by XMLDiff and similar programs. The authors state that the deficiency in these
on the level of lines in a file, and shortest edit distance, as used by XMLDiff and similar programs
-- which operates on nodes in a document tree. The authors state that the deficiency in these
methods lies in the way that they do not represent changes as made by authors. Rather they try to make
methods lies in the way that they do not represent changes as made by authors. Rather they try to make
the smallest" possible edit script or diff.
the "smallest possible" edit script or diff.


The authors present their method of "structure preserving difference", which instead of trying to find
The authors present their method of "structure preserving difference", which instead of trying to find
the smallest possible edit script, attempts to maximise the structures maintained in changing one
the smallest possible edit script, attempts to maximise the size of sub-structures maintained in
document to another.
changing one document to another. In order to calculate this, they model the document as a graph,
where relations other than simple parent-child can be taken into account, for example ancestor-descendent,
and sibling relationships.

Revision as of 14:44, 3 April 2008

Structure-Preserving Difference Search for XML Documents

The paper presents a strategy for measuring the difference between a pair of documents in XML. The authors report that this is an improvement over the more traditional strategies. The two traditional strategies tested against were longest common subsequence (LCS), as used by GNU diff -- which operates on the level of lines in a file, and shortest edit distance, as used by XMLDiff and similar programs -- which operates on nodes in a document tree. The authors state that the deficiency in these methods lies in the way that they do not represent changes as made by authors. Rather they try to make the "smallest possible" edit script or diff.

The authors present their method of "structure preserving difference", which instead of trying to find the smallest possible edit script, attempts to maximise the size of sub-structures maintained in changing one document to another. In order to calculate this, they model the document as a graph, where relations other than simple parent-child can be taken into account, for example ancestor-descendent, and sibling relationships.