Difference between revisions of "User:Francis Tyers/MT"
(16 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
Schubert, E., Schaffert, S., and Bry, F. (2005). "Structure-preserving difference search for xml documents". ''Proceedings of the Extreme Markup Languages 2005 Conference, 1-5 August 2005, Montréal, Quebec, Canada'' |
Schubert, E., Schaffert, S., and Bry, F. (2005). "Structure-preserving difference search for xml documents". ''Proceedings of the Extreme Markup Languages 2005 Conference, 1-5 August 2005, Montréal, Quebec, Canada'' |
||
==Summary== |
|||
;Structure-Preserving Difference Search for XML Documents |
;Structure-Preserving Difference Search for XML Documents |
||
Line 12: | Line 14: | ||
The authors present their method of "structure preserving difference", which instead of trying to find |
The authors present their method of "structure preserving difference", which instead of trying to find |
||
the smallest possible edit script, attempts to maximise the size of retained sub-structures |
the smallest possible edit script, attempts to maximise the size of retained sub-structures (that is |
||
changing one document to another. In order to calculate |
sub-trees of the XML document tree) maintained in changing one document to another. In order to calculate |
||
where relations other than simple parent-child can be taken into |
this, they model the document as a graph, where relations other than simple parent-child can be taken into |
||
and sibling relationships. |
account, for example ancestor-descendent, and sibling relationships. |
||
Retained sub-structures are calculated as nodeset correspondences (set of mappings between |
Retained sub-structures are calculated as nodeset correspondences (set of mappings between equivalent nodes) |
||
between the original, and modified document. Two nodes are considered equal when they match a node |
between the original, and modified document. Two nodes are considered equal when they match a node |
||
similarity relation, which is user-specified. The most simple node similarity relation would be to consider |
similarity relation, which is user-specified. The most simple node similarity relation would be to consider |
||
Line 23: | Line 25: | ||
so an algorithm needs to be used to find the optimal set of correspondences which maximises the retained |
so an algorithm needs to be used to find the optimal set of correspondences which maximises the retained |
||
sub-structures. |
sub-structures. |
||
The algorithm used is Dijkstra's algorithm, where the edge weights of the search graph are calculated by |
|||
different cost functions, where each node is a nodeset correspondence and each edge is a transition which |
|||
adds adds a new mapping to the nodeset correspondene. The authors outline four possible cost functions. The |
|||
first and most basic simply takes into account the number of sub-structures not retained in a mapping between |
|||
nodes. Others provide for credits for unavoidably dropped mappings and calculation of incomplete mappings. |
|||
The nodeset correspondence is considered maximal if there are no other nodeset correspondences which contain |
|||
mappings between more nodes. |
|||
The authors provide some details on experimental results in the form of comparative examples from both their |
|||
prototype and from Logilab xmldiff. They comment that presenting detailed results is difficult as it would |
|||
involve the inclusion of long XML documents, impractical in such a paper. They state that the output from their |
|||
prototype is more likely to reflect changes made by humans than xmldiff. |
|||
They conclude the paper by giving some prospects for further research and development, including methods of improving |
|||
the efficiency of the algorithm (search tree pruning and divide and conquer strategies) and taking advantage of |
|||
other available information (attributes and DTDs). |
|||
==Review== |
|||
<pre> |
|||
Revision 1 |
|||
"Structure-preserving difference search for XML documents" |
|||
Reviewer Recommendation Term: |
|||
Revisions |
|||
Comments to Editor: |
|||
Review Sheet: General Judgement |
|||
============================================================ |
|||
1. Is the paper acceptable for publication |
|||
(a) in its present form? NO |
|||
(b) with minor revisions? YES |
|||
Should the paper be reconsidered after major revision? |
|||
Is it unacceptable for publication? |
|||
2. Please list any other general comments or specific suggestions in the separate blind comments to author's box. |
|||
Comments to Author: |
|||
There are numerous minor errors in the English, some of which will be outlined below. The paper |
|||
would benefit from a thorough proof-reading. |
|||
Comments on Paper: |
|||
The introduction clearly lays out the problem. The state of the art section is well written, with a couple of |
|||
minor quibbles that should be taken care of. The explanation of the methodology is adequate. |
|||
The description of the data structures used (document graphs) and the quality measures (retained relations, node |
|||
similarity etc.) is good except the part describing the search graph, which could do with some clarification. Perhaps |
|||
it would be worth describing the "basic cost function" before describing the search graph and then outlining the remaining |
|||
cost functions afterwards. |
|||
In the results section I find that the paper focusses too much on the formats of the output rather than the actual |
|||
character of the output itself. |
|||
Regarding formatting, the style is less than optimal, equations are typeset too small, code fragments overlap the |
|||
margins of the page, the text is not justified, simple dashes are used where mdashes should be used. Sometimes |
|||
figures span over pages, which makes them hard to read. |
|||
A list follows: |
|||
* p.1 ... from _text oriented documents over languages aimed at exchanging messages between peers_ to ... |
|||
What the underlined section means is not clear. Perhaps a "from X to Y and Z" would be a better |
|||
structure for this section. |
|||
* p.1 ... is so-called difference search ... |
|||
is so-called → is the so-called |
|||
* p.1 ... must both be ... and be ... |
|||
→ must be both ... and ... |
|||
* p.2 The idea of "user-defined relations" could use some more explanation. |
|||
* p.2 ... allows to add extensions easily. |
|||
→ allows extensions to be easily added. |
|||
* p.2 It is stated that "three different cost functions" will be suggested, when in fact, there are four. |
|||
* p.2 The output described is not the default output of the diff program, which uses less-than and greater-than |
|||
symbol, but rather the 'unified diff', -u, output. |
|||
* p.3 In giving the deficiencies of 'diff', it is stated that whitespace is significant. This is not entirely |
|||
correct, the -w and -b options give the option of ignoring white space, although as the authors point out extra |
|||
blank lines are still significant. This should be clarified. |
|||
* p.3 It isn't clear in Figure 3 how the tags are mismatched. |
|||
* p.4 An example of _this kind of systems_ ... |
|||
→ these kinds of systems ... |
|||
* p.4 ... that allows to include ... |
|||
→ that allows the inclusion of ... |
|||
* p.5 What the "solution" may be is not defined and not immediately obvious. |
|||
* p.5 ... actual difference _are_ obviously ... |
|||
→ is |
|||
* p.6 Having (document graph of the) in parentheses twice in the same sentence is a bit difficult to read, |
|||
this could benefit from being re-worded. |
|||
* p.6 ... we _shall usually_ ... |
|||
→ will [typically] |
|||
* p.6 ... also allows _to model_ ... |
|||
→ "us to model" or "the modelling of" |
|||
* p.6 In the equation S1 ⊂ ~, '~' is undefined. The next page defines it, but this definition should be |
|||
given before the usage, or at least on the same page. |
|||
* p.7 The <em> and <i> tags are not as far as I know semantically equal. They belong to different groups of |
|||
of tags in XHTML (text and presentational respectively), and the semantics will vary depending on |
|||
style-sheet. Emphasising (<em>) something could mean putting it in bold face, it does not necessarily |
|||
mean italicising it. |
|||
* p.7 ... the node similarity relation can in most cases not be derived ... |
|||
→ ... in most cases, the node similarity relation cannot be derived ... |
|||
→ ... the node similarity relation can in most cases neither be derived from the data, nor ... |
|||
* p.7 Re-write the part in parentheses, "To a certain degree, X could be an interesting solution" |
|||
* p.7 Complexity of the task ... |
|||
→ The complexity of the task ... |
|||
* p.7 If every of these nodes is similar to every node in ... |
|||
Re-write this sentence, the confusion between "every", "all" and "each" makes it difficult to |
|||
understand. Check verb agreement "nodes is" → "nodes are". |
|||
* p.8 Comments about Dijkstra's algorithm, e.g. named, described, taught etc. are probably superfluous. |
|||
* p.9 It isn't entirely clear how the search graph is constructed, it would benefit from a diagram, such |
|||
as those describing "retained relations" and "maximal nodeset correspondences". |
|||
* p.9 "credits" data → "credit" data |
|||
* p.10 The sentence starting "Informally..." is too long, split it into two, probably around the "and". |
|||
* p.10 Move the part in parentheses starting "other node ..." to a footnote, as it disrupts the flow of |
|||
the sentence. |
|||
* p.11 ... credits value ... |
|||
→ credit value ... |
|||
* p.12 The part about installation is a bit orphaned and not particularly relevant. It could be removed. |
|||
* p.12 Does "optimal result" here refer to "shortest edit distance", if so this should be replaced, if |
|||
not it should be explained. Change "it (now) uses an" to "recently it has begun to use an" for |
|||
ease of reading. |
|||
* p.12 It would be nice with the first example to display the output of GNU diff for comparison. |
|||
---- |
|||
$ diff -uw /tmp/1~ /tmp/2~ |
|||
--- /tmp/1~ 2008-04-05 20:30:14.000000000 +0100 |
|||
+++ /tmp/2~ 2008-04-05 20:30:10.000000000 +0100 |
|||
@@ -11,10 +11,10 @@ |
|||
<title2> |
|||
and the Order of the Phoenix |
|||
</title2> |
|||
- <edition>Hardcover</edition> |
|||
+ <edition>Paperback</edition> |
|||
<author>J. K. Rowling</author> |
|||
- <price>19.79</price> |
|||
- <isbn>043935806X</isbn> |
|||
+ <price>8.49</price> |
|||
+ <isbn>0439358078</isbn> |
|||
</book> |
|||
<book> |
|||
<rank>2</rank> |
|||
@@ -24,9 +24,9 @@ |
|||
<title2> |
|||
and the Order of the Phoenix |
|||
</title2> |
|||
- <edition>Paperback</edition> |
|||
+ <edition>Hardcover</edition> |
|||
<author>J. K. Rowling</author> |
|||
- <price>8.99</price> |
|||
- <isbn>0439358078</isbn> |
|||
+ <price>17.99</price> |
|||
+ <isbn>043935806X</isbn> |
|||
</book> |
|||
</charts> |
|||
---- |
|||
* p.13 Remove exclamation mark, for emphasis, use bold or italic face. |
|||
* p.13 ISBN is an acronym, should it not be in upper case? The "N" stands for "Number", so "ISBN number" |
|||
is redundant. |
|||
* p.14 A link should be given to libxml, and the version required specified. |
|||
* p.14 The xmldiff output looks considerably easier to read than the XUpdate syntax, as the examples are |
|||
contrived anyway, it might be better to convert the XUpdate syntax to the logilab syntax for |
|||
comparative purposes. This would probably serve better than "human readable" descriptions. |
|||
* p.16 It might be interesting to note that Figure 16 is quite similar to what would be expected from a |
|||
traditional "diff" using diffutils. |
|||
</pre> |
Latest revision as of 17:39, 8 June 2008
Schubert, E., Schaffert, S., and Bry, F. (2005). "Structure-preserving difference search for xml documents". Proceedings of the Extreme Markup Languages 2005 Conference, 1-5 August 2005, Montréal, Quebec, Canada
Summary[edit]
- Structure-Preserving Difference Search for XML Documents
The paper presents a strategy for measuring the difference between a pair of documents in XML. The authors report that this is an improvement over the more traditional strategies. The two traditional strategies tested against were longest common subsequence (LCS), as used by GNU diff -- which operates on the level of lines in a file, and shortest edit distance, as used by XMLDiff and similar programs -- which operates on nodes in a document tree. The authors state that the deficiency in these methods lies in the way that they do not represent changes as made by authors. Rather they try to make the "smallest possible" edit script or diff.
The authors present their method of "structure preserving difference", which instead of trying to find the smallest possible edit script, attempts to maximise the size of retained sub-structures (that is sub-trees of the XML document tree) maintained in changing one document to another. In order to calculate this, they model the document as a graph, where relations other than simple parent-child can be taken into account, for example ancestor-descendent, and sibling relationships.
Retained sub-structures are calculated as nodeset correspondences (set of mappings between equivalent nodes) between the original, and modified document. Two nodes are considered equal when they match a node similarity relation, which is user-specified. The most simple node similarity relation would be to consider two nodes equal when the labels are equal. Between two documents, there are many nodeset correspondences, so an algorithm needs to be used to find the optimal set of correspondences which maximises the retained sub-structures.
The algorithm used is Dijkstra's algorithm, where the edge weights of the search graph are calculated by different cost functions, where each node is a nodeset correspondence and each edge is a transition which adds adds a new mapping to the nodeset correspondene. The authors outline four possible cost functions. The first and most basic simply takes into account the number of sub-structures not retained in a mapping between nodes. Others provide for credits for unavoidably dropped mappings and calculation of incomplete mappings.
The nodeset correspondence is considered maximal if there are no other nodeset correspondences which contain mappings between more nodes.
The authors provide some details on experimental results in the form of comparative examples from both their prototype and from Logilab xmldiff. They comment that presenting detailed results is difficult as it would involve the inclusion of long XML documents, impractical in such a paper. They state that the output from their prototype is more likely to reflect changes made by humans than xmldiff.
They conclude the paper by giving some prospects for further research and development, including methods of improving the efficiency of the algorithm (search tree pruning and divide and conquer strategies) and taking advantage of other available information (attributes and DTDs).
Review[edit]
Revision 1 "Structure-preserving difference search for XML documents" Reviewer Recommendation Term: Revisions Comments to Editor: Review Sheet: General Judgement ============================================================ 1. Is the paper acceptable for publication (a) in its present form? NO (b) with minor revisions? YES Should the paper be reconsidered after major revision? Is it unacceptable for publication? 2. Please list any other general comments or specific suggestions in the separate blind comments to author's box. Comments to Author: There are numerous minor errors in the English, some of which will be outlined below. The paper would benefit from a thorough proof-reading. Comments on Paper: The introduction clearly lays out the problem. The state of the art section is well written, with a couple of minor quibbles that should be taken care of. The explanation of the methodology is adequate. The description of the data structures used (document graphs) and the quality measures (retained relations, node similarity etc.) is good except the part describing the search graph, which could do with some clarification. Perhaps it would be worth describing the "basic cost function" before describing the search graph and then outlining the remaining cost functions afterwards. In the results section I find that the paper focusses too much on the formats of the output rather than the actual character of the output itself. Regarding formatting, the style is less than optimal, equations are typeset too small, code fragments overlap the margins of the page, the text is not justified, simple dashes are used where mdashes should be used. Sometimes figures span over pages, which makes them hard to read. A list follows: * p.1 ... from _text oriented documents over languages aimed at exchanging messages between peers_ to ... What the underlined section means is not clear. Perhaps a "from X to Y and Z" would be a better structure for this section. * p.1 ... is so-called difference search ... is so-called → is the so-called * p.1 ... must both be ... and be ... → must be both ... and ... * p.2 The idea of "user-defined relations" could use some more explanation. * p.2 ... allows to add extensions easily. → allows extensions to be easily added. * p.2 It is stated that "three different cost functions" will be suggested, when in fact, there are four. * p.2 The output described is not the default output of the diff program, which uses less-than and greater-than symbol, but rather the 'unified diff', -u, output. * p.3 In giving the deficiencies of 'diff', it is stated that whitespace is significant. This is not entirely correct, the -w and -b options give the option of ignoring white space, although as the authors point out extra blank lines are still significant. This should be clarified. * p.3 It isn't clear in Figure 3 how the tags are mismatched. * p.4 An example of _this kind of systems_ ... → these kinds of systems ... * p.4 ... that allows to include ... → that allows the inclusion of ... * p.5 What the "solution" may be is not defined and not immediately obvious. * p.5 ... actual difference _are_ obviously ... → is * p.6 Having (document graph of the) in parentheses twice in the same sentence is a bit difficult to read, this could benefit from being re-worded. * p.6 ... we _shall usually_ ... → will [typically] * p.6 ... also allows _to model_ ... → "us to model" or "the modelling of" * p.6 In the equation S1 ⊂ ~, '~' is undefined. The next page defines it, but this definition should be given before the usage, or at least on the same page. * p.7 The <em> and <i> tags are not as far as I know semantically equal. They belong to different groups of of tags in XHTML (text and presentational respectively), and the semantics will vary depending on style-sheet. Emphasising (<em>) something could mean putting it in bold face, it does not necessarily mean italicising it. * p.7 ... the node similarity relation can in most cases not be derived ... → ... in most cases, the node similarity relation cannot be derived ... → ... the node similarity relation can in most cases neither be derived from the data, nor ... * p.7 Re-write the part in parentheses, "To a certain degree, X could be an interesting solution" * p.7 Complexity of the task ... → The complexity of the task ... * p.7 If every of these nodes is similar to every node in ... Re-write this sentence, the confusion between "every", "all" and "each" makes it difficult to understand. Check verb agreement "nodes is" → "nodes are". * p.8 Comments about Dijkstra's algorithm, e.g. named, described, taught etc. are probably superfluous. * p.9 It isn't entirely clear how the search graph is constructed, it would benefit from a diagram, such as those describing "retained relations" and "maximal nodeset correspondences". * p.9 "credits" data → "credit" data * p.10 The sentence starting "Informally..." is too long, split it into two, probably around the "and". * p.10 Move the part in parentheses starting "other node ..." to a footnote, as it disrupts the flow of the sentence. * p.11 ... credits value ... → credit value ... * p.12 The part about installation is a bit orphaned and not particularly relevant. It could be removed. * p.12 Does "optimal result" here refer to "shortest edit distance", if so this should be replaced, if not it should be explained. Change "it (now) uses an" to "recently it has begun to use an" for ease of reading. * p.12 It would be nice with the first example to display the output of GNU diff for comparison. ---- $ diff -uw /tmp/1~ /tmp/2~ --- /tmp/1~ 2008-04-05 20:30:14.000000000 +0100 +++ /tmp/2~ 2008-04-05 20:30:10.000000000 +0100 @@ -11,10 +11,10 @@ <title2> and the Order of the Phoenix </title2> - <edition>Hardcover</edition> + <edition>Paperback</edition> <author>J. K. Rowling</author> - <price>19.79</price> - <isbn>043935806X</isbn> + <price>8.49</price> + <isbn>0439358078</isbn> </book> <book> <rank>2</rank> @@ -24,9 +24,9 @@ <title2> and the Order of the Phoenix </title2> - <edition>Paperback</edition> + <edition>Hardcover</edition> <author>J. K. Rowling</author> - <price>8.99</price> - <isbn>0439358078</isbn> + <price>17.99</price> + <isbn>043935806X</isbn> </book> </charts> ---- * p.13 Remove exclamation mark, for emphasis, use bold or italic face. * p.13 ISBN is an acronym, should it not be in upper case? The "N" stands for "Number", so "ISBN number" is redundant. * p.14 A link should be given to libxml, and the version required specified. * p.14 The xmldiff output looks considerably easier to read than the XUpdate syntax, as the examples are contrived anyway, it might be better to convert the XUpdate syntax to the logilab syntax for comparative purposes. This would probably serve better than "human readable" descriptions. * p.16 It might be interesting to note that Figure 16 is quite similar to what would be expected from a traditional "diff" using diffutils.