0

I have many VTD+XML indexes for different versions of the same file that i am hoping to implement a diff-like method to return the x-paths of nodes that have been modified between versions, as well as the difference between text within those nodes.

I figure using an existing algorithm such as O(nd) difference would be best to compare the text within two nodes. Thus the approach i envisioned would be to traverse the two documents simultaneously and store the xpath that corresponds with any nodes that contain text variations.

The issue is that once i encounter new or removed nodes, how do i determine that the node is infact an inserted/removed node or a variation of an existing node?

Or maybe there is another approach i should be taking?

Raptop
  • 169
  • 2
  • 13
  • Knowing whether two files are different is very different from knowing how different two files are... the first can be done using one way hashing like SHA or later variation of it.. the second can be a lot more CPU intensive... especially if you want to very granular knowledge of where the differences are... agree with me so far? – vtd-xml-author Jun 17 '17 at 03:14
  • Yes, i agree. I am looking for a very granular analysis of how different two xml files are, given their VTD+XML indexes. – Raptop Jun 17 '17 at 03:33

1 Answers1

0

Maybe my interpretation of your question is not exactly on the mark. But I feel that what you are trying to do may not have easy answers... consider the following XML snippet

<a>
   <b>text1</b>
   <b>text1</b>
</a>

and

<a>
   <b>text2</b>
   <b>text1</b>
</a>

You could say the second XML is simply the first one with text2 replaced with text1.

But you could also say the second XML is simply the first one removing the first b node, changing text1 of the the second b node to text2, and then insert text1 after the second b node.

In summary, it seems you don't just want to know what are the difference, but also the changes that lead to those differences. This is difficult as there are different things you can do that leads to the same output.

vtd-xml-author
  • 3,319
  • 4
  • 22
  • 30
  • As i am going to be defining the schema myself, i could use attributes to get around this correct? for instance text1. When the contents of a tag does not match, but the identifier matches, i can say that this is not a new tag, but the contents inside the tag have changed. I think i just answered my own question... – Raptop Jun 25 '17 at 04:10