4

I have some nearly identical XML that I am try to compare, and having found this: Compare XML snippets? which pointed to this: https://bitbucket.org/ianb/formencode/src/tip/formencode/doctest_xml_compare.py#cl-70 I have a way of testing two nodes.

The next step is to take the output from the node based test, and if False, step into all the children, and repeat the test.

I have written a walker the long way, that allows me to step through as many children as I want to write the code for:

 if xml.xml_compare(a.root, b.root) == False:
    for i, node in enumerate(a.root):
        if xml.xml_compare(a.root[i], b.root[i]) == False:
            for j, node in enumerate(a.root[i]):
                if xml.xml_compare(a.root[i][j], b.root[i][j]) == False:
                    for k, node in enumerate(a.root[i][j]):
                        ....
                            if xml.xml_compare(a.root[i][j][k][l][m][n], b.root[i][j][k][l][m][n]) == False:

This is clearly not suitable for arbitrary sized XML, and its not very elegant. I thnk I need to write a generator to walk the XML under test - I saw that itertool is a way of doing this:

class XML_Tools(object):
    ....
    def iterparent(self, xml_object):
    """ 
    returns the parent and children of a node
    """
    for parent in xml_object.getiterator():
        for child in parent:
            yield self.parent, self.child

    main():
    a = ET.parse(open(file_a, "r")
    b = ET.parse(open(file_b, "r")
    xml.iterparent(a.root)
    for xml.parent, xml.child in xml.iterparent(a.root):
        print xml.parent, xml.child

But I couldn't find a way of getting a working xml.parent or xml.child object that I can function on. I suspect I've messed up moving the function into a Class, and am not giving/getting the right things.

What I want to do is find the source of a False comparison, and print the two offending data elements, and know where they live (or are missing from) in the two pieces of XML.

Community
  • 1
  • 1
Jay Gattuso
  • 3,890
  • 12
  • 37
  • 51
  • 2
    I am not sure if this will be helpful at all, but in Python 3.3 you can use `yield from`, for instance, to implement something like a decision tree (which sounds like what you're doing) as in the recipe "Implementing the Iterator Protocol": http://chimera.labs.oreilly.com/books/1230000000393/ch04.html#_discussion_59 – erewok Aug 29 '13 at 21:21
  • @erewok thanks, I'm on 2.7 unfortunately. – Jay Gattuso Aug 29 '13 at 22:37
  • 1
    This seems highly inefficient - it looks like you're testing the full tree, then if they don't match, then you're iteratively checking smaller subsets (i.e. repeating a lot of comparisons, even if the comparisons bail at the first mismatch). It would probably be better to start comparing at the deepest level and work your way up to the full tree... – twalberg Sep 05 '13 at 17:36
  • @twalberg That's not very easy to implement if the structure of the XML documents can vary wildly. If you're looking for where the trees *diverge*, you need to be able to match up the trees, you can't just bail on the first failure. – millimoose Sep 06 '13 at 00:01
  • @twalberg Also for XML with a high degree of branching you're not really repeating that many comparisons, compared to the one you're doing at the top level. If each node has 4 children, you'll end up inspecting double-ish the content of the tree. – millimoose Sep 06 '13 at 00:11
  • @millimoose Understood. Efficiency is not really a primary concern for me, accuracy is. If I understand what you're saying, that's the trade off. – Jay Gattuso Sep 08 '13 at 22:19
  • @JayGattuso Not quite, I was saying that: a) There's a tradeoff between efficiency and how straightforward the implementation is. Starting at the top is obviously easier. And b) that the inefficiency of starting at the top is less dramatic than what twalberg makes it out to be, in the worst case you check the whole document `n` times if `n` is the depth of the tree. – millimoose Sep 08 '13 at 22:33
  • @JayGattuso That said I'm not sure what the result here is that you're going for. If say document one is ``, and document two is ``, what's the point of comparing `womble` to `cabbage` if you already know they differ in `foo` not being `bar`? What happens if doc1 is `` and doc2 is just ``? What if the trees diverge in multiple places? (E.g. `` vs. ``) You might want to actually specify what your algorithm is supposed to accomplish in a more precise way than "compare XML". – millimoose Sep 08 '13 at 22:43
  • @millimoose Ahh, OK. Now I understand. The source XML in both instances is not arbitrary, its two pieces of XML that are very similar, share the same basic template, but will differ at some (unknown) elements, and by an unknown amount. The idea is to signal where there is a variance so I can process those two nodes more to look at what the difference is. – Jay Gattuso Sep 08 '13 at 23:48

1 Answers1

3

I would suggest using a recursive algorithm which takes a list of the 2 items to compare, and the pass number, as arguments. You would need a dictionary specifying which list to supply on each pass. You could also write an algorithm to create the dictionary of n elements, hope this helps. I could try and give example code if that'd be more helpful.

EDIT:

n=3 ##Depth of tree

d={'0':['a.root', 'b.root', 0]}

for i in range(n):
    d[str(i+1)]=[d[str(i)][0]+'['+chr(105+i)+']', #since ord('i')=105, start
                 d[str(i)][1]+'['+chr(105+i)+']', # at i, j, k, etc
                 i+1                              #passNo
                ]

print(d)

def compare(points=d['0'], passNo=0):
    if xml.xml_compare(eval(points[0]), eval(points[1])) == False:
        exec('for'+str(chr(points[2]+105))+'in enumerate('+str(points[0])+\
             '): compare('+str(d[str(passNo+1)][0])+', '+str(d[str(passNo+1)][1])+')')

compare()

I apologise profusely for the messiness of the code, but I think this'll do what you want it to. I can't however test it without knowing how you've imported the xml modules/contents or what xml objects you're using. Hope this helps.

John Durrans
  • 156
  • 5
  • Thank you for your effort. I will have a play with this when I get into the office. I'm not 100% clear on whats happening in your example, but when I get to throw it at my source XML I'm sure it will make a lot more sense! – Jay Gattuso Sep 08 '13 at 22:17