0

I have two large XML files(c.100MB) containing a number of items. I want to ouput the difference between them.

Each item has an ID and I need to check if it's in both files. If it is then I need to compare the individual values for that item to make certain it's the same item.

Is a SAX parser the best way to solve this and how is it used? I used element tree and findall which worked on the smaller files, but now I can't for the large files.

srcTree = ElementTree()
srcTree.parse(srcFile)

# finds all the items in both files
srcComponents = (srcTree.find('source')).find('items')
srcItems = srcComponents.findall('item')
dstComponents = (dstTree.find('source')).find('items')
dstItems = dstComponents.findall('item')

# parses the source file to find the values of various fields of each
# item and adds the information to the source set
for item in srcItems:
  srcId = item.get('id')
  srcList = [srcId]
  details = item.find('values')
  srcVariables = details.findall('value')
  for var in srcVariables:
    srcList.append((var.get('name'),var.text))
srcList = tuple(srcList)
srcSet.add(srcList)
charlie123
  • 253
  • 1
  • 5
  • 17
  • 2
    show us the failing code you wrote – wroniasty Jul 30 '12 at 10:51
  • It loaded everything into memory so it's not going to work for these files. I used element tree to get a tree of the data in each xml file. I used find on the tree to get all the items into a list. I then looped through these items to get the values of each item and stored the information in a set of tuples: [(id,val,val),(id,val,val)]. I did this for both files. Found the difference of the sets and then stored the result in a file. – charlie123 Jul 30 '12 at 10:59

1 Answers1

2

You can use elementtree as a pull parser (like sax) http://effbot.org/zone/element-pull.htm as well there is an iterparse function in elementree http://effbot.org/zone/element-iterparse.htm both of these will allow you to process large files without loading everything into memory.

But sax can work (I have processed much larger than 100MB with it) but I would use elementtree to do that job now.

Also have a look at incremental/event based parsing with lxml (etree compatible) http://lxml.de/tutorial.html#event-driven-parsing

And here is a good article on using iterparse with files > 1GB http://www.ibm.com/developerworks/xml/library/x-hiperfparse/

Tim Hoffman
  • 12,976
  • 1
  • 17
  • 29