13

I have a very large XML file (20GB to be exact, and yes, I need all of it). When I attempt to load the file, I receive this error:

Python(23358) malloc: *** mmap(size=140736680968192) failed (error code=12)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug
Traceback (most recent call last):
  File "file.py", line 5, in <module>
    code = xml.read()
MemoryError

This is the current code I have, to read the XML file:

from bs4 import BeautifulSoup
xml = open('pages_full.xml', 'r')
code = xml.read()
xml.close()
soup = BeautifulSoup(code)

Now, how would I go about to eliminating this error and be able to continue working on the script. I would try splitting the file into separate files, but as I don't know how that would affect BeautifulSoup as well as the XML data, I'd rather not do this.

(The XML data is a database dump from a wiki I volunteer on, using it to import data from different time-periods, using the direct information from many pages)

Hairr
  • 1,088
  • 2
  • 11
  • 19
  • 2
    Do you have 20GB of ram? If not, even if you can get this to work it's going to be unbearably slow as it swaps in and out. There might be a way for you to operate on only chunks at a time with something like lxml, though. – Danica Feb 17 '13 at 18:09

1 Answers1

21

Do not use BeautifulSoup to try and such a large parse XML file. Use the ElementTree API instead. Specifically, use the iterparse() function to parse your file as a stream, handle information as you are notified of elements, then delete the elements again:

from xml.etree import ElementTree as ET

parser = ET.iterparse(filename)

for event, element in parser:
    # element is a whole element
    if element.tag == 'yourelement'
         # do something with this element
         # then clean up
         element.clear()

By using a event-driven approach, you never need to hold the whole XML document in memory, you only extract what you need and discard the rest.

See the iterparse() tutorial and documentation.

Alternatively, you can also use the lxml library; it offers the same API in a faster and more featurefull package.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343