12

I need to read selected files, matching on the file name, from a remote zip archive using Python. I don't want to save the full zip to a temporary file (it's not that large, so I can handle everything in memory).

I've already written the code and it works, and I'm answering this myself so I can search for it later. But since evidence suggests that I'm one of the dumber participants on Stackoverflow, I'm sure there's room for improvement.

Marcel Levy
  • 3,407
  • 1
  • 28
  • 39

4 Answers4

9

Here's how I did it (grabbing all files ending in ".ranks"):

import urllib2, cStringIO, zipfile

try:
    remotezip = urllib2.urlopen(url)
    zipinmemory = cStringIO.StringIO(remotezip.read())
    zip = zipfile.ZipFile(zipinmemory)
    for fn in zip.namelist():
        if fn.endswith(".ranks"):
            ranks_data = zip.read(fn)
            for line in ranks_data.split("\n"):
                # do something with each line
except urllib2.HTTPError:
    # handle exception
Marcel Levy
  • 3,407
  • 1
  • 28
  • 39
  • You want to replace the first line with: import urllib2, zipfile. – Jim Sep 18 '08 at 17:08
  • Why don't you use `ZipFile(urllib2.urlopen(url))`? – jfs Sep 18 '08 at 17:39
  • I tried that, but I couldn't get it to work because even though it was a file-like object, it didn't support a particular function that Zipfile needed. That's why I buffered it with cStringIO. – Marcel Levy Sep 18 '08 at 17:43
  • The directory for a zip file is stored at the end, therefore the entire file must be downloaded before extraction, whether into memory, or on disk. – Ignacio Vazquez-Abrams Jan 10 '09 at 00:56
  • It's not that hard to create your own file-like object to wrap the url so you don't have to download the whole thing: http://stackoverflow.com/questions/7829311/is-there-a-library-for-retrieving-a-file-from-a-remote-zip/7852229#7852229 – retracile Oct 21 '11 at 16:23
5

Thanks Marcel for your question and answer (I had the same problem in a different context and encountered the same difficulty with file-like objects not really being file-like)! Just as an update: For Python 3.0, your code needs to be modified slightly:

import urllib.request, io, zipfile

try:
    remotezip = urllib.request.urlopen(url)
    zipinmemory = io.BytesIO(remotezip.read())
    zip = zipfile.ZipFile(zipinmemory)
    for fn in zip.namelist():
        if fn.endswith(".ranks"):
            ranks_data = zip.read(fn)
            for line in ranks_data.split("\n"):
                # do something with each line
except urllib.request.HTTPError:
    # handle exception
Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561
4

This will do the job without downloading the entire zip file!

http://pypi.python.org/pypi/pyremotezip

1

Bear in mind that merely decompressing a ZIP file may result in a security vulnerability.

Jim
  • 72,985
  • 14
  • 101
  • 108