How do I read selected files from a remote Zip archive over HTTP using Python?

Question

I need to read selected files, matching on the file name, from a remote zip archive using Python. I don't want to save the full zip to a temporary file (it's not that large, so I can handle everything in memory).

I've already written the code and it works, and I'm answering this myself so I can search for it later. But since evidence suggests that I'm one of the dumber participants on Stackoverflow, I'm sure there's room for improvement.

Marcel Levy · Accepted Answer · 2009-01-10T00:03:25.650

9

Here's how I did it (grabbing all files ending in ".ranks"):

import urllib2, cStringIO, zipfile

try:
    remotezip = urllib2.urlopen(url)
    zipinmemory = cStringIO.StringIO(remotezip.read())
    zip = zipfile.ZipFile(zipinmemory)
    for fn in zip.namelist():
        if fn.endswith(".ranks"):
            ranks_data = zip.read(fn)
            for line in ranks_data.split("\n"):
                # do something with each line
except urllib2.HTTPError:
    # handle exception

edited Jan 10 '09 at 00:03

answered Sep 18 '08 at 17:03

Marcel Levy

3,407
1
28
39

You want to replace the first line with: import urllib2, zipfile. – Jim Sep 18 '08 at 17:08
Why don't you use `ZipFile(urllib2.urlopen(url))`? – jfs Sep 18 '08 at 17:39
I tried that, but I couldn't get it to work because even though it was a file-like object, it didn't support a particular function that Zipfile needed. That's why I buffered it with cStringIO. – Marcel Levy Sep 18 '08 at 17:43
The directory for a zip file is stored at the end, therefore the entire file must be downloaded before extraction, whether into memory, or on disk. – Ignacio Vazquez-Abrams Jan 10 '09 at 00:56
It's not that hard to create your own file-like object to wrap the url so you don't have to download the whole thing: http://stackoverflow.com/questions/7829311/is-there-a-library-for-retrieving-a-file-from-a-remote-zip/7852229#7852229 – retracile Oct 21 '11 at 16:23

score 5 · Answer 2 · answered Jun 04 '09 at 20:13

Thanks Marcel for your question and answer (I had the same problem in a different context and encountered the same difficulty with file-like objects not really being file-like)! Just as an update: For Python 3.0, your code needs to be modified slightly:

import urllib.request, io, zipfile

try:
    remotezip = urllib.request.urlopen(url)
    zipinmemory = io.BytesIO(remotezip.read())
    zip = zipfile.ZipFile(zipinmemory)
    for fn in zip.namelist():
        if fn.endswith(".ranks"):
            ranks_data = zip.read(fn)
            for line in ranks_data.split("\n"):
                # do something with each line
except urllib.request.HTTPError:
    # handle exception

score 4 · Answer 3 · answered Jan 22 '13 at 14:43

4

This will do the job without downloading the entire zip file!

http://pypi.python.org/pypi/pyremotezip

answered Jan 22 '13 at 14:43

Filipe Varela

41
2

Nice! Too bad this is py2 only. – mdaoust Jul 28 '22 at 16:05

score 1 · Answer 4 · answered Sep 18 '08 at 17:07

1

Bear in mind that merely decompressing a ZIP file may result in a security vulnerability.

answered Sep 18 '08 at 17:07

Jim

72,985
14
101
108

How do I read selected files from a remote Zip archive over HTTP using Python?

4 Answers4

Linked