Get the list of files stored in an archive hosted on an HTTP server without downloading it, using Python

Question

How to get the list of files contained in an archive hosted on an HTTP server without downloading the entire archive?

Interested in whether this is possible, specifically for very lage rar/zip archives (1000GB :)) hosted remotely.

abarnert · Answer 1 · 2013-07-02T19:30:54.037

Possible? Probably. Easy? No.

If you control both sides, it would be much, much smarter to make the server store, or dynamically generate, a file-list for each archive. Similarly, if you don't control the server control both do control intake to it, make the file-list generation part of the upload process.

But if that's not feasible, you can do it.

If you look at how zipfiles work, you can see that it's possible to find the entire central directory by searching backward from the end. (The details are a bit different for Zip64 and Zip32, but section 4.3.6 shows the general idea, and you can read the individual sections for more information.)

Things are similar for rarfiles. If I remember correctly, RAR can store directory information in file headers anywhere in the archive, but this was only intended to be used for multi-file archives, and isn't actually used there, so you only need to read… I can't remember if it's the end plus a few bytes off the front or vice-versa, but either way, it's the same basic idea as with zip files. Read the spec and figure it out, or test truncated rar files yourself.

So, assuming your server supports Range requests, you can do something like this:

Do a HEAD to find the length of the file, and to verify that the server can Accept-Ranges: bytes.
Do a GET with a Range: bytes=…-… to read the last, say, 256KB.
Skim the resulting buffer to make sure it contains the entire central directory. If not, you have two choices:
- Just blindly read the next 256KB from the end, and repeat until you're done.
- Do some smarter parsing to figure out how many bytes you actually need (which isn't guaranteed to be possible, but almost always will be—if not, fall back to blindly reading 256KB until it is) and read that much.
Parse the directory.

But can the stdlib zipfile module handle reading just the end of a zipfile? It isn't actually documented to work, but… as it turns out (at least using the versions in CPython 2.7.2, 2.7.5, and 3.2.3, and 3.3.2 and PyPy 1.9.0 and 2.0b1), it actually does enough for you.

So, you can just do this:

Do a HEAD to find the length of the file, and to verify that the server can Accept-Ranges: bytes.
Do a GET with a Range: bytes=…-… to read the last, say, 256KB.
Try to create a ZipFile out of the results.
- If it works, call zf.namelist().
- If it raises, read another 256KB and try again.

If you want to know exactly which exceptions (and/or which errno values for OSError) to treat as "I need more data" instead of real exceptions, you'll need to read the source and/or do a lot of testing.

Anyway, this obviously won't be as efficient or as robust, but it'll be a lot simpler.

For RAR files, there's no stdlib module that does it, but there are a few alternatives available, like rarfile, so you can probably do something similar.

Get the list of files stored in an archive hosted on an HTTP server without downloading it, using Python

1 Answers1

Linked