How to get the list of files contained in an archive hosted on an HTTP server without downloading the entire archive?
Interested in whether this is possible, specifically for very lage rar/zip archives (1000GB :)) hosted remotely.
How to get the list of files contained in an archive hosted on an HTTP server without downloading the entire archive?
Interested in whether this is possible, specifically for very lage rar/zip archives (1000GB :)) hosted remotely.
Possible? Probably. Easy? No.
If you control both sides, it would be much, much smarter to make the server store, or dynamically generate, a file-list for each archive. Similarly, if you don't control the server control both do control intake to it, make the file-list generation part of the upload process.
But if that's not feasible, you can do it.
If you look at how zipfiles work, you can see that it's possible to find the entire central directory by searching backward from the end. (The details are a bit different for Zip64 and Zip32, but section 4.3.6 shows the general idea, and you can read the individual sections for more information.)
Things are similar for rarfiles. If I remember correctly, RAR can store directory information in file headers anywhere in the archive, but this was only intended to be used for multi-file archives, and isn't actually used there, so you only need to read… I can't remember if it's the end plus a few bytes off the front or vice-versa, but either way, it's the same basic idea as with zip files. Read the spec and figure it out, or test truncated rar files yourself.
So, assuming your server supports Range
requests, you can do something like this:
Accept-Ranges: bytes
.Range: bytes=…-…
to read the last, say, 256KB.But can the stdlib zipfile
module handle reading just the end of a zipfile? It isn't actually documented to work, but… as it turns out (at least using the versions in CPython 2.7.2, 2.7.5, and 3.2.3, and 3.3.2 and PyPy 1.9.0 and 2.0b1), it actually does enough for you.
So, you can just do this:
Accept-Ranges: bytes
.Range: bytes=…-…
to read the last, say, 256KB.ZipFile
out of the results.
zf.namelist()
.If you want to know exactly which exceptions (and/or which errno
values for OSError
) to treat as "I need more data" instead of real exceptions, you'll need to read the source and/or do a lot of testing.
Anyway, this obviously won't be as efficient or as robust, but it'll be a lot simpler.
For RAR files, there's no stdlib module that does it, but there are a few alternatives available, like rarfile
, so you can probably do something similar.