0

I am requesting a zip file from an API and I'm trying to retrieve it by bytes range (setting a Range header) and then parsing each of the parts individually. After reading some about gzip and zip compression, I'm having a hard time figuring out:

Can I parse a portion out of a zip file?

I know that gzip files usually compresses a single file so you can decompress and parse it in parts, but what about zip files?

I am using node-js and tried several libraries like adm-zip or zlib but it doesn't look like they allow this kind of possibility.

Alon Weissfeld
  • 1,295
  • 6
  • 21
  • 35

1 Answers1

0

Zip files have a catalog at the end of the file (in addition to the same basic information before each item), which lists the file names and the location in the zip file of each item. Generally each item is compressed using deflate, which is the same algorithm that gzip uses (but gzip has a custom header before the deflate stream).

So yes, it's entirely feasible to extract the compressed byte stream for one item in a zip file, and prepend a fabricated gzip header (IIRC 14 bytes is the minimum size of this header) to allow you to decompress just that file by passing it to gunzip.

If you want to write code to inflate the deflated stream yourself, I recommend you make a different plan. I've done it, and it's really not fun. Use zlib if you must do it, don't try to reimplement the decompression.

cliffordheath
  • 2,536
  • 15
  • 16
  • So what I understand from this is that I need to get the zip catalog at the end of the file (does that mean that first I need to get the completed zip file? Because I'm trying not to request more than 5 mb for a service) and then generate a gzip header so I then can decompress it by passing it to gunzip? The purpose here is to request each time no more than 5mb from a service and simultaneously parse the returned bytes – Alon Weissfeld Dec 10 '15 at 12:47
  • Yes. If you know the completed file size, then you can read the catalog in the tail, which tells you everywhere else you might want to look for the content files. This is good if you want a specific file. It's also possible to start at the beginning and decode one file at a time, because after the zip header each compressed file is preceded by its own header. – cliffordheath Dec 10 '15 at 21:43
  • So if it's possible to start at the beginning and decode one file at a time (or part of a file), it should it work if I pass the returned data (part of a zip file) directly to a gunzip, right? – Alon Weissfeld Dec 11 '15 at 18:23
  • You can't pass it directly. You must construct a gzip header (see the RFC here: https://www.ietf.org/rfc/rfc1952.txt), code here: https://github.com/madler/zlib/blob/master/deflate.c#L693-L707 – cliffordheath Dec 12 '15 at 21:42