0

I've located the end of a local file header in download stream of a large zip file that

  • specifies deflate compression with
  • bit 3 set indicating the length of the compressed data follows the compressed data

and would like to now inflate that data using Node zlib but I cannot figure out how to feed data into zlib and receive feedback telling me when the deflate stream has self terminated.

Does Node's zlib library support consuming chunks of deflate data and returning a result letting the caller know when the deflate stream has ended?

Or is this an insane thing to do because it would imply I'm inflating on the UI thread and what I should really do is save the downloaded file and once downloaded use an NPM package? Hm.. well.. either the network is faster than inflation in which case streaming inflation would slow the network (bummer) or the network is slower than streaming inflation so why deflate while streaming (which I can't figure out how to do anyway) when I could simply saving to disk and reload-deflate while I'm sitting around waiting for the network..

Still, for my edification, I'd still like to know if Node supports streaming inflation.

var zlib = require('zlib')
var data = bufferOfChunkOfDeflatedData
var inflate = zlib.createInflate();
var stream = inflate.pipe(fs.createWriteStream(path));
var result = stream.write(data);
// but result doesn't indicate if the inflate stream has terminated...

Describes deflate headers and how they encode the length of the stream: https://www.bolet.org/~pornin/deflate-flush-fr.html


In memory stream: https://www.npmjs.com/package/memory-streams


Well, this guy just pulls till he hits the magic signature! :) https://github.com/EvanOxfeld/node-unzip/blob/5a62ecbcef6523708bb8b37decaf6e41728ac7fc/lib/parse.js#L152


Node code for configuring convenience method: https://github.com/nodejs/node/blob/6e56771f2a9707ddf769358a4338224296a6b5fe/lib/zlib.js#L83 Specifically: https://nodejs.org/api/zlib.html#zlib_zlib_inflateraw_buffer_options_callback


Eh, looks like node is setup to return the decompressed buffer as one block to the callback; Doesn't look like node is setup to figure out the end of the deflate stream.

https://nodejs.org/api/stream.html#stream_transform_transform_chunk_encoding_callback says The callback function must be called only when the current chunk is completely consumed. and here's the spot where it passes the chunk to zlib https://github.com/nodejs/node/blob/6e56771f2a9707ddf769358a4338224296a6b5fe/lib/zlib.js#L358. So there's no opportunity to say the stream was partially consumed..


But then again... https://github.com/ZJONSSON/node-unzipper/blob/affbf89b54b121e85dcd31adf7b1dfde58afebb7/lib/parse.js#L161 but not really. Also just checks for the magic sig: https://github.com/ZJONSSON/node-unzipper/blob/affbf89b54b121e85dcd31adf7b1dfde58afebb7/lib/parse.js#L153


And from the zip spec:

4.3.9.3 Although not originally assigned a signature, the value 0x08074b50 has commonly been adopted as a signature value for the data descriptor record. Implementers SHOULD be aware that ZIP files MAY be encountered with or without this signature marking data descriptors and SHOULD account for either case when reading ZIP files to ensure compatibility.

So looks like everyone just looks for the sig.


Mark says that's a no-no... So don't do that. And know that if your using an NPM lib to unzip, then there's a good chance the lib is doing that. To do it right would require, I think, grocking this from the zlib API docs: https://zlib.net/manual.html

The Z_BLOCK option assists in appending to or combining deflate streams. To assist in this, on return inflate() always sets strm->data_type to the number of unused bits in the last byte taken from strm->next_in, plus 64 if inflate() is currently decoding the last block in the deflate stream, plus 128 if inflate() returned immediately after decoding an end-of-block code or decoding the complete header up to just before the first byte of the deflate stream. The end-of-block will not be indicated until all of the uncompressed data from that block has been written to strm->next_out. The number of unused bits may in general be greater than seven, except when bit 7 of data_type is set, in which case the number of unused bits will be less than eight. data_type is set as noted here every time inflate() returns for all flush options, and so can be used to determine the amount of currently consumed input in bits.

This seems to indicate the final compressed bit will not be byte aligned. Yet the ZIP spec seems to indicate header that starts with the magic sig, the one everyone is using but shouldn't, is byte aligned: https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT

4.3.9.1 This descriptor MUST exist if bit 3 of the general purpose bit flag is set (see below). It is byte aligned and immediately follows the last byte of compressed data. This descriptor SHOULD be used only when it was not possible to seek in the output .ZIP file, e.g., when the output .ZIP file was standard output or a non-seekable device. For ZIP64(tm) format archives, the compressed and uncompressed sizes are 8 bytes each.

How can the end of the deflate stream not be byte aligned but the following Data descriptor be byte aligned?

Is there a nice reference implementation?


Reference impl using Inflate with Z_BLOCK: https://github.com/madler/zlib/blob/master/examples/gzappend.c


This guys reads backwards to pull out the directory: https://github.com/antelle/node-stream-zip/blob/907c8876e8aeed6c33a668bbd06a0f79e7a022ef/node_stream_zip.js#L180 Is this necessary?

This guy seems to think that zips cannot be inflated without reading the whole file to get to the directory: https://www.npmjs.com/package/yauzl#no-streaming-unzip-api

I don't see why that would be the case. The streams describe their length... and Mark verifies they can be streamed.


And here is where Node.js checks for Z_STREAM_END!

Christopher King
  • 1,034
  • 1
  • 8
  • 21
  • 1
    Looking for the signature is a bad idea not just because it isn't always there (which kills it right there), but even if you try to look for the next signature and backtrack, there is nothing that keeps any four-byte sequence from appearing in the compressed data. The only way to do this right, and the intent of that format option, is for inflate to tell you when the deflate stream ends. – Mark Adler Feb 26 '19 at 16:30
  • Agreed! Yet, both the popular NPM unzip implementations search for the sig... Mom said just cuz everyone else is doing it doesn’t me I should! So, does there exist a NPM unzip package that does it right? Looking through the Node source, as best I can tell, the goal was to expose the inflator as a stream transform. And the steam protocol, as best I can tell, has no mechanism for partially consuming a pushed chunk... – Christopher King Feb 26 '19 at 17:37

1 Answers1

0

It looks like it does, since the documentation lists zlib.constants.Z_STREAM_END as a possible return value.

Mark Adler
  • 101,978
  • 13
  • 118
  • 158
  • Thanks! So `Return codes for the compression/decompression functions` includes `zlib.constants.Z_STREAM_END` and the referenced zlib manuals says `inflate() should normally be called until it returns Z_STREAM_END or an error` so now the question is which of the Node functions can return `Z_STREAM_END`.. – Christopher King Feb 26 '19 at 07:21
  • As far as I can tell, that constant is just copied over from the underlying C code. I don't find it used in the Node code nor any of the NPM consumer packages I surveyed. Maybe it's getting returned by some Node interop call... just hard to tell... – Christopher King Feb 26 '19 at 18:03
  • 1
    Sounds like a flawed interface then. I don't know node.js, but perhaps you can have an underlying stream that is feeding the inflator. Then when the inflator is done, if it was implemented properly, the data after the deflate stream should remain to be read in the underlying stream. – Mark Adler Feb 26 '19 at 22:56
  • Yeah. I'm a node.js + zlib, + zip noob so I can't say for sure. Now that we know definitively that scanning for the magic sig is a no-no then maybe a node.js expert will chime in and let us know if it's possible to correctly implement deflate using the existing node.js implementation. Or, maybe a zip expert will chime in and say that, while incorrect, in practice everyone just scans for the magic sig and that's why the value made its way into the official spec. – Christopher King Feb 26 '19 at 23:01
  • But while I have you, is it true the deflate stream may end on a non-aligned boundary? If so then how can the zip data header, the one following the deflate stream if bit 3 is set, be byte aligned? Or am I reading the specs wrong? Or are those extra bits simply ignored by zip implementations? – Christopher King Feb 26 '19 at 23:06
  • 1
    By "non-aligned", do you mean end in the middle of a byte? If so, then no. The deflate format specifies that the last byte is filled with zero bits to end the stream at a byte boundary. – Mark Adler Feb 27 '19 at 01:05
  • Ah! Thanks. Well, I opened this issue with node to see if they care to update their zlib library to allow for correct implementations of unzip. https://github.com/nodejs/node/issues/26332. Is this the reference implementation you would point them to for how to inflate when the length of the stream is unknown? https://github.com/madler/zlib/blob/master/examples/gzappend.c – Christopher King Feb 27 '19 at 09:00
  • 1
    No, that's more advanced than what you need. This [well-annotated zlib usage example](http://zlib.net/zlib_how.html) shows the normal use of inflate, where it stops when it gets to the end. I expect that they're doing something like that, but they do not seem to be providing a means for you to find out where they stopped. – Mark Adler Feb 27 '19 at 15:33
  • Still searching... Another thing, do you know if the Zip spec supports inflating a zip file without first reading the directory at the end of the file? So that to inflate, one simply reads parses the first header, deflates the stream, and repeats until the directory is encountered. I ask because I found two implementations that read the directory at the end before they deflate from the start. https://github.com/antelle/node-stream-zip/blob/907c8876e8aeed6c33a668bbd06a0f79e7a022ef/node_stream_zip.js#L180 and https://www.npmjs.com/package/yauzl#no-streaming-unzip-api. – Christopher King Feb 27 '19 at 21:01
  • 1
    Yes, you can process a zip file sequentially. See [sunzip](https://github.com/madler/sunzip). – Mark Adler Feb 27 '19 at 21:30
  • Mark, are you still there? If so, would you mind doing a code review for us over in Node world? Or actually, just drop a note at GitHub here https://github.com/nodejs/node/issues/26603#issuecomment-472236941 The node team in inferring inflate() returned Z_STREAM_END if there is input left to consume yet no output is generated; So they're assuming for every input byte an output byte will be generated and flushed. Is that true? – Christopher King Mar 13 '19 at 00:57