0

Question regarding zip and io.Reader/io.Writer. As far as I understand, one of the purpose of io/Reader/io.Writer is streaming. But should I implement one of these if my type does not really make sense "as chunks"?

For more details:

Lets say I have this struct.

type MyZip struct {
    file1, file2 []byte
}

MyZip represents a particular structured zip. Let's say for example it represents a zip file containing exactly a file named file1 and a file named file2. MyZip has the responsibility of parsing a zip file to extract these two files into two []byte fields. It also should handle the other way around (turning these two []byte fields as two files named test1 and test2 archived into a zip file).

As far as I understand, the package archive/zip does not allow to decompress a zip file as a stream. We have to fully load the zip in memory or as a file and decompress afterwards.

So to refine my question, does it make sense for MyZip to implement io.Reader/io.Writer for reading/writing from/to the final zip file?

As said above, as I cannot extract the two files on the fly, I would have to add some sort of buffer to MyZip and just read/write from/to this buffer. So the zip would anyway be fully stored in memory before being streamed. Is it a counter indication for not using io.Reader/io.Writer?

Thanks a lot for shedding light!

Loric
  • 1,678
  • 8
  • 28
  • 48

1 Answers1

2

As far as I understand, the package archive/zip does not allow to decompress a zip file as a stream. We have to fully load the zip in memory or as a file and decompress afterwards.

Wrong. Some metadata needs to be loaded into memory, yes, but you do not need to load everything into memory. You can extract individual files from a zip archive. See How to unzip a single file?

Yes, zip.Reader and zip.Writer doesn't implement io.Reader and io.Writer, because they are not a single source or target of bytes. But the files in them, they are. So the files in them implement io.Reader and io.Writer. More specifically a file in an archive is represented by a zip.File which may be used to obtain an io.Reader to get its (uncompressed) content using File.Open(). When you add a new entry to a zip archive using e.g. Writer.Create(), that returns you an io.Writer because that represents a target of bytes, you can write the file's content into it.

Back to your exmaple: MyZip also does not represent a single source or destination of bytes, so it doesn't make sense to itself implement io.Reader or io.Writer, so don't do it. Similarly to archive/zip, the individual files in it may do so.

icza
  • 389,944
  • 63
  • 907
  • 827
  • Why a zip file as a whole could not be a single source of bytes? We could still transfer it byte after byte without caring about the inner files. Do you mean that to be able to extract some files in a zip, we still need to iterate on the entire file to read these metadata? If I understand, that would effectively mean we cannot extract an inner file (or part of an inner file) from a chunk of a zip file. Do you confirm? And is it in that regard that `zip.Reader`/`zip.Writer` don't implement `io.Reader`/`io.Writer`? – Loric Sep 17 '19 at 08:27
  • Yes, an archive file can still be looked at as a single source of bytes, this is what the `archive/zip` package does, because it needs to parse it / extract data from it. But from a user's point of view, you're not interested in the compressed data and zip internals, you're interested in the decoded, decompressed data, which is a set of files, which is not a single source of bytes. – icza Sep 17 '19 at 08:46
  • And you do not need to read the whole file to extract a single file from it. Yes, it's not enough to just read the beginning of the file, as metadata may be "scattered" all over the zip file, but the implementation jumps over the data it doesn't need, and thus doesn't read the whole file. – icza Sep 17 '19 at 08:47
  • For example, the file headers in the zip archive contain the compressed length. The `archive/zip` package just "jumps" over (seeks) the uncompressed data until you explicitly ask for it e.g. with `File.Open()`. – icza Sep 17 '19 at 08:48
  • Thanks for the clarification. But there is still a little point I don't really grasp. Does it mean the current implementation of `archive/zip` does not allow to extract all files from a zip while being currently streamed (over a network connection for instance)? Say it receives the first 30kb, extracts what is can, then receives the next 30kb, extracts, etc... It seems to be up to the end user to decide which file to extract (and not at the same time as being streamed). – Loric Sep 17 '19 at 09:07
  • @Loric- Theoretically if you just have an input stream of the zip archive, you could iterate over it and just extract (save) what you need. Typical implementations however first _parse_ the whole file (it doesn't mean they _read_ the whole file) to gather what files are included in the archive. Same goes for `archive/zip` package: it first scans the archive to gather the file list, which it exposes via the `Reader.File` field which then you can iterate over and acquire `io.Reader` for those files (from the archive) you need. – icza Sep 17 '19 at 09:15
  • @Loric- That's why the `archive/zip` package provides you "only" 2 ways to process an archive: either a file (`zip.OpenReader()`) because a file is seekable, or the `zip.NewReader()` function which requires an `io.ReaderAt`, not just a simple `io.Reader`, the former allows the implementation to "jump" in the input to only read parts it needs. – icza Sep 17 '19 at 09:17
  • Thanks for globally explaining the guts of the `archive/zip` package! – Loric Sep 18 '19 at 11:53