1

Say I'm trying to read from a Java InputStream like this:

ZipInputStream zis = new ZipInputStream(new FileInputStream("C:\\temp\\sample3.zip"));
zis.getNextEntry();
byte[] buffer2 = new byte[2];
int count = zis.read(buffer2));
if(count != -1) //process...
else...//something wrong, abort

I'm parsing a binary file and I set my buffer to 2 in this case because I want to read the next short. I would set my buffer to size 4 if I want to read the next int and so on for other types. The problem is that sometimes zis.read(buffer) won't fill the buffer even when I know that there is enough unread data to fill the buffer. I could simply dump the entire file contents into an array and parse that, but then I end up implementing my own stream reader to do that which seems like re-inventing the wheel. I could also implement a read() function that checks the read count and if less than buffersize, request more data to fill the buffer, but that's inefficient and ugly. Is there a better way to do this?

This is a follow-up question to a question posted here:

Java ZipInputStream extraction errors

Community
  • 1
  • 1
PentiumPro200
  • 641
  • 7
  • 23
  • "The problem is that sometimes zis.read(buffer) won't fill the buffer even when I know that there is enough unread data to fill the buffer" - are you sure about this? That should not be happening. – f1sh Apr 20 '17 at 13:57
  • _"I could also implement a read() function that checks the read count and if less than buffersize, request more data to fill the buffer, but that's inefficient and ugly."_ - Ugly or not, that is how I/O works. – Sean Bright Apr 20 '17 at 14:02
  • Off-topic, but maybe useful for you (I just learned that myself recently): Do not use "new FileInputStream(...)", use "Files.newInputStream(java.nio.file.Path, java.nio.file.OpenOption...)" instead to avoid problems as described e.g. here: https://dzone.com/articles/fileinputstream-fileoutputstream-considered-harmful – AntiTiming Apr 20 '17 at 14:54
  • @f1sh, see my linked previous post. – PentiumPro200 Apr 20 '17 at 17:50
  • @l00tr, thanks for the reference. I'll start using the Files.* methods for opening streams. – PentiumPro200 Apr 20 '17 at 17:51
  • 1
    @PentiumPro200, there already is a `read` method that blocks until the requested amount of data is read, just not on `ZipInputStream`. `InputStream`s are designed to be wrapped by other `InputStream`s. [Stephen C's answer](http://stackoverflow.com/a/43522959/21926) below appears to be the best solution for your use case. If you don't want to read primitive by primitive, there is always [`DataInputStream.readFully()`](https://docs.oracle.com/javase/7/docs/api/java/io/DataInputStream.html#readFully(byte[])). – Sean Bright Apr 20 '17 at 18:01

3 Answers3

1

Is there a better way to do this?

Well ... a ZipInputStream ultimately inherits from InputStream so you should be able to wrap it with a BufferedInputStream and then a DataInputStream and read data using readShort, readInt and so on.

Something like this:

while (zis.getNextEntry() != null) {
  DataInputStream dis = new DataInputStream(new BufferedInputStream(zis));
  boolean done = false;
  do {
    short s = dis.readShort();
    int i = dis.readInt();
    ...
  } while (!done);
}

NB: you shouldn't close the dis stream as that would cause the zis to be closed. (Obviously, the zis needs to be closed at an outer level to avoid a resource leak.)

The BufferedInputStream in the stack ensures that you don't do lots of small reads on the underlying stream ... which would be bad.

The only possible gotcha is that its methods have particular ideas about how the binary data is represented; e.g. numbers are bigendian. If that is an issue, consider reading the entire zip entry into a byte array, and wrapping it in a ByteBuffer.

Stephen C
  • 698,415
  • 94
  • 811
  • 1,216
  • Thanks Stephen C, wrapping the ZipInputStream in a BufferedInputStream and a DataInputStream seems to be the best solution. Notably, the DataInputStream has a readFully function which does exactly what I was looking for. It blocks until the request amount of data has been accumulated. Very nice! – PentiumPro200 Apr 20 '17 at 18:55
0

You need to check the byte count and keep reading until you have all the information you need

zis.getNextEntry();
byte[] buffer2 = new byte[2];
int count = 0;
while (count < 2) {
  int bytesRead = zis.read(buffer2, count, 2 - count));
  if(bytesRead != -1) {
    count += bytesRead;
  }
  else...//something wrong, abort
}
//process...
ControlAltDel
  • 33,923
  • 10
  • 53
  • 80
0

ZipInputStream conforms to the contract defined by InputStream. The read(byte[] ...) methods are allowed and documented to return either -1 for end of stream, or any value between (1...requested length).

And there is good reason the API is defined that way, it gives the implementation the freedom to return partial data as soon as it is available without blocking for extended periods of time while waiting for data to become available (think of SocketInputStream).

If you require a minimum amount of data you need to call read repeatedly until you have read as much data as needed to continue processing.

As for "thats inefficient and ugly", reading tiny amounts of data through the bulk-read methods incurs its own overhead, and possibly in the code you show also creation of a garbage byte[] for each data entity you read. For reading a handful of bytes, you could simpy use the read() method that returns a single byte, implemented in a simple utility method e.g.:

 static short readShort(InputStream in) throws IOException {
      short s = 0;
      for (int i=0; i<2; ++i) {
          int read = in.read();
          if (read < 0)
              throw new IOException("unexpected end of stream");
          s = (short) ((s << 8) | read);
      }
      return s;
 }

(this can be easily adapted to other primitive types)

Single byte I/O is in most cases totally acceptable, as long as you take care to ensure the InputStream is wrapped into a BufferedInputStream. The average overhead then reduces to a few array index bounds checks inside the BufferedInputStream. It won't cause an excessive number of calls down to the native data source.

Durandal
  • 19,919
  • 4
  • 36
  • 70