1

Below is some code which extracts a file from a zip file containing only a single file. However, the extracted file does not match the same file extracted via WinZip or other zip utility. I expect that it might be off by a byte if the file contains an odd number of bytes (because my buffer is size 2 and I just abort once the read fails). However, when analyzing (using WinMerge or Diff) the file extracted with code below vs. file extracted via Winzip, there are several areas where bytes are missing from the Java extraction. Does anyone know why or how I can fix this?

package zipinputtest;

import java.io.BufferedOutputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.OutputStream;
import java.util.zip.ZipInputStream;

public class test2 {
    public static void main(String[] args) {
        try {
            ZipInputStream zis = new ZipInputStream(new FileInputStream("C:\\temp\\sample3.zip"));
            File outputfile = new File("C:\\temp\\sample3.bin");
            OutputStream os = new BufferedOutputStream(new FileOutputStream(outputfile));
            byte[] buffer2 = new byte[2];
            zis.getNextEntry();
            while(true) {
                if(zis.read(buffer2) != -1) {
                    os.write(buffer2);
                }
                else break;
            }
            os.flush();
            os.close();
            zis.close();
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

I was able to produce the error using this image (save it and zip as sample3.zip and run the code on it), but any binary file of sufficient size should show the discrepancies.

enter image description here

Community
  • 1
  • 1
PentiumPro200
  • 641
  • 7
  • 23

2 Answers2

2

You can use a more verbatim way to check whether all bytes are read and written, e.g. a method like

  public int extract(ZipInputStream in, OutputStream out) throws IOException {
    byte[] buffer = new byte[BUFFER_SIZE];
    int total = 0;
    int read;
    while ((read = in.read(buffer)) != -1) {
      total += read;
      out.write(buffer, 0, read);
    }
    return total;
  }

If the read parameter is not used in write(), the method assumes that the entire contents of the buffer will be written out, which may not be correct, if the buffer is not fully filled.

The OutputStream can be flushed and closed inside or outside the extract() method. Calling close() should be enough, since it also calls flush().

In any case, the "standard" I/O code of Java, like the java.util.zip package, have been tested and used extensively, so it is highly unlikely it could have a bug so fundamental as to cause bytes to be missed so easily.

PNS
  • 19,295
  • 32
  • 96
  • 143
  • 1
    OP, the `read` actually tells you how many bytes have been read – Scary Wombat Apr 20 '17 at 01:40
  • Yes, so he can use it to check against the file size. – PNS Apr 20 '17 at 01:41
  • Do you know if close() in C++ also invokes a call to flush()? If so, I'll just stop using flush as C++ and Java are the only languages that I use. I would expect any modern implementation of close() in any language to also call flush. – PentiumPro200 Apr 20 '17 at 04:15
  • @PentiumPro200 *Which* `close()` in C++? Java's `FilterInputStream.close()` calls `flush()` because there might be an application-side buffer. It would be pretty criminal of any I/O library not to flush any application cache before closing, but my patience with and interest in C++ I/O libraries is strictly limited. – user207421 Apr 20 '17 at 04:37
  • I was thinking of std::ofstream, I checked and it turns out that close() invokes a flush(). So I guess I don't need to call flush() anymore. It's basically the public restroom protocol...when you're done, don't flush the toilet, just close the door...easy to remember. kidding... http://www.cplusplus.com/reference/fstream/ofstream/close/ – PentiumPro200 Apr 20 '17 at 06:25
2
while (true) {
    if(zis.read(buffer2) != -1) {
        os.write(buffer2);
    }
    else break;
}

Usual problem. You're ignoring the count. Should be:

int count;
while ((count = zis.read(buffer2)) != -1)
{
    os.write(buffer2, 0, count);
}

NB:

  1. A buffer size of 2 is ridiculous. Use 8192 or more.
  2. flush() before close() is redundant.
user207421
  • 305,947
  • 44
  • 307
  • 483
  • I was using buffer size of 2 to demonstrate the problem. Could not produce the discrepancy with buffer size of 1. The full version of the code would also be setup to extract all the files in the .zip instead of just 1. Tried to keep the problem as small and concise as possible. – PentiumPro200 Apr 20 '17 at 04:02
  • Thanks EJP and PNS. I spent most of today trying to debug that issue. You both saved me a lot more debug time. Only question I have is why would the buffer not be completely filled unless it's the last read of the file? I'll use a counter anyhow, just curious. – PentiumPro200 Apr 20 '17 at 04:12
  • 1
    Why *would* it be filled? The contract for `InputStream.read()` doesn't require it. That's why it returns a count. Consider a socket: you get whatever has arrived in the socket receive buffer. Consider ZIP: you get whatever can be unzipped that will fit: however the protocol works, the next unzip piece might not fit into the current buffer. – user207421 Apr 20 '17 at 04:35
  • That makes sense. ZipInputStream receives data in chunks from FileInputStream. Even though in all cases except for the last, FileInputStream has more data to send, for some reason it doesn't fill the buffer every time. I was thinking that for efficiency, it would make sense to always fill the buffer when possible, but turns out that's not the case. – PentiumPro200 Apr 20 '17 at 06:19
  • It *would* make sense, but it may not always be possible, for the reason Instated. – user207421 Apr 21 '17 at 01:43
  • I posted a follow-up thread here: http://stackoverflow.com/questions/43521536/java-inputstream-read-buffer?noredirect=1#comment74106678_43521536 where Durandal stated: "And there is good reason the API is defined that way, it gives the implementation the freedom to return partial data as soon as it is available without blocking for extended periods of time while waiting for data to become available (think of SocketInputStream)." That makes sense and I had not considered that case earlier. – PentiumPro200 Apr 21 '17 at 18:11